Dan Linstedt, Supercharge Your Data Warehouse

December 30, 2016 | Author: Kam Krist | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Dan Linstedt, Supercharge Your Data Warehouse...

Description

Inv valuab ble Da ata Mo odelin ng Rules to o Impllemen nt You ur Datta Vau ult

Dan Linsttedt Inventorr of the Da ata Vault

ISB BN: 978-0--9866757-1 1-3

Super Charge Your Data Warehouse

Page 2 of 152

Super Charge Your Data Warehouse Invaluable Data Modeling Rules Implement Your Data Vault Copyright © Dan Linstedt, 2008-2011 http://LearnDataVault.com All rights reserved. All images are the property of Dan Linstedt, unless an image source is otherwise noted. No part of this book may be reproduced in any form or by any electronic or mechanical means including information storage and retrieval systems, without permission in writing from the author. The only exception is by a reviewer, who may quote short excerpts in a review. Printed in the United States of America First Printing: December, 2010

Co-Editor: Kent Graziano Special Thanks: Tom Breur for additional editing Abstract: The purpose of this book is to present and discuss the technical components of the Data Vault Data Model. The examples in this book provide a strong foundation for how to build, and design structures when using the Data Vault modeling technique. This book is a second in the series of books surrounding the Data Vault model and methodology (approach). The target audience is anyone wishing to implement a Data Vault model for integration purposes whether it be an Enterprise Data Warehouse, Operational Data Warehouse, or Dynamic Data Integration Store.

.

© Daniel Linstedt, 2008-2011 © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

http://SuperChargeYourEDW.com

Super Charge Your Data Warehouse

Page 3 of 152

Table of Contents Acknowledgements .................................................................................................................................. 10 1.0

Introduction and Terminology ...................................................................................................... 11

1.1

Do I need to be a Data Modeler to read this book? ................................................................ 11

1.2

Review of Basic Terminology .................................................................................................... 11

1.3

Data Modeling Notations Used in This Text............................................................................. 15

1.4

Data Models as Ontology’s ....................................................................................................... 15

1.5

Data Model Naming Conventions and Abbreviations ............................................................. 17

1.6

Introduction to Hubs, Links, and Satellites ............................................................................. 19

1.7

Flexibility of the Data Vault Model ........................................................................................... 22

1.8

Data Vault Basis of Commutative Properties and Set Based Math ....................................... 25

1.9

Data Vault and Parallel Processing Mathematics ................................................................... 27

1.10 Introduction to Complexity and the Data Vault ....................................................................... 32 1.11 Loading Processes: Batch Versus Real Time .......................................................................... 35 2.0

Architectural Definitions ............................................................................................................... 37

2.1

Staging Area .............................................................................................................................. 37

2.2

EDW – Data Vault...................................................................................................................... 38

2.3

Metrics Vault.............................................................................................................................. 39

2.4

Meta Vault ................................................................................................................................. 39

2.5

Report Collections ..................................................................................................................... 40

2.6

Data Marts ................................................................................................................................. 40

2.7

Business Data Vault .................................................................................................................. 40

2.8

Operational Data Vault ............................................................................................................. 41

2.9

Dynamic Data Vault .................................................................................................................. 42

3.0

Common Attributes ....................................................................................................................... 43

3.1

Sequence Numbers .................................................................................................................. 45

3.2

Sub Sequence Numbers (Item Numbering) ............................................................................ 46

3.3 Load Dates ................................................................................................................................ 47 © Dan Linstedt 2010-2011, all rights reserved http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 4 of 152

3.4

Load End Dates ......................................................................................................................... 49

3.5

Last Seen Dates ........................................................................................................................ 50

3.6

Extract Dates ............................................................................................................................. 53

3.7

Record Creation Dates.............................................................................................................. 54

3.8

Record Sources ......................................................................................................................... 54

3.9

Process ID’s ............................................................................................................................... 55

4.0

Hub Entities ................................................................................................................................... 56

4.1

Hub Definition and Purpose ..................................................................................................... 58

4.2

What is a Business Key? .......................................................................................................... 59

4.3

Where do we find Business Keys? ........................................................................................... 60

4.4

Why are Business Keys Important? ......................................................................................... 61

4.5

How do Business Keys tie to Hubs and Business Processes? ............................................... 63

4.6

Why not Surrogate Keys as “Master Keys”? ........................................................................... 64

4.7

Hub Smart Keys, Intelligent Keys ............................................................................................. 64

4.8

Hub Composite Business Keys ................................................................................................ 65

4.9

Hub Entity Structure .................................................................................................................. 66

4.10 Hub Examples ........................................................................................................................... 67 4.11 Dependent and Non-dependent Child Keys ............................................................................ 69 4.12 Mining patterns in the Hub Entity ............................................................................................ 71 4.13 Process of Building a Hub Table .............................................................................................. 73 4.14 Modeling Rules and Standards for Hub Tables ...................................................................... 74 4.15 What Happens when the Hub Standards Are Broken............................................................. 75 5.0

Link Entities ................................................................................................................................... 77

5.1

Link Definition and Purpose ..................................................................................................... 77

5.2

Reasons for Many To Many Relationships .............................................................................. 77

5.3

Flexibility .................................................................................................................................... 80

5.4

Granularity ................................................................................................................................. 83

5.5 Dynamic Adaptability ................................................................................................................ 86 © Dan Linstedt 2010-2011, all rights reserved http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 5 of 152

5.6

Scalability................................................................................................................................... 87

5.7

Link Entity Structure.................................................................................................................. 90

5.8

Link Driving Key ......................................................................................................................... 90

5.9

Link Examples ........................................................................................................................... 92

5.10 Degenerate Fields In Links ....................................................................................................... 94 5.11 Multi-Temporal Date Structures ............................................................................................... 94 5.12 Link-To-Link (Parent/Child Relationships) ............................................................................... 96 5.13 Link Applications ....................................................................................................................... 99 5.14 Hierarchical Links...................................................................................................................... 99 5.15 Same-As Links ......................................................................................................................... 101 5.16 Begin and End Dating Links ................................................................................................... 103 5.17 Low Value Links....................................................................................................................... 106 5.18 Transactional Links ................................................................................................................. 106 5.19 Computed Aggregate Links..................................................................................................... 108 5.20 Strength and Confidence Ratings in Links ............................................................................ 110 5.21 Exploration Links ..................................................................................................................... 111 6.0

Satellite Entities .......................................................................................................................... 112

6.1

Satellite Definition and Purpose ............................................................................................ 112

6.2

Satellite Entity Structure ......................................................................................................... 113

6.3

Satellite Examples................................................................................................................... 114

6.4

Importance of Keeping History ............................................................................................... 115

6.5

Splitting Satellites by Classification or Type of Data ............................................................. 116

6.6

Splitting Satellites by Rate of Change.................................................................................... 118

6.7

Satellites Arranged by Source System ................................................................................... 120

6.8

Overloaded Satellites (The Flip-Flop Effect) .......................................................................... 122

6.9

Satellite Applications: ............................................................................................................. 124

6.9.1

Effectivity Satellites ............................................................................................................. 124

6.9.2 Record Tracking Satellites .................................................................................................. 125 © Dan Linstedt 2010-2011, all rights reserved http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 6 of 152

6.9.3

Status Tracking Satellites ................................................................................................... 128

6.9.4

Computed Satellites (Quality Generated) .......................................................................... 129

6.9.5

Multiple Active Satellite Rows ............................................................................................ 129

6.10 Splitting Satellites ................................................................................................................... 132 6.11 Consolidating Satellites .......................................................................................................... 136 7.0

Query Assistant Tables ............................................................................................................... 140

7.1

Point in Time Tables ................................................................................................................ 140

7.2

Bridge Tables ........................................................................................................................... 143

8.0

Reference Tables ........................................................................................................................ 146

8.1

No-History Reference Tables .................................................................................................. 147

8.2

History Based Reference Tables ............................................................................................ 148

8.3

Code and Descriptions............................................................................................................ 150

9.0

Conclusions ................................................................................................................................. 151

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 7 of 152

Table of Figures Figure 1-1: Example E-R Diagram (Elmasri/Navathe) ............................................................................ 13 Figure 1-2: Crows Foot and Arrow Notation Example ............................................................................ 15 Figure 1-3: Small Example: Ontology for Vehicle.................................................................................... 16 Figure 1-4: Example Abbreviations and Naming Conventions .............................................................. 18 Figure 1-5: Example Data Vault ............................................................................................................... 20 Figure 1-6: Flexibility of Adapting to Change .......................................................................................... 23 Figure 1-7: 3rd Normal Form Product and Supplier Example ................................................................ 24 Figure 1-8: Applied Set Theory for the Data Vault .................................................................................. 27 Figure 1-9: Parallel Computing Simplified .............................................................................................. 28 Figure 1-10: Logical Data Vault Hyper Cube........................................................................................... 29 Figure 1-11: Physical Data Vault Layout (Starting point) ....................................................................... 30 Figure 1-12: Physical Data Vault Layout (Partitioned) ........................................................................... 31 Figure 2-1: Enterprise BI Architectural Components ............................................................................. 37 Figure 3-1: Time Series Batch Loaded Data ........................................................................................... 43 Figure 3-2 Real-Time Arrival, Data Geology ............................................................................................ 44 Figure 3-3: Load Date Time Stamp and Record Source ........................................................................ 47 Figure 3-4: Example Load Date Time Stamp Data ................................................................................. 48 Figure 3-5: Load End Date Computations, Descriptive Data Life Cycle ................................................ 49 Figure 3-6: Structures containing Last Seen Dates ............................................................................... 51 Figure 3-7: Scan all data in EDW............................................................................................................. 51 Figure 3-8: Reduced Scan Set after Applying Last Seen Date .............................................................. 53 Figure 4-1: Business Key Changing Across Line of Business ................................................................ 57 Figure 4-2: Hub Example Images ............................................................................................................ 58 Figure 4-3: Hub Example Data ................................................................................................................ 59 Figure 4-4: Smart Key Example ............................................................................................................... 65 Figure 4-5: Composite Business Key Hub Example ............................................................................... 66 Figure 4-6: Example Hub Entity Structure .............................................................................................. 67 Figure 4-7: Example Hubs from Adventure Works 2008 ....................................................................... 68 Figure 4-8: Example of National Drug Code Data Vault ......................................................................... 69 Figure 4-9: Dependent Child Relationship Modeling ............................................................................. 70 Figure 4-10: Typical Hub Row Sizing ....................................................................................................... 75 Figure 5-1: Relationship Changes Over Time ......................................................................................... 78 Figure 5-2: Link Table Structure Housing Multiple Relationships ......................................................... 79 Figure 5-3: Starting Model Before Changes ........................................................................................... 81 © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 8 of 152

Figure 5-4: Data Vault After Modification ............................................................................................... 81 Figure 5-5: Additional Data Vault Model - More Changes...................................................................... 82 Figure 5-6: Global Data Vault Linking ..................................................................................................... 83 Figure 5-7: Uncovering Fact Table Grain ................................................................................................ 84 Figure 5-8: Data Vault Grain, Representing Star Schema ..................................................................... 84 Figure 5-9: Traditional Data Vault Storage Layout ................................................................................. 87 Figure 5-10: Performance Physical Split Version 1 ................................................................................ 88 Figure 5-11: Performance Physical Split Version 2 ................................................................................ 89 Figure 5-12: Performance Physical Split Version 3 ................................................................................ 89 Figure 5-13: Sample Link Structure ........................................................................................................ 90 Figure 5-14: Example Driving Key for Link ............................................................................................. 91 Figure 5-15: Example of Link Satellite with Driving Key ........................................................................ 91 Figure 5-16: Insert to Link/Sat Based on Driving Key ........................................................................... 92 Figure 5-17: Link Driving Key/Satellite End Dated ................................................................................ 92 Figure 5-18: Example of Link Tables From Adventure Works 2008 Data Vault................................... 93 Figure 5-19: Example of Link To Link Relationships .............................................................................. 96 Figure 5-20: Step 1, Flattening Link-To-Link Hierarchy ......................................................................... 97 Figure 5-21: Step 2, Flattening Link-To-Link Hierarchy ......................................................................... 98 Figure 5-22: Example Organization Structure ...................................................................................... 100 Figure 5-23: Hierarchical Link for Offices ............................................................................................. 100 Figure 5-24: Example Hierarchical Link of Employees ........................................................................ 101 Figure 5-25: Same-As Link Example, Business Data ........................................................................... 102 Figure 5-26: Same-As Link Data Vault Model....................................................................................... 102 Figure 5-27: Incorrect Link with Begin/End Date ................................................................................. 103 Figure 5-28: Begin & End Dates in Links .............................................................................................. 104 Figure 5-29: Example of Poorly Constructed Link ................................................................................ 105 Figure 5-30: Satellite Effectivity on a Link ............................................................................................ 105 Figure 5-31: Transactional Link Example ............................................................................................. 106 Figure 5-32: Transactional Link, No Satellite ....................................................................................... 108 Figure 5-33: Example of Computed Aggregate Link ............................................................................ 109 Figure 6-1: Example Satellite Entity ...................................................................................................... 114 Figure 6-2: Example Satellite Entities ................................................................................................... 115 Figure 6-3: Satellites Split by Type Of Data Option 1 ........................................................................... 117 Figure 6-4: Satellite Data Rate of Change Example ............................................................................. 118 Figure 6-5: Satellite Split by Rate Of Change ....................................................................................... 119 Figure 6-6: Customer Satellites Split by Source System ..................................................................... 121 Figure 6-7: Satellite Overload from Many Sources .............................................................................. 122 © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 9 of 152

Figure 6-8: Satellite Effectivity ............................................................................................................... 124 Figure 6-9: Denormalized Record Source Tracking Satellite ............................................................... 126 Figure 6-10: Normalized Record Source Tracking Satellite................................................................. 127 Figure 6-11: Status Tracking Satellite .................................................................................................. 128 Figure 6-12: Multi-Active Satellite Rows ............................................................................................... 130 Figure 6-13: Multi-Active Satellite Row Data ........................................................................................ 131 Figure 6-14: Multi-Active Satellite with Business Sub-Sequence........................................................ 132 Figure 6-15: Step 1: Identify Satellite Split Columns ........................................................................... 133 Figure 6-16: Step 2: Split Satellite Columns, Design New tables ....................................................... 134 Figure 6-17: Step 3: Copy Data From Original to New Satellites ........................................................ 134 Figure 6-18: Step 4: Eliminate Duplicates ............................................................................................ 135 Figure 6-19: Step 4: Alternate Elimination of Duplicates .................................................................... 135 Figure 6-20: Step 5: End Dates Adjusted After Satellite Split ............................................................. 136 Figure 6-21: Consolidating Satellite Data ............................................................................................. 137 Figure 6-22: Load End Dates Calculated in Consolidated Satellite .................................................... 139 Figure 7-1: Structure of PIT Table ......................................................................................................... 140 Figure 7-2: PIT Table Architecture Overview ......................................................................................... 141 Figure 7-3: Example PIT Table with Snapshot Dates ........................................................................... 142 Figure 7-4: Bridge Table Structure ........................................................................................................ 143 Figure 7-5: Bridge Table Architectural Overview .................................................................................. 144 Figure 7-6: Bridge Table Example Data ................................................................................................ 145 Figure 8-1: Non-History Reference Table .............................................................................................. 148 Figure 8-2: Standard History Based Reference Table.......................................................................... 149 Figure 8-3: Hub/Sat History Based Reference Table ........................................................................... 150 Figure 8-4: Group Code and Description .............................................................................................. 151

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 10 of 152

Acknowledgements I wish to personally thank Kent Graziano for sticking by me all this time. His relentless editing skills have truly helped to shape and hone this book. It’s taken me two years to put this book together, and countless hours of writing, creating graphics and examples in high quality print and color. In addition to Kent, Tom Breur also assisted me in the editing process – he helped me to draw out important points – and yes, he wanted me to change to single spacing – but that’s one thing I just didn’t compromise on. Then, there is Sanjay Pande – he’s an IT veteran turned marketing expert who knows his stuff inside and out. He’s been an inspiration to me to try new things, and create new titles for the book. He’s also helping me with many other aspects of marketing that I wasn’t even aware of. I wish to thank my wife Julie for putting up with me spending hours editing my book (even on my vacations) which I really shouldn’t do. My wife also helped me re-formulate the cover art and pick a cool looking design. I’d also like to thank God for blessing me with this knowledge and then finally urging me to trust Him and write it down for others! Finally, I’d really like to thank YOU, the reader. Many of you know me, or have seen me teach in person – without you, there would be no Data Vault successes in the world today. I love to hear about your trials, as well as your successes with the Data Vault – if you’d like to help me write (yet another book of case-studies) then I want to hear from you! Of course, if you’re ever in Saint Albans or even Burlington Vermont, drop me an email or call me – I’d be delighted to meet you for lunch. Sincerely, Daniel Linstedt [email protected]

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 11 of 152

1.0 Introduction and Terminology Welcome to the technical book about Data Vault Modeling. This book is for practitioners and implementers. The content of this book is focused on the Data Vault structures, definitions of the structures, metadata, and data modeling – it does not cover the loading, querying, nor partitioning of the Data Vault. These feature sets will be covered in the next set of technical books. 1.1

Do I need to be a Data Modeler to read this book?

No, it is not necessary to be a data modeler to read this book. While a data modeling background is helpful, it is not required. The writing covers the basic components of the Data Vault Model, and also introduces information about the concepts utilized by nearly all relational database systems. Experience with RDBMS engines also can be applied to the concepts and knowledge presented here. This book also assumes you are familiar with the basics of data warehousing as defined by W.H. Inmon and Dr. Ralph Kimball. A common understanding of fields / columns, tables, and key structures (such as referential integrity) is helpful. In the next section are descriptions of common terms used throughout this book. 1.2

Review of Basic Terminology

The terminology in this book consists of basic entity-relationship (E-R) diagramming and data modeling terms. Terminology such as Table, Entity, Attribute, Column, Field, Primary Key, Foreign Key, and Unique Index are utilized throughout. For reference purposes the following basic level definitions of the terms are provided. Term Table Entity Attribute Column Field Primary Key

Definition A composite grouping of data elements instantiated in a database, making up a concept. A table, as referred to in a logical format (eg: customer, account, etc..) A single data element comprised of a name, data type, length, precision, null flag, and possibly a default value. An ordered attribute within a table. Same as Column. See Column definition. Main set of one or more attributes indicating a unique method for identifying data stored within a table.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse Term Foreign Key

Unique Index Business Key

Natural Key Relationship Many to 1

Many to Many 1 to 1 Cardinality

Page 12 of 152

Definition One or more attributes associated with the primary key in another table. Often used as lookup values, may be optional (nullable) or mandatory (non-null). When enabled in a database, foreign keys insure referential integrity. One or more attributes combined to form a single unique list of data spanning all rows within a single table. Component used by the business users, business processes, or operational code to access, identify, and associate information within a business operational life-cycle. This key may be represented by one or more attributes. See business key. An association between or across exactly two tables. A notation used to describe the number of records in the lefthand table as related to the number of records in the right-hand table. Example: many customer records may have 1 and only 1 contact record. An open-ended notation. For example: where many customer records may have many contact records. A notation dictating singular cardinality: 1 customer record may have 1 and only 1 contact record. In mathematics, the cardinality of a set is a measure of the "number of elements of the set". For example, the set A = {1, 2, 3} contains 3 elements, and therefore A has a cardinality of 3. There are two approaches to cardinality – one which compares sets directly using bijections and injections, and another which uses cardinal numbers. Reference: http://en.wikipedia.org/wiki/Cardinality

Constraint

Weak Relationship Strong Relationship Associative Entity

A relationship between two tables that enforce existence of data in a parent table, or an indicator of uniqueness and not-null column. A constraint may also indicate basic rules such as defaults, or functions to check values, or possibly ranges of data. A constraint that is optional (when the data is not null, then the constraint is checked for validity). When the data is null, the constraint is not checked for validity. A constraint that is non-optional. Data is required (not-null) in the child table all the time, and therefore is checked for validity. An associative entity is an element of the Entity-relationship model. The database relational model doesn't offer direct support to many-to-many relationships, even though such relationships happen frequently in normal usage. The solution to

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 13 of 15 52

this problem is the creatio on of anotherr table to hold the necessary n infformation forr this relation nship. This neew table is called c an associative entitty. http://en.wikip h pedia.org/wiki//Associative_eentity

Data a models are e diagrammattic representtations of info ormation and d classes of information tto be held within a mechanical storage mechanism such s as a database engin ne; except in the case of a concceptual/business model which w should d be independent of techn nology. Common databasse engines today include: DB2 UDB, U Teradatta, MySQL, Po ostGreSQL, O Oracle, SQLSeerver, and Syybase ASE.. There are several s main notations ussed for E-R diiagrams (e.g.., Chen, Barkker, IDEF, etcc). An exam mple of an E--R diagram using Elmasri//Navathe nottation is beloow:

Figurre 1-1: Example E-R Diagram (Elmasrii/Navathe) Image: http://ciisnet.baruch.cuny.edu/holowczzak/classes/94 440/entityrelatioonship/

Data a models (such as E-R dia agrams) housse linguistic representatio r ons of concepts tied togetther through associattions. These associationss can also be e thought of aas Ontologiess. There are many type es of data mo odeling notations available in the world d today. Twoo main types are focused on in this rd docu ument: 3 normal form and a Star Sche ema. For refference purpooses, simplee definitions o of both style es are include ed below. 3rd N Normal Form is defined ass follows: The third normal n form (3 3NF) is a norm mal form used in database n normalization. 3NF was originally defined d by E.F. Codd[1] in 1971. Codd's definition d statees that a tablee is in 3NF if an nd only if both h of the follow wing conditionss hold: • The re elation R (table e) is in second d normal form (2NF) • Every non-prime atttribute of R is non-transitive n ly dependent ((i.e. directly deependent) on every key of R.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 14 of 15 52

A non-prim me attribute off R is an attribute that does not belong to any candidatee key of R.[2] A transitive dependency d iss a functional dependency in which X Æ Z (X determinees Z) indirectlyy, by virtue of o X Æ Y and YÆZ Y (where it is not the case e that Y Æ X). [3] A 3NF definition that is equivalent e to Codd's, C but exxpressed differrently was giveen by Carlo Zaniolo in 1982. This de efinition statess that a table is i in 3NF if an d only if, for eeach of its ditions holds: functional dependenciess X Æ A, at lea ast one of the following cond • • •

X conttains A (that iss, X Æ A is trivial functional dependency), or X is a super key, or A is a prime attributte (i.e., A is contained within n a candidate key)[4]

Zaniolo's definition d givess a clear sense of the difference between n 3NF and the more stringen nt Boyce-Cod dd normal form m (BCNF). BCN NF simply eliminates the thirrd alternative ("A is a prime attribute"). http://en.w wikipedia.org//wiki/Third_no ormal_form

Starr schema modeling is defiined as follow ws: The star scchema (somettimes referencced as star join schema) is tthe simplest sstyle of data warehouse e schema. The e star schema consists of a few "fact tablees" (possibly oonly one, justifying the t name) refe erencing any number n of "dim mension tabless". The star scchema is considered d an importan nt special case e of the snowflake schema. http://en.w wikipedia.org//wiki/Star_sch hema

a modeling approach is “ccalled” a starr, because it appears to loook similar to o a starThe Star Schema like shape. Star Schema mod deling is championed by Dr. Ralph Kim mball.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 1.3

P Page 15 of 15 52

Data Modeling Notations Used U in This Text

Crow ws Foot notattion is utilized d throughoutt this text to represent r raw w data modeels; in additio on to the crow ws-foot notatiion this text introduces arrrows to represent data m migration paths (vectors/d direction of da ata flow). It is i occasionallly easier to describe d the vector v notatiion to busineess users whe en compared with describing d cro ows-foot nota ation. “The e "Crow's Foo ot" notation re epresents relationships with w connectiing lines betw ween entitiess, and pairss of symbols at the ends of those lines to represen nt the cardin ality of the reelationship. C Crow's Foott notation is used u in Barkker's Notation n and in meth hodologies su uch as SSAD DM and Inform mation Engiineering.” htttp://en.wikipeddia.org/wiki/Enntity-relationshipp_model

Figure 1-2: Crowss Foot and Arrrow Notation n Example Arrow notation iss less descrip ptive (mathem matically) and d shows onlyy the direction or flow of tthe parent table e primary keyy (e.g. Artist) into the child d table (e.g. Song). S 1.4

Data Modelss as Ontology’ss

Data a models function as onto ologies in thiss world. Theyy seek to orgganize a hiera archy of inforrmation into a classificatiion system. Ontologies are extremely powerful nottions that ca an capture au ugmented or en nhanced mettadata (information about the data model) that is not represen nted by the m model itself. “In both co omputer scien nce and inform mation science e, an ontology is a formal rep presentation o of a set of co oncepts within a domain and d the relationsships between n those concep pts. It is used tto reason about the properties of that do omain, and ma ay be used to define the domain. Ontologiess are used in artificial a intelliggence, the Semantic Web, ssoftware engin neering, biomedica al informatics, library science e, and informa ation architectture as a form m of knowledge e representa ation about the world or som me part of it.” http://en.w wikipedia.org//wiki/Ontologyy_(computer_sscience)

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 16 of 15 52

Onto ologies are one way to rep present terms beyond datta modeling; much of thee Data Vault m model is base ed on the Ontology conce epts. When th he Data Vaullt model form m is combined d with the function of data a mining, and d structure mining m then ne ew relationsh hips can be d discovered, ccreated and d dropped overr time. The ontology o can be morphed or dynamically altered intto new relationships goin ng forw ward. There iss more discu ussion on thiss topic in diffe erent section ns of this boook around the e flexibility of the Data D Vault mo odel.

Fig gure 1-3: Small Example: Ontology forr Vehicle F 1-3 is extremely sim mple and small. It repressents the nottion of the pa arent term The ontology in Figure vehicle which contains the su ub-classes: Car C and Truck k. Car and Trruck are both h types of veh hicles; however each ha as potentiallyy different de escriptors. Trrucks genera lly contain la arger frames, larger moto ors, larger wh heels, and arre capable off towing and hauling heavvy loads where cars generally have a sm maller turning g radius, use less gas, and d can house more peoplee. Onto ologies are po owerful categ gorization an nd organizatio on techniquees. Imagine a set of musiic on a mob bile computin ng device. No ow imagine that there are e many differrent categorizzations for th hat music, rangging from yea ar, to composser, to album, to band, to artist, lead vvocalist, etc… … Now stack these cate egorizations in different orrders or hiera archies – the ey function ass indexes into the data se et. At the end of the index are the same music files; they are sim mply categoriized differenttly. This is th he basic makkeup of ontolo ogies. How wever, this de escription can n go deeper; switch the exxisting categoories out for business terrms, and begin to describe e each category. For insta ance: Genre.. Different peeople might d define “whatt is classsified as rock k and roll” diffferently, butt they are botth right. Cateegorization iss in the eye o of the beho older, and is based on the e individual’ss belief system and know ledge set (orr context) surrrounding the iinformation at a the bottom m of the stack k; which in th his case are tthe music filees.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 17 of 152

The deeper into the ontology (or index), the more specialized and differentiated the definition becomes. For example: underneath Rock, there might be 70’s, 80’s and 90’s rock, or there might be classic rock and roll. Where an individual who grew up in the 60’s considers 60’s and 70’s to be a part of; while an individual who grew up in the 90’s or later considers any music earlier than 1985 to be classic rock. This is just one of the issues that the Data Vault Model and implementation methodology provides a solution to. This book will uncover the key to modeling Ontologies in enterprise data warehouses for use with Business Intelligence systems. In fact, learning warehousing, applying, and using ontologies is a critical success factor for handling, managing and applying unstructured data to a structured data warehouse. It is also a major component for operational data warehousing, along with business rule definition and dissemination of the data within an Enterprise Data Vault. These are general descriptions of ontologies as used throughout this book. In addition to ontologies, data models typically contain short-hand notations for names of fields known as abbreviations. These abbreviations can have similar meaning within the same context (i.e. industry vertical) but may have different meaning across different context. For example: Abbreviation CONT in health care may mean contagious, in a legal system it may mean continuation. Abbreviations are best separated by vertical industry. 1.5

Data Model Naming Conventions and Abbreviations

Physical data models often contain abbreviations for classifying tables and fields as many RDBMS engines impose length limits on object names. The desire is to carry metadata meaning within the abbreviations which results in a data dictionary being created. The naming conventions usually start from the left hand side of the object name and move to the right with a logical flow with different parts of the abbreviations separated by an underscore. The typical abbreviation is made up of multiple components as shown in Figure 1-4:

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 18 of 15 52

Figure 1--4: Example Abbreviations A s and Namin ng Convention ns NOTEE : THESE NAMIN NG CONVENTIONS S ARE FOR THE PHYSICAL P DATA MODEL. S OME OF THE EXAMPLLES ON THE ENS SUING PAGES ARE P PHYSICAL, AND USE U THIS APPRO OACH, WHILE OTH HER EXAMPLES ARE A LOGICAL AN ND DISPLAY BUS INESS TERMS . PLEASE TAKE NOTE OF THE DIFFERENCES BETWEEN THE LOG GICAL AND PHYSICAL DATA MODE ELS .

Vowels may be kept k in order to t increase readability, r ho owever in geeneral vowelss are removed from the a abbreviationss for shortening reference e names. Th here are som e notations w which do not include unde erscores rath her they utilizze text case to t indicate th he start of new w terms (e.g., camel case e used with SQL Server). The Data Vault V physical modeling methodology m rrecommendss underscore es as the bestt practice for abbreviation ns. The Data Vault logical modeling coomponents rrecommendss utilizing full b business nam mes which arre demonstra ated in this book. NOTE: Naming conventions are one form f of ontologgy and metadaata that can b be applied actively within the confin nes of the Bussiness Intellige ence world.

ure 1-3 the abbreviations might be as follows: For tthe ontology listed in Figu • • • • •

Vehicle Car Truck 2 Wheel Drive D 4 Wheel Drive D

= VEH = CAR = TRK = TWOWHDRV, TWDRV = FOURWHD DRV or AWD, FORWDRV F

The suggested ta able naming conventions for the Data Vault are ass follows: © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse • • • • • • • • • • •

HUB , or H LNK, or L HLNK, or HL LSA, SAL, SLNK, SL TLNK, TL SAT, or S HSAT LSAT PIT, or P BR, or B REF, or R

Page 19 of 152

for HUB tables for standard Link Tables for hierarchical Links for same as Links for transactional Links for generic Satellites for Hub Satellites for Link Satellites for point-in-time tables for Bridge tables for reference tables

Within each of the Data Vault tables there are standardized fields (more on this later). The naming convention for these fields is as follows: • • • • • •

LDTS, LDT LEDTS, LEDT SQN REC_SRC, RSRC LSD,LSDT SSQN

for load date time stamps for load end date time stamps for sequence numbers for record sources for last seen dates for sub-sequencing identifiers

Always document the naming convention and the abbreviations chosen through a data dictionary in order to convey meaning to the business and the IT team. Naming conventions are vital to the success and measurement of the project. Naming conventions allow management, identification, and monitoring of the entire system – no matter how large it grows. Once the naming convention is chosen, it must be adhered to (stick to it at all times) . One way to insure this is to conduct frequent data model reviews and require non-conforming objects to be renamed 1.6

Introduction to Hubs, Links, and Satellites

The Data Vault model consists of three basic entity types: Hubs, Links, and Satellites (see Figure 15). The Hubs are comprised of unique lists of business keys. The Links are comprised of unique lists of associations (commonly referred to as transactions, or intersections of 2 or more business keys). The Satellites are comprised of descriptive data about the business key OR about the association. The flexibility of the Data Vault model is based in the normalization (of or separation of) data fields in to corresponding tables.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 20 of 15 52

Figure 1-5: Example Data Vault Data a Vault mode els are repressentative of business b proccesses and aare tied to thee business th hrough the b business keyys. Business keys indicate how the bu usinesses inttegrate, conn nect, and acccess inforrmation in their systems. Data Vault models m are built b based on n the concep ptual understtanding of the b business. Concepts such h as custome er, product, order, o email, sale, inventoory, part, ervice, accou unt, and portffolio are used d to represen nt ideas that cross lines se off business. Examples E of lines of busin ness may incclude: sales, finance, marketing, m contracting, ma anufacturing, planning, p roduction, an nd delivery. Th hese concep pts can be rep presented with business keys that shoould cross lin nes of busine ess. The Hub bs carry the unique u list off these keys a and are defined by sem mantic grain (granularity) and businesss utilization..

he keys (in Figgure 1-5 theyy show an Orrder actually being The Links repressent association across th invoiced to a cusstomer for a specific s product). The asssociations ch hange over tiime, some ha ave direcction (akin to o mathematiccal vectors), others o are directionless. Links are ph hysical representations of foreign ke eys, or in data a modeling te erms: an assoociative entitty.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 21 of 15 52

Hubs and Link ks do not con ntain context. Satellites p provide the ccontext ociations for a specific pooint in time. TThe defining the keys and asso bout a Hub orr Link that Satellites conttain the desccriptive data attributes ab an change ovver time. Sa atellites are the data wareehousing porrtion of the ca Data D Vault mo odel. You see, Hubs an nd Links are like the skele eton and liga aments of thee human bod dy – without tthem we have e no structure. Without th hem, our Datta Warehouses are blobs of data loosely coupled w with each othe er. But WITH them we havve definition,, structure, height, depth,, and specificc features. W We as hum mans couldn’tt survive with hout a skeleto on. The Data Warehouse cannot surviive without H Hubs and Linkks. They form m the foundattions of how we hook the data togetheer. For iinstance, the e Hubs are lik ke the differe ent bones in the t body, thee arm, the legg, the toes, th he head, etc… … The Links are a similar to o the ligamen nts that hold the bones toogether, give them flexibillity, and attacch them in a specific orde er. Finally, th he Satellites are added. S Satellites aree like the skin n, muscle, and organs. The ey add color, hair, eyes, an nd all the oth her componeents we need to be described. By separating the e concepts of o descriptive data from sttructural dataa, and structtural data from Linkkage data, we e can easily begin b to asse emble a pictu ure or an ima ge of what our companie es look like. The Hubs provide p the working constructs to whicch everythingg is anchored. The Links p provide the a assembly strructure to how w these anch hors interact,, and the Sattellites definee the contextt (like hair color, eye color, skin s type, etcc.) of all of these components. Rem member this: the Data Vau ult is targeted d to be an En nterprise Datta Warehousee. Its job is tto integgrate dispara ate data from m many differrent sources, and to Link it all togetheer while main ntaining sourrce system co ontext. It sitts between th he source sysstems and th he data martss. Why? Beccause the da ata marts are e the interrpretation layyer of the inte egrated data a warehouse data. In hum man terms th hink about it tthis way: thinkk about a cerrtain event th hat occurred in your life th hat you shareed with anoth her person. Do you both h remember it i the same way? w Do you both remem mber the exacct details? Or is your interpretation of th he event sligh htly different than that of your friend? Exacctly my point,, interpretatio on depends on o context, and a the conteext you use too remember the event is diffferent than the context your y friend usses. Even if the t facts or tthe event itseelf is exactly the same (you were both th here, you botth saw the ecclipse of the sun, s but you both experieenced it diffe erently). This is why it’s so o important to t separate in nterpretation n from the faccts. © Da an Linstedt 2010-2011, 2 all a rights rese erved http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 22 of 152

Let your Data Vault house the facts, and build your data marts to house the interpretation. 1.7

Flexibility of the Data Vault Model

The Data Vault model is built for extreme flexibility and extreme scalability. The Link table separates the relationships from the business key structures (the Hubs). The Link table provides for the representation of the relationship to change over time. The Satellites provide the descriptive characteristics about the Hubs or Links as they change over time. For instance, suppose you own a car and you are the registered driver. You currently have two relationships to the car: one as a driver, and one as an owner. Now suppose you hired a driver. Well, you still own the car right? Now, you have one relationship with the car as the owner, but the person you hired now has a relationship with the car as the driver. However, the description of the car has not changed. What if you sold the car to someone else? Then your relationship with the car as an owner would END, and the buyer’s relationship with the car would begin. This information about the relationship between business keys is what we keep in the Link structures. Again, the basic description of the car remains unchanged so the Satellite data is untouched. The Link table may also be applied to information association discovery. Business changes frequently – redefining relationships and cardinality of relationships. The Data Vault model approach responds favorably because the designer can quickly change the Link tables with little to no impact to the surrounding data model and load routines. MAJOR FUNDAMENTAL TENANT: THE DATA VAULT MODEL IS FLEXIBLE IN ITS CORE DESIGN. IF THE DESIGN OR THE ARCHITECTURE IS COMPROMISED ( THE STANDARDS / RULES ARE BROKEN) THEN THE MODEL BECOMES INFLEXIBLE AND BRITTLE . B Y BREAKING THE STANDARDS / RULES AND CHANGING THE ARCHITECTURE , RE - ENGINEERING BECOMES NECESSARY IN ORDER TO HANDLE BUSINESS CHANGES . O NCE THIS HAPPENS , TOTAL COST OF OWNERSHIP OVER THE LIFECYCLE OF THE DATA WAREHOUSE RISES , COMPLEXITY RISES , AND THE ENTIRE VALUE PROPOSITION OF APPLYING THE D ATA V AULT CONCEPTS BREAKS DOWN .

For example, suppose a data warehouse is constructed to house parts – then after 3 months in operation the business would like to track suppliers. The Data Vault can quickly be adapted by adding a Supplier Hub, Supplier Satellites, followed by a Link table between parts and suppliers - the impact is minimal (if any) to existing loading routines and existing history held within.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 23 of 15 52

Figure 1-6: Flexibility F of Adapting A to C hange In Figure 1-6 we see the existting data warrehouse in pu urple and thee new section ns in yellow a and oran nge (to the rig ght of the red d dotted line)). Placing the e association ns in a Link ta able enabless the data ware ehouse desig gn to be flexib ble in this ma anner – wherre new comp ponents do noot affect exissting components. One difference between b the Data D Vault model and a 3rd normal forrm model is tthe use of Lin nks to rd represent associations across concepts. 3 Normal fo orm represen nts most rela ationships byy tying pare ent keys to ch hild tables directly (withou ut an extrapo olated associiation table). Rela ationships such as 1 to Many, Many to o 1, and 1 to 1 are repressented in 3rd normal form directly by embedding th he parent fields in the chilld tables. Thiis leads to infflexibility of tthe model. en the busine ess rules change and the cardinality of the data m ust change tto meet busin ness Whe need ds, the mode el is altered as a is the operrational appliication usingg the model. An e example of a 3rd normal fo orm model iss shown below w in Figure 1 1-7.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 24 of 15 52

Figure 1-7: 1 3rd Normal Form Prod duct and Sup pplier Examplle In Figure 1-7, a Product P can have h 1 and only o 1 supplie er, but a Supp plier can sup pply many pro oducts. With h a model like e this, the bu usiness rule may m be: “a Prroduct can on nly be suppliied by a singlle supp plier”, which means that the t operation nal system th hat collects th he informatioon is coded acco ordingly. Whe en or if the business b chan nges its rule to say: “a prooduct can bee supplied byy more than n one supplie er” then the application a must m change, as must the underlying d data model sttructure. Whille this appea ars to be a sm mall change itt may affect all kinds of u underlying infformation in the operrational syste em; especiallly if the produ uct is a PARE ENT to other ttables. For d data warehouses (exceptt Data Vaults) this structure leads to eeven more coomplexity. In a data ware ehouse that contains c fore eign keys embedded in ch hild tables, th his leads to ccascading change impa acts. In othe er words, any changes ma ade to parentt keys will casscade all thee way down in n to every singgle child table e. The end re esult? • • •

You have to rebuild ALL L your ETL load ding routines You have to rebuild ALL L your Queries against the dependent stru uctures You have to re-model ALL your parentt and child tab bles

The e end result is massive m re-eng gineering efforrts, and that’s not all! The pproblem gets eexponentially h harder to handle with larger and larger datta warehouse models.

Th his is the #1 reason why Data Wareho ouse/BI Projeects are “torn n down, sttopped, halte ed, burned, and a ripped ap part” or labelled failures! The growingg and already high cost of re e-engineeringg which is cau used by poorr architectural design d and dependenciess built in to yoour data warrehouse model! m Don’t let this happ pen to you! Use U a Data Vaault and avoiid this mess up-front. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 25 of 15 52

This is evident when loading history to a data d warehou use where th e cardinality that exists in n today’s mod del did not exxist in the passt. Data ware ehousing tea ams run a higgh risk of re-eengineering tthe arch hitecture and loading proccess, if the en nterprise datta warehousee model enfoorces currentt relattionships. Th he Data Vaultt model mitiggates this risk by providin ng Link tabless for all relatiionships rega ardless of carrdinality. 1.8

Data Vault Basis B of Comm mutative Properties and Set Based Math

The Data Vault iss based on ra aw data sets arriving at th he warehousee (with little tto no alteratiion of the data a within). This is common nly referred to o as “the raw w data warehoouse.” Theree is a notion for consstructing a bu usiness base ed Data Vaultt that will be discussed laater in this boook. One of the found ding principle es of the Data a Vault is: enable re-creattion of a sourrce system data for a speccific point in time. t The Da ata Vault ach hieves this byy loading raw w data, passivvely integratin ng it by business key and d time-stamp ping the load cycles with the t arrival daates of the da ata set. Therre is a law in mathematics called the commutative c e property. T he commuta ative propertyy is defin ned below: In mathema atics, commutattivity is the abiliity to change the order of someething without cchanging the en nd result. It is a fundamental property in mosst branches of mathematics m an nd many proofs depend on it. TThe commutativvity of simple op perations was fo or many years im mplicitly assum ed and the prop perty was not given a nam me or attributed d until the 19th century when mathematicians m began to forma alize the theory of mathematiccs. http://en.w wikipedia.org/wik ki/Commutative e

ecific point in n time, wheree A = a sourcee system/source The basic notion is that A = B = C at a spe appllication, and B = staging area, a and C = enterprise Data Vault; ssuch that A ca an be reconsstituted for a any point in time containe ed within C. This preservves the auditaability of the data set hou used within the Data Vault V – while offering base level integrration acrosss lines of bussiness (see previous discussion on Hu ub based bussiness keys).

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 26 of 15 52

In n some casess B can repre esent the Datta Vault whilee C represents “as-is raw le evel star sche emas.” Raw level star schemas are u tilized to shoow the business whatt the source systems are collecting, a nd where thee gaps may be between th he business rules, r busine ess operation ns, and sourcce system applications. Information quality (IQ) can c be improvved through resolution ed gaps. off the identifie To fiind out more about gap analysis, please read the book: b The Neext Business Supermodel. The a Vault Mode eling. Busiiness of Data Anotther founding g principle be ehind the Datta Vault arch hitecture is th he use of set logic or set b based math. The Hubs and Links are loaded based on union n sets of information, while the Satelliites are load ded based on n delta chang ges inclusive of the union functionalityy. Set logic iss applied to tthe load ding processe es for restart ability, scala ability, and pa artitioning of the components. Stan ndard set the eory is defined as follows: Set theoryy, formalized using first-orde er logic, is the most common n foundationall system for mathemattics. The langu uage of set the eory is used in the definition ns of nearly alll mathematica al objects, su uch as functions, and conce epts of set theo ory are integraated throughoout the n be introduce mathemattics curriculum m. Elementary facts about se ets and set meembership can ed in primary school, along g with Venn dia agrams, to study collectionss of commonpllace physical objects. Ellementary ope erations such as a set union and a intersectioon can be stud died in this context. More M advanced d concepts succh as cardinality are a stand dard part of the undergrad duate mathematics curriculu um. http://en.w wikipedia.org/w wiki/Set_theory

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 27 of 15 52

In th he Data Vaultt approach, set s theory is applied a to inccoming data sets. The seet theory app plied in load ding routines is depicted in Figure 1-8::

Fig gure 1-8: App plied Set Theo ory for the Daata Vault The set theory is applied again for Hub an nd Link loadin ng where onl y new data (not previously inserted) is applied or loaded d. Set-based logic is applied when single distinct liists of keys a are loaded in orrder to the ta arget table wh here they havven’t yet bee en loaded. 1.9

Data Vault and Parallel Pro ocessing Math hematics

The main purposse to introduccing set-theory concepts and a the math hematics beh hind the Data a Vault is to prrovide you wiith a glimpse e of the actua al engineeringg effort behin nd the Data V Vault architecture itself. The archittecture is nott merely “justt another dessign” of tablees strung toggether – no, itt is engineered with specific tolerrance levels so that it can be flexible ass necessary. These n scale, and b conccepts are fou undational to understandiing why the architecture a i s designed, a and what the e specific purp poses of the design d eleme ents are. I ho ope you find this section enlighteningg, as it explain ns some of th he backgroun nd reasons ass to why you should stick with the origginal structurres (unmodified) as you implement your Data Vau ult. The Data Vault Modeling M com mponents are e based on pa arallel processsing mathem matics (versu us serial proccessing). Massively Parallel Systems typically t use a shared-notthing design. The Data Vault Mod deling compo onents make use of this design techniq que to split d data in a verttical format: a aka vertiical partitioniing. The verttical partition ning of data iss applied to H Hub, Link and Satellite sttructures and is a base part of the arch hitecture. The objective of o vertical parrtitioning with hin the Data Vault Mod del is to split the t work, so that the data abase engine es can optim ize the follow wing: © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse • • • •

P Page 28 of 15 52

Index Covverage Data Redundancy (miniimize this) Parallel Query Resource Utilization (sp plit over hardw ware platformss)

If you are not fam miliar with the e mathematiical principless of Massiveely Parallel Prrocessing (MPP) you can read about generic g parallel processin ng rules and performance p e speed up (ta asking) on W Wikipedia. Figure 1-9 is a grraph drawn from Wikipedia that introd duces the con ncepts of Parallel Processsing:

mputing Simp lified Figure 1-9: Parallel Com Im mage: http://en.wikipedia.org/w wiki/Parallel_coomputing

m of design an nd processing. The math hematics The principles att work expresss themselves in the form behiind the Data Vault Model can be found by reading about paralllel processingg. Specificallly: Para allel Data Pro ocessing, Parrallel Task Pro ocessing, and MPP syste ms design an nd architectu ure. The topology of the t computin ng cluster (da atabase engin ne) can be an ny of the dessired pieces including: star,, ring, tree, hyper-cube, fa at hyper-cube e, or n-dimen nsional mesh . The Data Va ault splits ou ut the Busiiness Keys, the relationsh hips (associations), and th he descriptivve data (repeetitive).

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 29 of 15 52

Busiiness keys byy nature are generally g non n-repetitive, therefore t inccreasing indeex coverage dram matically. Bu usiness keys by nature are e also specific to a set or tuple of data a as an identtifying marker. The Hub bs therefore act as indepe endent sourcces of inform mation, makin ng it easy (forr insta ance) to splitt different Hu ubs across diffferent comp puting platforrms – in otheer words, app plying vertiical partitioniing at the hardware level.. This is the nature e of MPP and d is known ass “scale-out.”” Scale-out teechnology alllows the mod del to grow w as large as the data sett or the busin ness demands while keep ping near lineear performance gains (in re elation to the e scale of the e hardware). The Links orr association ns also follow w across multtiple platfforms, and are sometime es replicated in shared-no othing environ nments for e ease of joins. One term m for this is: “join-indexes” “ ”. This is just one way w to view th he Data Vaultt Model; it is essentially b based on the e principles off a scalefree tree, all the way down to the individual table strucctures built w within the moodel. Multiple e scalefree trees are no othing more than more Hu ubs, Links, an nd Satellites within the Da ata Vault, thu us prod ducing a “cub be-like” struccture if desire ed. An examp ple of a logical design or conceptual vview of the Data a Vault in a Hyper H Cube – it might look k something like this:

Figure 1-10: Logical Data a Vault Hyperr Cube Image e: http://clanbase.g ggl.com/img.php?url=fc07.deviantart.co om/fs14/i/2007/07 76/a/b/Hypercube__by_Meninx.jpg

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 30 of 15 52

In th he physical model, m Hubs are a connecte ed to Link stru uctures; Linkks become a physical notiion for an asso ociation. In the physical Data D Vault Model nodes are a connected through Lin nks to each o other, theyy are not directly related. This is a con nceptual bassis for establiishing the preemise of the vision. Hubs provide the e keys, while Satellites aro ound Hubs “describe” thee key for anyy given point in time. Hype er Cubes can n be created as can trees. A simpler vision v or view w of the Data Vault Model split for para allelism is in Figure 1-11:

Figure e 1-11: Physiccal Data Vault Layout (Staarting point) p off the data beecause the size of the This is where it sttarts, quite simple enough – no real partitioning et large enoug gh. All of the e tables go th hrough one, ttwo, or three I/O connectiions to a data set is not ye SAN or a NAS drivve. When the e data set grows, physica al partitioningg (or split-off of tables) can occur. The e end-result (to o an extreme e) might be ass shown in Fiigure 1-12:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 31 of 15 52

Figurre 1-12: Physsical Data Vault Layout (P Partitioned) In th his case, each h table has I//O channels bound to it, along a with deedicated diskk (DASD) sittiing on Raid d 0+1 formatt. This is the ultimate in separation s for relational d database enggines. This allows the relattional databa ase engines to t operate their “parallel query enginees” without d disk wait state e depe endencies accross each ta able. Truly independent hardware h leveels can achieeve very high h perfo ormance. Th he next step might be sep parating the processing p poower out intoo different no odes, reacching the MPP level of arcchitecture in hardware. The Data Vault Model M followss a scale-free e topology. Scale-free top pology is defin ned as follow ws: A scale-fre ee network is a network who ose degree disstribution follow ws a power law, at least asymptoticcally. That is, the t fraction P((k) of nodes in the network h having k connections to other nodes goe es for large values of k as P(k) ~ k-? where e ? is a constaant whose valu ue is typically iin the range 2 < ? < 3, alth hough occasio onally it may lie e outside thesse bounds. Scale-free networks are noteworthy because many empirically ob bserved netwoorks appear to o be scale-frree, including the protein ne etworks, citatio on networks, aand some social networks.[1]

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 32 of 15 52

Source: htttp://en.wikipe edia.org/wiki//Scale-free_ne etwork

The mathematicss behind scale-free netwo orks applies to t the Data V Vault Model. Any physical model builtt from the Da ata Vault principles will ca arry the same e mathematiccal properties. Using a sp pringgrap ph, or a weigh hted graph in n either 2 dim mensions or 3 dimensionss, it becomess apparent w which of the n nodes are the “most impo ortant” in the e Data Model. The most iinterconnectted nodes are e centtralized within the graph, and have the e most neigh hbors. 1.10 0 Introduction to Complexityy and the Data a Vault

This topic is reallly deserving of o an entire chapter, c perh haps even an n entire bookk. However, in the interrest of time, and due to th he fact that this t concept must be brou ught to light, I will make a small intro oduction to th his concept. The Data Va ault model an nd methodoloogy make a tremendous conttribution to lo owering the overall o compllexity of the systems s invo lved in data w warehousingg. Currrent data warrehousing sysstems try to do d too much in their load ding cycle. Th hey try to add dress ALL of th he following problems p in a single load pattern:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse •

•

Page 33 of 152

Sourcing Problems: o Synchronization / Source Data Availability time windows o Cross-System Joins o Cross-System Filters o Cross-System aggregates o Indexing issues, leading to performance problems o Disjoint or missing source data sets o Missing source keys o Bad source data, out of range source data o Source system password issues o Source system Availability for loading windows o Source system CPU, RAM, and Disk Load o Source System structure complexity o Source system I/O performance o Source System transactional record locks Transformation problems – often IN STREAM o Cleansing o Quality and Alignment o Joins o Consolidation o Aggregation o Filtering o Sequence Assignment - often leading to lack of parallelism o Data type correction o Error handling (when the database kicks it back) o Error handling (data is: out of bounds, out of range) o Size of Memory o Lookup issues (more sourcing problems, caching problems, Memory problems) o Sorting issues (large caches, disk overflows, huge keys) o BUSINESS RULES , especially across SPLIT data streams o Multiple targets o Multiple target errors o Multiple sources o Single transformation bottleneck (performance, realationships – joins, and so on)

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse •

Page 34 of 152

Target Problems o Lack of database tuning o Index updates (deadlocking) o Update, Insert, and Delete mixed statements – forcing data ORDER to be specific, cutting off possibilities for executing in parallel o Block size issues o Multi-target issues (too many connections, error handling in one stream holding up all other targets in the same data stream) o WIDE targets (due to business rules being IN-STREAM) o Indexes ON TARGETS (because targets ARE the data marts) o Lack of control over target partitioning

Along with many more issues. In these cases, this is the traditional view of issues that data integration specialists are left to solve. You are expected to construct load after load that answers ALL of these problems in a SINGLE data stream right? Well, this is no way to do business. This increases complexity to an unimaginable level, and this contributes to the ultimate downfall of the data warehousing project! “Quality Software Management” Vol. 1 Gerald M. Weinberg pp. 135-139. When you develop your ETL for a star schema EDW, you essentially get a sequential set of (big T) transformations. As that sequence grows in size and complexity, the difficulty of testing it, and tracing errors back to the source grows exponentially, hence as your (S-S) EDW grows, you get haunted by ever growing development cycles, and increasingly less control over the testing process, until your EDW has developed into yet another legacy system. And then you know what its fate will be…

The Data Vault Motto: DIVIDE AND CONQUER! Believe me, you can win every time with this strategy. Let’s analyze this for a minute: Sourcing Problems: Nearly every problem can be addressed through a few simple rules:

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse •

•

Page 35 of 152

Separate each source load, and land the data in the target – make each load a very simple copy operation where the data is pulled from the source, and landed directly in the target – in this case, a STAGING AREA (as defined in the next chapter). Yes, you may source a CDC (Change Data Capture) operation if you so desire. Run the staging loads when the data is ready! Don’t wait for other systems, don’t perform any other systems joins, and don’t force the data to “conform or align” with specific rules or datatypes.

These two simple rules ensure that you can get in, get the data from the source, and get out – when the source data is ready to go. No waiting! No Joining! No timing complexity! No performance problems! You can always take a copy operation and partition the target, and partition the load for MAXIMUM throughput! Transformation problems: Divide and conquer. The following rules make it much easier to deal with this part of the loading cycle: •

•

Move the business rules downstream. This includes all the joins, filters, aggregations, quality, cleansing, and alignments that need to happen – between the Data Vault and the Data Marts. This also allows you to effectively target the PROPER data mart with the PROPER rule set (as deemed appropriate by the business). Load raw data in to the Data Vault area, this provides SIMPLE, maintainable, and easy to use loading code that meets the needs of the business. It also prevents you from having to “re-engineer” loading routines to add new systems, or add new data. Sure you end up with a lot more routines, BUT each one is a thousand times less complex, and easier to manage.

The end result for this? You can PARALLELIZE the loading routines to your data warehouse AND you can load data to your Data Vault in REAL-TIME at the SAME TIME as your batch loads are running. Just try that with your standard star-schema! Targeting problems: They all but disappear. Why? Because once you divide and conquer, your loading routines will be built for inserts only, high speed inserts at that, and they generally will contain only one or two target tables for loading purposes! No more locking problems, no more worries about wide rows (except when you get to loading data marts, that’s another story). High degrees of parallelism, high degrees of partitioning, and high performance, and really low complexity scores, what more could you ask for? 1.11 Loading Processes: Batch Versus Real Time

This book introduces the concepts with a small bit of background, it is meant to be only an introduction to the loading patterns and processes used within the Data Vault. The purpose of this entry is to define the basic terms of batch loading and real-time loading.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 36 of 152

Batch Loading: usually occurring on a scheduled basis, loading any number of rows in a “batch”. The execution timing will vary from every 5 minutes to every 24 hours, to weekly, to monthly, and so on. Any load cycle running every 20 seconds or less, tends to fall close to the real-time loading category. All other scheduled cycles tend to be labeled “mini-batches.” Real-Time Loading: there is a grey area of definition between what a batch load is and what a real-time load. For the purposes of this book, real-time loading is any loading cycle that runs continuously (never ends), loads data from a web-service or queuing service (usually) whenever the transactions appear.

Neither loading paradigm has any effect on the data modeling constructs within the Data Vault. The Hub, Link, and Satellite definitions remain the same and are capable of handling extremely large batches of data, and or extremely fast (millisecond feeds) loads of data.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 37 of 15 52

2.0 Architectural Defin nitions The Data Vault approach (pro oject/method dology) has common arch hitectural com mponents defined. The componentss are referred d to througho out this book.. The purposse of this boook is to define e the Data a Vault data model m structures. Contexxt for those structures s is a necessary ffoundationall component of un nderstanding g the Data Vault. The com mmon architeectural comp ponents utilizzed in the Data a Vault appro oach are defined in Figure e 2-1:

Figu ure 2-1: Enterrprise BI Arch hitectural Com mponents The Data Vault methodology m includes eacch of these co omponents. The architecctural components discussed in thiss book (in dettail) include the t Staging area a and the Data Vault. This section briefly intro oduces the otther sectionss as part of th he architectu ure for you to consider. 2.1

Staging Area a

The staging area a consists of tables t in the database to house incom ming data 1:1 1 with the so ource syste em (with som me additional system driven elements). The stagin ng area is reffreshed (purgged) prior to ea ach batch loa ad cycle, in other o words, they t should not n ever housse history of loads. This iis often calle ed a transient staging are ea. Staging ta ables house no referentiaal integrity an nd no foreign n keys. Theyy house a seq quence number which is reset and cyccled for each h table with eeach batch cyycle. They housse a load datte stamp and d a record source for each h table. Thesse componen nts are descrribed in Chapter 3.0 Com mmon Data Elements. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 38 of 152

These tables do not carry any foreign keys, or original primary key definitions. Exceptions: Loading a de-normalized COBOL based file, and executing normalization (splitting into multiple tables), the staging tables will carry parent ID references. Loading a denormalized XML based file and executing normalization, the staging tables will carry parent ID references. The staging area may be partitioned in any manner desired. The format is owned and maintained by the data warehousing team. The staging area tables may also contain any indexes needed (postload) in order to provide the data warehouse/Data Vault loads with the proper performance downstream. Staging area data should be backed up at regular intervals (if the data arrives in realtime), otherwise it will be backed up at scheduled intervals. The future need for a staging area is in question. In fact, within the operational Data Vault and 100% real-time feeds there appear to be no real needs to have a staging area. There are already a few Operational Data Vaults built using the principles of by-passing the staging area, and loading data directly (from the real-time feeds/web-services) to the Data Vault. The only reasons for staging areas to continue to exist (as of 2010) include the following: • • • • 2.2

Data Synchronization – with other static lookup data Hot-Data Backup – continuous backup in case the queuing engine dies (the transactional feed engine) Batch data Delivery – Reformatting and consolidation File Format adjustments / alignment EDW – Data Vault

The EDW (enterprise data warehouse), or core historical data repository, consists of the Data Vault modeled tables. The EDW holds data over time at a granular level (raw data sets). The Data Vault is comprised of Hubs, Links, and Satellites (defined in section 1.6 and further defined throughout this book). The Enterprise Data Warehousing Layer is comprised of a Data Vault Model where all raw granular history is stored. Unlike many existing data warehouses today, referential integrity is complete across the model and is enforced at all times. The Data Vault model is a highly normalized architecture. Some Satellites in the Data Vault may be denormalized to a degree under specific circumstances.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 39 of 15 52

Th he Data Vaullt model follo ows all definittions of the D Data Warehoouse (as defined by Bill Inmon) exccept one: the Data Vault iss functionallyy based, not su ubject oriented – meanin ng that the bu usiness keys are horizonttal in nature and provide viisibility acrosss lines of business.

The Data Vault modeling m arch hitecture hass been likene ed to 3½ norm mal form. Th he business kkeys in th the H Hub appear to t be 6 norm mal form, wh hile the load date d and recoord source are 3rd norma al form. The Data Vault model m should represent th he lowest posssible grain. The Hubs an nd Links in th he Data Vaullt model provvide the back k-bone structure to which context (the Satellites) a are applied. 2.3

Metrics Vaullt

A co omponent forr capturing te echnical metrrics about the e: load proceess, loading ttime-lines, co ompletion ratess, amount off data moved d, growth of ta ables, files, and a indexes. This Data Va ault capturess the tech hnical metada ata for the prrocesses and d the databasse. By captu ring growth rrate actuals a along with run-ttimes, insert numbers, up pdate numbe ers, and row counts c – proj ojections of fu uture storage e requ uirements can be created and manage ed. This allow ws the busin ness to monittor their need ds, and budgget 6 monthss to 1 year in advance forr future hardw ware. The Metrics Vaullt can also be e crafted to in nclude inform mation aboutt CPU utilization, RAM acccess; I/O throughput and I/O I wait time es. The addittional informa ation in the M Metrics Vaultt begins to prrovide a conssistent and concise view of o the utilizattion of the syystem in conjjunction with h the growth o of the data a sets and the e hot spots on o disk. From m all of these e metrics, a n nearly compleete technical man nagement dashboard can be presente ed to monitorr the EDW efffort. 2.4

Meta Vault

The Meta Vault contains c busiiness metada ata (ontologie es/taxonomi es/definition ns) and physiical data mod del attribute names, n functtions (for tran nslation) and d technically iimplemented d business ru ules that ETL / ELT followss to interpret the data. Th he Meta Vault allows bus iness to prod duce, maintain, and delivver metadata a across the board b from within w their ED DW/BI soluti on set. The Meta Vault iss in fact one form of Operrational Data a Warehouse. The Meta Vault contains c metadata for the e staging area a, EDW Dataa Vault, Report Collectionss, Data Martts, and Metrics Vault area as. The meta adata is defin ned through IT, business,, and processs tech hnologies.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse 2.5

Page 40 of 152

Report Collections

Report collections are defined as flat-wide denormalized structures, used for high-speed reporting or flat file output access; they may also be used by data mining tools. They are a form of data mart where end-user access is direct. Report collections provide the business users with pre-computed totals at the end of each row. These pre-computed totals allow high speed filtering against patterns of rows that are “out of the normal zone” (in other words, breaking business requirements). 2.6

Data Marts

Data marts are defined as: any point at which generic users directly access the structures and the data for ad-hoc reporting, or drill-down analysis. This may or may not be a Star Schema. It may also include normalized and denormalized tables. Data Marts may be virtualized; for example: in-RAM cubes, and dynamically altered information sets. A form of a data mart is an Excel spreadsheet that communicates directly with the Data Vault through an interactive metadata layer (possibly something like Microsoft SharePoint direct to the Data Vault back-end). Direct communication between the user, the metadata management, and the Data Vault is the beginnings of an Operational Data Warehouse. For purposes of auditability and accountability the data is separated into two physical layers: corporate marts, and error marts. Corporate marts serve as the standard data marts, where data that meets soft business rules is contained. Error marts serve as the landing zone for “bad data”, that is: data that does not meet soft business rules. The definition of hard and soft business rules is covered in the book: “The Next Business Supermodel, the Business of Data Vault Modeling.” 2.7

Business Data Vault

There is a new component in the architecture (not shown in Figure 2-1). The component is called the “Business Data Vault.” Business users and IT alike are seeing the benefits of the flexibility, scalability, and adaptability of the Data Vault model. They want the benefits, but with the business data embedded. Downstream of the raw Data Vault, (between the Data Vault and the Data Marts in the Figure 2-1) they are building a new store called the Business Data Vault. The Business Data Vault (BDV) is a concept, a grouping of specific tables in fashioned using Data Vault modeling concepts, but not necessarily following all the Raw Data Vault modeling rules. A Business Data Vault (also known as EDW+) can be a group of tables inside the raw Data Vault (where the record source has changed), or can be a completely separate data store. Either way, the data that exists in the BDV has been altered, cleansed and changed to meet the rules of the business and is downstream of the raw Data Vault. You may be able to dual-purpose the BDV and apply master data rules as well, thus making the BDV a starting point for a Master Data System. © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 41 of 152

The Business Data Vault contains all business data, all altered data, aggregated, and cleansed information. IT staff are executing the business transformations once, assigning more metadata (including master data definitions), and then releasing (through simple copy) the data needed in the marts. The Business Data Vault is considered an extra copy of the information; however it is paired with the business metadata and all of the transformations needed to make virtual cubes and high speed delivery possible. The argument received from the business is that the data (posttransformation) is used on the financial reports, and as such, must also be accountable and auditable. Therefore a second copy of the data (post-transformation) is necessary as another system of record. The technical argument provided is that the IT staff only wishes to do the transformation once, or that they have a standing order to provide “virtual marts”; which in this case translates to RAM based cubes, and views that look like dimensions and facts. 2.8

Operational Data Vault

The nature of the Raw Data Vault (EDW as depicted in Figure 2-1) is changing to include operational data. The need to combine/consolidate operational data with the raw Data Vault is being driven by Master Data Initiatives, and business needs. The business wants more historical data mixed with current transactions at their finger-tips. In order to meet this demand the Data Warehousing teams are loading operational data (real-time loading) directly in to the Raw Data Vault, thus creating an Operational Data Vault. The entire discussion of Operational Data Vaults is outside the scope of this text, and will be defined elsewhere in articles and discussion forums. WARNING: AN ODV INHERITS ALL THE ISSUES, PROBLEMS, AND RELIABILITY CONCERNS OF AN OPERATIONAL SYSTEM . I TEMS SUCH AS GOVERNANCE , UP- TIME (6 X9’ S ), 24 X7 X 365 SUPPORT , ALL COME TO BEAR WITH AN OPERATIONAL DATA V AULT . T HE DECISION TO BUILD ONE SHOULD NOT BE TAKEN LIGHTLY .

What is an Operational Data Vault? The Operational Data Vault is part data warehouse, and part online transactional data store (operational data store). The Operational Data Vault stores all changes to data as inserts (as does a traditional data warehouse), however at the same time it also offers “update/edit” access to the operational applications sitting directly on top of the data warehouse.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 42 of 152

In case you are wondering: “Has this ever been done successfully?” The answer is yes, it has – several times already. A company called Cendant Timeshare Resource Group (Cendant TRG) rebuilt their entire operational layer in Java directly on top of the Data Vault, consolidating data warehousing directly with operational applications. There were no separate systems for reporting, no separate systems for “operational data” or OLTP applications, simply the Data Vault and the Java OLTP application. This is one example which has been in use since 2001. Another example is a drug manufacturing traceability warehouse that was built in 2008 for the US Congress. This Data Vault had operational applications that were driven by drug packaging machines which assigned unique ID’s to every drug package from every manufacturer around the world. These machines fed the data over remote web-services connections directly to the Data Vault every 10 minutes, where the data was encrypted, secured, and stored – only to be accessed every time the drug was scanned at different points in the supply chain. At which time the warehouse would provide different web-service access points to retrieve audit trails of all points where the drug was scanned. In this manner, you (the consumer) could log in to a web-site after purchasing a drug, type in its bar-coded number, and check its authenticity. It was called: Drug Track And Trace anticounterfeit operation. 2.9

Dynamic Data Vault

The Dynamic Data Vault is an operational Data Vault with dynamic adaptation to the structure. In other words, the tables, columns, indexes, and keys are all subject to change – automatically. Of course to achieve this state requires a constant vigilant watch on the metadata, including but not limited to incoming structures. The incoming structures may include XSD, XML, staging tables, or other metadata (including queue based or process metadata) that describe the structure of the incoming data set. The dynamic nature of the Data Vault means: new attributes may be added to Satellites, new Links and new Hubs may be formed on the fly. ETL /ELT loading code will be adjusted automatically, and BI Query views will also inherit certain changes. At the end of all the automatic model changes, emails of the changes are sent to the IT staff for review in the morning.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 43 of 15 52

3.0 Common n Attributess The Data Vault sttructures (tables) contain n standard atttributes thatt assist with tthe constructtion, trackking, and que erying. The common c attributes in the Data Vault aare defined h here and are applied throughout the Hubs, H Links and Satellitess. The common attributess include: seq quence numbers, subsequ uence numbe ers (line item m numbers), load dates, lo oad end-datees, last seen dates, extracct dates, reco ord creation dates, d and re ecord sourcess. Mosst of these fie elds are EDW W (enterprise data d warehouse) system defined, and d EDW system m gene erated/mainttained; as a result, the da ata in these columns c are “reference d data” and are e nonaudiitable as theyy do not existt in the sourcce system. However, H recoord creation dates and lin ne-item num mbers are two o cases that are a auditable e particularly when they eexist in the soource system m. The Data Vault works w on the principles sim milar to geolo ogical layerin ng where data a arriving in tthe ware ehouse (in a single batch)) is stamped with a “geolo ogical time b based layer” ((a load date time stam mp). The load d dates enforrce audit trails and record d history bassed on the on ne and only conttrollable syste em date time e available to o the EDW loa ading routinees. The only point at whicch this princciple does no ot apply is du uring real-time feed proce essing.

T Series Batch B Loaded d Data Figure 3-1: Time

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 44 of 15 52

The Data Vault assists with auditability an nd recoverability by stamp ping all particcipating rowss in a singgle batch with h the same lo oad date time e stamp. If th he loading prrocess fails m mechanically (for any reasson) it is nece essary to exa amine all rows that were lo oaded duringg that processs; resulting iin remo oval, replace ement, or aug gmentation to o the data se et. This is thee only mecha anism availab ble to recre eate the audit trail of the data for thatt date time stamp. As a sside note theese mechaniccal prob blems are nott often discovvered for wee eks or month hs after they have occurreed. Real-time data lo oads are trea ated differenttly. Real-time e data loads are stamped d based on m message arrivval time. Rea al-time latenccy is typically defined as message m arrivval in a data loading queue with laten ncy of arrival being less th han one minute. Real-Tim me Loading iss commonly defined in te erms of: transactions perr second. An example ima age of real-tim me data stam mps is shown n in Figure 3--2.

Figure 3-2 Real-Time R Arrival, Data Geeology ears similar to t layers of p pebbles on th he beach. Da ata is not Real-time data arrival time-sttamping appe ns. It can be e grouped toggether for ana alytic purposses but a conggruent with tiime intervalss or time-span singgle time consttant does nott represent any a fixed laye er of informattion in the en nterprise data a ware ehouse. As data d loads sh hift to incorpo orate real-tim me data feedss (also known n as trickle fe eeds) the liness between co onstant time (batch loads) and continu uous time (reeal-time loadss) blurs.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse 3.1

Page 45 of 152

Sequence Numbers

Sequence numbers are required by relational database management systems (RDBMS) in order to process joins quickly and efficiently. Without sequence numbers the joins across huge amounts of information would operate comparatively slowly (compared to character based joins). The use of sequence numbers as primary keys for Hubs and Links also eliminates any possible issues maintaining multi-part cascading keys in Satellites or nested Link tables. Staging area sequences are stored within the staging area. These sequences should be restarted and set to cycle over for each load to a specific table. Staging sequence numbers are utilized only to identify loaded duplicates. Staging area sequences should not ever leave the staging area, and should not be moved forward into the Data Vault. Duplicates are rows that have 100% completely the same data - from the keys, to the nulls, to the descriptive fields. When the data is 100% duplicate, there needs to be a way to delete the rows from the staging table in order to proceed with loading only one unique copy to the target Data Vault. Without a sequence number, there is no unique identifier on each row. With a sequence number it is easy to “pick” the first or last row as the candidate to leave in place and delete the rest. Before deleting the duplicates – the Metrics Vault should record a history of how many duplicates there are per staging table per business key. By counting the duplicates auditability can be maintained if the IT staff is ever asked to reproduce the source load. The number of duplicates multiplied by one row provides the recreation with an accurate picture. In other words, a Cartesian join product is applied in order to reproduce the original duplicate row set. Hub and Link sequence numbers are created 1 for 1 with each unique business key and unique association inserted to the respective table. Satellite sequence numbers are generally parent table sequence numbers, in other words they are inherited from the Hub or Link parent table. It is a recommended practice to setup sequence numbers to be number(12). In Oracle there appears to be no byte-storage difference between a number(12) and a number(38). Most sequence numbers will fit within this length, and will not require double or floating point math to resolve at query time.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 46 of 152

Sequence numbers in the Data Vault should never be shown to business users, and must not leave the Data Vault going forward. First, sequence numbers are meaningless numbers which are there simply to provide uniqueness to the rows they represent. Second, the numbers are there merely for JOIN purposes at high rates of speed. Third, if I ask you: “please tell me what number 5 means to you?” Can you define it? Can you make sense of it? No. It’s a meaningless NUMBER. There is no context. The sin of this is that once you expose the sequence number to the business – they will forever attach that “customer/product/employee/service” or what-ever-it-is to the number you give them. Meaning that they give it context, they force it to mean something to the business! Now, you (as IT) no longer have the right or the ability to “change/alter/destroy and rebuild” that number, nor are you allowed to attach different rows to that number. This will cause problems for future re-loading, re-building, or even fixing the Data Warehouse, regardless of the data modeling technique you choose! DON’T DO THIS, DON’T EXPOSE SEQUENCE NUMBERS TO THE BUSINESS… EVER! 3.2

Sub Sequence Numbers (Item Numbering)

Sub sequences depend on parent tables for context and within context have business meaning; however as stand-alone attributes they hold no business meaning what-so-ever. In this regard subsequences do not work well as independent Hub keys. Sub sequences may also be defined as “ghost Hub tables” if logically modeled – but should not ever be physically implemented. For example: A line-item number “5” has no context however it is required when discussing a particular detail item on an invoice. Sub sequence numbers are utilized to order Link or Satellite rows. In Link tables they are part of the unique index, in Satellite tables they can be included in order to provide context called: “multiple active Satellite rows.” Sub-sequences simply allow multiple rows to be “active” for a single master key. It is a best practice to avoid sub-sequence numbers if at all possible. When used in a Link table they can cause reengineering of the loads in the future (if the Link structure changes). WARNING: IF SUB-SEQUENCES APPEAR IN THE MODEL, IT MAY BE A CALL TO RESEARCH FURTHER. TAKE THE TIME TO INVESTIGATE IF A LINK AND NEW HUB TABLES NEED TO BE DEFINED . I T IS COMMON TO MISTAKE THE NEED FOR A SUB SEQUENCE WHEN THE CORRECT MODEL WILL HAVE ONE OR MORE NEW HUBS WITH A L INK IN PLACE . EXCEPT IN REALTIME MILLISECOND SYSTEMS.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse 3.3

P Page 47 of 15 52

Load Dates

Load d dates are system s generrated, system m maintained fields. This attribute is a applied to the e arriving data a set in both real-time and d in batch mo odes. Load dates d represeent the date time stamp (acccording to the e EDW machine clock) of the t arriving data. d Load daates are applied to data ssets arrivving in the sta aging area off the Data Va ault. Load d dates for re eal-time data a are applied based on the e clock time arrival of thee transactions housed in th he incoming queue. q Load dates for ba atch based da ata are set oonce per batcch. They can be thou ught of as a date-time-sta d mp equivalent to a batch h load processs identifier. As described d above (see e Figure 3-1) the t Data Vau ult relies on the notion tha at load datess are consisteently applied per batcch for tracking purposes. The Load Da ate should no ot be set by rrepetitive sysstem calls thrroughout the llife-cycle of a single load, nor should it be changed d from one seet of staging data to anotther. The load d date time sttamp is the id dentifier thatt indicates which geologiccal layer (in time series) that this data a applies to. Load d dates should be looked up from a siingle table ca alled CONTRO OL_DATE which is housed d within the sstaging area and containss a single column, single row of inform mation. The load to the staging table es lookup the e load-date (LOAD_DTS), and hard-cod de the record d source. For example, iff the batch wind dow is a nightly batch that begins at 22:00 hours, and complettes at 06:00 hours the following morning, then the LOAD_DTS S should be set to 00:00 hours h for thee following moorning day’s start acro oss all data in n the staging area.

Figure e 3-3: Load Date D Time Sta amp and Reccord Source

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 48 of 15 52

By kkeeping a con nsistent load date time sttamp on the information i i t becomes p possible to tra ace errors and find technica al load proble ems (affectin ng the data) months m afterr the load hass occurred. IIt also beco omes possiblle to remove that layer off geology and d identify how w far the prob blem data ha as spread. The resulting load cycles become repeata able, consiste ent, and restaartable for an ny given load d cycle acro oss all time.

Figure 3-4: Exam mple Load Date Time Staamp Data Load d dates are also a utilized to perform ga ap analysis be etween arriv ing data in th he warehouse (from all p parts of the world) w and the e extract date es or creation n dates of thee data set. B By analyzing the gap betw ween the load d stamp (whe en it arrived at a the EDW machine) m and d extract datee, the busine ess can quicckly see if the ere is an unaccceptable de elay in data arriving at thee warehouse.. For e example: sup ppose the loa ad date is 10 0-14-2000 ass in Figure 3-4 4, but the exxtract date is 10-022000. The busin ness in this case c has a se ervice level aggreement (SLLA) in place tto provide the e data within 5 days of extraction e fro om the sourcce. By storingg both in the Data Vault, tthe businesss can quicckly determine that their SLA S is not being met. Extract dates arre discussed in a later secction, botto om line is tha at neither exttract nor crea ation dates on o data sets sshould be uttilized to reprresent the load d date time sttamp in the data d warehou use. Histo orical treatment of the loa ad date time e stamp may differ slightlyy. Load date time stampss for histo orical data will vary based d on the availability of dattes in the hisstorical data. It is recomm mended to state e an assump ption that: IF the t system existed e at tha at historical p point THEN th he load date time stam mp would havve been “X”. Based on th his premise, set s the controol date, load the history to the stagging tables, and then the processes p follow precisely for loadingg historical geeology layers in to the Data a Vault. Note: to assist a in managing volumiinous data se ets set the co ompression fl ag on the Loa ad Date Time e stamp column. This will help queries s tremendoussly; howeverr keep in mind that on most m RDBMS engines e comp pression adds s overhead to o the loading cycle. It ma ay be unwise e in a true-re eal-time soluttion.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 49 of 15 52

On o occasion the historical date time stam mp must coinccide with thee creation datte of the data a, and sometimes the granularity g of the creation date may be e monthly wh here the curreent loads occcur daily. For tthese reason ns controlling g the load datte time stamp as a singlee unit providees full flexibility of histo orical loads for f specific grrains of data as well, provviding snapsh hot availability. 3.4

Load End Da ates

Load d end dates are a system computed attributes. Thesse are mech anical attribu utes that exisst solely to m make queries against the Data Vault ea asier. Load end e dates arre NOT necesssary for the arch hitecture, the ey are query attributes a only. These atttributes indiccate the end of the data lifecycle within the loading time-frame e of the Data Vault. Time--series based d database eengines are ccapable of computing data life-cycles witthout resortin ng to load en nd date colum mns.

Figure F 3-5: Lo oad End Date e Computatio ons, Descripttive Data Lifee Cycle Load d end dates are a set accorrding to the next n current row r load datee. They may be exclusive e (as indiccated here) where w 1 seco ond has been n subtracted from f the nexxt most current load date,, or inclu usive (not ind dicated here) where they are a equal to the load datee from the neext most rece ent row. Load d end dates of o the current row are sho own in Figure e 3-5 to be N ULL. It is opttional to configure them m to be future e dated if desired. Load end e dates that are futuree dated do noot need to be e relocated on disk when w updated (the end da ate is reset). Load d end dates which w are NU ULL do not ca ause the row to migrate too another dissk block when upda ated or end-d dated; most RDBMS R engines make da ate/time dataa types take tthe same am mount of byte es whether NU ULL or not – this is known n as a fixed-le ength colum n in the data abase engine e. Load end dates are no ot auditable as a they are syystem compu uted values. Load end da ates must be e updated in th he row-set in order to ensure time-line consistency. Figure 3-5 depicts a Sa atellite entity with custtomer namess. Satellites are a defined in detail in Ch hapter 5.0. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse 3.5

Page 50 of 152

Last Seen Dates

Last seen dates are a particular component of the architecture that enable source-system hard delete monitoring without resorting to complete scans of the data set currently in the Data Vault. Last seen dates are optional within the Hubs and Links. The “last seen” metadata can be tracked in alternative Satellites for better resolution at a lower level of detail. Last seen dates are not required by the architecture of the Data Vault to stand up and work properly. There are other manners in which to track data (discussed in the Satellite chapter) that may provide more information than the last-seen-date; an alternative architecture is a status-tracking-Satellite, or a record source tracking Satellite. NOTE: LAST SEEN DATES SHOULD NOT BE USED IF THERE IS AN AUDIT TRAIL AVAILABLE. AN AUDIT TRAIL IS MORE ACCURATE FROM THE SOURCE SYSTEM AND ELIMINATES THE NEED TO IMPLEMENT LAST SEEN DATES. A UDIT TRAILS MAY ALSO BE UTILIZED IF GENERATED FROM CHANGE - DATA- CAPTURE ( CDC ) UPSTREAM IN THE SOURCE SYSTEM .

The problem faced by enterprise data warehouses is: detecting hard-deletes of source data while the “set” of EDW data is continuously growing. The case is as follows: a source system does not provide an audit trail, nor does it provide any event or transaction indicating which rows are being deleted or removed. During every load cycle the entire source table/xml file is simply dumped and loaded to the staging area of the Data Vault. Traditional set theory dictates that in order to find “missing rows” that have been hard-deleted from the source feed, a process takes place that scans everything in the Data Vault that does not exist on any of the source feeds. This is an extremely expensive operation, and cannot be mathematically sustained for high volume data warehouses. At some point running the full scan on the Data Vault becomes impossible. It is at this point that the set can be contained or limited to a finite point by introducing a system maintained date stamp called a last seen date. An example of the structure can be seen in Figure 3-6 below.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 51 of 15 52

Figu ure 3-6: Strucctures containing Last Seeen Dates The follow wing section ad ddresses techn nical impleme entation which is out of scop pe for this documentt. However, the following infformation is ne ecessary to asssist in the exp planation of th his concept off last-seen-datte; therefore itt will be includ ded in this textt.

The last seen date can be fou und within Hu ubs and Link ks. The last sseen dates’ functionality iis to track the llast time the Hub/Link sa aw the key/re elationship on o any incom ing feed. The last seen d date is a syste em generate ed and system m maintained d feed, thereffore it is a noon-auditable attribute exissting within the data warehouse. w It may be upd dated in placce without afffecting the au uditability of the unde erlying wareh house data.

Figure 3-7: 3 Scan all data in EDW W Lastt seen dates provide a me echanism to reduce the data d set scan nned to detecct missing row ws on the sourrce feed. A different d architecture know wn as Statuss Tracking Saatellites can p provide more e detailed inforrmation in the appearancce and disapp pearance of the t keys. Staatus Trackingg Satellites m may be used d in place of Last Seen Da ates. These Satellites are e covered in the Satellite chapter. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 52 of 152

For example suppose the Hub_Customer had 800 million customer keys. The source feed has 30 million on a nightly basis. The customer keys arriving on the source feeds are originating from three applications: finance, sales, and contracts. The SQL query / code for detecting hard-deleted keys (without utilizing a last seen date) is, as follows: Select * from HUB_CUSTOMER where Customer_Acct_Num not exists ( Select cust_acct from STG_MANUFACTURING UNION ALL Select customer_acct_num from STG_SALES UNION ALL Select cust_num from STG_CONTRACTS )

First a last seen date column must be added to the HUB_CUSTOMER table. Second a new business rule is created and signed-off on by the business in a service level agreement (SLA). The new rule is: data is aged, and not marked as deleted until it hasn’t been seen for more than 3 weeks. The keys in HUB_CUSTOMER are tracked by reversing the set logic in the following code: (which presumes Last Seen Date is a column in the Hub) Update HUB_CUSTOMER set Last_Seen_Date = Load_DTS Where Customer_Acct_Num exists ( Select cust_acct from STG_MANUFACTURING UNION ALL Select customer_acct_num from STG_SALES UNION ALL Select cust_num from STG_CONTRACTS )

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 53 of 15 52

This completely reverses r the mathematica al set operattion so that th he database now scans o only 30 million incoming records and direct joins against a the 800 8 million (eequal-join meeans hitting o only 30 million in the Datta Vault Hub)). A seconda ary operation is then run aagainst any H HUB_CUSTOM MER rows that are “older” than t 3 weekss; the resultin ng set scan iss typically a ffinite 10% off the total datta or less. The set size gene erally stays small s (in this case around 80 million roows is scann ned compared d to a scan n of 800 milliion rows with hout a last se een date). Th he reduced s can set is deepicted in Figgure 3-8.

Figure 3--8: Reduced Scan Set afte er Applying LLast Seen Da ate Inse erts to a Satellite are made for “soft-de eletes” of keyys that are deeactivated. TThe rows alre eady “marked as deleted” are igno ored from the e follow-on sccans, reducin ng the data sset size again n from 80 million down to 20 2 or 30 million at most. These are th he averages tthat have beeen experiencced by impllementing this solution att large global companies. 3.6

Extract Dates

Extra act dates are wonderful to o capture if they are available on the ssource systems. Extract dates repre esent the datte and time that t the data is “extracted d” or written to flat-file on n the source ssystem. Extra act dates typiically are nott available wh hen direct SQ QL access is u utilized. Extrract date and d time for SQL extracts is ussually stored in the metad data (processs logs of the ETL perform ming the extra act) so it is no ot required to o store in the Data Vault structures s directly. Extracct dates for flat-files are e extremely helpfful, particularly if the data a set is pulled d from severa al areas arou und the world d.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 54 of 152

Extract dates are not reliable as in some cases the extract may be created on a PC where the clock, and system date time are in question. In other cases the server performing the extract may be in a different time zone than either the source system or the data warehouse server. Bottom line, the EDW team generally has no control over the extract date and time on the source system, therefore it is non-auditable data – it is reference data in a manner of speaking, and as such should be stored as just another attribute of the Satellites in the Data Vault. 3.7

Record Creation Dates

Record creation dates are wonderful if they are available on the data set. If they are available they should be recorded as attributes in Satellites. They should not be a part of the Hub nor the Link structures as they are not reflective of the key structures or associations. Record creation dates generally represent the date and time of creation of the source system row (in its entirety). In some cases these date time stamps may be edited by the business users on the source system (which means they can change over time). Regardless of the case, the EDW team has no governance to cover the management and consistency of record creation dates. Furthermore even if governance procedures existed, it would be a great undertaking to ensure governance over 100% of the source system data; resulting in a non-auditable field which must be treated the same as any other source system data – as an attribute in a Satellite. 3.8

Record Sources

Record source columns are row-based metadata that represent where the row originated. These are hard-coded values applied to maintain traceability of the arriving data set. Record sources can be codified with the descriptions residing within reference tables. Record sources should be architected to the lowest level of granularity. For example: SAP.FINANCE.GL (indicating an SAP source system, followed by a financial application, followed by General Ledger). Record sources are metadata that must be carried in to the staging area (are hard-coded in the staging loads to the Data Vault). They can be created as lookup codes or lookup sequences to avoid duplication of the data set in high volume situations. They are then resolved on the way from the Data Vault to the data marts. If they are created as lookup codes, they then are placed in a reference table. The reference tables are covered in the reference data chapter, chapter 8.0.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 55 of 152

Record sources must remain on the row level as a part of a 100% compliant and auditable solution. These fields are used to answer questions about the data, where it came from and more specifically which application. Traceability of the data from the AS-IS data marts all the way back to the source systems provide compliance that meet regulatory standards. Developers, auditors, and business users benefit from having a record source in each row of data across the entire model. Tech Tip: To manage volume or repeating groups without joining (resolving to a code), compress the column in database engines that support compression. Record source codes are highly repeatable and redundant data. Record sources may be comprised of reference codes; resolved on the way out of the Data Vault by joining to reference data. Reference codes as record sources allow the data set to be compressed from the start. 3.9

Process ID’s

Process ID’s are a tracking mechanism for the loading process that brought the data into the warehouse. They are not part of the core-architectural components of the Data Vault. Process ID’s may be used as a means to track the data set back to the individual loading process. They are augmentative metadata only. Process ID columns are repetitive in nature, and as such should be setup for column compression. Process ID’s may replace both record sources, and load-dates. If process ID’s are tied to technical metadata stored in the Meta Vault, they can replace the two items in Hubs and Links (load dates will always be needed as part of the key for a Satellite). In this situation, the process metadata must be tagged with a record source and a “run/load” date.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 56 of 152

4.0 Hub Entities Hubs are defined by a unique list of business keys. They are surrounded with additional technical metadata elements such as load date time stamp, last seen date (optional), record source and sequence number. Business keys may be composite (made up of more than one field), intelligent (smart-keys) – contains meaning across parts of the keys, or sequential in nature. Though not ideal, or desirable, over time operational application developers have made the mistake of producing, displaying, and reporting on sequence numbers. The definition of a business key will be discussed in detail starting in section 4.2 Tech Tip: Meaningless sequence numbers in operational systems can be a design/architecture hazard. If sequence numbers in operational systems are exposed to business users then they become (by default) business keys. Hub Tables are meant as a consolidation point for horizontal business functions. For example: Customer Account Numbers should span multiple lines of business. At the end of the day having a single customer account number from customer inception to delivery is what the business process needs to provide corporate level answer sets.

Unfortunately in the real-world, customer keys (as with so many other business keys) change depending on the system being used. The keys change from one state to another as the customer information passes from one system to another. These changes are typically a manual process resulting in little to no visibility at the corporate level for where a customer is in the life-cycle of business.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 57 of 15 52

Figure 4-1: 4 Businesss Key Changin ng Across Lin ne of Businesss In Figure 4-1 abo ove the key changes throu ugh an Excel managed prrocess when the custome er is transsferred from the sales system to the procurement p t system. Thee ideal would d be for the ssame key to be e used horizo ontally acrosss all lines of business reggardless of th he system of origin and th he system of transfer. Wha at business doesn’t realize e is just how much mone y they are lossing by “chan nging” the business key from one line off business to o the next. Theyy also frequently allow this to happen by implemen nting off-the-sshelf productts which expose sequ uence numbe ers as busine ess keys. Cle early, sequen nce numbers from Oracle Financials w will never matcch sequence e numbers in Siebel or PeopleSoft or SAP, S etc… Beecause the seequence num mbers are expo osed, the bussiness beginss to use them m as businesss keys – autoomatically lossing traceability (and mon ney) when the e sequence number n for th he same custtomer differss across multtiple systemss. One O of the “jobs” that a go ood data warehouse shou uld perform iss: gap analysis - thatt is: provide the business with a view oof the GAP beetween the way w the business believes they are ope erating their business, an nd the way th he systems are collecting the data. Byy examining tthis gap, the business ca an quickly loccate where they are hemorrhaging mooney.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 4.1

P Page 58 of 15 52

Hub Definitio on and Purposse

The job of a Hub is to track th he first time the t Data Vau ult sees a bussiness key arrrive in the ng device. The ware ehousing load, and where e it came from m. The Hub is i a businesss key recordin business keys in a Hub shoulld be defined d as the same e semantic ggranularity. FFor example: Customer Indivvidual is a diffferent grain than Custom mer Corporatiion. Each of these types of customerss should be re espectively modeled m in tw wo different Hubs H as show wn in Figure 4 4-2.

Figure 4-2: 4 Hub Example Imagess ndard fields including seq quence numb ber (SQN), Looad Date Hubs have severral of the stan OAD_DTS), an nd Record So ource (REC_S SOURCE) . In special case s, a Hub will also include e an (_LO encrryption key (E ENCR_KEY) and a potentially a Last See en Date (LASTT_SEEN_DTS S). The encryyption key is a part of the Hubs when the data set is encrypted. It may be onee half of a tw wo-part publicc key. Encrryption key iss not standarrd which is wh hy it is not lissted in Chaptter 3.0. Lastt seen dates are not required, and are e not a part of o the core architecture. LLast seen dates assist in tra acking delete ed rows/agin ng business keys. k Busine ess keys in Hu ubs may be ttracked throu ugh status trackking Satellite es which are covered in th he Satellite chapter. Req uired in the a architecture are the sequ uence numbe er, load date, and record source. The purpose of the Hub is to provide a soft-integration n point of raw w data that iss not altered from the sourrce system, but b is suppossed to have th he same sem mantic mean ing. The resulting singula ar list of keyss assists in th he discovery of patterns across a system ms. The Hub key also alloows corporate business to track k their inform mation acrosss lines of bussiness; this p rovides a con nsistent view w of the curre ent state of application a syystems. The ese systems are a supposed d to synchron nize, but ofte en don’t – when they don’t synchronize, business ke eys begin to be b replicated and worse yyet, are then applied to diifferent conte extual data sets. s Som me examples of Hubs and their data arre shown in Figure F 4-3:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 59 of 15 52

Figure e 4-3: Hub Exxample Data In th he HUB_CUSTT_ACCT (Hub b Customer Account) it is easy e to spot similar patteerns, fat-finge ered data, and errors in enttry, possibly a lack of edit masks. The e typical requ irement in th his case is ass follows: The busin ness says: “W We always cre eate our custo omers in con tracts. You w will always ge et your customer numbers from contrracts first bec cause they arre responsibl e for closing the deals and getting the money.”

Whe en the pattern ns in the data a are examin ned, it is clear that Sales has produced keys (as ha as finan nce) that are e not in contra acts. It’s up to the busine ess to figure out why; it’s the job of the data ware ehouse to po oint out the pattern. With this type of analysis, a the data warehoouse can pro ovide the need ded gap anallysis between n the businesss requireme ents and the source systeems. In this ccase there mayy be broken source s system m synchronization routine es, or worse: a loop-hole in the business proccess that ince entivizes peo ople in sales to t enter new customers. All of this is speculation until the business figures out why it’s happening and moves to fix it in the ssource system ms or primaryy proccesses of the e business. Therre are ways that these keys can be rollled together for BI reportting purposess. The notion n of hiera archical Link ks and same-as Links is discussed in the t Link chap pter (chapter 5.0). The da ata itself stayss in-tact in order to re-con nstitute the source s system m as necessaary for audita ability. 4.2

What is a Bu usiness Key?

A bussiness key iss something that t the busin ness uses to track, locatee, and identiffy information. It is a natural keys.. best practice to have h unique business b keyys. Business keys are alsso known as n Busin ness keys sh hould be uniq que, but often n are not. Bu usiness keys may actuallyy be source ssystem sequ uence ID’s tha at have been n released to business ussers and now w are embedd ded in busine ess proce esses. Busin ness keys are e supposed to t have mean ning to the bu usiness. Exa amples of bussiness keys include:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse • • • • • • • • • • • •

Page 60 of 152

Vehicle Identification Number (VIN) Auto License Plate Number Driver’s License Number Account Number Portfolio Number Part Number Work Order Number Employee Badge Number Invoice Number Ticket Number Bar Code Product Number

Each of these keys “stand-alone” in business and in the operational systems – they usually are surrounded with descriptive context to give them meaning. In data modeling terms these keys are parents, and do not require any additional keys to provide them with the grain of definition. There are times when business keys are composite keys (such as VIN numbers, or bar-codes). These are also known as intelligent keys. Business keys may also include the “natural key” and the corresponding source system surrogate sequence key; because the business failed to make the natural key truly unique – and the source system surrogate is now needed for traceability within the EDW. 4.3

Where do we find Business Keys?

Business keys can be found in source system applications, on-line lookup screens, report headers, source system data models, XML, XSD (schemas), and source COBOL copybooks. Business keys can also be found in business process engines, SQL joins, source code (COBOL, java, stored procedures, etc...), Business keys may also be found in Excel spread-sheets used to group items together and label elements used in reports. Business keys may also be found listed in OLAP cubes as part of dimensions used for drill down. The best place to find business keys are within the business process layers. Businesses often identify and track their information sets through business keys. The business process layers allow business users to communicate from one person to another – and translate, send, or attach the information to the business process flow. Business keys may indicate hierarchies, groupings, crossmapping (from one system to another), physical identification tags, and global traceable information.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 61 of 152

NOTE: just because a surrogate is used within a source system does not automatically qualify it as a business key. It must be presented, printed, displayed, or “searched on” – made known to – the business user in order to qualify as a business key. It should also be clearly defined by the business to represent a noun or an object that has context or is defined to be the “key” to contextual information in order to become a business key. However there are a number of surrogate keys (like Order Number and Invoice Number) which are true surrogate numbers and have business value. Both Order Number and Invoice Number qualify as business keys, as they are used by the business to uniquely identify (and track) data in the source systems. In these cases, it is a hopeful thought that only one system maintains and produces these surrogate numbers; that would be the optimal solution. 4.4

Why are Business Keys Important?

Business keys are the most important component of all information systems. Business keys provide Links between business processes and the context that drives decision making. Business keys are the most stable of data elements used by the business. They should be consistent throughout and across lines of business. Through listing the business keys of the same semantic grain together, patterns of inconsistencies and consistencies begin to emerge. Typing mistakes are more easily caught, domain overload (domain chaos) is more easily visible, and missed punctuation becomes clearer. At the time of this writing it is known to be an extremely rare circumstance to acquire or locate common business keys that transcend lines of business and the applications in which they are generated, stored, and utilized. Businesses must begin to identify through metadata their need for common business keys. This is a sign of true business architecture. The end result of common business keys gives rise to board level visibility of the end-to-end business process in which their data travels. By tracking the data set and their business keys, business users can begin to optimize the business processes. This simple notion is the root of master data management. Master Data will not succeed without the proper identification and management of business keys. Because the business key (in theory) is supposed to be static and stable, it should be consistently the “smallest” portion of the business and the most unchanging component of the business – regardless of the business units in which it is applied. This is separate from the business metadata that defines the element and the functionality of how the element is applied in business. This is a technical definition of “same semantic grain” that associates this business key with the corresponding context surrounding it. © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 62 of 152

For example: an automobile’s VIN number (vehicle identification number) should not change. However, the color of the car, number of doors, windows, seats, length of the car, and size of the engine may all change over time. These are examples of descriptive attributes which are covered in Chapter 6: Satellites. Because business keys are supposed to be the most stable component, by separating them in the model to a Hub, we can therefore stabilize the model itself over time. At the same time as we stabilize individual structures, we also can adapt easily to new business keys at different grains or defined by different criteria. Thus, the adoption of new structures to meet new business operating procedures becomes easier (without losing history in the current system). Without business keys, IT will not be able to build a master data system and properly tie the data set (context) back to the business processes. Business keys make up “master record locators” that are embedded for information visibility across lines of business. Business keys should never change, should never be re-used. However, it is a well-known fact that the business keys do change and are re-used, however this has major implications in business life-cycles and will cost the business significant money on a year over year basis. In Figure 4-1, it demonstrates the nature of the business key changing from line of business to another. The end result is: no consistent visibility at the corporate level for maximum optimization. The businesses that have this problem without tracking across the change will not be able to answer the following questions: • • • • • • •

How many customers does my business have today? Where in the business life-cycle are my customers? How many customers are in Sales, and not yet in Contracts? How long does each customer spend in different business units? Which business unit takes the longest to process customers through? Which customers are most profitable? Which customers take the shortest time to process through our business?

These are all master data questions that require consistent and tracked business keys. It does not require stable business keys, as long as the business key changes are tracked across multiple alterations. Bottom line is: business keys are the only way to create auditable and traceable information back to the root business processes and source systems.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse 4.5

Page 63 of 152

How do Business Keys tie to Hubs and Business Processes?

Business Keys are the heartbeat of the data that travels through business processes. Think about it, when you access a source system application to look up a customer, what do you type in? When you look for a part, or a product, or an employee in a source system, what do you search on? Well, if you guessed business keys, you guessed correctly! Business keys are a part of every-day life. We use computers and their data stores to “remember and track” all the possible information that we collect. We are then left to focus on a product, a portfolio, or a set of customers. From these activities we have to identify, define, trace, and manipulate all of this information within the business processes. These business processes include manual efforts (we print a report and hand it to someone else), or source system application (think data entry), or a dashboard of our top customers that we have to touch every day to see if there’s anything we can do for them. Without business keys involved in these processes, there would be chaos. Without business keys identifying all this information, it would all be ZERO VALUE to us. Which in fact is exactly what happens to the data in the systems if or when the keys to that data are lost…. You know the old saying, “out of sight, out of mind.” If we can’t track, edit, retrieve or manage all the information in our operational systems, then the value of that information drops to zero. Business keys are tied to the data set in the source application. Business keys are likewise tied to every business process that the data flows through, thus ensuring traceability at the business process level. Business keys are the foundation of the Data Vault; which means that your data warehouse is centered on business keys. These keys are the life-blood of the data warehouse, which is how we can tie value of the data assets back to the business. Centering your data warehouse around business keys provides you with a huge advantage in data warehousing valuation as an asset to your business. It gives you the ability to track, and trace, all the information back to the point in the business processes where it makes the most sense. THIS, MY FRIENDS, IS CALLED: GAP ANALYSIS. THIS IS OUR TRUE JOB AS BUSINESS INTELLIGENCE EXPERTS . W E ARE SUPPOSED TO POINT OUT THE GAPS , AND HELP THE BUSINESS CLOSE THEM !

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse 4.6

Page 64 of 152

Why not Surrogate Keys as “Master Keys”?

Surrogate keys are helpful and useful to a machine, particularly when it comes to speeding up joins and processing data sets in order of creation. However, that’s where the helpfulness and usefulness stops. Surrogate keys should never – ever be shown to business users. They should never be placed on reports, search screens, or operational application screens. They should never be mistaken for business keys by the business users. They invariably cause problems (never ending problems) that cost business large sums of money over the life of the source system and data warehouse. Problems begin to arise when the data needs to be re-loaded, and new surrogates must be generated for the rows. This causes confusion in the auditability of the data set, and even calls in to question any “previously exposed” surrogate keys that were printed on reports. These old surrogates no longer match up with the newly generated data! So much for the system-of-record source system! Surrogate keys should remain within the confines of the systems in which they are applied. However in modeling a Data Vault for source systems (especially those without business keys today), the Data Vault model must accommodate the surrogate keys and (unfortunately for business) treat them as the business key to that source. The loading routines must deal with collision, semantic meaning, and definitional aspects of “simple numbers.” Surrogate keys mean nothing to the business, and the business should not be asked to memorize or embed meaningless data into their business operations. Note: As mentioned in section 4.3 – some business keys are in fact surrogate keys. These include keys such as Order Number and Invoice Number. These keys are used as meaningful business keys and should be represented as Hubs when necessary.

4.7

Hub Smart Keys, Intelligent Keys

Some business keys like Bar Codes are called: “Smart Keys” or “intelligent keys”, meaning it’s a key comprised of multiple parts. All parts must be kept together as an UOW (unit of work). The business utilizes the entire key as one unit (one identifier) to represent other information.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 65 of 15 52

Figure 4-4: Smart Key K Example In Figure 4-4 the key represents a single manufacturin m ng work ordeer number. The components unde erneath are the t rules sup pplied by business as documentation ffor the makee-up of the nu umber. Sma art keys are helpful h only when w they are e entered pro operly in the ssource systeems. Sma art keys may also be know wn as “intelliggent keys” – that is to sayy that the data within the e keys have e business meaning m by po osition, value e, and formatt. When inte lligent keys a are used by tthe business they must be kept together t in a single Hub within w the Datta Vault. This is consistent with the d definition and context of the t businesss processes that t search aand index thiss key for purp poses of discovering addittional contexxt. Furthermore it may m be that th his key is exp posed to end--users. For eexample: a tw wo dimension nal bar code acturer is, e. Bar codess are printed on drink prod ducts. If the user wanted d to know who the manufa where it was manufactured, and a on what date, they may m want to “sscan” or lookkup the bar ccode. Som me bar codes also house batch b informa ation, so thatt certain prod ducts may bee recalled by batch and by date. Sin nce these barr-codes are “kept as a sin ngle identifierr” within the business, they are to be kkept as a sing gle Hub busin ness key with hin the Data Vault V model. It is possib ble (for explorratory reassons) to break the parts of o the bar-cod de into separate Hubs – h however it is mandatory that the origiinal bar-code e Hub remain n in place as well. w 4.8

Hub Compossite Business Keys

Hubss are not required to housse their busin ness keys witthin a single field. There may be time es when the d data is well-defined enoug gh that it allo ows the differrent compon ents of an intelligent key to be split apart into se eparate fieldss. The busine ess key is the en made up of multiple fiields, resultin ng in a comp posite identiffier. The com mposite identtifier is still th he single bussiness key for the Hub; meaning that the key is un niquely indexe ed. e may be rea asons hidden n in performa ance, indexingg, partitionin g, or searching to split a business There key a across multip ple fields. Tw wo examples of composite e business keey Hubs are sshown in Figgure 4-5. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 66 of 15 52

Figu ure 4-5: Composite Business Key Hub b Example Tech Tip: In Hub_Bar_Co ode the compo osite key (whe en concatenatted) makes up p the full barcode that is printed on a container. Each E constitue ent part is a piiece of the whole. Since the e business uses u the entire e bar code to track t the conttainer, the ent ire bar code iss itself a business key. k Multiple fields f are simp ply split apart to represent tthe compositee whole. In other words, the e ENTIRE BAR CODE is used as the busine ess key by the business, theerefore it is part of a single e Hub.

Can w we also have e Vendor, Pro oduct Code, and a Productio on Date in th heir own Hubss? Yes, of co ourse – as they most likely represent unique data by themselves; however, thee nature of a BAR CODE iis to be a conju ugation of alll its constitue ent parts – and as such, will w remain a single Hub w with all comp posite fieldss in its own right. The ssource system for HUB_D DOCTOR is wrritten to be ru un in differen nt states. The application n was assigned dooctor ID = 1 to a then setup in Colorado, Denve er, and New York. Y Each “application” “ different doctor in n its own statte. In order to t avoid collissions upon d data load, thee state ID or sstate code e must be loa aded as a com mposite with the Doctor ID D. This main ntains traceability back to o the sourcce applicatio on in each sta ate. 4.9

Hub Entity Structure

The Hub entity sttructure conssists of these e required ele ements: a su rrogate sequ uence id, a bu usiness key, a load date stamp, and a record sourrce. There arre additional componentss that are necessary and helpful in ord der to meet the t applied needs n of the data set. Iteems such as llast seen datte, conffidence rating g, strength ra ating, encrypttion key, and d possibly oth her metadata a elements m may be adde ed for query purposes, pe erformance purposes, p and d discovery p purposes as b business req quires.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 67 of 15 52

Th he Hub entityy must NEVER contain forreign keys. Iff the Hub strructure is co ompromised (i.e., the modeling standards are not adhered to),, then the in ntegrity of the e data; and th he flexibility of o the model are immedia ately co ompromised.

Hubs must stand d alone (be a parent to all other tabless), they mustt never be children. Figure 4-6 is an e example of th he Hub Entityy Structure.

Figure 4-6: Example Hub b Entity Struccture Any comp promise made e in the struc cture will lead d directly to rre-engineerin ng, high maintenance costs, diffficulty in gro owth, lack of flexibility, an nd problematic real-time in the near future. f Neve er alter the raw structural definitions o of the Data V Vault.

4.10 0 Hub Example es

For tthe exampless, there are several s differrent data mod dels that havve been utilizzed. The exam mples here e are created based on th he Adventure Works mode el (circa 2008 8 from Microosoft). There are a few Hub examples an nd data sets in Figure 4-7 7. You can find additiona l examples a and download ds on http://danLinste edt.com

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 68 of 15 52

Figure e 4-7: Example Hubs from m Adventure W Works 2008 Lege end: • • • •

Field extension SQN = Sequence S num mber, surrogatte identity useed to manage u uniqueness. Field extension LDTS = Load Date Tim me Stamp Field extension RSRC = Record Sourcce The remaining field(s) are a the business keys.

This model has a mix of data types for bussiness keys. There are a few like DoccumentNode,, and Prod ductNumber which match h character based businesss keys. Thee rest of thesee keys were derived beca ause of the fo ollowing reassons: • • •

There is no n source syste em application n to check the e rules againstt. There are no source sysstem businesss users to ask (the Adventurre Works modeel was created d by programm mers) There is only o a single syystem that is in ntegrated here e. In multiple systems casees, the processs of modeling a Data Vault generally g repre esents alpha-n numeric busin ness keys.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 69 of 15 52

Figure e 4-8: Examp ple of Nationa al Drug Codee Data Vault Figure 4-8 repressents the ND DC (National Drug D Code) Data D Vault. M More informa ation about N NDC sourrce data and the operatio onal system can c be found at: http://ww ww.fda.gov/ccder/ndc/ (n note: if the LLink no longe er works, sea arch Google for f “NDC druggs” and clickk on the Link available fro om www w.fda.gov) On ne large differrence betwee en this system and the Ad dventure Works model is that this syste em has real business use ers, along witth defined metadata for eeach of the b business keyss. The fact that a business usess surrogate keys k as business keys dicttates that thoose source ssystem surro ogate keys are chosen ass business ke eys for defining their Hubss. As noted in these exam mples it is abso olutely vital to o annotate assumptions, questions, and a reasons ffor designingg the Data Va ault arch hitecture as the model is built. b In the case c of Adve enture Works there are noo business ussers to spea ak with, and there t is no so ource system m to consult (application ( logic is missing). Once a standard is ch hosen, it shou uld be adhered to through hout the life of the design n. 4.11 1 Dependent and a Non-dependent Child Keys

Hub business ke eys may be co omposite for another reasson – depend dent businesss keys. A de ependent business key only has contexxt when included with a pa arent key. H owever, a deependent bussiness key is im mportant enough to warrant uniquenesss and when coupled with h a parent keey, uniquely identifies addiitional data. Dependent business b keyys are anothe er source for creating com mposite or multi-field Hub business ke eys. In Figure e 4-9 below, please p remem mber that thee table on the left is a source rd syste em table rep presented in 3 normal fo orm, the depe endent child key is the Hu ub Line Item Number.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 70 of 15 52

A priime example e of a depend dent child keyy would be lin ne-item-num ber. Line-Iteem-numbers e exist only within the contexxt of an invoicce or an orde er. They are important in keeping the proper ordering of the lline-items on n the invoice. Without line e-item-numbe ers, every tim me the system m would print the invoice the line-ittems would be b printed in different ord dered sets. LLine-item-num mbers by the emselves makke no sense, an attempt to “find” line item 5 (five) by itself wou uld be difficullt – if not imp possible. Line e-item-numbe ers depend on parent con ntext (such ass order numb ber) to exist.

Figure 4-9: Depe endent Child Relationship p Modeling NOTE:

TH HE

HUB LINE ITEM

IS IN RED R AND DOTT TED LINED BE ECAUSE IT HA S NO

CONTEXT, NO MEANING G BY ITSELF.

THE

LINE - IT EM NUMBER IIS WHOLLY DE EPENDENT ON N

THE SURR ROUNDING KEY YS FOR CONT EXTUAL RESO OLUTION.

ITEM

SHO OULD NOT BE MODELED IN THE T PHYSICA L DATA

TH EREFORE, MODE EL .

TH E

HUB LINE

Line e-item numbe ers are known n as a depen ndent child. They T are imp portant as a b business key, but not by th hemselves. They T must acccompany an n additional business b key to make sen nse. The Datta Vault mod deling standa ards allow mu ultiple repressentations of the dependeent child keyss. They can be inclu uded in the same s Hub witth the parentt or stand-alo one businesss key, or they can be mode eled within a Link table.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 71 of 152

Another example of a dependent child key may be a sub-typed business key representation. The most important question to ask is: does the key stand on its own? Does it have meaning by itself? If the answer is no, and it remains a business key, then it may very well be a dependent child key. Dependent child business keys are not allowed to be modeled explicitly within the model. If they are combined in another parent keys’ Hub, then they shall not be modeled logically either – Hubs are not allowed to contain foreign keys. However if they are included in a Link structure (explained in the next chapter), they can be represented logically; this notation is called a “weak Hub.” In the Link Chapter (Chapter 5) they are also referred to as: degenerate fields. 4.12 Mining patterns in the Hub Entity

The Hub table brings together previously disassociated business keys. It represents lists of these business keys in a single common table. For example: a list of all part numbers that appear across the enterprise. Patterns can be mined from the single list of business keys. By coagulating the business keys from multiple source systems into a single component, it becomes possible to extract business value and meaning. Hubs can be mined for the following information: • • •

Entry patterns and format masks Source System Key creation/generation patterns over time Possible ontological relationships or hierarchies within keys

By mining the Hubs’ data it is possible to discover practical associations and ties across business keys. Hierarchies and ontologies can be discovered which translates into added business value. The results always need to be checked against the business to see if they are false positives. Complex inter-relationships across the internal data patterns and shifts in entry can be discovered. It is interesting to note that the longer the business key, the more likely it is to make these discoveries. Entry patterns and format masks can also be established. The percentage of data that meets particular patterns can be assigned. The greater the percentage, the more likely the business rule is out there somewhere being utilized. It is possible to tie strength and confidence ratings to percentages of data meeting specific patterns that have been discovered. Just as with the last case, the more data involved in the discovery, the higher the confidence that the discovered pattern is an applicable business pattern.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 72 of 152

Source system key creation (or broken business requirements) can be discovered as the data set is loaded. Confidence ratings increase as additional data (new keys) arrives to demonstrate that the business rule truly is broken. For example, the business requirement is: contracts always create new customer keys, but when the data is loaded – the pattern states otherwise. Contracts are responsible for the creation, and inflow of customer accounts, the negotiation of these accounts before the organization can begin building product for the customers. The data set in the Hub shows that 40% of the new business keys are being created by a financial system. Further discovery shows that it takes 20 days before the customers are synchronized and moved into the contracts system. The business then needs to ask the following questions about their business processes: • • • •

What does this say about the business requirement? Can Finance negotiate contracts with the customer? How long does it take for the data to move from the contracts system to the financial system? How much is it costing the business to NOT have the customers created in contracts first?

What happens when the business key for a certain customer changes when it is passed from Finance to Contracts? What if the programmatic code that changes the key does not record the “from-to”, or the business user does not record the “from-to” change when they key it in to the contracts system? What impact to the business does this have? It can be huge, it can be costly, and it can range from the $10 dollar mark to the $10 million dollar mark. Mining Hub keys for patterns can be a powerful way to validate the data against the business requirements. It provides insight into the gap between the “vision” that the business assumes it’s operating under, and the “reality” of their operational systems, coupled with the business process in place today. This is the fundamental idea behind process improvement, monitoring and measuring.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 73 of 152

4.13 Process of Building a Hub Table

The process is simplistic in nature; however it requires a consistent check with the business users, business application, and source system data set. At the end of the day the business application collecting the data has the overriding decision. It is the responsibility of the Data Vault to enable reproduction of the source system as-it-stood as of a specific point in time; otherwise the commutative property is broken, and the system of record that exists within the Data Vault is compromised. 1) Find the business key a. Go to the business users and watch how they interact with the operational systems. View their print-outs, application screens. Locate the “find” mechanisms, headers of reports, and dimensional groups they use in their MS Excel Spreadsheets. b. Determine which business keys are truly used in which business units. Do NOT worry or consider HOW to define the business keys, leave that to the business users later in the project. c. Locate the business keys in the source system by examining the record join/find code. d. Look for business keys in the dusty old data model that is supposed to represent the source system, look for the primary keys and secondary unique indexes. e. Pry open the physical data stores on the source systems, look for alternate unique indexes and primary keys. 2) Validate the Business Keys a. Check with the business units, balance the data sets and unique indexes that are physically printed or seen by the business users. Eliminate those keys that are internal only. Many times the internal keys are there for performance reasons. b. Validate the business key data by profiling the data set. Discover the consistency, actual uniqueness; develop metrics against the business keys, their patterns, and their associations to other records in other systems. 3) Check Business keys against multiple source systems a. Develop profiling patterns across multiple source systems that are within scope, discover where the collisions are. Work on resolving the multiple entry patterns that occur. Again, the focus is not to define these keys, but rather simply to identify the business keys. 4) Finally, build the Hub a. Define the systems that feed the Hub. Develop data flows that identify potential collisions. b. Define what to do in case of a collision. Get this answer from the business users by ASKING them to define which system is the first master, the second master, the third master and so on. c. Implement loading paradigms from a staging area to the Hub in the Data Vault d. Profile the results to produce metrics and measurements about the patterns of the data sets. e. Publish the results to the entire IT team, the business users, and anyone interested in the Data Warehouse. BEGIN the data quality improvement process as early as possible.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 74 of 152

5) Validate the results in the Hub a. Reconstruct the LIST of business keys for each system, and balance the keys against each of the source systems to ensure integrity has not been lost.

These are the fundamental steps to building a single Hub within the Data Vault model. 4.14 Modeling Rules and Standards for Hub Tables

The Data Vault model is a repeatable, consistent, scalable and flexible technique. There are rules and standards around each of the table structures that must be followed, or the resulting model will not qualify as a Data Vault model and will be subject to the risks it was designed to avoid. Below are the modeling rules and standards that surround a Hub Table. • • •

•

• •

• • • •

A Hub must have at least 1 business key A Hub should not contain a composite set of business keys. ** exception below A Hub SHOULD support at least one Satellite to be in existence, Hubs without Satellites usually indicate "bad source data", or poorly defined source data, or business keys that are missing valuable metadata. However, a Hubs’ Satellites may be hidden because of security restrictions or information hiding paradigms A Hub Business Key CAN be composite when: two of the same operational systems are using the same keys to mean different things AND these keys collide when integrated back together again. In this case, the record source becomes part of the business key. Please be aware: BAD DATA CAUSES BREAKS IN THESE RULES - THESE ARE GUIDING PRINCIPLES. Exceptions to this rule should not happen (but do), also be aware, bad architecture in source systems causes breaks in these rules too. A Hub Business Key MAY also be composite because the key is utilized as a composite key within the business Hub's business key must stand-alone in the environment - either be a system created key, or a true business key that is the single basis for "finding" information in the source system. A True business key is often referred to as a NATURAL KEY A Hub should contain a surrogate sequence key (if the database doesn't work well with natural keys). A Hub's load-date-time stamp or observation start date must be an attribute in the Hub, and not a part of the Hub's primary key structure A Hub's PRIMARY KEY cannot contain a record source (though the business key may as noted above). A Hub may contain a Last-Seen-Date if desired grain of tracking is needed

The rules for Data Vault modeling have not changed (architecturally) since 1997; which makes the architecture itself stable and easy to use. The rules and standards for modeling are kept up to date on the following web-site: http://DanLinstedt.com.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 75 of 152

4.15 What Happens when the Hub Standards Are Broken

The standards, the design, and the architecture of the Hub are based on mathematics including finite complexity, measurable maintenance effort, including number of rows per block. If the Hub standards are broken (such as introducing a foreign key directly in to the Hub) then the flexibility of the model breaks. The adaptability to future business requirements breaks. The ability to load past history (which may not match the relationship definition) breaks. When the rules and standards are broken, it also introduces high levels of re-engineering upstream of the Data Warehouse. It forces business requirements to creep back in to the upstream loads. Eventually the business requirements change, and thus – force re-engineering to occur in the loading, querying and structuring of the Data Vault. The current architecture of the Data Vault avoids all re-engineering if the rules and standards are adhered to. If descriptive data is introduced to a Hub, then data over time becomes more difficult to manage. The complexity of the loading cycle increases. The staging area requires additional “copies” of the data set to synchronize it with the final image. It becomes impossible to split data by rate of change or type of information. It is not recommended nor condoned to break the standards of the Data Vault. The engineering work has been done in order to avert pitfalls encountered on typical enterprise data warehousing projects. In fact, if the standards are broken, the model will not qualify as a Data Vault model. The only risk a “pure” Hub design has is the width of the business key. If the business key is comprised of multiple fields (is a composite business key), then it may be possible that the number of rows per block exceeds the desired count. When this happens, the number of I/O’s increases dramatically to search through the Hub structure and locate the proper business key. The average Hub row size is accounted for as follows: Field Sequence Business Key Load Date Time Stamp Record Source TOTAL

Average Bytes 8 25 8 12 53 bytes

Figure 4-10: Typical Hub Row Sizing © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 76 of 152

If the block size is 16,384 bytes (16k) then it can fit approximately 309 rows per disk I/O. If the block size is 32k, then the Hub can fit approximately 618 rows per disk I/O. With a block size at 64k the Hub can fit approximately 1236 rows per disk I/O. The best average is around 1000 rows per block. The Data Vault implementation book covers the mathematics in detail, along with the loading mechanisms, block sizes, and row widths. NOTE:

THIS INFORMATION IS TECHNICAL IN NATURE, AND WILL BE COVERED IN DEPTH

IN THE

DATA VAULT IMPLEMENTATION

BOOK , AND IN THE COACHING AREA .

THIS

INFORMATION IS HERE TO CLARIFY THE PRESENTED TOPIC .

Do not break the rules of the design or architecture. If the rules are broken, the design will suffer reengineering in the near future. It also breaks the ability to keep costs down from a maintenance perspective. The Data Vault model is based on scalability mathematics involved in computing nearlinear scalability from an MPP (massively parallel processing) perspective.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 77 of 152

5.0 Link Entities Link entities act as the flexibility component of the Data Vault model. They are the glue that pulls together any related association of two or more business keys. Where business keys interact, Links are created. Link entities are generated as a result of a transaction, discovery, relationship, or interaction between business units, business processes, or business keys themselves. Links provide flexibility to the Data Vault model by allowing change to the structure over time. Mutability of the model without loss of history is critical to the success and long-term viability of the enterprise data warehouse. In other words, the model itself can now be adapted, morphed, and changed at the speed of business without loss of auditability, and compliance. The Data Vault model also gains flexibility from this technique because of the Link entity. The Link entity (in data modeling terms) is commonly referred to as an associative entity. 5.1

Link Definition and Purpose

A Link Entity is an intersection of business keys. It contains the surrogate ID’s that represent the Hub’s and Link’s parent business keys. A Link must have more than one parent table. A Link table’s grain is defined by the number of parent keys it contains. Each Link represents a unit-ofwork (UOW) based on source system analysis and business analysis. The purpose of the Link is to capture and record the past, present, and future relationship (intersection) of data elements at the lowest possible grain. The Link Entity also provides flexibility and scalability to the Data Vault modeling technique. Typical examples of Links include: transactions, associations, hierarchies, and re-definition of business terms. WARNING: ANY

CHANGE TO THE

LINK

STRUCTURE

( LIKE

ADDING BEGIN/ END DATES,

ADDING BUSINESS KEYS ), OR CHANGING THE ARCHITECTURAL DEFINITION OF THE WILL RESULT IN THE NEED FOR RE - ENGINEERING LATER.

CHANGES

LINK

TO THE

ARCHITECTURE COMPROMISE THE AGILITY AND FLEXIBILITY OF THE DATA MODEL.

DO

NOT MAKE CHANGES TO THE ARCHITECTURAL DEFINITIONS. 5.2

Reasons for Many To Many Relationships

Within the Data Vault modeling constructs a Link is formed any time there is a 1 to 1, 1 to many, many to 1, or many to many relationship between data elements (business keys). The resulting physical Data Vault can capture “what the relationship was”, while it captures “what the relationship is”, and can adapt to “what the relationship will be in the future.” Many-to-Many relationships provide the following benefits: © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

1 1. 2 2. 3 3. 4 4.

P Page 78 of 15 52

Flexibilityy Granularity Dynamic adaptability Scalability

Man ny-to-many re elationships allow a the phyysical model to t absorb da ta changes a and businesss rule chan nges with little to no impa act to both exxisting data sets s (history) and existingg processes (load and querry). Businessses must cha ange at the speed of business, and IT must becom me more agile e and resp ponsive to handling those changes. More M and morre business rrules are changing, fasterr and faste er. Thro ough the Link k entity the Data Vault mittigates the ne eed to restru ucture/redesign the EDW model beca ause the rela ationship changes. For exxample: today the businesss states “1 portfolio can n handle man ny customerss, but each cu ustomer musst be handled d by 1 and on nly 1 portfolioo.” If the model is designed in a rig gid fashion (th hat is to say with w parent-cchild depend encies) then it representss the curre ent businesss rules quite well. w All is we ell until the business b (tom morrow, next year, or 2 ye ears ago) decides to chang ge their busin ness rule: “no ow, a custom mer may be h andled by 3 or 4 different portffolios.” Figure 5-1 demon nstrates relationship change over tim e.

Figure F 5-1: Relationship Changes C Oveer Time © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 79 of 15 52

One of the proble ems of mode eling today’s relationship in any data w warehouse iss that it make es the strucctures static.. It forces the e structures to represent today’s relattionship ruless. These rela ationships have e changed in the past, and will change e again in the e future. Thiss is the dyna amic changing nature of th he business: grow, change e, or die. The e problem with introducin ng static relattionships in tto the mod del is that it also a re-introduces businesss rules to the loading proocesses. It a also introduce es static relattionship enfo orcement in to t the loadingg routines. When W the relaationship doees change, ITT is forced to re e-engineer the loading rou utines, the modeling m arch hitecture, and d the queriess to get the da ata set in to th he Data Ware ehouse. Thiss is an unacce eptable and un-maintainaable cost going forward. The Data Vault must m remain flexible, f and not introduce e the need foor re-engineeering as the m model grow ws. By modeling the Linkss as a many-tto-many relattionship, we can easily acccomplish this goal. The Link table fu unctions to fu uture-proof th he model and d provide maxximum flexib bility. Figure 5-2 dem monstrates the reason for using a Link table:

Figure 5-2: Link Table Structure S Housing Multip le Relationsh hips

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 80 of 152

Many-to-many relationships ensure that the business associations (past, present, and future) can be added to the warehouse without altering the model or the load routines. The metadata that is currently lost is the nature of the relationship (e.g.,1:1, 1:M, M:1) as documented in the source system (what exactly did the operational model look like?). This must be documented in the metadata of the Link table, hopefully in the Meta Vault. By capturing the metadata in the Meta Vault (including computational functions that create the relationship, along with how it’s used) the business can begin to track changes to business knowledge as they relate to the data set and operational systems over time. The resulting power of this capture mechanism enables the business to monitor the impact of their decision. Data mining on the Meta Vault and the data set can then perform gap analysis in regards to the quality of the decision and the end resulting impact (pre and post decision process). If the business adapts its business process, adding a new Link table can be done easily and quickly without reengineering the entire existing data warehouse. Load routines are isolated from the impact, as are queries and BI processes. If the model is rigid, then the loading (ETL) designs are also rigid. If the business rule changes, and meets a rigid architecture, then the result of the impact is: forced re-engineering. The extent of the impact may cascade into other “child tables” thus, the larger the EDW model grows, the larger the possibility for impact. The less agile IT can be in response to business rule changes, and conversely the more it costs (over time) to continue to adjust the EDW architecture to meet business needs. This is the common design pattern that occurs in traditionally modeled warehouses. This impact is completely mitigated by building a Link entity into the Data Vault. The Data Vault therefore is highly scalable, flexible, and now, agile. The Link entity allows the structure to handle changes to business rules without the impact of re-engineering (aka re-factoring), and without the ever increasing cost curve. However, it is suggested that the business rule itself, along with any calculation that produces this data set be recorded within a Meta Vault. To learn more about Meta Vault, check out the one-on-one coaching area at: http://danLinstedt.com 5.3

Flexibility

Many-to-Many relationships provide maximum flexibility and agility. The more flexible the model is, the faster it is to adapt or change. The faster the model can adapt, the less time it takes IT to respond to business changes. The less time it takes for IT to respond to business changes, the more work can be done in a shorter amount of time, leading to increased productivity of the IT staff in the data warehousing environment. Adding new tables (especially Link tables) to the Data Vault is easy. © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 81 of 15 52

Figure 5-3: Starting S Mode el Before Chaanges Supp pose the model starts out with an order tracking system s that kknows the cusstomer, orde er, prod duct, and line e-items (see Figure F 5-3). Time T passes and now thee business wiishes to add a set of prod duct categorie es and supplliers. How do oes the mode el evolve?

Figure 5-4: Data Vault After A Modificaation

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 82 of 15 52

As seen in Figure e 5-4, changing the mode el or adding new n structurees is a simplee process. Not much time e or effort is required r to make m the changes occur and a it has no impact on th he existing po ortions of the w warehouse. We W can add Links L to repre esent the new relationshiips without h having to revise existting structure es (to add ne ew foreign ke ey columns) or o reload any data. Do noot confuse this with the ttime and effo ort required to t find and esstablish apprropriate busi ness keys. C Creating the Hub strucctures is the first step, an nd the most important ste ep to take. S Suppose the b business now w wishes to ad dd sales regions to mana age both custtomers and orders o as a coombined com mponent, how w hard migh ht it be to exttend the mod del again?

Figure 5-5: Addition nal Data Vault Model - Moore Changes V model due d to the Link structure (see Figure 5 5-5). It is easy to add new entities to the Data Vault B of Data D Vault Mo odeling bookk about classiified informa ation Therre is a discusssion in the Business syste ems. This type of model flexibility lends itself well to protected d environmen nts as the seccured or encrrypted data can c be stored d in separate entities and easily Linkeed to unsecurred data entities. Keep in mind tha at adding new w Links acrosss global envvironments iss also possible. This is especially help pful in manag ging distribute ed (yet conne ected) system ms. As indicaated in Figuree 5-6, the Lin nk tables are sstored in diffferent global systems; it iss up to the ap pplication to keep the keyys synchronizzed for querry use.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 83 of 15 52

Figure 5-6 6: Global Datta Vault Linkiing In th his situation, there are ma any different applicationss that are syn nchronizing th he operationa al Data Vaullt. Most likely the applica ation loading data to the global g Data V Vault is comp prised of web b-services, and a business rules r engine. It is quite po ossible to have more than n one global controller, and to farm m out differen nt components of the access dependin ng on the seccurity, geo-loocation, or oth her crite eria. The con ntrol over the data, the loa ading and qu uerying are beeyond the scope of this b book, and will b be discussed d in the book titled: Data Vault V Implem mentation. 5.4

Granularity

Gran nularity is vita al to an EDW W; the Data Va ault is no diffferent. Grain n can be mea asured by the e number of “p parent” table es a Link conttains. For ea ach parent, th here is a new w (lower) leveel of grain introduced. The same mode of thinking applies a when considering fact tables in n a Star Scheema. For exa ample, what is the grain of the follow wing fact table (see Figure e 5-7), and hoow can it be accurately described?

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 84 of 15 52

Figure 5-7: Uncovering Fact Table G Grain The grain of this fact table ca an be read ass: Customer by b Product byy Sales by Weeek/Year/Mo onth, etc.. Each h dimensiona al key createss a new level of grain for the facts. G rain as defin ned by this exxample simp ply means de etailed level of o data. Graiin in the Data a Vault Link ttables is no d different. The e Link table es represent the level of detail d that th he data is stored at. Afterr converting tthis Star Schema to a Data a Vault, the grain g would lo ook Figure 5--8.

Figure 5-8: Data Va ault Grain, Re epresenting S Star Schema

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 85 of 152

When the business requirements indicate a need to record data at a different grain, new Links should be added to the existing Data Vault – old ones are simply no longer fed incoming data (but are retained as they contain historical data). The alternative option is to re-engineer the existing Link to add the new Hub-surrogate-key. Re-engineering is the enemy of flexibility, and auditability – and can quickly cause an EDW project to scale out of control. In regards to auditability, the question is: once a new Hub-surrogate-key is added to the existing table, how should it be defined to the business? Especially if the definition has to apply to “past-historical data” that is stored in the Link already. The very same question plagues the changes to star-schema fact tables; adding a dimensional surrogate to a fact table causes the grain of all the data to change. When the business asks the next question: can we reproduce a report from last year and compare it to data from this year? Of course, the answer is: technically yes – but what has to happen to the code that drives that report? It has to split in to two parts, one part of the code for grabbing history, and a second part of the code for grabbing “current” data with the new key, now the project is beginning to take on a much greater cost in terms of maintenance. As changes continue to alter the structure, more code forks are necessary to mitigate the business users’ desire for reporting; until one-day, the business wakes up and says to IT: We can’t afford any more changes, and why is the system such a mess already? This is one of the reasons we advocate using Data Vault model for your core EDW instead of a dimensional architecture – this kind of change will not break a Data Vault. What lurks in the shadows is even more troubling. Suppose it’s the first change, all is well – and everyone is happy (as long as access to each data set is governed). Then one day, another business unit decides they need to “roll up” the data, or summarize the recent data that has the new key. They then combine these results with the old-data that doesn’t have the new key, and the numbers no longer match. Now they ask IT: why is the data reporting “bad” numbers? Accountability has just been destroyed. As stated above, in the situation of new relationships and with the added needs of a data warehouse, it is best to always create new Links for these changes and leave the old ones be. A hint from the implementation book: as data degrades in value (get’s older), there’s a good chance that the old Link and its data, will be backed up and the old Link will no longer be necessary within the warehouse. This is the beginnings of a Data Vault model that truly changes with the business needs.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse 5.5

Page 86 of 152

Dynamic Adaptability

Link structures enable dynamic adaptability; that is: the ability to “define associations or correlated data sets on the fly.” Dynamic adaptability leads to a fluid modeling structure. A data mining tool with specialized algorithms that mine both the metadata (data model ontology) and the data set is capable of discovering new relationships that are not yet represented in the model. The data mining algorithm must include the metadata definitions of terminology that explain data models in order to apply appropriate context when deciding to Link different data sets (i.e., Hubs) together. Relationships (i.e., new Links in the Data Vault) created in this fashion must include two additional attributes: confidence and strength. In other words, how confident is the mining engine (neural network) that this relationship actually exists and is real, and how strong are the correlations across the data sets? These two metrics are applied to every row of data that is loaded to the newly formed Link. A fluid Data Vault model is constantly adapting, self-learning. Like any neural network, the alterations and learning must be a guided and corrected process, otherwise the neural network may drive the model to an undesired state (possibly un-usable). Before these notions are dismissed as theoretical in nature they must be considered as reality. A company known as NetQuote in Denver, Colorado applied this technique (human based mining) to build an up-sell Linkage resulting in a 40% profitability increase in the first week. Learning systems, intelligence systems, and military grade systems may actually see the most benefit from this technique. It allows “testing” of hypothesis without losing any of the historical data which has been captured. To take advantage of the fluid model requires automated changes to apply to loading and querying routines; in addition, it requires automated changes to the data marts down-stream. It is possible to create a learning system that is capable of discovering relationships across data sets where none existed previously. It is possible to create a system that adapts to newly arriving elements on the XML feeds, or web-service transactions. It is possible to create a system that arrives at potential high impact information without the need of up-front human intervention. How to build these systems is way beyond the scope of this book, but a well-designed Data Vault is a prerequisite to even starting down this path.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse 5.6

P Page 87 of 15 52

Scalability

The Link entity also provides the model with unlimited d scale-out; th he same wayy MPP (massively para allel processing) relies on scale-out (ad dding new independent p processing noodes) to reacch larger and larger data sets. s The Lin nk entity enab bles data to sit s in differen nt geographiccal locations,, yet be Linkked or associa ated at run-tiime (see Figu ure 5-6 above e). The Dataa Vault scalab bility is limite ed only by the iimagination and a the hard dware compo onents applie ed underneatth the model.. What is MPP? Short for Massively M Para allel Processin ng, a type of co omputing that uses many seeparate CPUs running in parallel to exe ecute a single e program. MPP is similar to symmetric processing (SMP P), with the main m difference e being that in n SMP systemss all the CPUs share the sam me memory, whereas in n MPP systems, each CPU has h its own me emory. MPP syystems are theerefore more difficult to program beca ause the appliication must be b divided in s uch a way tha at all the executing segments can n communicate with each otther. On the otther hand, MP PP don't sufferr from the bottleneck b problems inheren nt in SMP syste ems when all the CPUs atteempt to accesss the same memory at once. http://ww ww.webopedia a.com/TERM/ M/MPP.html

Physsical location n of the tabless on specific storage devices can be ooptimized forr maximum perfo ormance. Figure 5-9 indiicates a tradiitional startin ng point for th he Data Vaullt architecturre on Raid 5 (SA AN or NAS diisk). This typ pe of architeccture provides the lowest cost entry pooint for a singgle syste em. The Datta Vault mode el is flexible enough e to grow with the ccorrespondin ng needs. As perfo ormance gro ows, as data sets s grow, ass real-time da ata arrives – the Data Vau ult model can n scale as desired.

Fig gure 5-9: Trad ditional Data Vault Storagge Layout

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 88 of 15 52

Whe en the perform mance of thiss architecturre falls below w expectations, it can be eeasily adjuste ed to a new physical architecture as shown in Figgure 5-10. Asssuming in th his case that Link-Custom mer-Order and Satellite-Cusstomer-Orderr are growingg at an unpreccedented ratte, or that theey contain a massive volume of inform mation, they can c be split off o physically on to a specialized DASD D (direct attacched stora age disk) with multiple I/O O channels. This allows the t business managemen nt and IT to p provide an SLA (service leve el agreement) which specifies perform mance driven metrics arou und certain q queries or proccesses that lo oad down-stream marts.

Figu ure 5-10: Perrformance Ph hysical Split V Version 1 hen be furthe er partitioned d across multtiple I/O chan mponents, The tables can th nnels, additioonal disk com or ha ardware. The e architecturre allows the performance e to be tightlyy coupled to the physical storage, while e allowing the model to be b de-coupled d from the ph hysical layerss. This is an ooptimal situa ation for an M MPP design. For further performance, p , additional RAID R 0+1 con nfigurations a and DASD ca an be intro oduced to oth her table stru uctures, as se een in Figure e 5-11.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 89 of 15 52

Figu ure 5-11: Perrformance Ph hysical Split V Version 2 This process can n be repeated d again, and again acrosss each individ dual table, an nd down to th he partition level of each individual table. Th his enables fu ull scale-out MPP style arrchitecture to o be execcuted at the physical p level of the Data Vault. This type t of desiggn is geared ffor extremelyy large syste ems, and forr the flexibilityy of breakingg off parts of the model on n to slower eequipment, w while other partss of the model are placed d on high-spe eed, high-cosst equipment .

Figu ure 5-12: Perrformance Ph hysical Split V Version 3 Addiitional I/O ch hannels can be b added to each e disk device for addittional paralleel access cap pacity. Partitioning of the tables further enhances performancce and parallelism. Furth her discussio ons of physsical table strructuring can n be found in the Data Vault implemen ntation book.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 5.7

P Page 90 of 15 52

Link Entity Structure S

The Link entity sttructure conssists of basicc required ele ements: surroogate sequen nce id, multip ple business sequen nce keys, load d date stamp p, and record d source. Theere are addittional components that are necessa ary and helpfu ul in order to o meet the ap pplied needs of the data sset. Items su uch as last seen date, confidence ra ating, strength h rating, encryption key, aand possibly other metad data elem ments may be e added for query q purposes, performa ance purposees, and discoovery purpose es as business require es. As techno ology advances, items succh as last seeen dates, meetadata (inclu uding reco ord source), and a encryptio on key may be “swallowed d” by the dataabase functionality. The Link entity must m NEVER contain c busin ness keys, or begin and eend dates. If a Link structture is com mpromised, th hen the flexib bility of the model m is imme ediately com promised. Iff the structurre of the Linkk is compromised then you are sure to o need reengiineering in th he future. Add ding businesss keys to a Lin nked table insurers that itt depends on n the businesss logic for looading, this ra aises the com mplexity of th he loading routines. Linkss must contain two or morre key sequeence fields (frrom either Hu ubs or Linkks) in order to o be considerred valid; a Link with a sin ngle Hub seq quence key iss considered a peg leg Linkk and is invaliid. Figure 5--13 is an exa ample of the Link Entity Sttructure.

Figure 5--13: Sample Link Structu re Warning: Any compro omise made in the structu ure will lead d directly to re-engineering, high main ntenance costts, difficulty in i growth, lac ck of flexibilitty, and proble ematic realtime in th he near future e. Never altter the raw sttructural defin nitions of the e Data Vault.

5.8

Link Driving Key

In evvery Link there is a notion n called a driving key. The e driving keyy is the main key that drives the rest of the relatio onship. The driving d key iss necessary to t identify so that Satellitees based on the Link can be appropria ately “end-da ated” when th he relationship changes. In Figure 5-1 14, the drivin ng key has arbittrarily been assigned a to CUST_SQN. C © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 91 of 15 52

Figure 5-14: Example Drriving Key forr Link Wha at this meanss in this exam mple is: the account and employee e seq quences can be re-assign ned to a speccific custome er. For instan nce: when the e warehouse e sees: CUST= =11, ACCT=2 25, and EMP__SQN=12 on O October 14, 2000, 2 it’s the e first time for this relation nship. An inssert occurs too the Link, esta ablishing Link k _SQN = 1. Figure 5-15 adds a Satellite (discusssed in chapteer 6) for illusttrative purp poses.

e 5-15: Exam mple of Link Satellite S with Driving Key Figure his case, the Link record 1 has 1 Satellite record. What W happen ns when the operational ssystem In th chan nges the acco ount numberr that the cusstomer is asssociated with h? What if the operationa al system chan nges the emp ployee that deals d with the e customer? In each of th hese cases, w we see the fo ollowing insert (intermediate step) occcur. Figure 5-16 5 depicts the t “post-inssert” of the neew row in the e Link and Sate ellite.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 92 of 15 52

Figurre 5-16: Insert to Link/Sa at Based on D Driving Key To re estore order to the data, row 1 in the Satellite S requ uires that it b be end-dated d. Why? Beca ause the custtomer sequen nce in the Lin nk is the driving key, and the Link receeived a new record that supe ersedes the old o version. The T ETL proccessing mustt take into acccount the drriving key in o order to makke the properr determination of Link records and asssociations too Satellite rows. Note: the e Driving Ke ey may be a composite. It may be rrepresentative e of the sourrce system Prrimary Key, but b not alway ys. Sometimes it is a com mbined view ((super-set view) of multiple m syste ems. Choose e what makes s sense to the e business.

The final Figure 5-17 5 below shows s the Satellite row prroperly end-d ated by usingg the Drivingg Key to dete ect changes.

Figure 5-17: Lin nk Driving Key/Satellite En nd Dated 5.9

Link Examples

For tthe exampless of the Linkss we have ussed several different mod els includingg Microsoft Ad dventure Works data model and a health-care mod del. These Lin nk structures do not carryy last seen da ates nor stren ngth/confide ence ratings. Figure 5-18 contains exa ample Link sstructures fou und in the cu urrent verssion of the Ad dventure Worrks 2008 Datta Vault. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 93 of 15 52

Figure 5-18: Exam mple of Link Tables From Adventure W Works 2008 Data Vault In ea ach of these examples, fo ocus on the particular p gra ain of the datta set. The la ast Link (seen n bottom rightt) has a grain n of 3 differen nt Hubs: Hub b Product, Hu ub Category, Hub Sub Cattegory. In business term minology this would be rea ad as: Producct by Categorry by Sub Cattegory. These would be kknown as the d dimensions. In technical terms, this iss deemed to be the grain n of the data which is represented by th he Link table. The combination n of Hub Sequ uences (composite) mustt form a uniq ue index. Th his unique ind dex must matcch 1 to 1 with the generated Link Seq quence. The Link Sequen nce is the prim mary key of tthe Link table e. This stand dard is enforcced so that the sequence es in the Dataa Vault modeel can be re-b built at any time. Exam mine the Link k Lnk_WOID_ _LocID (which h stands for: Link Work O Order ID by Loocation ID). N Notice it’s asso ociated keys. It contains Hub Work Orrder ID seque ence, and Hu ub Location ID D sequence, but it also contains an Oper_Seq. The T Oper Seq q turns out to o be an operaational sequeencing numb ber used by th he operationa al system (in this case Ad dventure Worrks Applicatioon if there wa as one) to ord der all the child d records. It also serves to t make the combination c of the two keeys unique. This particular elem ment of a Link k table is called a degene erate field.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 94 of 152

5.10 Degenerate Fields In Links

The degenerate field is also known as a child attribute. It is a degenerate field because it depends on the combination of both of the parents fields in order to make the relationship unique. The data in this field is meaningless outside the context of the relationship, in other words the field is not a business key. The field will not function as a Hub. Oper_SEQ as a Hub would only contain sequential integers for ordering data. This field may also be a date value. In the case of an oil well – there may be a need to capture a physical date as to “when” the well was turned on, because until it’s turned on – it is not assigned an actual well-number. Another example may be an expiration date on a prescription drug bottle. This date is generally worked in to the bar-code, making it a part of a larger business key. In this case, it is also part of the relationship between the drug itself, and the packaging material. These degenerate keys are necessary in describing relationships to a higher level of detail – however, by themselves they do not provide significant information to cause the creation of a business key. Degenerate fields have the following rules: • • • • •

They cannot stand on their own (as Hubs) They have no business meaning They are “dependent” on other context in order to be defined They give meaning and uniqueness to additional relationship information They have no “descriptors” of their own

Examples of degenerate Link fields include “sequencing” or numbering information, for instance line-item-sequence on a purchase order or invoice may be called a degenerate Link key. Dates (on occasion) are also degenerate Link fields. However, this case must be carefully examined – as not all dates (such as start/stop, begin/end, and other descriptive dates) should end up as a part of the Link. We discuss begin and end-dating Links in a section below. The “degenerate” field that is a date is generally a rare case that should be applied sparingly and with caution. It usually is also an indicator or a composite of a business key, making the relationship unique. 5.11 Multi-Temporal Date Structures

The Data Vault is enabled to house multi-temporal views of the information. Multiple date-time stamps must be defined as data attributes in Satellite structures (defined in the Satellite chapter). Utilizing the data in a multi-temporal state is accomplished through the query designs.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 95 of 152

WARNING: DO NOT ALTER THE ARCHITECTURE, NOR MODIFY THE STRUCTURES OF THE ARCHITECTURE TO “ GAIN” A MULTI- TEMPORAL VIEW OF THE DATA. AS STATED PREVIOUSLY , ANY DEVIATION FROM THE STRUCTURE WILL CAUSE A SERIOUS BREAKDOWN OF THE VALUE OF THE D ATA VAULT MODEL TO THE BUSINESS . T HE REASONS ARE LISTED IN THE RE- ENGINEERING STATEMENTS THAT ARE MADE THROUGHOUT THIS BOOK .

The structures and standards have been built and tested for over 15 years. The standards have been built to avoid the pitfalls and problems that existing data warehousing models suffer today. Including but not limited to: cascading change impacts, scalability issues, flexibility problems, absorption of new systems, and so on. By breaking the standards, you will experience many of the same problems that you have today – you will negate the whole reason for moving to a new data modeling structure! It is fine to add attributes to Satellites; it is not okay to change the primary keys of the Link or Hub structures. There is a tendency by designers to want to add temporality (date/time keys) to Links primary key structures and Hub key structures. The original Data Vault design in 1993 allowed this as an option. By 1995 flaws in this design begin to appear – as with cracks in the foundation of a home, these flaws were significant enough to warrant a re-definition of the Link structure. The finalized design was tested and passed with significantly better results – the finalized design allows no temporal date/time elements as part of the primary key of the Links. Allowing temporality as a part of the primary key of the Links caused re-engineering 3 to 6 months later. It is the view of this author that *any* cause to re-engineering should be eliminated if possible, and if not possible, the impacts of changes should be reduced to a minimum. Otherwise, the results are disastrous; akin to an invasive wall-climbing vine, that anchors its roots deep in the structure it’s climbing. Eventually that structure must be torn down and completely re-built. It is the same for the Link table where the primary key introduces temporality. One of the foundations of the Data Vault is to enable iterative development by consistently minimizing rework and reengineering. This not only future proofs the Data Vault, but also facilitates rapid development because slight omissions or oversights at initial stages of the Data Vault design can be coped with in an elegant and extremely economic way. Not so when temporal structures are incorporated into the primary key structure because the primary key changes must be migrated down-stream to ALL child tables – leading to cascading change impacts. The cascading change impacts affect re-engineering efforts of ALL the child tables from loading routines to queries – everything must be changed. © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 96 of 15 52

That said: introdu ucing temporrality as a rea ason to produ uce a new seequence in th he Link and treating the d date/time as attributes within the Link k table exhibit the exact ssame cause a and effect. TThis practtice of alterin ng the Link Entity should never n happen. **CAVEA AT** The only time a Link k may contain temporality y is when it functions as a FACT tablle – in other words, it is a transactiona al Link where the data is ttime-stamped d, and the data d set canno ot be updated or changed d in any way (legally by th he business).

5.12 2 Link-To-Link (Parent/Child Relationshipss)

A Lin nk-to-Link relationship ind dicates a parrent-child arra angement, o r a hierarchyy of some sorrt. In this case e, it is equiva alent to a nessted relationsship with diffe erent levels oof grain. An eexample of a Link-ToLinkk is below, in Figure 5-19.

Figu ure 5-19: Example of Link k To Link Relaationships e, Hub Producct and Hub Supplier are both b parents to Link A. Link A and Hub Sales For tthis example Persson are both parents to Liink B. Link B and Hub Te erritory are booth parents tto Link C. This forces a risse in complexxity in loading g and queryin ng, and it beccomes most evident when n Satellites a are found as children of ea ach Link.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 97 of 15 52

Data a modelers often feel the need to put a Link to Link relationshi p in the mod del. While the ese relattionships are e interesting and a may be easy e to read logically, theey are extrem mely difficult tto impllement. From m a logical sta andpoint it iss easy to see the parent-cchild relationsships when m modeling a Lin nk to Link arcchitecture. Im mplementing this kind of model m is diffiicult becausee of the parent child depe endencies du uring the load ding cycle. Th he following is i a discussioon on best prractices for removing Linkk to Link from m the physical implementa ation. Note: it is fine to logicallly model Link to t Link relationships; the prroblem is when n they are expressed d in the physica al data model. To avert issu ues and probleems, the Link structures should be flattened out,, and the hiera archy depende encies removeed.

The consequence (born by the loading pro ocesses) of physically p imp plementing this type of sttructure is as fo ollows: •

It requiress the ETL to load sequentiallly to each child Link, it remooves the abilitty to load all Liinks in parallel alll the time

Math hematically speaking, s if we w “denorma alized” the Lin nk Structuress so each Lin nk is connectted to its pare ents’ Hubs we e can represent the same e data set in a flattened m manner. Thee structure b becomes simp pler to mainta ain (going forward, it is “ffuture proof” – especially if the relatioonship in the source syste em changes… …) and, the structure s is easier to load as well as q uery. Figure 5-20 shows the first “flatttened” hierarchy, and the e new structu ure.

Figure e 5-20: Step 1, Flatteningg Link-To-Linkk Hierarchy

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 98 of 15 52

The first step is to t remove Lin nk A sequencce from Link B. The seco nd step, is too put the Hub bs from Linkk A directly in to Link B. Th his moved Link B to a pee er status of LLink A, howevver, Link B is a diffe erent grain. Remember, R Link L B alwayss had the following grain: Product by S Supplier by Sales Persson. In orderr to simplify the model, the loading and querying p rocesses – the dependen ncy on Linkk A was removved or flatten ned. Now, to o finish the jo ob, the processs is repeateed for Link C. Figure 521 b below shows the complete ed Link Strucctures which should be im mplemented physically.

Figure e 5-21: Step 2, Flatteningg Link-To-Linkk Hierarchy nhooks the de ependency of Link C to Link B. The Li nks are now successfullyy flattened The final step un (den normalized). This new structure allowss all Links to be loaded in n parallel, and allows all q queries direcct access to the t data set through the Hubs. This means m that iff the queries need accesss to other Linkks, it will be available base ed on direct request. r It also has the ffollowing effeect: There can n be reco ords in LINK C which do not exist in Lin nk B!! There can be recorrds which exist in Link B w which do not e exist in Link A. A This is absolutelyy vital to haviing a Data Warehouse W ca apable of abssorbing 100% % of the data 100% of the ttime (within scope). s If the e other dependencies we ere in place, w we would be forced to cre eate pare ent records fo or the entriess in Link B an nd Link C justt to load the appropriate data set. Please no ote: the proce ess of removing Link-To-L Link relationsh hips should h happen afterr the logica al model has been built, as s it is easier to t accomplish h once the co orrect relationsh hips have bee en established d.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 99 of 152

5.13 Link Applications

Link structures are all defined the same way. There are several different applications of Links which require discussion, introduction, and definition. It is these types of applications that are discussed in the following sections. The different types of Links include the following: • • • • • •

Hierarchical Links Same-As Links Transactional Links Exploration Links Low-Value Links Computed and Aggregate Links

In some cases, the data within the Links is derived or computed by one or more business processes – thus resulting in a Link which contains non-auditable data; or at the very least, data which never existed in the source system. If this is the case – mark those rows with the appropriate “system generated” record source, or process name. Some of these cases include utilizing a Data Quality engine to produce “similarity” across names and households, business names, product names, etc… Other cases include using aggregate functions to produce corporate vision information that is used to drive the business in a day-to-day decision making function. A majority of the time, these computations and aggregations or results of processing belong in a business Data Vault which is defined in the upcoming book: Quick Start Guide to Business Data Vaults. 5.14 Hierarchical Links

Hierarchical Links are just what their name implies: a Link structure which contains N levels of hierarchical data from the same Hub. For example, consider the case of an employee who reports to a manager. Their manager is also an employee – who happens to report to a director, and so on. The Hierarchical Link allows roll-ups and aggregation of lower level data into a tree topology, or treelike organization. Hierarchies are a form of ontologies. Let’s look at an Organizational chart, and an example of a Hierarchical Link. Remember, the Hierarchical Link is an application of the standard Link structure. It does not change the Link structure nor violate the rules in any fashion.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 100 of 1 152

Figure F 5-22: Example E Organization Strructure n in Figure 5-22: The following asssumptions arre made about the organiization shown • • • •

Each divission office is not n a store-fron nt The execu utive office is not n a store-front Each “office” has its ow wn office busin ness key identifier Each “store” has its own store busine ess key identiffier

In th his case, diffe erent stores report r to diffe erent division nal offices, th he division offfices report to an execcutive office. The Data Va ault model would appear as Figure 5-2 23:

Figure 5-23 3: Hierarchica al Link for Offfices The Hierarchical Link is show wn in purple above a the Hub Office tablee. The hierarchical Link ccontains be as the o office sequen nce twice; on nce for the ro oot office, oncce for the “paarent” office.. There can b man ny office roll-u ups as neede ed. The reaso on for extrap polating the h hierarchy to a many-to-ma any relattionship is th hat: relationsh hips in busin ness change over o time. S So the represeentation of th he hiera archy today may m not be th he same as yesterday y or even e the sam me as tomorrrow. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 101 of 1 152

The Data Vault model is isolated orr decoupled from f the impa act of busine ess change!

e of a hierarchical Link miight be the employee stru ucture described earlier. In this Anotther example case e, all employe ees have a badge numbe er, all employe ees are just tthat: employyees. Since the “tree” peated here.. The Data V would look very similar s to the e business tre ee above, it will w not be rep Vault mod del for this strructure would d appear in Figure F 5-24:

Figurre 5-24: Exam mple Hierarch hical Link of Employees Rem member: the application a of o the Link do oes not chang ge the structture of the Link. It merelyy defines a ne ew use for the e Link. Do NOT add begin n and end da ates to Link sstructures, including the appllication of the e Links discu ussed here. Changing C the e structure off the Link can n break the fflexibility of th he Data Vaultt model and introduce re--engineering in the near ffuture (as sooon as the bussiness chan nges). 5.15 5 Same-As Links

Sam me-As Links are another tyype of applica ation of the Link L structuree. In this casse, the data sset is appllied as resolu ution informa ation. In othe er words, all data d exists aat the same ssemantic grain – or has the same me eaning to the e business. All A data are peers p to one-aanother. In these cases, the diffe erently spelle ed names of companies c all represent the t same com mpany. Figu ure 5-25 belo ow dem monstrates bu usiness data that identifie es the same-as concept.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 102 of 1 152

Figure 5-25: Sam me-As Link Exxample, Busi ness Data Therre are as many names forr corporations as there arre corporatio ns in the worrld today. Why? Beca ause no-one can seem to o (or wants to o) spell them the same waay. There aree many differrent reassons why nam mes are multiplied acrosss the compan ny (ranging froom incorrectt business inccentives to simple mistyping). Especia ally if there are different systems s or in nternal and eexternal feedss bringing data a together. Whatever W the case may be e, there are ju ust a few bassic rules thatt need to be ffollowed when preparing the t data set to t be loaded to the Same e-As Link. • •

A businesss user (not IT)) must pick a MASTER M spelling to which alll “similar spellings” will map OR: A neu ural-net data mining m engine must pick a be est-guess spe lling to which all similar spe ellings will map

In ea ach case for the example above, a ma aster spellingg has been ch hosen. This can be thougght of as a step p in the directtion of defining master da ata for use in n the operatioonal systemss. The Data V Vault mod del for this exxample would d appear in Figure F 5-26:

Figure F 5-26: Same-As S Link k Data Vault Model Rem member that even e though the application (or usage e) of the Linkk varies, the LLink structure e stays the ssame. © Da an Linstedt 2010-2011, 2 all a rights rese erved http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 103 of 1 152

5.16 6 Begin and End Dating Link ks

Therre is an urge by many datta architects to add the no otion of timee to Link strucctures. It fee els like a natu ural thing to do, d however it i will break the t flexibility of the Data V Vault and soooner or later (when the b business cha anges its min nd on how the e relationship p is defined),, it will requirre re-enginee ering of the lload processses and the SQL S access processes. Figure 5-27 beelow depicts a Link with “beggin/end-dates” embedded d.

Fig gure 5-27: Inccorrect Link with w Begin/E End Date hnically. How k won’t hurt anything a tech wever it incre eases the The process of putting a “datte” in the Link chan nce that the data d set will be utilized th he wrong wayy. In other woords, the IT p person will no ow have to an nswer questions like: • • • • • •

What does the date me ean? How is it documented? d How is it computed? c Can the date change? What affe ect does it have on the assocciative key structure? What does it do to the meaning m of the e surrogate ke ey?

All o of these quesstions arise – along with complication c in loading, q uerying and mining when n the struccture of the Link and exists L is comprromised. Eve ery field in th he Data Vaultt has a speciffic purpose, a in a specific placce for one or more busine ess reasons. Begin and E End dates deescribe when a relattionship is acctive / inactivve. The purpose of a Link k is: to establlish the fact tthat a relatio onship existts (remembe er: right, wron ng or indiffere ent with no re egard to timee).

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 104 of 1 152

Example: A patient checcks in to a hosspital and rece eives an ID tagg with a numbeer on it. Immediate ely their tag is scanned and the system asssociates the ppatient data w with the locatio on of the hospital where they checked in. ID is paired up by Location. Whether orr not the ID number is right or wrong g, or the locatiion in the com mputer is right oor wrong, it’s tthe data that was generrated by the ch heck-in event. The Link wou uld model ID b by Location.

Once e the associa ation has bee en establishe ed, it is a factt that it existeed at that pooint in time in n the operrational syste em. The factt stands for all a time, the fa act is neitherr right nor wrrong, nor doe es it “start” or “stop”. It is a relationship that the t source syystem record ded; thereforee the Data Va ault reco ords it as well. Rem member, a pa atient can havve many diffe erent interactions with th at single loca ation at different time es. This means that the dates d and tim mes in this example are in n fact descrip ptive in nature e. Therrefore the tem mporality of the t Link data a must be desscribed in a S Satellite in order to mainttain the prop per structuress. Remembe er this: adding begin and end dates too Links changge the grain o of the data a (or businesss key) in the Link. Figure 5-28 below shows the efffect of addin ng begin and end date es to the Link k structure.

Figure 5-28 8: Begin & En nd Dates in LLinks In th his example, the t driving key comes in to question. Some of thee many questtions that appear as a resu ult: “What doe es the BEGIN N and END da ate mean? What W do they represent? H How are theyy gene erated?” In this t case, if the source syystem createss begin and eend dates, th hen the busin ness user has complete control over the ese and theyy cannot accu urately depictt a proper tim me-line for th he system drive en relationsh hip. Why? Be ecause the business b userrs can back-d date the “beggin and end” date sequ uences. Therre’s a more te echnical reasson that work ks against th his structure. In Figure 5-29 below, it is clear to which extend see that the sam me relationship can be rep presented mu ultiple times over time – w ds the mea aning of the unique u busine ess key sequ uence. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 105 of 1 152

Fig gure 5-29: Exxample of Poo orly Construccted Link oblems IF and d WHEN therre are Satellittes attached as Linkks with begin and end dates cause pro child dren. They ca an also cause queries to produce Carttesian resultts when joinss are made th hat ignore “currrent” or “sing gle record” access. Resu ults can be disastrous acrross joins and d performancce will slow w to a crawl (just as it does in HUGE Fa act Tables) due to the lack of unique eentries in the e Link table e. In orrder to track begin and en nd cycles of relationships r s, the best praactice solutioon is to place e them in an E Effectivity Sattellite. These e are discusssed in the Sattellite Chapteer (Chapter 6 6). However, Figure 530 b below shows an example of an Effectivvity Satellite off the Link, note: other S Satellite Data a (Sat__Cust_Acct_Emp_Detailss) has been shortened due e to screen rreal-estate.

0: Satellite Efffectivity on a Link Figure 5-30

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 106 of 1 152

Notice that the “begin and en nd” dates are e now unhookked or decou upled from th he key associiations. Thesse dates can now be set and a controlle ed by front-en nd applicatio ns as well ass data miningg logic. 5.17 7 Low Value Liinks

Low value Links are another application a of o the Link structure. Low w value Linkss provide asso ociations without any context. They exist as an association for the t sake of jooining two orr more busine ess keys toge ether. They may m even be categorized (in ( some cases) as explorration Links ((see below). An exam mple of a Low w Value Link might be som mething that joins part nu umber to seccondary supp plier id. In the ccases where primary suppliers are important, but secondary su uppliers are rarely used. Low value Links may also be calle ed computed aggregate Links (see bellow). They m may supply roll-up aggrregation poin nts at higher levels of grain. 5.18 8 Transactiona al Links

Tran nsactional Lin nks are defined to be a da ata set which h cannot legaally change. In other word ds, it’s transsactional hisstory. Any tra ansaction tha at cannot lega ally be edited d qualifies foor the transacctional Linkk. The easiesst qualificatio on for transacctional Link would w be to ccall it an unallterable fact. In other word ds, once issu ued – the reco ord stays inta act as audita able history foorever. Therre are two wa ays to model the application of this da ata within thee Data Vault. The first is a traditional metho od: Link and Satellite (mo odified to havve no history)) the second is to place all the data in th he Link structture itself. Fiigure 5-31 indicates the first f method ffor setting up p a transactio onal Link.

1: Transactional Link Exam mple Figure 5-31 © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 107 of 152

Transactional data is loaded direct to the transactional tables (both the Link and the Satellite). In the above example, the transaction number is included in the Link for unique key structuring, while the transactional date and time is included in the Satellite. It is possible in this circumstance to use the transactional date as the Load date if and only if the time at which the transaction is loaded to the Data Vault is relatively close (within seconds) of the actual transaction date itself. Otherwise it is important to separate the data set to accurately represent it. Although it is not pictured here, the transactional representation of the Satellite does not need to store the Load Date, as in most cases it will match the Load Date housed in the Link parent. There is however an exception to this rule: in some specific cases, the transaction is delivered in two parts from two different streams, just milliseconds apart. In this type of real-time case, the Load Date should be modeled and stored in both the Link parent and the Satellite in order to properly represent the different arrival timings. Figure 5-31 above also indicates a slightly modified Satellite structure (again, discussed in the Satellite Chapter). In this case, there is no load-end-date in the Satellite, indicating there is no history; in other words – once the data has been added to the Data Vault it cannot be “changed” or superseded with new information. There is another option within the Data Vault for modeling transactional data, where the information is housed directly in the Link structure. This architecture is not preferred as it changes the architectural design by introducing decisions to the design process. Therefore it increases the complexity of the maintenance cost and loading routines. In certain circumstances where performance is absolutely required to the millisecond level (or lower), it may be necessary to structure the transactional Link in Figure 5-32:

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 108 of 1 152

Figure F 5-32: Transactiona T al Link, No Saatellite The only issue to o watch for with this type of o Link is the e width of thee data set. The width can n easily beco ome too large e, and quickly cut down on o the numbe er of rows peer block. If th he Link becom mes too wide e, the perform mance of botth the load an nd the querie es will decreaase. Transacttion Links are e gene erally built to o house inserrt only, rapid-ffire transactiions which arrrive on a continuous multi-stream basis direct from m the operatio onal systemss in to the Data Vault. Thee decision to adopt this m modeling struccture must be b made on a case by case basis. 5.19 9 Computed Aggregate A Links

Com mputed aggregate Links are similar to Fact tables in n a dimensioonal model. C Computed agggregate Linkks have a reco ord source th hat is labeled d system gen nerated. Com mputed aggreegate Links a are utilized to ho ouse pre-com mputed data sets like tota als, summaries, averagess, minimums and maximu ums. They are p part of the multi-layer m (scale free) arch hitecture tha at the Data Vaault offers. TTypically Com mputed Aggrregate Links are found on nly in the arch hitectural com mponent knoown as the business vaullt. How wever, there may m be timess when they provide p value e to the raw d data sets and d hence will b be found in th he raw Data Vault. V

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 109 of 1 152

Figu ure 5-33: Exa ample of Com mputed Aggreegate Link The data found in computed aggregate Links are gene erally not aud ditable, as th hey are mach hine computed within the Data Vault and are not n part of an ny source sysstem. Caution: if the resu ults houssed in the co omputed aggrregate Links are found in financial rep ports, or on a corporate e executive’s deskktop, they ma ay become auditable – ass they are utilized by busi ness users too run the bussiness. For further exploration: In the example e above (Figurre 5-33), the ssuppliers are interested in knowing th heir total saless of each prod duct by store te erritory. Ratheer than producce a separate data mart,, the architects decided to in nclude the pre e-computed agggregate direcctly in the Data a Vault. The e function F(x) determines th he business ru ules for aggreggation and posssibly cleansin ng; which mayy or may not in nclude productt roll-ups to higher level asssemblies. This type of o structural co omponent is an a add-on, and d is not consid dered to be pa art of the core Data Vaultt model. The add-on a is similar to other qu uery assist tab bles that provid de pre-built answer se ets to routines that load dow wn-stream data a marts.

The implementattion book will cover the details of how w to load and query such a Link; for the purp poses of explanation – im mplementing a computed aggregate Link directly in n the Data Va ault assissts with the query q aspectts of virtual data d marts.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 110 of 152

5.20 Strength and Confidence Ratings in Links

The study of the Data Vault as a neural network introduces a number of concepts. One of which is the idea that the tables in the Data Vault act as the data storage or nodes in a fuzzy logic algorithm. In doing so, the neural network needs to establish associations with strength and confidence ratings. In querying the Link structures, the neural network can “learn” the context housed within, and determine if the relationship needs to be improved, if it’s the strongest relationship, or if it’s the weakest. The strength rating (when added to the Link) is the result of data mining efforts to establish a correlation across the “related” data sets. In other words, if there are two Hubs with Satellites that both describe cars (maybe different types of cars), then an association or relationship can be formed with a fairly high strength of 90% or above (for example). But it’s more specific than that. It’s based on EACH BUSINESS KEY association. In this example, there are two cars with different VIN numbers that were recorded by two different systems. They each describe a blue car, front wheel drive with 250k miles, that is a 1998 make and model match.... They are each connected to similar owners/drivers in different states. The inference might be 90% chance that these are the same car. The confidence rating must be added to the Link in conjunction with the Strength rating so that we know how confident the knowledge engine is in the rating it’s provided. In the example of the cars above, the confidence may only be 60% because the drivers might have different names and never have had the same address. However, maybe the confidence is 90% because a mining effort across the drivers sees a family relationship to the drivers. Strength and confidence can be added to the Link structure on a row by row basis, and are utilized by analytics routines to filter out important correlations. Of course, these strength and confidence ratings may change depending on the question being asked. At that point, the knowledge that is sought may need to recalculate these ratings so that they make sense. The neural net engine that is making the assumptions and assigning these calculations should utilize an industry vertical ontology that describes the business terms. Otherwise, spotting the context for associations will be difficult if not impossible.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 111 of 152

Note: this type of activity brings into focus another application of the Link structure called a Dynamic Link. Dynamic Links are discovered and created by machine learning algorithms. The data in the Dynamic Links are generally not auditable (as fuzzy logic rarely produces the same result twice). They are very similar in nature to exploration Links (described below) – however the difference is that Dynamic Links are machine driven, while exploration Links are manually created.

What you can do with Dynamic Linking is limitless. The Data Vault model is a scale free architecture, which allows you to explore different Linking constructs until you find the right one that represents the business. It’s also the very same reason that the Data Vault model is future proof, in that it can absorb any future change without changing the nature of the historical data that has already been collected. 5.21 Exploration Links

Exploration Links are short-circuits to the joins across the Data Vault and are placed in to the Data Warehouse for business reasons only. They are manually generated, and maintained – however if an exploration Link proves to be valuable to the business, the loading cycle can be automated. Exploration Links are a form of computed aggregate Links. They may or may not contain computed attributes. Exploration Links are not auditable. The architect and BI team implement exploration Links to cross several different parts of the model. They may span between 2 different Hubs which are spread across the model, and not directly Linked by the source systems. A small company in Denver called NetQuote installed an exploration Link to determine up-sell potential for targeted ads and discounts to their web-customers as they clicked through the system. This company saw a 40% increase in profitability as a result of the exploration Link. They found a reason to implement it on a consistent basis. This company also built an Operational Data Vault that was loaded at the time the transaction was generated from the web-front end. The operational Data Vault was hooked to the message bus for both incoming and outgoing message routing. Exploration Links are encouraged once the base Data Vault has been constructed and is in operation. These Links can be created, queried, and destroyed at will without the destruction of history within the Data Warehouse. Exploration Links give the business user and the IT architect a chance to play with the data set within the Data Vault; hopefully resulting in new questions and answers that can be viewed from the data warehouse.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 112 of 152

6.0 Satellite Entities Satellite entities are the warehousing portion of the Data Vault. Satellites store data over time. Satellites are comprised of descriptive data that provide context to the keys and associations at a point in time or over a time period. Descriptive data in warehouses often changes; the purpose of the Satellite is to capture all deltas (all changes) to any of the descriptive data which occurs. Satellites are typically arranged by type or classification of data, and rate of change. There are many different manners in which to setup classifications of data within a Satellite. For example, the attributes could be classified by data type, or by content, or by context – each of which will yield the same result physically – but a different result in the understanding or interpretation of the model. Rate of change is yet another classification of Satellite data. Rate of change allows the Satellite to split away groups of fields that change more quickly than others. This prevents or removes the need for column data replication (of the slower changing attributes). By splitting the Satellites by rate of change, the rows are also reduced in size – allowing the data to insert more quickly, and be more responsive to real-time feeds. The lower the latency of arrival, the faster the database must respond with insert speed, the nature of these mechanics will be covered in the Data Vault implementation book. 6.1

Satellite Definition and Purpose

A Satellite is a time-dimensional table housing detailed information about the Hub’s or Link’s business keys. The purpose of the Satellite is to provide context to the business keys. Satellites are the data warehouse portion of the Data Vault. The Satellite tracks data by delta, and only allows data to be loaded if there is at least one change to the record (other than the system fields: sequence, load-date, load-end-date, and record source). A Satellite can have one and only one parent table. Satellites provide the descriptive data about the business key, or about the relationship of the keys. They describe the relationship changes over time. Their job is to record the information as it is loaded from the source system. They use load dates and load-end-dates to indicate record lifecycles because most database systems today are not capable of internally representing time-series properly. Satellites often provide data normalization for future proofing, scalability, and auditability of the data sets. How normalized a Satellite gets is a function of the design and a choice made by the designer.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse WARNING: ANY

Page 113 of 152

SATELLITE STRUCTURE WILL DAMAGE THE FLEXIBILITY OF THE DATA VAULT MODEL . FOR INSTANCE , ADDING A FOREIGN KEY TO A SATELLITE IS NOT ALLOWED – IF A FOREIGN KEY IS ADDED, IT IS NO LONGER A D ATA VAULT MODEL. ON THE OTHER HAND, BECAUSE SATELLITES ARE CHILD TABLES, THEIR PRIMARY KEY CHANGE TO THE

STRUCTURE IS SOMEWHAT FLEXIBLE WITHOUT TOO MUCH HARM TO THE MODELING ASPECT.

Remember: all models must serve a purpose and a function. The rules and standards described in this book are golden guidelines – most of which should be adhered to; however there are some circumstances in which the principles of the Data Vault can be preserved while altering the structure of the data model to fit the needs. 6.2

Satellite Entity Structure

The Satellite entity structure consists of basic required elements: surrogate sequence id (from the parent table), load date stamp, load end date stamp, and record source. Database engines today do not currently support (natively) time-series based table structures. Due to this limitation, the architecture is forced to compensate with Load Date Stamps and Load End Date Stamps. These date stamps have been described in the common attributes chapter (Chapter 3) of this book. The Satellite entity must NEVER contain foreign keys (except for the single parent on which it relies). If a Satellite structure is compromised, then the flexibility of the model is immediately compromised, in other words: all possible hope of future proofing the data model is immediately lost. You are then forced to reengineer the data model in the near future when the business changes the way relationships are structured. Satellites may contain unknown or not-yet-identified business keys until such time as the business keys become identifiable. While this is not a general practice it is acceptable. When applying this standard rule, the business key housed in the Satellite is treated the in the same manner as the rest of the descriptive data set – as just another descriptive element with changes tracked. However, once the business key becomes identifiable it will be necessary (at that time) to split the key out to its own Hub, add a Link association to the current parent of the Satellite. Then, the Satellite data must be re-formulated without losing history. The process of reformulation of Satellites is covered in the one-on-one coaching section and the Data Vault Implementation book. Satellites must have one and only one parent table, none others are allowed. Figure 6-1 below shows a standard structure of a Satellite Entity.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 114 of 1 152

Figure 6--1: Example Satellite S Entiity The primary key of the Satellite is most offten the Pare ent Sequencee combined w with the Load d Date time e stamp. If th he Satellite iss loading in re eal-time, then it may be n necessary to add a sub-se equence num mber, or if the e database alllows it – posssibly adding a millisecon d timer to the load date. By addiing a sub-seq quence, or millisecond tim mer, real-time e data can eaasily flow direectly in to the e Satellite without creating duplicate priimary keys (a as a result of load date coollisions). Therre are even situations s where the sub-ssequence ma ay represent the parallel pipe numberr that is feed ding the Satellite with data a (this of cou urse is true if the data is aarriving at 10 0,000 transactions per seco ond across 10 parallel pip pes); and in that t case, it may m even be necessary too split the Sa atellites in to 10 0 Satellites (1 for each piipelined insert). Otherwisse, the datab base may be overwhelmed d by the load ding speed. However H thatt is beyond th he scope of this book and d is discussed d both in the e one-onone coaching secction and the e Implementa ation of the Data D Vault boook. 6.3

Satellite Exa amples

Figure 6-2 showss several Sattellite examples from the Adventure W Works 2008 D Data Vault mo odel. Cult = Cu ulture, ProdD Desc = Producct Description, Prod = Pro oduct, Mod = Model, Prod d_Loc = Product Loca ation.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 115 of 1 152

Figure 6-2 2: Example Satellite S Entit ies The data in each h of these formulates the warehouse. Keep in min nd that the foor this particu ular Data Vaullt model we are a 1) dealing g with a single source sysstem 2) a moodel which ha as no identifiable or mea aningful busin ness keys. 6.4

Importance of o Keeping Hisstory

Histo ory is partly what w a data warehouse w iss all about. The T Data Vau ult is no differrent, except tthat in the Data a Vault, histo ory is raw data a. Satellite structures s be eing what theey are, can bee changed, altered, and re-designed (as is docum mented later in this chapte er). It’s impoortant to rem member: when na Sate ellite changess its design – 100% of the e historical data d must be preserved – or the Data Vault will no lo onger pass an audit.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 116 of 152

As you continue through this chapter, please be mindful of this principle. Be thinking of how the history can be preserved throughout the different changes. History serves as the audit trail of the source systems, and the only record available is in the Data Vault; which means that the Data Vault you build is now a system of record. There are differing opinions about what a system of record really is, however as the business retires old sources, or as you implement operational data warehousing, the data warehouse is relied upon to make financial decisions. It is at this point where the data warehouse is a system of record. 6.5

Splitting Satellites by Classification or Type of Data

There are many different ways to define type of data. One way is to define type as data type. In this manner, the Satellites can be divided into different pieces based on their data types. History has shown that the benefits of this approach are as follows: • • • • • • • • • • •

Create a fixed width row for bits, integers, dates/times (all non-varchar components) Create variable width rows for all varchar / char attributes Create variable length BLOB / CLOB / LOB objects Dramatically increase compression rates for data sets Decrease overall storage needs (by reducing the potential for chained rows) Easier management and maintenance No “guess work” involved in defining new Satellites Easier indexing strategies Easier partitioning strategies Easier Query Parallelism End Result? Increased performance

Of course we can’t ignore the nature of the query set. When classifying attributes into different Satellites by data type, it is important to remember the queries that will be grabbing the data sets – and put it in context with the platform that the queries are running on. For instance, if the platform is Teradata, or IBM DB2 UDB EEE / MPP then the queries and parallelism will work quite well. Or if the platform is SQLServer 2008 R2/MPP, or Oracle SMP Big Iron with Partitioning and Parallel query, then the queries will work quite well. If the platform is a DB2 based AS/400 – then normalizing the Satellite goes against the performance principles of the HFS (hierarchical file system). Also, if the hardware is under-powered, under-sized, or the database hasn’t been tuned appropriately then the queries might not run so well.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 117 of 1 152

Furthermore – th he database industry is ch hanging (and d by the time this is publisshed, will havve chan nged). The riise of NOSQLL (like HADOO OP) solutionss, and column nar databasees will change e the way we p physically loo ok at partition ning data setss. This is all beyond the scope of thiss book, and w will be cove ered in the Da ata Vault Imp plementation n book, or in the t one-on-oone on-line cooaching sectiion of my web-site at: http:://danLinsted dt.com In th he interest off discussion, and for the purposes p of demonstratio d on, the Figuree below (Figu ure 6-3) show ws a split of SAT_PROD S frrom Figure 6--2 (above) intto multiple S Satellites.

Figure 6-3: Satellites Split by Type Of Dataa Option 1 o splitting coolumns up (oor normalizingg) the This is known as vertical parttitioning. That is: the act of table es into different groups tied together by the same key. This is a concept th hat the colum mnar data abases follow w for performance, but the ey take it to the t extreme b by splitting (n normalizing) every singgle column in to its own ta able structure e. One of the beneffits of this tecchnique is it begins to sho ow potential mismatchess for data types and nam ming conventions. For insttance: I would argue that size nvarcha ar(5) might b be better off a as a num meric, either decimal d or flo oat – unless of course the e size includees a measureement like cm m (cen ntimeters) or in (inches). But in that ca ase, the size and size meeasurement sshould be separate columns – both should s be sto ored in the numeric Satelllite. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 118 of 1 152

Upon further insp pection: DayssToManufactture could be e seen as a n numeric (as itt is now), or q quite posssibly as part of o the date data types Sa atellite. It all depends on how the business desiress to setup and manage its metadata m ontology. Onto ological definitions/classiffications for ffields in the ttable are as numerous as the stars in the t sky. In otther words – there are m illions of com mbinations (w ways) to splitt your data byy type; spend d some time in i the design n phase whic h manner of classification / grou uping suits yo our needs best. 6.6

Splitting Satellites by Rate e of Change

Rate e of change iss a similar to opic to type or o classificatio on of data. R Rate of changge can be de escribed in w fast man ny different ways. w Howeve er, in this parrticular case the term ratee of change refers to: how doess each eleme ent or group of elements change in re elationship too each other. In other wo ords, the rate of change off cell phone numbers n for an individual may be exp ponentially higher/faster tthan the rate of change fo or that persons’ address. Lumping all this quickly changing data with otherr, slower chan nging data ca auses data space explosion. Figure 6-4 shows an example of d data that is deno ormalized in to a single Satellite, S and changes at different d ratees.

Figu ure 6-4: Satellite Data Ratte of Changee Example ough compre ession in the database offfers a little bit of relief froom the repetiitive informattion, it Altho still piles up overr time. All this “extra data a” is recordin ng little to no changes to tthe informatiion as it flows in. The imp pact can be seen s in longe er loading tim mes, longer b backup times, bigger log sspaces, large er temp area as (needed fo or queries), and slower qu ueries overalll (more I/O ggoing on at th he hard dware level).

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 119 of 1 152

In th he above example, the cell phone changes every day. The Pho ne number m may change e every othe er day, but the name and address mayy change oncce a year or leess. The rigoor required fo or loading a ne ew row in thiss instance can be painful.. Not only do oes the data sset in the tab ble explode, tthe inde exes for the fiields also exp plode. What’s worse, is the coverage (that’s the in nternal datab base ratin ng for which index i is the best b to use fo or SQL) is dra amatically red duced by dup plicate data; which mea ans that if you u index the Name N or the Address A columns, their seelectivity will become veryy poor very quickly. Note: if you y think it lo ooks OK now, or you don’t see the harrm in this (be ecause you’ve e been doin ng this for years with a typ pe 2 dimension), then jusst try to imag gine this happening g over 10x th he amount off data that yo ou currently h have. In othe er words, imagine this table with h 100 million rows in it, when w in reality y you only ha ave 6.6 millio on names an nd addresses (assuming each changed about 15 tim mes total), an nd 100 million n cell phone e number cha anges. The performance p is dreadful to o think aboutt because of the width h or row size of the table.

This in turn leadss to all kinds of attempts at database optimization n, all becausee of a poor ch hoice in the d design of the e table. This can be mitiggated very eassily by splittin ng the Satellite up by rate e of chan nge, this allow ws the table to settle in to o predictable e patterns of repetition.

Figure F 6-5: Satellite Split by Rate Of C Change In Figure 6-5, we e see a 5x dissk space com mpression immediately ta ke place in th he contact name Sate ellite (becausse only 5 rows were shown). As it turn ns out, storingg the informa ation once is much more e efficient an nd practical. The contact phone Satellite still has aall the changges and all th he history, but tthe row-size is much sma aller, the inde ex coverage iss better, and d the perform mance is faste er.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse 6.7

Page 120 of 152

Satellites Arranged by Source System

When designing Satellites, sometimes the first instinct is to try to combine multiple systems data directly in to a single structure. This can be good and bad. By combining multiple systems data in to a single structure, there are a lot of considerations to be made – these are covered in the next section: Overloaded Satellites (the Flip-Flop Effect). The best practice, and the easiest to get comfortable with is to split the Satellite data into separate Satellites, one per source system. The next question that comes to mind then is: why include a record source? Well to be honest, because the source of the data may still need to be geographically identified, or possibly application identified. For instance, the source may be SAP – sales module, but the SAP Sales module may have been implemented across more than one source system (physical machine). It is also possible that it may be implemented in different geographic regions. The key to using a single Satellite per application is to ensure there is a match across the structures, and that the metadata is defined the same way by the business. Note: The best practice is to split the Satellites across each source system.

What benefits does this provide? • • • •

•

Allows the designer to add new systems as they come in the door without impacting existing designs, and existing data sets Removes the need to “fight” over what the data means, how to integrate it, and whether or not it needs to be split, concatenated, lengthened, shortened, and manipulated. Allows different data sets from different sources to populate their audit trail in accordance with their rate of change and type of data. Where in this case, type represents the source system. Solves the problem of disparate data arrival times. In other words if or when the data arrives, it is inserted directly in to its’ Satellite for that system, there’s little to no competition (at the I/O or database level) for that resource (table). This allows us to Maximize Load Parallelism Allows real-time data to flow from one system, while batch data flows from another – limits the exposure to the risk of having to “merge” data sets on the fly. Eliminates the dependencies across multiple systems that would force those systems to have the data ready at the same time.

They say a picture is worth a thousand words, Figure 6-6 provides a generic example of what this might look like in a Data Vault model.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 121 of 1 152

Figure e 6-6: Custom mer Satellitess Split by Sou urce System In th his example, there t are som me overlaps for customerrs, including Name and P Phone Numbe er – but that’s where the similarities appear a to sto op. Each systtem probablyy has its own n unique way of defin ning what cu ustomer mean ns! But the business b statted in their reequirementss very clearly: if the custtomer record d has the sam me business key k in each system, s then it is supposeed to represe ent the sam me customer. Very rarely does this data ever line up u in the begginning, espeecially once history is load ded to the Satellite. The job of a good d Data Wareh house is to point out or make m known tthe discrepan ncies (the ga ap anallysis) betwee en the way the business believes b it’s operating, o an d the way thee source systtems are trulyy running. Th he job of the Data Wareho ouse is not to o filter the infformation or to alter it in any way – to do o so would viiolate compliance and au uditability rule es. Here, thee discrepanciies are plain to see, and if you look closely – you’ll notice that statistics can be run acrooss the sourcce systems to see “how w far out of alignment a the ey are with ea ach other”. In other word ds: what’s this is costing myy business to ha ave broken business b rule es in differentt source systtems? This q question can finally be an nswered with metrics and measureme ents. quite possible that from profiling p the data d in this example, one might learn that contactss really It’s q shou uld have their own busine ess keys, beccause they arre totally and d uniquely disstinct from th he notion of cu ustomer. Or they might le earn the oppo osite, that all contacts aree customers. The point iss: the Data a Vault should assist in te elling the storry, and by splitting the so urce systemss across multiple Sate ellites makess it easier to spot s these errroneous pattterns.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 6.8

P Page 122 of 1 152

Overloaded Satellites S (The e Flip-Flop Effe ect)

Now w suppose the ere is a need d to see all the data in a single s Satellitte, what do w we do then? TThere are speccific reasons why and whyy not to do th his. There are e inherent rissks as well; ssome of thosse risks were e covered in the t last section. Take a minute m to che eck the last ssection to en nsure you did dn’t miss anytthing importa ant. This tech hnique is called overloading because it allows multiple definitions of sourrce system da ata to insert multiple row ws of data to the t same tab ble. The hopee is that the metadata defin nitions are th he same for the t fields, but there’s no way w to enforcce that. Therrefore, the da ata frequently becom mes “messy” very quickly.. e of ove erloading when viewing data sets in leegacy system ms. That is: “a a single We ccan see the effects field d used for mu ultiple purposses, and multtiple meanings based on character poosition and appe earance.” In other wordss, smart-key data d where no edit checks ks are in placee in the application; tackked on to the Cobol copyb book are multtiple re-defines and progrrammatic loggic to re-defin ne what the d data should represent. Overrloading a Sa atellite is not necessary given today’s technology, aand brings w with it many riisks, such as m misinterpretattion, misunderstanding, inability to se ee patterns, d difficult to disscover proble ems in the data a, risk of audit problems. One of the other o issues that t overload ding a structu ure also bubb bles to the ssurface is the e question off: “what do we w do with the e data set? D Do we join it all together tto make one best looking g row for insert? Do we ru un rules again nst it to coaleesce it togeth her?” It’s a sslipper slope that leads right back to o where we sttarted: with business b rulees being impllemented up--stream of the E EDW. This iss not what we e want. How wever, all thatt said… Figurre 6-7 represents what an n overloaded Satellite migght look like (from a data a perspective e), and the following paraggraphs explain what (if an ny) good usess there might be for this type of desig gn.

Figure 6-7: Sate ellite Overload from Manyy Sources

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 123 of 152

First, notice the record source. Each record source indicates a different source system (for purposes of this example the sources are lines of business). It is not entirely clear which source system should be the master system. Of course there are all kinds of questions that arise from this example: • • • • •

Does one row supersede another? Which row is the “right” most current row? The primary key is duplicated, should we include the Record Source as part of the Primary Key to resolve? What about the rest of the fields? What if they are NULL or not available in other sources? Should we create a combined/merged row during the load to this table?

The questions and many more will begin to pop out with additional overloaded Satellites. So what’s the benefit of overloading if there are so many issues? The only benefit that I’ve personally experienced in the past is: to get the business to deal with the source of the problem because the Data Vault ran out of disk space. The load cycle would load 15 million x 5 source feeds rows on every load, because the loading mechanism detected a delta. Which brings up another point: when a Satellite is overloaded, the loading cycle begins to take a turn toward the serial path. Loading the Satellite must be done in reverse order (from least important to most important) – whereby the last row to delta (be inserted) becomes the most current, and all the others get enddated. Again, implementation is in the other book – and this explanation is necessary to show the gravity (risk) of this design. The better design is to split the Satellites by source system. This allows each business unit to define which system is their master system, and when building the data marts – each Satellite will then provide the most current row to the process. Furthermore, by splitting the Satellites out (as described in the previous section), the load can happen in parallel. Subsequent to the load, a reporting structure could be built that attempts to merge the multiple Satellite data into one table for the purposes of doing the cross-system data quality checking. Let it be known: Overloading a Satellite incurs many risks. From metadata to understanding, from load performance to indexing, from data quality to merging – all of which take a toll on the business and on IT.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse 6.9

P Page 124 of 1 152

Satellite App plications:

The Satellite can n be utilized in different manners. m The ere are somee specific app plications for Satellites whicch are fairly common c to Data D Vault implementation ns world-widee. These app plications are e docu umented in this book to assist a you. Th hese uses include: effecttivity trackingg, record tracking, statu us tracking, and a compute ed Satellites just j to name a few. Thesse types of Satellites are typically calle ed system drriven Satellitees and are noot included a as a part of th he audit trail for the data warehouse. The reason is i that appliccation logic d determines th he conttents, and in general (exccept for comp puted Satellittes) the data in these Sattellites can be e backed up, rrolled off, deleted, and re e-built without harm to the e end-user orr business da ata. The data in these e Satellites iss generally fo or IT utilizatio on, to track th he data itselff, to assist wiith performance, or to determine e start and sttop control ovver the Linkin ng mechanissms. The following secttions cover th he application n usages of Satellites. S 6.9.1 1

Effectivity Satellites

One of the appliccations for Sa atellites is ca alled effectiviity. This typee of Satellite iis most often n found hangging off a Lin nk table (see Figure 6-8). Its purpose is to track wh hen the Link is active acccording to the b business. It is a tempora al based struccture that ho ouses begin aand end datees for the asssociation or fo or the business key it reprresents.

e 6-8: Satellitte Effectivity Figure at these date es are not sysstem computted. These d dates must arrrive on a sou urce feed, Keep in mind tha e set in the source s system m via a sourcce application n, an audit trrail, or a chan nge data cap pture or be (CDC C) program. These dates are auditablle, and must be traceablee back to thee source in orrder to passs audit. If yo ou are looking g (in contrastt) to guess when w data ap pears/disapp pears from a feed then n the status tracking t Sate ellite (discusssed below) is a better fit. © Da an Linstedt 2010-2011, 2 all a rights rese erved http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 125 of 152

Note: all children (all Satellites hanging off this Link or Hub) are ended when the effectivity of the association is ended. This status of “ended” must be forwarded to the data marts that are fed from the Data Vault. This is business user / source system data, and must be included in all queries that access this information.

In addition, not all Links and not all Hubs are a good fit for effectivity. These Satellites are strictly for business user (source system) based data. It is not necessary to create an effectivity Satellite for every Hub and Link in the model unless the source system delivers the data set for every business key and every relationship, and that would become a data miners’ gold! 6.9.2

Record Tracking Satellites

There is another application of Satellites called Record Tracking. Record tracking is a system generated set of data. This data is not auditable (generally speaking). The purpose of the record tracking Satellite is to identify which source applications are feeding which keys and associations on what load cycles. It originated as a need to capture changes (missing rows) from a source feed because we received a full dump of a legacy system every day, and rows would disappear for three days, and then re-appear. We were told that just because they “disappear” for three days, it doesn’t mean they are deleted. Furthermore, we didn’t have any CDC in place – so when a record was truly deleted, it went missing for an extended period of time. The business wanted a way of identifying the difference between “missing for a few days” and “was deleted.” They settled on a rule that said: for Data X if the key doesn’t show up from this application for 7 consecutive days, then mark it deleted. They had other rules for other data, i.e.: for Data Y it was 30 days, Data Z = 5 days. This discussion is similar in nature (and related in concept) to the LAST SEEN DATE discussion that was depicted in section 3.5. Record source tracking Satellites indicate each system’s arrival, on a load-cycle basis. For each key in each source system, or for each association on each source system, an insert is made to the record tracking Satellite indicating that it was present on the feed during the current load-cycle. The load-cycle is identified by load date, or load-cycle-id where load date has been replaced. Because this Satellite is non-auditable (other than IT metrics), its rules for use and definition can be bent without breaking the architecture; the structure itself doesn’t change – what does change is the way the data is treated. The following rules apply to record source tracking Satellites:

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse • • •

P Page 126 of 1 152

A row is in nserted (regard dless of delta)) for every dayy the key / asssociation appeears on the fee ed. In other words, it is not su ubject to delta processing. To avoid data d explosion n, each column n (or the table itself) must b be compressed d. Because it’s i system driven, old load-ccycle informattion may be su ummarized, an nd rolled off orr deleted without ha arm. By rolled d off, you may choose to bacck it up or movve it to slower storage.

Figure 6-9: 6 Denorma alized Record d Source Traccking Satellitte ed in a denormalized form mat. The Leggend is as follows: The data in Figurre 6-9 is store • • • •

RS_MFG = Record Sourrce Manufactu uring RS_FIN = Record Sourcce Finance RS_SLS = Record Sourcce Finance RS_CONTTR = Record So ource contractts

Notice that this Satellite S has no load-end--date, and ha as no “record d source” itseelf. The reaso on is that all ro ows are SYSG GEN (system generated). End-dates are a not necesssary, becausse as describ bed abovve, an entry is inserted (re egardless of delta) for each load cyclee that is trackked. Row 1 a and Row 2 are identical in n nature. Thiss is a great fo ormat for partitioning, filttering, and qu uerying, it pro ovides extre emely fast acccess to thesse componen nts and the discovery as to which feed ds they “appe ear and disappear” on. However, H be warned – it introduces an n “insert folloowed by an u update” for ea ach reco ord source that is added to the table. Also note that ea ach record so ource is specifically name ed in the metaadata (the sttructure) of the table. This makes the table t structurrally driven as opposed to o data driven n. This may oor may not be e the right apprroach for you ur Data Vault and does ad dd a level of complexity c siince the struccture (and lo oads) would have to ch hange if a sou urce system is added. An n alternative method of sttorage and trracking is available, and ye et again – req quires a mino or change to the structuree of a Satellitte and its corre esponding prrimary key. The T alternativve is commonly known ass a pivot of th he above tab ble, in this case e, normalized d. Figure 6-1 10 representss the same data in a moree dynamic foormat: © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 127 of 1 152

Figure 6-10: 6 Normalized Record Source Traccking Satellitee his case, it is easy to add new record sources s dyna amically. Theere is no limitt to parallel in nserts; In th therefore there iss no limit to the t scalabilitty of this table e. It pushes the complexxity downstream to the querry (for interprretation and pivoting). If necessary, re eview the reccord source ccolumn definition in secttion 3.9. Assu uming this RS S Satellite is a child of Customer, then n the data miight be interp preted as folllows: Assu ume Sequencce 1 = Customer Key ABC C123. • • •

On 10-14-2000, ABC12 23 (the key) ap ppeared on the Manufactur ing feed, howeever it did NOTT appear on the finance, contractts, nor sales fe eeds. On 10-15-2000, ABC12 23 appeared on o the Manufa acturing feed, aand did not ap ppear on finan nce, contracts,, nor sales. On 10-16-2000, ABC12 23 appeared on o all feeds EX XCEPT contraccts

If the e business provides detailed record sources (that might even iindicate the p point of origin within a business processs), then they might be able to begin trracing the keeys through th he business proccesses. An asstute data miner m could make m good usse of this infoormation to help the busin ness unde erstand how and when th he data is mo oving through h the systemss. Someone who misses this conccept sees no value in utiliizing record source s tracking Satellitess.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 6.9.3 3

P Page 128 of 1 152

Status Tra acking Satellite es

Status tracking Satellites S are put in place to track CRU UD (create, u pdate, deletee) from the ssource syste em. These statuses s are generally g pro oduced by audit trails or cchange data ccapture syste ems. If the sstatus is available and prroduced by either a source system or a source app plication (mayybe a business processs application n) then it can and should be b tracked. TThe best wayy to do this iss to track the sstatus separa ately of the rest of the infformation (i.e e., put it in itss own Satellitte). Thesse statuses should s repressent the state e of the business key. In other wordss, when a bussiness key or re elationship iss created, a CDC C record would w arrive in n the Data Vaault indicatin ng “Create” o or “Insert”. Whe en it is update ed, a new sta atus would be issued – and of course when it’s deeleted, an inssert to the statu us tracking Satellite S would indicate a delete had occurred o on the source. Status Tracking Satellites S (se ee Figure 6-11) allow visib bility in to thee disappearance and reappe earance of business keyss and relation nships. The business b boook of the Data Vault discu usses the notio on of businesss key re-use e, and how when businessses re-use th heir keys (to rrepresent diffferent data a) they are acctually prone to “lose” mo oney. This is because theey lose audit ttrail capacityy at that poin nt, and it beco omes more difficult d to tra ace the data set s back oveer time. Busin ness users a also get pe for this bo conffused when the t business keys are re-u used. However this topic is out of scop ook.

Figure 6-1 11: Status Tracking Satel lite Status Tracking Satellites S sho ould be norm malized, and should s follow w the standarrd Satellite la ayout and ruless; such as inssert only whe en changes are a detected. They should d have comp pression turned on to makke best use of o the storage e space, and if you’re luck ky – they will help you ideentify which ssource appllication or so ource businesss process do oesn’t match h business reequirements and which do o. Statuses may be e inserted from multiple sources during the same l oad cycle. TThis may or m may not lead d to multiple active a Satellite rows (whicch are descriibed in sectioon 6.14 below w).

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 129 of 152

The best practice is to insert only the “master system status” or, to split the status Satellite by source systems. If the Satellite is split by source systems, then you simply postponed the decision to assign the master system (for query purposes of selecting current status) until you access the information (loading to a data mart). 6.9.4

Computed Satellites (Quality Generated)

At first glance, everything in the Data Vault looks as though it has to be raw data. For the most part, this is true (and is one of the fundamental premises of the approach). Often there is a need to process raw data through quality routines, cleansing routines, and address correction routines; generally the desire is to run these routines once and then distribute all the information downstream to the data marts. Within the Data Vault methodology and architecture there is a place for this data. It’s called a computed Satellite. The computed Satellite is a standard Satellite structure (with all the same rules, formats, and structural integrity). The difference is that the record source is SYSGEN (system generated information) – or potentially the name of the application that is performing the data alterations. Computed Satellites are not auditable (generally). I say “generally” because when or if the data is used to run the business or make financial decisions, there is a good possibility that an auditor will come back and expect to see how, when, and what the data was. From an implementation perspective, it is suggested that you split the computed Satellites off to their own disk storage area. It may be wise to place them on SSD (solid state disk) if they are highly accessed and need to be extremely fast. At a minimum, they should be placed on their own I/O channels and their own storage so they do not compete for read/write resources with the raw data sets. 6.9.5

Multiple Active Satellite Rows

Multiple active Satellite rows are similar to Satellite overloading. Satellite overloading is discussed previously, in section 6.8 above. The concept here indicates that there are several rows per key that are “alive, active, and valid” all at the same time. In most cases, they would be arriving in the Satellite from different systems. However, there are times when the data is normalized (as in the example below) that make it a better choice to have multi-active Satellite rows.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 130 of 1 152

For in nstance, a pa art number which w is assiggned a statuss flag; in the manufacturing system it’’s an ACTIV VE part; in th he planning system s it’s an n inactive parrt. This part number mayy have multip ple statu uses from mu ultiple system ms, and they may or may not be valid ((depending oon the view p point of the u user and dep pending on th he definition of o the flag). Multiple activve Satellite rrows can be a averted easily (most of the time) by sp plitting the da ata by source e system (in m most cases), although in some case es, you may want w to split itt further by application a within each soource system m. Supp pose however, that you ha ave a list of phone p numbe ers on incom ming data; tha at you never know just how many phone e number colu umns will arrrive. Some days your load ding process may see 3 p phone numbers, other days, d it may see s 5, and evven within the e same load batch – the number of ph hone numbers is variab ble. In this case, it is extrremely difficu ult to “architeect” the rightt set of phone e number colum mns in a Sate ellite, and the e last thing that should be considered d is: phone_1 1, phone_2, phon ne_3…. Etc… causing wid de rows, sparrse population, and a bun nch of null coolumn values. It is preciisely for thesse reasons th hat multi-activve Satellite rows exist! w demonstra ates the loadiing of hierarcchical XML daata; it could a also represen nt a Figurre 6-12 below hiera archical Cobo ol data set. Any A hierarchiccal structured information with variab ble list length h is a cand didate for thiss technique. By normalizing the structture, the arch hitecture is w well-suited to o absorb an unknown num mber of eleme ents per pare ent record. The T normalizeed Satellite h has an additio onal elem ment to the prrimary key kn nown as a sub-sequence number. Su b-sequence n numbers are e discu ussed in secttion 3.2 of this book. The ey basically provide p a mecchanism with h which to un niquely identtify the data.

Figure 6-12 2: Multi-Activve Satellite R Rows

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 131 of 1 152

In so ome cases it might make sense to replace the sub--sequence n umber with a an actual piece of information that the t businesss users underrstand. In this very speciaal example (n not shown he ere), the architect replaced d the sub-seq quence number with a co opy of the phoone number.. This technique allow wed them to overcome o a difficulty d in trracking the Satellite S data from load too load. While e loading and d implementa ation is not a focus of thiss book, this iidea will be b briefly discusssed here as it has relevancce to the stru ucture choice es made by th he architect. One of the issues of utilizing subsequ uence numbe ers is that it introduces orrder-depende ency to the looad cycle. In other words, from one lload to the next, if the ord der of the phone numbers rs change theen it’s seen a as an entire n new delta for th he employee – which mea ans all the ph hone numbers are re-inseerted as delta a rows, even if the phon ne numbers themselves t did d not changge. This can be mitig gated in two ways: w one (ass described above), a usingg the phone n number as th he subsequ uence numbe er (removes the t order dep pendency durring delta cheecking), or tw wo: includingg the existtence of the phone p numbe er in the Sate ellite as a currently activee row before inserting. O Option #1, destrroys any chance of reprod ducing the da ata set in the e proper ordeer as it arrived (if this is im mportant to yo ou, then sub-ssequencing is the only wa ay). Option #2 # doesn’t ch heck the deleeted phone numbers that may have dissappeared frrom the incom ming data se et. In ca ases where th here is no nu umber column n alternative, replacing th he sub-sequeence with an alphanumeric causes great g problem ms with perfo ormance. Un nfortunately tthis is one of the cases where choo osing the best worst-case scenario see ems to be the e ideal. In su uch a case, ssub-sequenciing is always the architectural fall-back, please couple c that choice c with tu urning on com mpression off duplicate data across the table. This will assist with h maintainingg the integrityy while allowiing flexibility of unkn nown number of elementss to flow thro ough the Data a Vault.

Figure F 6-13: Multi-Active Satellite Row w Data

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 132 of 1 152

In Figure 6-13, itt is easy to se ee that custo omer 1 loaded d four phonee numbers on n 10-14-2000 0. They are a all supersede ed by the load on 10-20-2 2000, as it iss – the loadin ng mechanism doesn’t make any attempt to detecct deltas, simply changes to the ordering along with h new and deeleted phone e num mbers. This iss because the e business decided that keeping k the oorder of the p phone numb bers has a business importa ance. This iss a prime casse for turningg on table com mpression.

Figure 6-1 14: Multi-Activve Satellite with w Businesss Sub-Sequence he phone num mber has rep placed the acctual “orderin ng” column that was used d in In Figure 6-14, th ection of exissting rows and discovery oof rows that a are deleted from the Figure 6-13. Thiss allows dete T is not the preferred technique, t ass it requires a unique colu umn to be avvailable sourrce system. This that is in numericc format. 6.10 0 Splitting Satellites

Splittting a Satellite occurs wh hen one or more m columnss within the S Satellite begin n changing a at different ratess (when com mpared to other columns within w the Satellite). The reasons whyy we split Sattellites are discussed in section 6.5 (classsification orr type) and 6..6 rate of chaange. This seection introduces you to ho ow to split th he Satellite, and a what the concepts are e to make it w work properly. Therre are a set of o standards around splittting Satellitess, and while most of this information is about load ding, and movving data aro ound, it is within the purview of this boook to discusss; mostly beccause it also involves arcchitectural ch hanges. The e steps to spllitting a Satellite are beloow:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse • • • • • •

P Page 133 of 1 152

Identify da ata in a Satellite that changges more rapid dly than the other data in thee Satellite Group tho ose “common”” elements toggether by rate of change Split the Satellite S archittecturally by crreating a new Satellites and d moving thosee elements Load the data d to the ne ew Satellite by simply copyin ng the existingg columns, load d dates, seque ences Run anoth her process th hat begins rem moving duplicates (looks for deltas) Run a fina al process thatt updates the load-end date es after the du upes are remooved

The m most importan nt concept to hold h to when splitting s a Sate ellite is to main ntain 100% off the history off the data. If anyy of the historyy is lost by dele eting rows tha at contain delta as, then the n new Satellite m must be trunca ated and re-loa aded from the original Satellite. Maintaining the audit trail t is vitally im mportant. Oncce the audit trrail has been checked and verified in botth new Satellittes, then the old o Satellite caan be deleted//removed. It is recommended that you run queries q (in parallel to the old Satellite) forr a few weeks against the new Satellites to match the results. Once O a balancce has been esstablished, an d they are botth showing equal results, then and only then can you delette the old Sate ellite and repla ace any affectted downstrea am processes or extracts.

Figure 6-15: Step p 1: Identify Satellite S Splitt Columns In th he Figure 6-1 15, the colum mn that is cha anging most frequently f is the cell phon ne. The phon ne num mber column also changess, not quite as a fast as the e cell phone, but still moree frequently tthan the nam me and addre ess. Thereforre, we will split the Satellitte in to two d different structures (see F Figure 616) and group th he phone and d cell phone together t in one structure, while movin ng the rest off the data set tto another sttructure.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 134 of 1 152

Figure 6-1 16: Step 2: Sp plit Satellite Columns, C Deesign New tab bles Now w that the columns are pro operly split apart, and the e new structu ures have beeen built, the next step is to o handle the data d (Figure 6-17).

17: Step 3: Copy Data Fro om Original too New Satelliites Figure 6-1

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 135 of 1 152

It is a replicated copy, and at this point it is easy to see that the cu ustomer name Satellite co ontains dupllicate entriess, which doess not follow th he standardss for deltas too be maintain ned within th he Sate ellites. The next step (Figure 6-18) is to t delete / re emove the pu ure duplicatees. This proce ess of dupllicate elimina ation must be e run againstt each new Satellite. S Oth erwise, you m might miss subtle dupllicates that reside in the new Satellite es.

Figure 6-18 8: Step 4: Elim minate Dupliccates This particular exxample does not show all the details of o eliminatingg duplicates, and in fact – by simp ply eliminatin ng the duplica ates in this example, e we no n longer neeed to run thee next processsing step: adju usting the beg gin and end dates. d Howe ever, these stteps will run regardless, a as there may be other rowss of information that requ uire fine-tunin ng. The exam mple below (FFigure 6-19) sshows the da ata after copyy-in, but before running re eduction and delta processsing.

Figure 6-19: Step 4: 4 Alternate Elimination E oof Duplicates

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 136 of 1 152

After we have eliminated the duplicates, and a selected the earliest load date in a series, and d the last load d-end-date of the series, we w have succceeded in pro operly adjustiing the Satellite. Figure 6 6-20 represents the final stage of a Satellite affter processin ng is complette.

Figure 6-20: Step 5: End Dates Ad djusted Afterr Satellite Split Once made availab e the Satellittes have been successfully split, and the t data set m ble, it is reco ommended to o release the e data in a pa arallel path; meaning, m set up a duplicatted data marrt area, re-diirect copies of o the loading g routines to the new data a mart, then spend a weeek or so comparing the n new informattion to the old. Once reco onciliation ha as been accoomplished, yoou can switch h over to the n new structures and drop the old one, releasing quite a bit of diisk space. Don’t forge et! Sometime es normalizing the data set and a increasingg the parallelissm leads to higher num mbers of I/O calls, c however when run in parallel, p it lead ds to overall reeduction in processing g time. A bala ance needs to be achieved for the best peerformance poossible. 6.11 1 Consolidatin ng Satellites

After splitting the e Satellites, occasionally o (throughout ( time), t the daata set changge rate beginss to slow down again, or in n some casess – disappea ar. If the source system iss retired, or the source da ata feed is switcched over, orr the businesss focus chan nges – then the t rate at w hich the data a changes may slow down or stop cha anging all-tog gether. In the ese situations you may coonsider conssolidating two o or more Sate ellites togethe er. This is co ommonly kno own as denorrmalization. BE WARE! Over-denorm malization leads to rows thatt are very widee, and un-wield dy. If you denormalize too far, the e performance e of the entire system s will su uffer tremendoously, because e the I/O count and fragm mentation, and chained rowss will appear aall across the p processes.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 137 of 1 152

So w why would you u want to do this?

1 1.

Identify da ata sets that are a changing at a the same ra ates of speed iin one or moree Satellites related to the same parent

2 2. 3 3.

Design the consolidated d structure consisting of a combination c off all user-baseed Satellite defined fields

4 4.

Find the earliest e load da ate from all th he Satellites fo or each row prooduced by thee join, select th hat row for insert – using the load date d that was chosen

5 5.

Run a posst-insert proce ess to clean up p end-dates

Join the Satellites S acrosss parent sequ uence numberrs, ordered by earliest load d date; note: thiis will likely also produ uce a Cartesia an join productt, which is som mething we acctually want too start with

Figure 6-21 conttains an exam mple of “splitt data” that has h settled an nd needs to b be re-consoliidated. This section will walk w through h the processs of putting th he data backk together.

Figure 6-21 1: Consolidating Satellite Data This example sho ows the data a set, please note: this exa ample does n not show reccord source prop pagation. I do o however disscuss this component in the t followingg paragraphs..

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 138 of 152

The date in purple (highlighted/bold) in SAT_CUST_CONTACT_ADDR is the first and earliest date for this particular customer (08/01/2000). Therefore it results in the first row inserted in to the consolidated Satellite. The name and cell are NULL because there is no data for that customer available as of that date in those Satellites. The record sources are all “??” because resolving them can be a matter of interpretation or a decision made by the business user as to which system is the MASTER system. Please note: if the data in the original Satellites was previously “split” apart, then there’s a chance that the record sources for all rows across all split Satellites would be the same. In this case it is ok to select the one available record source value and assign it to the newly combined row in the consolidated Satellite. If this is not the case, please see the discussion below.

If record sources vary across the multiple split Satellites (from row to row within a given parent key), then a decision must be made in consolidation: which record source to use? This decision should be put forward to the business users for complete resolution and sign-off, however that is not always possible. For the cases where the business users won’t make the decision, the following rule of thumb is provided: First, select the record source from the same table that houses the earliest load date that is being selected. If this does not produce the desired outcome, then select the master system record source from the Satellite in which it appears. Unfortunately during consolidation of multi-system Satellites you may lose metadata. Please be aware that if the metadata is lost, the only way to correctly audit the system will be to restore that “days” load to the staging area for further review. At the end of this process of consolidation, run the assignment of Load-End-Dates to properly adjust the dates and times of the data set. In Figure 6-22, I’ve included Load-End-Date calculations after they’ve been set.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 139 of 1 152

Figure 6-22 2: Load End Dates D Calcula ated in Consoolidated Sateellite It is important to reconcile the information n all the way through to th he Data Martts before deleting or desttroying the prreviously split Satellites. Run the data a set in paralllel for a wee ek or so throu ugh to a new data mart in n order to ballance the info ormation and d reconcile th he results to the old data mart. Whe en buy-off is achieved, a the en it will be safe to backup and roll-offf the old splitt Satellites. Splitting and consolidating can happe en at any time during the lifee-cycle of the D Data Vault. It’s a judgmen nt call based on o the rates off change in the e data set, and d the width of the rows.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 140 of 1 152

7.0 Query Asssistant Tables Query assistant tables t are bu uilt for one re eason: performance. The architecturee of the Data Vault doess not require these tabless to survive (ii.e., they are optional com mponents). D Due to restricctions in curre ent database e engines and hardware, these tabless are sometim mes necessary in order to o gain addiitional perforrmance from the Data Vault models. Thesse tables havve a second reason r for exxistence: bufffering querie s from stream ming real-tim me data. The business ma ay have an SLLA (service le evel agreeme ent) in place tthat states th hey can acco ommodate data a changes every 5 minute es on their da ashboard. If data is arrivi ng every 10 sseconds, the e query musst be buffered d to a 5 minu ute increment so that the dashboard iss updated in n accordance e with business needs. The query assistant a tablles can functtion as sched duled snap-sh hot tables to buffer real--time feeds. Therre are two ba asic types of query q assista ant tables: po oint in time taables (PIT), a and Bridge ta ables. Thesse table types are describ bed below. These T tables are a not need ded in column nar database es, or flatwide e devices succh as Netezza a, and genera ally are not necessary n in TTeradata eith her. 7.1

Point in Time e Tables

Poin nt in time tables (PIT table es) - are strucctures which surround a ssingle Hub (oor Link) and itts corre esponding Sa atellites. It iss defined as: A structure which w sustain ns integrity off joins acrosss time to all th he Satellites that are connected to the e Hub. It is a specialized form of Sateellite. There iis a single PIT ttable built for each Hub. These tabless cannot and d should not sspan multiplee Hubs and LLinks. Figure 7-1 showss the basic sttructure of a PIT table.

Figure 7-1: 7 Structure e of PIT Tablee

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 141 of 1 152

Reco ord sources are a not necesssary, as the e PIT table is a system gen nerated tablee. Should yo ou wish to inclu ude record so ource you ma ay, and the only reason fo or doing so w would be beca ause you havve multtiple loading processes po opulating the e PIT table. End-dates E aree not necesssary unless yo ou wish to enab ble BETWEEN N queries aga ainst the sna apshot inform mation. PIT ttables provide equal-join access to table es around a Hub rather th han focusingg on outer-join n queries to aanswer quesstions. This iss why PIT table es are a query assistant table t only. PIT TTables should d not be crea ated until or unless u there is a perform mance problem m with accesssing the Sate ellites around d a single Hub. Figure 7-2 2 shows an arrchitectural d depiction of w where PIT tab bles fit within the Data Vault. V

Figure 7-2: PIT P Table Arch hitecture Oveerview own here, you u may add PIT tables to help h join Linkk Satellites ass well. In gen neral, PIT Whille it’s not sho table es are not ad dded and nott usually need ded until the ere are 3 or m more Satellitees off a paren nt. A data exam mple for one of the PIT tables is shown in Figure 7-3 to help elaaborate on th he notion of a PIT table e.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 142 of 1 152

Figure 7-3: Exam mple PIT Table e with Snapsshot Dates One of the key po oints to this Figure F is: you u can schedu ule the snapsshot process as frequentlyy as you need d it; the prefe erred schedu ule is to load all PIT tabless as the very last step in tthe load. Forr example, you can run the process p as fa ast as every second, s or ass slow as eveery six or 12 h hours. Anoth her funcction of both the t PIT and Bridge B tabless is to buffer real-time inp put feeds from m alerting processes in ta actical reportss. It will keep p the “image e” of the data a consistent b between snapshot dates//times to ensu ure that a consistent view w of the data can be retrie eved, even th hough the rea al-time data may be strea aming in to other o Satellite es around the e Hub. This requires all eextract/retrieeval processe es to querry the Data Vault V using th he PIT tables. PIT TTables may contain c the HUB H businesss key if so dessired (to avoiid joining bacck to the Hub b in order to re etrieve the bu usiness key). PIT Tables may m not conttain any otheer Satellite da ata. If you wiish to consstruct your ow wn version off a PIT table you y may do so s under the guise of com mputed Satellites (disccussed in the e last chapter). PIT Tabless are specificcally architeccted for enha ancing the perfo ormance of queries. q It is best howeve er to keep the e rows as sho ort as possible (width wisee) in order too keep the perfo ormance as fast f as possible. It is also o suggested that databasse columnar compression n be turned on (where e available) fo or the PIT tab bles to make e it work moree efficiently. As a final no ote, it is reco ommended th hat the data be contained d within a ma anaged windoow; meaningg that the num mber of rowss be kept to a maximum threshold. t When W the data a becomes tooo “old” it should be rolle ed off the backk (deleted) so o that the new data can be b added. Byy keeping thiss rolling wind dow of data yyou will be able e to easily tun ne performan nce. © Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse 7.2

P Page 143 of 1 152

Bridge Table es

Bridgge tables are e nearly the same as PIT tables – they serve the saame purposes, and have tthe same goalss: to improve e performance of the querries. They sh hould not be constructed until and unless perfo ormance of th he parallel jo oins across th he Data Vaultt is too slow. The differen nce between this and the P PIT table is th he Bridge con nstruct focuses on joiningg across multtiple Hubs an nd Links. It iss therefore a spe ecialized form m of a Link ta able. The Brid dge table can n be thought of as a higheer-level factle ess fact (som mething like th hat anyhow). It too conta ains no Satelllite data (duee to width isssues), but con ntains keys from multiplle Hubs and Links. The Bridge B table iss also a systeem generateed / process ggenerated table e, and therefo ore is not-aud ditable. Remembe er, the architeccture does nott need these ta ables in orderr to function prroperly – these e tables exisst solely for the e purpose of query q performance.

The basic structu ure of a Bridg ge table is provided in Figgure 7-4:

Figure 7-4: 7 Bridge Ta able Structurre The Bridge table may contain n Hub business keys; how wever be careeful as raisingg the number of bytes per rrow will dram matically slow w the perform mance of the table, t especiially in very la arge data setts. Anotther consequ uence of haviing this table e become too o wide is the introduction of chained ro ows, fragmentation, and over-inde exing. Keep in mind that the t purpose of this table is to enhancce join perfo ormance, not kill it.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 144 of 1 152

In Bridge tables you y may choose to have business b keyys, in others – you might lleave them o out. Even within a Bridge ta able, it is nott necessary to o contain all the businesss keys. It migght be best to contain only those busine ess keys which are most frequently qu ueried from a filter or a like clause. TThe Bridge table e must conta ain Hub sequences (surro ogate keys), and a Link sequ uences (surroogate keys), and it is reco ommended th hat a Bridge table t only be e utilized whe en you have ttwo or more LLinks to join with a singgle query. Note e: the Bridge table is not required r to be b at the sam me grain as th he Links it is covering. In those case es, a group byy might be uttilized to con nstruct a Bridge table with h data at a hiigher grain th han that of th he Link benea ath it. Please e keep in min nd that by ch hanging the ggrain, the Brid dge table rea ally is representative off an explorattion Link as described d pre eviously in seection 5.21.

Fig gure 7-5: Brid dge Table Architectural O Overview As in ndicated, thiss Bridge table e spans data a from three Hubs, H and tw wo Links. This example (ssee Figure 7-5) maintains th he lowest posssible grain by b keeping th he Cartesian product in-ta act; no group p by has been n executed prior p to load of o the data se et. The data in the Bridgee (see Figure 7-6) might b be read as: sseller by prod duct by parts. Only the ke eys of the pro oduct which h have a sellerr, and are use ed in the man nufactured pa arts will be lissted in the Bridge table, unless u you ch hoose to pop pulate some o of the keyss with NULL values. v

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 145 of 1 152

Figure 7-6: Bridge Tablle Example D Data The SQN (or sequ uence) colum mn is the prim mary key of th his table; it’s merely meant to keep th he herwise. Thee Load date iis the insertion order of the recordss and is uniqu ue. It has no o meaning oth date e/time of inse ertion (record d creation or snap-shot da ate). This exxample of thee Bridge table e has grou uped the sequence with th he business key for reada ability purposses. For perfoormance: it iss best to grou up all sequen nces at the he ead of the table, followed d by the busin ness key colu umns. This p provides maxximum accesss to fixed-wid dth numeric columns c (bussiness keys), while the va ariable colum mns are placced at the end d of the row.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 146 of 152

8.0 Reference Tables Reference tables are just that, tables utilized across the data warehouse as descriptors. To simplify matters, my definition of reference data can be found below. Should reference tables be a part of the Data Vault Model? Yes and no. Note: I spoke to Bill Inmon about reference data, and I asked him why we have to have it in the Data Warehouse. His answer was: it is necessary in order to store the lookup data without redundancy. He continued on to share with me the following: if we lived in a perfect world, a world without disk space limits, then the reality would be (should be) that all reference data is resolved on the way in to the warehouse. This way, the descriptions would be stored as deemed necessary, possibly in addition to the encoded or keyed information. However since we do not live in a perfect world, it is sometimes necessary to house reference data in separate lookup tables to avoid redundancy across the entire warehouse.

I define reference data as follows: any information deemed necessary to resolve descriptions from codes, or to translate keys in to a consistent manner. Many of these fields are “descriptive” in nature and describe a specific state of the other more important information. As such, reference data lives in separate tables from the raw Data Vault tables. Generally speaking, reference data is neither a business key, nor purely descriptive. It lives in a grey area and covers a number of facets to resolve the information in the warehouse to a better context. For instance, ICD9 / ICD10 (medical drug diagnosis codes) are an example of reference data. They may be external sources of data governed (in this case) by a world body. These codes are often found used as descriptors of other business keys. While building separate tables in the Data Vault, and adding the codes to Satellites appears to constitute “foreign keys in Satellites”, I will tell you that it should be a logical representation only. If you physically create the foreign keys in the Satellites to reference tables, you can a) blacken your model (too complex to maintain, too many relationships) b) destroy flexibility, and of course c) you would not be following the perfect world scenario set up above.

© Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 147 of 152

In a truly perfect world, we would resolve all reference data on the way in to the warehouse, thus making “reference data” obsolete, and then, there would be no need for foreign keys in the Satellites to begin with. In any case, reference data can and does exist as a part of the Data Vault model, and should be defined as: external data outside your control, data that is commonly used to setup context/describe other business keys, or quite simply put: standard codes and descriptions or classifications of information. The structure of the reference tables can vary from 3rd normal form to star-like, to Hubs, Links and Satellites from the Data Vault. So there is no need to worry or fret about the type of structure that you want to use; just choose the one that works best for you and move on. Some options and scenarios are described in the following sections. 8.1

No-History Reference Tables

Sometimes there is no need to store history of reference data changes; in this case we use a typical 3NF or 2NF type table. The nature of a data warehouse is in fact to store history, but when the business signs off on the expected “no-history requirement” then the EDW team has the go-ahead. A no-history reference table is a structure that has no history! Imagine that! Ok, enough kidding aside – it’s a table with no begin and no end-dates. Before I go on, I’ll say this: reference tables can be designed as Hubs and Links, or as simple 3rd normal form tables, that is: flat and / or wide, it’s up to you. You need to decide what’s best, and what fits – then load it and go. What types of data might you see in a no-history reference table? Well, that all depends of course, but here are some examples of what I’ve run in to in my career: • • • • • • • •

Medical drug prescription codes and definitions Stock exchange symbols Medical diagnosis codes VIN number codes and definitions (manufacturer codes) Calendar dates Calendar Times International currency codes US State code abbreviations

And so on… If you’d like to add to this list, I’d love to have your feedback. Just put the example in an email and send it off. An example of a non-history based 3rd normal form reference table is shown in Figure 8-1. © Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 148 of 1 152

Figure 8-1: Non-History Reference TTable In ad ddition to cap pturing the code, short de escription and long descrription, I like tto capture th he date the rreference row w was loaded d. Sequence e numbers are optional. TThere are tim mes when I usse sequ uence numbe ers for join pe erformance, but most tim mes I simply u use the CODE E as the prim mary key. Thatt’s the whole point of refe erence tabless, to use the code, c and leaave the codee in place acrross the mod del. The COD DE is the natu ural key of the e table. Rem member, that a non-historyy reference table t will onlyy and foreverr show the cu urrent value – and hencce you can no longer ask the question n: what was the t descriptioon of the CA ccode last yea ar? 8.2

History Base ed Reference Tables T

Histo ory based refference table es are reference data with h a requirem ment or a business need to o store the h history of desscriptions. In n other wordss, we or the business b wan nts to track w what the desccription was last year, lasst month, and d so on. The history may become imp ortant for certain reference data espe ecially if the reference r data relates to financial rep ports. Particu ularly when oold reports arre reprintted in the futture, sometim mes the busin ness or the auditor a wantss to see whatt the code an nd desccription was at a a particula ar point in tim me. In th his case, I wo ould strongly urge you to create c Hubs, Links, and S Satellites to h house the hisstorical reference data. However, I would w discourrage you from m using SEQU UENCE numb bers in this situation. Natu ural keys tend to be much h more consistent over tim me (in the caase of referen nce data), an nd typiccally it’s the natural n keys which appea ar in the rest of the raw Daata Vault moodel (EDW Mo odel) – particularly in the e Satellites.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 149 of 1 152

Addiing sequence e numbers to o the history based b refere ence tables u sually adds n no value sincce the code es tend to be e static (i.e., ever e see the abbreviation for the statee of California a change?). On the othe er hand, if you have a valid reason to do d so – then don’t be shyy. Documentt the reason, and procceed to use the sequence es all across your y model. An example of a history b based reference table is sh hown in Figurre 8-2:

Figurre 8-2: Stand dard History Based B Refereence Table In th his Figure, you can plainlyy see the prevvious code, the time-line ffor the validity of the prevvious code e, along with the historica al values of th he previous code. c We aree however ussing the natu ural key (the code) as the e primary keyy, but the load d date must be included ffor uniqueneess. Figure 8 8-3 shows a diffferent appro oach (using a Hub and Satellite) as a history h based d reference ta able:

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 150 of 1 152

Figure 8-3: Hub/Sat History Based B Refereence Table 8.3

Code and De escriptions

Code es and descrriptions are commonly c fou und in reference data. If you have a loot of codes to o model take e the most effficient route – that is: one e that makess logical sensse. I like to ggroup many o of the similar codes tog gether in to a single “masster code” tab ble. In thesee cases, I havve to also asssign a uniq que “group co ode” to help make the underlying code e unique. Offten times the group code e is a mad de-up or manufactured co olumn (hard coded c data in n the ETL rou utine). Beca ause the group-code is syystem genera ated, and it has h no forma l business m meaning outsiide of the ness EDW W (generally), I usually try to keep the group g code in nside the EDW for joiningg and uniquen reassons only. Th he example in n Figure 8-4 is made-up data, d but shoows how you ccan apply a m master code e or a group code c to use a single struccture and house all your i nformation.

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Supe er Charge Yo our Data Ware ehouse

P Page 151 of 1 152

Figure 8-4: Group Code e and Descrip ption As yo ou can see, this t solution has a few fla aws: if the gro oup code chaanges or the code itself changes – this leads to confusion with in nterpretation n of the surro ogate keys. H However, as llong as the ccodes and grou up codes are consistent, it shouldn’t be b such a pro oblem.

9.0 Conclusio ons The Data Vault Model M and me ethodology are highly verssatile when tthe standards and rules a are follo owed. It’s when you break k the rules an nd standardss that you ca n get in to troouble, and I h hope I’ve show wn you enoug gh insight to see how to apply a the app propriate and d proper design. It’s by fo ollowing the rrules and sta andards that you can take e advantage of the years of research a and design I’’ve put in to th his; allowing you y to overco ome and avoid the potenttial pitfalls an nd project isssues. I would like nothing more than to help you u succeed, an nd to hear froom you abou ut your conce erns, quesstions, or com mments. I’m m always interrested in hea aring about c ustomer successes as we ell as challenges you fa ace in your day to day job b. If you become a Data Vault fa an along the way, feel free e to let me k now! Sinccerely, Dan Linstedt

© Da an Linstedt 2010-2011, 2 all a rights rese erved

http:///LearnData aVault.com

Super Charge Your Data Warehouse

Page 152 of 152

INDEX adaptability, 38, 73, 76, 83 Architectural. See Architecture Architecture, 9, 139 Basic Terminology, 10 Business Key, 4, 7, 11, 55, 57, 64, 72, 73 Business Keys, 4, 27, 58, 59, 61, 63, 71 Data Vault, 2, 3, 7, 8, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 51, 52, 53, 56, 61, 62, 63, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87, 88, 90, 91, 92, 93, 97, 98, 99, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 118, 119, 121, 122, 123, 124, 126, 127, 129, 137, 138, 139, 144, 145, 146, 149 Data Vault Modeling. See Data Vault EDW, 3, 7, 33, 36, 37, 38, 39, 41, 45, 46, 48, 49, 52, 58, 76, 78, 81, 82, 83, 120, 145, 146, 148 Flexibility, 3, 4, 7, 21, 22, 76, 78 HUB, 18, 44, 50, 51, 57, 64, 140 Hubs. See Hub

© Dan Linstedt 2010-2011, all rights reserved

Link, 4, 5, 7, 8, 18, 20, 21, 22, 24, 26, 28, 34, 43, 44, 49, 52, 57, 69, 75, 76, 77, 78, 80, 81, 82, 83, 85, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 122, 123, 139, 142 Links. See Link Load Date, 7, 45, 46, 56, 66, 73, 105, 111, 112 Load End Date, 7, 47, 111 Record Source, 7, 9, 45, 56, 66, 73, 121, 124, 125 Reference, 6, 9, 11, 53, 144, 145, 146, 147 Satellite, 5, 6, 8, 9, 26, 34, 43, 44, 47, 48, 49, 52, 53, 56, 85, 89, 90, 92, 102, 103, 104, 105, 106, 110, 111, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 140, 141, 147 Satellites. See Satellite Scalability, 4, 76, 84 Sequence, 3, 9, 32, 43, 44, 66, 73, 91, 112, 125, 130, 146 sequences. See Sequence SQN. See Sequence

http://LearnDataVault.com

Dan Linstedt, Supercharge Your Data Warehouse

Short Description

Description

Comments

We need your help!