Data Warehouse Names

Share Embed Donate


Short Description

This Naming standards document defines Data Warehouse standards to follow for schemas, subject areas, projections, attri...

Description

Matthew Lawler [email protected]

Datawarehouse Names

Datawarehouse Names Matthew Lawler [email protected]

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

1 of 27

Matthew Lawler [email protected]

Datawarehouse Names

INTRODUCTION

3

HIGH LEVEL STANDARD

7

PROJECTION STANDARDS

11

CONSTRAINT STANDARD

17

ATTRIBUTE STANDARD

18

GENERAL RENAMING RULES

23

TIME STANDARD

24

PLATFORM CONSTRAINTS

26

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

2 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Introduction Licence As these are generic software documentation standards, they will be covered by the 'Creative Commons Zero v1.0 Universal' CC0 licence.

Warranty The author does not make any warranty, express or implied, that any statements in this document are free of error, or are consistent with particular standard of merchantability, or they will meet the requirements for any particular application or environment. They should not be relied on for solving a problem whose incorrect solution could result in injury or loss of property. If you do use this material in such a manner, it is at your own risk. The author disclaims all liability for direct or consequential damage resulting from its use. This is really a sample document, illustrating particular choices taken. As each DW environment is different, so these choices will also be different. There is no universal standard for naming conventions.

Purpose This Naming standards document defines the standards to follow for schemas, subject areas, projections, attributes, constraints, boilerplate columns, renaming, and time. This document is needed to make it easier for Business users to understand column and table names, and reduce the cost of egregious renaming.

Audience The primary audience of this document are any staff who use the DW, or do design development or maintenance on the DW. It will also be useful to Business Intelligence end users.

Assumptions It is assumed that the naming conventions will support a variety of DW design patterns such as Data Vault, Kimball, and source standards.

Approach The naming standard will be based on common DW design patterns and other essential design decisions.

Related Documents There are many documents describing the DW design methodologies such as Kimball, Inmon, etc. Some are below. Author

Reference

Publisher

Year

W Inmon

Building the Data Warehouse

Wiley

1996

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

3 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Author

Reference

Publisher

Year

CJ Date

Temporal Data & the Relational Model

Morgan Kaufmann

2002

Ralph Kimball

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modelling

Wiley

2002

Definitions Term

Source

Definition

Aggregate

Kimball

An aggregate is a summary table using group by, often based on fact and/or dimension tables.

Bridge

Kimball

A bridge table to capture many-to-many relationships, such as when a fact table row can be associated with more than one value in a dimension.

Data Model

Chris Date

The data model must represent demonstrably true statements about the business area. That is, the entities must represent things that mean something to the business, and the relationships between entities must represent meaningful links.

Data Vault

Linstedt

Data Vault is a database method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It provides a DW pattern that supports the Inmon goals of Subjectorientation, non-volatility, integration and Time-variance.

Dimension

Kimball

An independent entity in a dimensional model that serves as an entry point or as a mechanism for slicing and dicing the additive measure located in the fact table of the dimensional model. For example, all months, quarters, years, etc., make up a time dimension. Based on Measure Theory.

Disemvowel

DB

This is a function that removes all vowels from a text string. E.g. A sentence such as: “The quick brown fox jumps over the lazy dog” would, after being disemvowelled, become: “Th qck brwn fx jmps vr th lzy dg”.

EAV

DB

An Entity-Attribute-Value table. This is a type of RDF table. Data is recorded in only three columns: Entity: the item being described; Attribute or parameter: a foreign key into a table of attribute definitions; Value of the attribute. This pattern is very popular with business users, as they have the opportunity to redefine data in an application. Often, much of the useful reporting data are in these columns. However, it can be quite difficult to pivot this data into usable reporting tables. See the Resource Description Framework

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

4 of 27

Matthew Lawler [email protected]

Term

Source

Datawarehouse Names

Definition (RDF) as an example.

Fact

Kimball

A business performance measurement, typically numeric and additive, that is stored in a fact table. Based on Measure Theory.

Hub

Linstedt

A hub represents a core business concept such as Customer, Vendor, Sale or Product. The hub table is formed around the business key, as well as the source system keys.

Integrated

Inmon

Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible.

Invariant

Algebra

An invariant is a property of a mathematical object that remains unchanged when transformations of a certain type are applied to the object. As an example in a DW, the count of network nodes should be identical in a source database, as well as the target DW, at the same point in time.

Join

Codd

A join is a binary operator on two relations or database tables.

Link

Linstedt

A link represents a natural business relationship between business keys.

Logical name

Chris Date

These are full words from a standard business vocabulary. This contrasts with physical names that are often abbreviated due to name length limitations, etc.

Metadata

DB

Metadata is "data about data". While not often used in reporting, these tables are important in DW standards, and for generating and describing DW components.

Name Hygiene

Computer Science

The prevention of accidental capture of identifiers, or name collision. This occurs especially when generating identifiers or names. In programming languages, these are solved using hygienic macros.

Non-volatile

Inmon

Non-volatile design is essential. Non-volatility literally means that once a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When a data row is updated, the past information is destroyed. A fact or total that included the unmodified data can never be recreated.

Object

DB

For the purposes of the DW, an object is a projection over a table or a table join. They are often used in applications to layer and rename physical column names to attribute names. Consequently, the end user is more familiar with the attribute name, rather than the column

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

5 of 27

Matthew Lawler [email protected]

Term

Source

Datawarehouse Names

Definition name.

Pivot

DB

A pivot table is the transformation of an EAV table into columnar form.

Projection

Codd

A projection is a unary operation that selects a subset of attribute names. In database terms, this is a select. This can be implemented as tables, views, XML messages, etc.

RDF

W3C

The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It has come to be used as a general method for conceptual description of information and/or knowledge management, which is implemented in web resources, and other formats. An RDF triple is a statement about resources in the form of a subjectpredicate-object expression. DF is a more general form of the original EAV Entity-Attribute-Value design pattern, where the subject is an entity, predicate is an attribute and object is a value. A set of such RDF triples forms an RDF graph.

Relational Algebra

Codd

Relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over relations.

Satellite

Linstedt

A satellite contains the descriptive information (context) for a business key.

Subjectorientation

Inmon

Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This is a departure from serving only either the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side).

Time Variant

Inmon

Time variance calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. There may be detail for seven years along with weekly, monthly and quarterly aggregates of differing duration. This is critical for maintaining the consistency of reported summaries over time.

Tags Business Intelligence ; Data Mapping ; Metadata ; Standards ; Data Transformation ; Data Warehouse ; Database ; Fact / Dimension ; Data Load ; Data Model ; Data Vault ; Database Design ; Extract Load Transform - ELT ; Extract Transform Load - ETL ; Inmon ; Kimball ; Massive Parallel Processing - MPP ; Netezza ; Oracle ; Data Integration ; Data Lineage ; Data Traceability ; Time Variant ; Metadata Glossary ; Hub / Satellite ; Projection ;

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

6 of 27

Matthew Lawler [email protected]

Datawarehouse Names

High Level Standard Layer Standard These are all defined as schemas in the DW. This assumes a simple 4 layer DW design. The final layers start with A*, so they will appear first in any schema listing. Layer

Aligned Area

Schema Name

Description

Gold Layer

Business

AU_GOLD_ DECISION

This is a schema for tables and views that are available to Business Decision Makers. It represents the final state of the transformed data, suitably cleansed for the primary customer of the DW, the business decision maker.

Silver Layer

Business

AG_SILVER_ INTEGRATION

This is a schema that enables integration across multiple source schemas. It represents transformed Lead data into more valuable Silver data.

Iron Layer

Business

FE_IRON_ METADATA

This is a schema for all metadata data.

Lead Layer

Source

PB_ _ _ _

This contains the raw loaded source data. The primary need is for a unique definition of the data. This naming pattern supports this. The name defines the important characteristics of the load. The first code identifies the source system. The next code identifies the DBName for databases, or the address for web sources or the directory path for file sources. The DBName must always be based on the ultimate production database name, not the development name. In the case of directory or path names, the protocol prefix will need to be stripped off, and illegal characters like ":/" will need to be substituted with underscores. The next code identifies the Schema Name for databases, or the page for web sources, or the file for file based sources. The next code identifies the load type. This more precise method of identifying sources will enable a much large number of different sources to be identified, and to be able to manage multiple loads from the same source, so that they do not impact on each other. This approach is more selfdocumenting, with a clearer self documenting data lineage.

Load Type This is needed to distinguish how the data has been loaded. If data is brought in by separate ETL or CDC processes, it will be subject to different time variance and recovery processes. This is even an issue within CDC, where data can be either refreshed or mirrored. If the target is not distinct, then it

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

7 of 27

Matthew Lawler [email protected]

Datawarehouse Names

may be impossible to properly reason about its state, leading to incorrect replication and, ultimately, erroneous business decisions. Load Type

Load Name

Description

CDC

Change Data Capture

This is the default process for loading data.

ELT

Extract Load Transform

A common pattern in DW is to load the data in before transformation. This results in faster loads in an MPP appliance.

ETL

Extract Transform Load

ETL is used to load data into the DW. The transformation occurs in a separate ETL tool, which is a common anti-pattern which leads to poor single threaded ETL performance.

MNL

Manually loaded

This is practical for data that does not need to be loaded more than once, or very infrequently, which is generally invariant data.

REF

Reference

Used mainly for schema analysis, and possibly prototyping, but not to be used for reporting. A throw away design area.

Just as there can be alternate means of loading, there can also be alternate source types from the same source. In almost all cases, there is not a strict isomorphism between source and target, and the source type plays an important part in determining the amount of change that occurs between the source and target. In other words, the different source types have different constraints that dictate how well the data is loaded. If the source type is also a file type, then they should conform to standard file extensions. The source type may also be appended to the name, but it is not mandatory. File Type

File Type Name

DOC

Microsoft Word Document

HPR

Hyperion

HTML

Hypertext Markup Language

MDB

Microsoft Access Database

MSP

Microsoft SharePoint

NTZ

Netezza

ODS

Operational Data Store

Description

Many source systems currently provide a ODS layer. While well meaning, this anti-pattern often creates a reconciliation

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

8 of 27

Matthew Lawler [email protected]

File Type

Datawarehouse Names

File Type Name

Description

nightmare, as rarely are the ODS mappings back to the source system actually documented. ORA

Oracle

XLS

Microsoft Excel Spreadsheet

XML

Extensible Markup Language

Subject Area Standard The physical name can be used in schema names. The Subject Areas will have a three letter code. The Subject Area codes can be used in projection names, or in primary key values. These subject area names are examples only. SA Code

Subject Area (logical)

Subject Area (physical)

Layer

AAT

Authentication and Authorization

Authentication_Authorization, Athntctn_Athrztn

Iron

GLS

Glossary

Glossary

Silver

BUS

Business

Business

Gold

CTL

Control

CONTROL

Iron

DQ

Data Quality

DATA_QUALITY

Iron

DVT

Data Vault

DATA VAULT

Silver

ERL

Entity & Reference Entity_Reference_Links Links

Iron

GLD

General Ledger

General_Ledger

Gold

HRS

Human Resources

Human_Resources

Gold

INV

Invariant

INVARIANT

Iron

LNG

Lineage

DATA Lineage

Iron

OPS

Operations

Operations

Gold

PRJ

Project

Project_Management

Gold

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

9 of 27

Matthew Lawler [email protected]

SA Code

Subject Area (logical)

Datawarehouse Names

Subject Area (physical)

Layer

Management QLT

Quality

Quality

Iron

SLN

Source Lineage

Source_LinEage

Silver

TRC

Traceability

TRACEABILITY

Iron

Note that some metadata can be placed in other layers. However, if the metadata is shared across schemas or layers, then it is best to place in a separate schema. Lineage can be considered the column to column mappings. Traceability can be considered the actual row value to row value mappings.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

10 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Projection Standards A projection can be thought of as a subset of columns. These are a series of codes or types that can be used as part of a projection name. These codes will help to make the purpose of projection more transparent, through consistent usage of the name patterns. The next chapter will give examples of how they can be combined to create projection names.

Logical Projection Type (LPT) Logical means the primary purpose of this projection type within a given methodology. This purpose will help determine the correct set of columns (or attributes) for the projection. All new projections MUST have a Logical Projection Type as a prefix, except SAL tables copied from source tables. This is necessary to avoid name collisions, as most of these projection types can be used in any layer/schema. As far as possible, each type should belong within a type set, so the types are mutually exclusive, and all types form a complete type set. For example, within the Inmon type set, there can only be 3 types: Ephemeral, Invariant or Time Variant. LPT Code

LPT Name

Standard

Definition

A, ACT

Active (current)

Source

A projection of the logical active columns for an active table, filtered to show current data only, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). An active table has at least one row. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

B, BRD

Bridge, Helper

Kimball

A join over tables that can be used as a bridge in a Kimball schema.

C, CUB

Cube

Kimball

A projection that is a join of a Kimball Fact table, and related Kimball Dimension tables. This is useful when the users do not know how to join tables. This does not use GROUP BY, so it is different to an Aggregate.

D, DMN

Dimension

Kimball

A join over tables that can be used as a dimension in a Kimball schema. Typical columns are codes which have low cardinality, which allow for grouping. A dimension can be either a singleton view (over a single table), or based on a set of joined tables.

E, EPH

Ephemeral

Inmon

Something that is true at an instant in time, but has no duration or longevity. That is, it only lasts momentarily, or for a very short time. A typical example is a transaction or trade, which is effectively instantaneous. Typically, this data

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

11 of 27

Matthew Lawler [email protected]

LPT Code

LPT Name

Standard

Datawarehouse Names

Definition

is not mutable. That is, it cannot be updated, except for error correction. F, FCT

Fact

Kimball

A join over tables that can be used as a fact in a Kimball schema. Typical columns are numeric which represent either continuous data or countable data. This is often based on a single table.

G, AGR

Aggregate

Kimball

A join over tables that can be used as an aggregate or summary. This denormalisation can be used in all other standards. This uses GROUP BY, so it is different to a Cube.

H, HUB

Hub

Data Vault

A join over tables that can be used as a hub in a Data Vault schema.

I, INV

Invariant

Inmon

Something that is always true. For example, system setup data does not change over the life of a system, so this can be considered invariant. This data is not mutable. That is, it cannot be updated, except for error correction.

K, KEY

Key

Source

This represents cleansed Source tables that have had a surrogate and/or a distribution key added. A surrogate key is needed for Fact/Dimension joins. A distribution key is critical for adequate Netezza performance. Note that these tables would still retain their source names.

L, LNK

Link

Data Vault

A join over tables that can be used as a link in a Data Vault schema.

M, MTD

Metadata Metadata

A join over tables that can be used for metadata. Examples include A&A, Glossary, Data Quality, Data Lineage, Data Traceability, Invariants, etc.

N, NUB

Nub

Source

This represents Source tables that have been cleansed, but without column name changes. For example, cleansing can discard boilerplate columns, de-duplicate rows, add defaults values (e.g. N/A for nulls), convert types (e.g. Text -> Dates), fix column lengths, etc.

O, OTH

Other

Source

This represents some source system based column and/or table grouping which is not defined as a table/view the source RDBMS. For example, this could represent a manually maintained subject area grouping, or some set of data in the application layer.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

12 of 27

Matthew Lawler [email protected]

Datawarehouse Names

LPT Code

LPT Name

Standard

Definition

P, PVT

Pivot (from RDF/EAV)

Source

A join over RDF/EAV tables that pivot or flatten the EAV data into multiple tables in columnar form. They can then be joined to other tables.

R, RDF

Resource Description Framework

S, STL

Satellite

Data Vault

A join over tables that can be used as a satellite in a Data Vault schema.

T, TMV

Time Variant (Historical)

Inmon

A projection of the logical active columns for an active table, which shows current and historical data, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). This is only required for source tables. The non-source DW Design Patterns all support Time Variance. This data is mutable. That is, it changes over time, and a new row is created whenever the source data changes. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

U, UCN

Unconditional, (Audit)

Source

A projection of all columns for all tables, without filtering the column set or the row set. Therefore, this will show both current and historical data. This also shows deleted rows. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

Z, ZWK

ZWork

Metadata

Metadata

A join over source tables used to describe source data that is in RDF form. Typically, there are 4 tables needed: Resource (Entity) which defines tables and primary keys, Predicate (Relation) which defines relationships, ResourceType which defines type of each attribute (e.g. date, char, etc.) and RDF (Value) which contains the RDF triples. This data can be used to pivot into standard tables.

These projections satisfy technical requirements, and not reporting requirements. These should not be visible to reporting users. Many will be required by the RDBMS to define projections like indices, etc.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

13 of 27

Matthew Lawler [email protected]

Datawarehouse Names

This hierarchy shows what standards the logical projection types belong to. The projection designer needs to ensure that the projection conforms to these projection types. For example, if the projection is a fact table, it must begin with F_*, and no other prefix.

Logical Projection Metadata

Source

Kimball

Data Vault

Inmon

M Metadata

A Active

B Bridge

H Hub

E Ephemeral

R RDF

K Key

C Cube

L Link

I Invariant

Z Work

N Nub

D Dimension

S Satellite

T Time Variant

O Other

F Fact

P Pivot

G Aggregate

U Uncond

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

14 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Physical Projection Type (PPT) All physical types are defined, which will help minimize name collisions. The designer may choose not to add any suffix, at the risk of subsequent name collisions. Not defining the Physical Projection Type suffix is unsafe practice when generating projections automatically. PPT Code

PPT Name

A

Abstract

F

Function

True

G

Package

True

I

Index

True

L

DB Link

True

M

Materialised True

O

Object View

True

P

Procedure

True

Q

Queue

True

R

Synonym

True

True

S

Sequence

True

True

T

Table

True

True

This covers sub-types such as temporary and external.

U

Type

True

V

View

True

True

Views can be implemented as materialized or dematerialized.

X

XML

True

Z

Trigger

True

Oracle Netezza Comment

An ‘Abstract’ type is also provided, even though this does not exist as a type in any database. The *_A type can be a table or a view, etc. This will enable the designer to replace a view with a table, without having to change the name. This may help reduce impacts on downstream users. True

True

True

Avoid using this type. Use V instead.

True

While a synonym can point to any other physical type, it still needs to be managed as a particular type. Use *_A, if *_R is too restrictive.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

15 of 27

Matthew Lawler [email protected]

Datawarehouse Names

The hierarchy below shows which environments physical projection types belong in. Physical Projection

A Abstract

Shared Types

Oracle Only Types

F Function

G Package

I index

O Object View

M Materialized

L Link DB

P Procedure

Q Queue

R Synonym

U type

S Sequence

X XML

T Table

Z Trigger

V View

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

16 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Constraint Standard Constraints will be implemented as a single character suffix. Note that these are RDBMS specific.

Constraint Code CT Suffix Constraint Type

Oracle Netezza Description

_C

Check

True

_F

Foreign Key

True

_H

Hash Expression

True

_O

Only View Read

True

_P

Primary Key

True

_R

Referential Integrity

True

_S

Supplemental Logging True

_U

Unique Key

True

_V

View Check

True

True

True

Constraint Name Patterns This shows what would be the most common examples. It is assumed that there will be a maximum of 9 unique column sets on any given table. Given that these are logically equivalent to the primary key, more than this is unlikely. Note that the Distribution Key is not a constraint, but a part of the table structure, so it cannot be altered with an add or drop statement. Therefore, there no naming standard applies to a Distribution Key as a constraint. Constraint

Pattern

Example Base Projection

Result Constraint

Primary Key

_P

D_NODE_T

D_NODE_T_P

Foreign Key

< tableName>__F

D_NODE_T

D_NODE_T_1234_F

Unique

< tableName> _Unique_U

D_NODE_T

D_NODE_T_UNIQUE3_U

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

17 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Attribute Standard Attribute Data Type (ADT) These standards only apply to new attributes created for the DW. All source column names and data types should remain unchanged. All non-source columns MUST have an attribute data type. The DB data type can be varied where it makes sense. Some choices are provided. For example, an enumeration can be INTEGER or VARCHAR. However, a date must be DATE. Similarly, length can also be varied where it makes sense. Suffix

Description

Oracle Data Type

Netezza Data type

Definition

_A, _AMT

Amount

NUMBER(28,10)

FLOAT(15)

Any currency or monetary amount, including balances, prices, etc. This can enable fractions of cents.

_B, _BL

Binary Large Object

BLOB

BINARY VARYING

Any Binary Large Object such as an image, audio or video. For example, well-known binary (WKB), which is used to define geometric objects in binary.

_C, _CD

Enumeration (aka code or List of Values)

VARCHAR2(30), NUMBER

VARCHAR(30), INTEGER

An enumeration is a collection of items that is a complete, ordered listing of all of the items in that collection. For example, Frequency which can be Day, Week, Month, etc. DAY_OF_WEEK_C can be VARCHAR2(9) or NUMBER(1,0).

_D, _DT

Date

DATE

DATE

Any date that does not include time. That is, only YYYY-MM-DD.

_E, _EX

Explain

VARCHAR2(255)

VARCHAR(255)

A description or definition statement.

_F, _FL

Float

NUMBER(28,18)

FLOAT(15)

Any numeric that is not an amount, a percentage, an

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

18 of 27

Matthew Lawler [email protected]

Datawarehouse Names

integer, a degree or a rate. _G, _DG

Degree

NUMBER(10,7)

FLOAT(15)

A number that can be used for latitude or longitude.

_I, _ID

Identifier (aka Name or business key)

VARCHAR2(80), NUMBER

VARCHAR(80), INTEGER

An identifier is a name that identifies (that is, labels the identity of) a unique object. This is always provided by the source system. It may even be a surrogate in the source, but is still an id. The length or type may be changed if needed. E.g. NODE_I

_J

Geometry

SDO_GEOMETRY

ST_GEOMETRY

A type that defines a geometic object. This type varies between platforms. The types can be converted to WKB or WKT format.

_K, _KN

Kind

VARCHAR2(255), NUMBER

VARCHAR(255), INTEGER

Any code that must follow rules. E.g. A FSA must be in SAAA format.

_N, _NT

Note

VARCHAR2(2000)

VARCHAR(2000)

Text or string is a sequence of characters. This can be used for any definition or description, except for Identifiers, Codes or Explanations. Oracle limit is 4,000; Netezza limit is 64,000.

_O, _OC

O’clock

TIME

TIME

Any moment in a 24 hour period, independent of date.

_P, _PCT

Percentage

NUMBER(28,18)

FLOAT(15)

A percentage as a fractional number.

_R, _RT

Rate/Percentage

NUMBER(28,18)

FLOAT(15)

A rate as any fractional number.

_S, _SK

Surrogate Key (Natural)

NUMBER(28,0)

BIGINT

An identifier that is based on the natural key of a table. On non-time variant tables, this is identical to the Unique

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

19 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Surrogate Key (_U). For time variant tables, this will change when there is a change to the key. _T, _TS

Timestamp

TIMESTAMP(6)

TIMESTAMP

A date and time down to the micro-second. That is, YYYYMM-DD HH:MM:SS.FFFFFF.

_U, _UK

Unique Surrogate Key

NUMBER(28,0)

BIGINT

A unique identifier for each row of a table. Incremented whenever a new row is added.

_W, _WH

While

NUMBER(28,0)

BIGINT

Any time duration measured in seconds.

_X, _XM

XML

CLOB

VARCHAR

XML Document. For example, well-known text (WKT), which is used to define geometric objects in XML.

_Y, _YN

Boolean (aka, flag or indicator)

VARCHAR(1 CHAR) 'Y' | 'N', VARCHAR(5 CHAR) 'True' | 'False', BIT, 0 | 1 NUMBER(1, 0), 0 | 1

CHAR(1), CHAR(5), BYTEINT

An attribute type with only two possible values: yes or no. It must be expressed as a question with a clear true or false value e.g. is_deleted_YN.

_Z, _ZI

Integer

NUMBER

INTEGER

Any integer. This excludes monetary or fractional numeric. This could be count or sequential value, etc. Z is the default symbol for integer in Maths.

The longer attribute type (e.g. _AMT) may be used if there less than 27 characters in the name. Otherwise, the shorter attribute type must be used, even if that means truncating the name to 28 characters. Why such large Numeric values for ORACLE? Oracle implements NUMBER as a variable length column. So, the column size in bytes for a particular numeric data value NUMBER(p), where p is the precision of a given value, can be calculated using the following formula:

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

20 of 27

Matthew Lawler [email protected]

Datawarehouse Names

ROUND((length(p)+s)/2))+1 where s equals zero if the number is positive and s equals 1 if the number is negative. Therefore, the large number sizes will not impact storage, as they are not fixed.

Boilerplate Physical Columns System Column

Description

SOURCE_CHANGED_ON_TS

The date and time when the record was last modified in the source system.

SOURCE_CREATED_ON_TS

The date and time when the record was initially created in the source system.

EFFECTIVE_AT_TS

This column stores the moment date and time at which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. Note that this column is not needed on most tables. It will only be on event type tables, and certain kinds of fact tables.

EFFECTIVE_FROM_TS

This column stores the date and time from which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.

EFFECTIVE_TO_TS

This column stores the date and time after which the record does not represents a true value. The EFFECTIVE_TO_T of the previous row MUST = EFFECTIVE_FROM_T of the next row – 1 TIME UNIT. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.

IS_DELETED_YN

This boolean indicates the deletion status of the record in the source system. A value of YT indicates the record is deleted from the source system and logically deleted from the source aligned layer. A value of NF indicates that the record is active.

PROCESS_INSERT_ID

System field. This column is the unique identifier for the specific ETL batch process used to create insert or update this data row. Both INSERT_PROCESS_ID and UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data.

PROCESS_UPDATE_ID

This column is the unique identifier for the specific process used to update this row. Both INSERT_PROCESS_ID and UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data.

_DISTRIBUTION_ID, _DSTR_I

ForeignKeyName is the name of the Primary Key on the table that is being used for distribution over the Netezza nodes.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

21 of 27

Matthew Lawler [email protected]

Datawarehouse Names

System Column

Description

ROW_NATURAL_ID

Concatenated artificial primary key. This is constructed out of the columns that represent the true table key. These values are cast into string type.

SOURCE_SYSTEM_C

This should be a valid 3 character source system code.

TRANSACTION_TYPE_C

This code indicates the kind of load transaction used for this record. There are 4 possible values: I (Insert), U (Update), D (Delete ), R (Refresh).

A flag IS_CURRENT_YN can be added to views, but it will not be needed on the physical tables, as this can be derived from high date for on EFFECTIVE_TO_TS.

Primary and Foreign Key Names This standard applies to all views, in all layers. They also apply to all tables not in SAL. The primary key standard is needed to support the automatic key matching in Tableau and other tools. Name

Pattern

Description

Primary Key Name

_I

Apply in all layers.

Foreign Key Name

_I

The only exception is when a child table has the two relationships to the parent table. In this case, the most important relationship should have the same name. All other keys will have a prefix.

Foreign Unique Column Name

Unique ID

This acts like a "foreign key" column that uses the nth unique key of a parent table. N ranges from 1 to 9. These key types must also match easily in Tableau.

Unique Column Name

Unique ID

This is the nth unique key for this table. N ranges from 1 to 9. This kind of column can also be called an Alternate Primary Key.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

22 of 27

Matthew Lawler [email protected]

Datawarehouse Names

General Renaming Rules These are global rules for all names, whether they are used for schemas, projections, attributes, etc.

Logical to Physical Name Mapping Automated Function This describes the function that changes a logical name to physical name. The key idea is to minimise all renaming, as this only introduces confusion, and forces users to lookup reference lists. The rules are only applied when needed, which is not often. The default is NO change at all. For Oracle, MaxNameLength = 30, and Netezza, MaxNameLength = 128. Restrict the length to (MaxNameLength - 4) chars, so that 2 chars can be prefixed and 2 chars can be suffixed e.g. F_*_T (for fact table). 1. If the name length is 26 or less, then use the name as is, after taking out illegal characters e.g. "Unique GNAFs incl Proposed" -> "Unique GNAFs incl Proposed" or UNIQUE_GNAFS_INCL_PROPOSED 2. If the deduplicated, disemvowelled name length is 26 or less, then de-duplicate and disemvowel the name e.g. "ERP Project Template Parameters" -> "ERP Prjct Tmplt Prmtrs" or ERP_PRJCT_TMPLT_PRMTRS 3. Else de-duplicate, disemvowel, truncate all words to a max of 4 letters and truncate whole name "ERP Actl Brdn Ttl Cst Amn" or ERP_ACTL_BRDN_TTL_CST_AMN Logical to Physical Name Definitions Disemvowelling means removing vowels from a word. e.g. Assignment -> Assgnmnt. The first char is always retained, regardless of whether is a vowel or not. All other vowels are removed. De-duplication removes repeated consonants. E.g. Access -> ACCSS -> ACS. Truncation means removing all but the first 4 letters E.g. Burdened -> BRDND -> BRDN.

Logical to Physical Name Mapping Manual Process This applies where a manual approach is being used. More flexibility is available here, while retaining the overall goal of not forcing the user to look up some dictionary of valid abbreviations. For example, DESC is a standard abbreviation for DESCRIPTION. So "ACCESS TECHNOLOGY DESCRIPTION" could be "ACCESS TECHNOLOGY DESC". However, please note that DESC could also be DESCENDING, or some other thing. So this approach should be used for standard abbreviations only. The automated function above would become "ACCSS TCHNLGY DSCRPTN". However, only use this approach when the name is approaching the Max Name Length (e.g. 26 for Oracle). For example, if the Logical name is Appointment, turning this into APNT may create confusion, as this could also be Apparent, or many other words starting with Ap*n*t. Use this approach wisely.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

23 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Time Standard Time Quantum or Time Delta The DW time quantum will be a second. That is, this will be the smallest unit of time in the data.

High and Low Dates The high date is 31-12-2999 00:00:00. The low date will be the start time for Epoch or UNIX time. That is: (UTC), 1 January 1970 or 01-011970 00:00:00. However, in general, this low date should not be used. Instead, the low date of a row should be the date that the row was first created in the source system. Only use low date when a date is mandatory, and the data is missing from the source system.

Time Zone This is EST or Australian Eastern Standard Time.

Table types and Additional Time keys History Type

Definition

Additional Time Keys

Example

Time Variant (aka Duration)

Data that is true over a period of time. Must contain valid from and valid to as part of primary key. This information is almost always based on load date, rather than on a specific business defined event date.

EFFECTIVE_FROM_TS, EFFECTIVE_TO_TS

Normally an SCD2 Dim table.

Ephemeral (aka Activity)

Data that is true at a point in time. Must contain valid timestamp as part of primary key. This information is based on business information such as transaction data, or effective date. As a last resort, it can be based on load date.

EFFECTIVE_AT_TS

This occurs on transaction grain Fact tables, as well as some source tables.

Invariant (aka No History)

Data that is always true or data whose change is not of interest. Date is not part of key.

Not Applicable

Equivalent to an SCD1 Dim table.

Time periods must be non-overlapping That is, the EFFECTIVE_TO_TS of the previous row = EFFECTIVE_FROM_TS of the next row – 1 TIME QUANTUM. If there is no clear distinction between the timestamps, then a query can result in 2 rows of data being true for the overlapping time period. This will cause great confusion for any analyst querying the data, and impact the credibility of the DW. The rule is to modify the previous EFFECTIVE

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

24 of 27

Matthew Lawler [email protected]

Datawarehouse Names

_FROM_TS to the next EFFECTIVE _TO_TS – 1 TIME QUANTUM. This is a fundamental standard that must be adhered to.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

25 of 27

Matthew Lawler [email protected]

Datawarehouse Names

Platform Constraints Summary Most of these standards are dictated by platform constraints. In many cases, this is obvious, as in the set of Physical Projection Types is clearly dictated by the Oracle and Netezza DB engines. However, there are some other standards that are more subtle. These are listed below. Standard

Platform

Rationale

Time Quantum

Oracle Logging

The minimum granularity of Oracle Logging is 1 second. Therefore, this is the finest/smallest discrete time unit available.

Upper Case only in Target

Datastage

This ETL tool does not support mapping to case sensitive object names. So, for physical tables loaded by Datastage, they will be in upper case. Note that this does not apply to the CDC tool.

Column count limit

Oracle

Oracle has a 1,000 column limit for tables and views.

Column size limit

Oracle

Oracle has a 4K column size limit, for VARCHAR.

Row size limit

Oracle

Also known as record size. No limit.

Column size limit

Netezza

Netezza has a 64K column size limit.

Row size limit

Netezza

Netezza has a 65,535K total row size table limit. This is the sum of the column lengths in a row.

Column count limit

Netezza

Netezza have a 1,600 column limit for tables and views.

Oracle The following list of rules applies to both quoted and nonquoted Schema Object Name identifiers unless otherwise indicated: 1. Names must be from 1 to 30 bytes long. 2. Nonquoted identifiers cannot be Oracle Database reserved words. Quoted identifiers can be reserved words, although this is not recommended. 3. The Oracle has many pseudo reserved words with special meanings, such as DIMENSION, SEGMENT, ALLOCATE, DISABLE, and so forth. These words are not reserved, but as Oracle uses them internally in specific ways, this may lead to unpredictable results. D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

26 of 27

Matthew Lawler [email protected]

Datawarehouse Names

4. Use ASCII characters in names. 5. Nonquoted identifiers must begin with an alphabetic character from your database character set. Quoted identifiers can begin with any character. 6. Nonquoted identifiers can contain only alphanumeric characters from your database character set and the underscore (_), dollar sign ($), and pound sign (#). Oracle strongly discourages you from using $ and # in nonquoted identifiers. 7. Within a namespace, no two objects can have the same name. The following schema objects share one namespace: Tables, Views, Sequences, Private synonyms, Stand-alone procedures, Standalone stored functions, Packages, Materialized views and User-defined types. Each of the following schema objects has its own namespace: Indexes, Constraints, Clusters, Database triggers, Private database links, Dimensions. Because tables and views are in the same namespace, a table and a view in the same schema cannot have the same name. However, tables and indexes are in different namespaces. Therefore, a table and an index in the same schema can have the same name. Each of the following nonschema objects also has its own namespace: User roles, Public synonyms, Public database links, Tablespaces, Profiles, Parameter files (PFILEs) and server parameter files (SPFILEs). Because the objects in these namespaces are not contained in schemas, these namespaces span the entire database. 8. Nonquoted identifiers are not case sensitive. Oracle interprets them as uppercase. Quoted identifiers are case sensitive. 9. Columns in the same table or view cannot have the same name. However, columns in different tables or views can have the same name. 10. Procedures or functions contained in the same package can have the same name, if their arguments are not of the same number and datatypes. Creating multiple procedures or functions with the same name in the same package with different arguments is called overloading the procedure or function.

Netezza Netezza objects include tables, views, and columns. Follow these naming conventions: 1. A name must be from 1 to 128 characters long. 2. A name must begin with a letter (A through Z), diacritic marks, or non-Latin characters (200-377 octal). 3. A name cannot begin with an underscore (_). Leading underscores are reserved for system objects. 4. Names without quotes are not case sensitive. For example, CUSTOMER and Customer are the same, but object names are converted to lowercase when they are stored in the Netezza database. However, if a name is enclosed in quotation marks, then it is case sensitive. For optimal Netezza performance, the distributions keys should be converted to integer data types.

D:\D\Documents\DW Me\0 Publish\DW Names.docx

February 13, 2018

27 of 27

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF