Dimensional Modeling CS 543 – Data Warehousing

Download Report

Transcript Dimensional Modeling CS 543 – Data Warehousing

Dimensional Modeling
CS 543 – Data Warehousing
From Requirements to Data Models
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
2
Logical Data Model

Logical data design includes identification of all data
elements and the structures in which they are connected
 Data
elements
 Data structures

Requirement gathering, and more specifically,
information packages lead to the logical data design
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
3
Dimensional Modeling


A logical data design technique to structure the
business dimensions and the metrics that are analyzed
along these dimensions
Dimensional modeling
 Is
intuitive for business
 Has proven to be efficient for queries and analyses

Information packages are the foundation of
dimensional modeling
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
4
Fact Table
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
5
Dimension Tables
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
6
Desired Characteristics in the Model






The model should provide the best data access
The model should be query-centric
The model must be optimized for queries and analyses
The model must show that the dimension tables interact
with the fact table
It must be structured in such a way that every
dimension can interact equally with the fact table
The model should allow drill down and roll up along
dimension hierarchies
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
7
The Star Schema
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
8
E-R Vs. Dimensional Modeling (1)

Entity-relationship modeling
 Removes
data redundancy
 Ensures data consistency
 Expresses microscopic relationships

Dimensional modeling
 Captures
critical measures
 Views along dimensions
 Intuitive for business users
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
9
E-R Vs. Dimensional Modeling (2)
DM rules more restrictive than for E-R
modeling.
 DM is a simpler logical model.
 E-R representative power is greater due to
variety of constructs supported.
 DM looks like normalized E-R conceptual,
except:

All
relationships mandatory M-1.
Single path between any two relations.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
10
Another Example: Retail Dimensions
Ad Year
Ad Month
Operations
Planning
Subcategory
Region
Zone
SKU
District
Market
Calendar
Merc handising
Fiscal Year
Calendar Year
Department
Fiscal Quarter
Calendar Quarter
Category
Fiscal Period
Calendar Month
Fiscal
Ad/Fiscal
Advertising
aligned
Calendar Week
Ad Week
UPC
Day
Store
Sales
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
11
Star Schema
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
12
Querying Against a Star Schema
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
13
Dimension Tables Characteristics








Dimension table key
Large number of attributes (wide)
Textual attributes
Attributes not directly related
Flattened out, not normalized
Ability to drill down/roll up
Multiple hierarchies
Less number of records
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
14
Fact Table Characteristics








Concatenated fact table key
Grain or level of data identified
Fully additive measures
Semi-additive measures
Large number of records
Only a few attributes
Sparsity of data
Degenerate dimensions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
15
“Factless” Fact Table

If the metric or unit of analysis is occurrence or non-occurrence
of an event, then the fact table will contain either 1 or nulls
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
16
Data Granularity (1)

Actual events are tied to actual transactions (e.g.,
sales).
 This

corresponds to the lowest grain or highest detail
Accumulated events are the effect of accumulated
transactions (e.g., inventory on hand).
 This
corresponds to a higher grain and lesser detail

Allowable events represent the “ability” to perform a
transaction (e.g., carried products, a.k.a. plan-o-gram).

Actual events are typically more sparse than allowable
events (e.g., a store carries more products than it sells
each day).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
17
Data Granularity (2)


Low grain designs are easy to change (“graceful”
change)
Low grain designs result in larger storage and
maintenance costs
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
18
Keys (1)

Be careful in picking and using operational system
keys as keys for the dimension tables
 Avoid
built-in meanings in the primary keys of the dimension
tables
 Do not use operational system keys as primary keys of
dimension tables
 Use surrogate keys (system generated keys)

Keep a mapping between the surrogate and primary
keys
 Include the
operational system primary key as an attribute in
the dimension table
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
19
Keys (2)
Primary key options for fact tables
 A single compound primary key whose length is the
total length of the keys of the dimension tables
 Foreign
keys need to be stored as additional attributes in the
fact table
 Increases size of fact table

A concatenated primary key that is the concatenation of
all the primary keys of the dimension tables
 No

need to store the foreign keys separately
A generated primary key independent of the keys of the
dimension tables
 Foreign
keys need to be stored as additional attributes
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
20
Advantages of the Star Schema




Easy for users to understand
Optimizes navigation
More suitable for query processing
Star-join and star-index
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
21
Updates

Updates to the fact table
 Addition
of rows
 Changes in row (adjustments in values)
 Rarely, addition of attributes (new fact or metric)

Updates to dimension tables
 Slow
addition of rows
 Slow addition of attributes
 New dimensions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
22
Updates to the Dimension Tables







Most dimensions are generally constant over time
Many dimensions, if not constant, change slowly over
time
The key of the source record does not change
The description and other attributes change slowly over
time
In the source OLTP systems, the new values overwrite
the old values
Overwriting is not always the best option for
dimension table attributes
The way updates are made depends on the type of
change
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
23
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
24
Type 1 Changes: Correction of Errors (2)





Overwrite the attribute value in the dimension table
row with the new value
The old value of the attribute is not preserved
No other change are made in the dimension table row
The key of this row or any other key value are not
affected
This type is easiest to implement
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
25
Type 1 Changes (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
26
Type 2 Changes: Preservation of History (1)

Properties





They usually relate to true changes in source systems
There is a need to preserve history in the data warehouse
This type of change partitions the history in the data warehouse
Every change for the same attribute must be preserved
Approach





Add a new dimension table row with the new value of the changed
attribute
An effective data field may be added into the dimension table
There are no changes to the original row of the dimension table
The new row is inserted with a new surrogate key
The key of the original row is not affected
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
27
Type 2 Changes: Preservation of History (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
28
Type 3 Changes: Tentative Soft Revisions (1)

Properties





They usually apply to “soft” or tentative changes in the source systems
There is a need to keep track of history with old and new values of the
changed attribute
They are used to compare performances across the transition
They provide the ability to track forward and backward
Approach






Add an “old” field in the dimension table for the affected attribute
Push down the existing value of the attribute from the “current” field to
the “old” field
Keep the new value of the attribute in the “current” field
Also, you may add a “current” effective date field for the attribute
The key of the row is not affected
No new dimension row is needed
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
29
Type 3 Changes: Tentative Soft Revisions (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
30
Large Dimensions (1)

Large dimensions?
 Large
number of rows (deep)
 Large number of attributes (wide)


Dimensions can become large because of frequent
changes (what type?) and need to have many attributes
for analysis
Consequence
 Slow

and inefficient
Solution
 Proper
logical and physical design
 Indexes
 Optimized algorithms
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
31
Large Dimensions (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
32
Multiple Hierarchies
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
33
Junk Dimensions



Dimensions for a DW are typically taken from
operational source systems
However, source systems contain many additional
attributes (such as flags, text, descriptions, etc) that
may not be useful in a DW
What are the options
 Discard
all such fields in the source systems
 Include them in the fact table
 Include all of them as dimensions
 Select some and add them to a single “junk” dimension table
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
34
The Snowflake Schema

Snowflaking is a method of normalizing the tables in a
star schema
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
35