Data Modelling 101

Download Report

Transcript Data Modelling 101

Dimensional Modelling 102
Author
Presented To
Revised Date
Nick Draper, John Campbell
BearingPoint Internal
September 2008
Global Management and Technology Consulting
Authors
This presentation was prepared by:
John Campbell, Technical Architect
BearingPoint
120 Collins Street
Melbourne Australia
Tel: +61 3 8665 9537
E-mail: [email protected]
Nicholas Draper, Consultant
BearingPoint
120 Collins Street
Melbourne Australia
Tel: +61 3 8665 9694
E-mail: [email protected]
This document is protected under the copyright laws of the United States and other countries. This document
contains information that is proprietary and confidential to BearingPoint, Inc., its subsidiaries, or its alliance partners, which shall not be
disclosed outside or duplicated, used, or disclosed in whole or in part for any purpose other than to evaluate BearingPoint, Inc.
Any use or disclosure in whole or in part of this information without the express written permission of BearingPoint, Inc. is prohibited.
© 2005 BearingPoint, Inc. All rights reserved.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
DIMENSIONAL MODELLING
102- CONTEXT
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Information Management Course Context
Information Management
Methodology (MIKE2)
CUSTOMER
CUSTOMER NUMBER
CUSTOMER NAME
CUSTOMER CITY
CUSTOMER ST
CUSTOMER POST
CUSTOMER ADDR
CUSTOMER PHONE
CUSTOMER FAX
Data
Investigation
Network
Data
Standardisation
Process
Automation
Data
Correction
ETL
Data
Enrichment
Data
Matching
Middleware/ESB
Relational
Data Modeling
Metadata
Management
Operations
Security
Platforms
‘Why’
Information Management Solution Packs (see imCollab)
Data Warehousing
Reference
and
Master Data
Data
Warehousing
DW 101
MIKE2
Information
Governance
SQL 101
ETL 101
Data
Quality
Information
Governance
Dimensional
Modelling
DM 102
Solution
Focused
Internal
Meta Data
Data
Modelling
DM 101
Technique Focused – ‘What’
ERwin
DM 202
© 2008 BearingPoint, Inc.
MetaData
101
Tool Focused – ‘How’
INFORMATION MANAGEMENT
DW Reference Architecture
Source
Systems
SS2
Changes
Changes
6
ODS
4
DDS
Extract,
Transform and Load
SS1
7
5
Extract,
Transform and Load
Source
System
Metadata
10
Value Add
2
Manual
Data
Entry
Source
System
Metadata
3
Extract, Cleanse,
Transform & Load
SS2
Replicate/Copy
SS1
1
Source
System
Replicas
9
8
Data
DataMart
Mart
ODS
Metadata
ETL
Metadata
Data
Warehouse
Metadata
ETL
Metadata
Web Client
Desktop
Client
Manual
Data
Source System ETL
Metadata Metadata
Web
Infrastructure
Data Mart
Metadata
Access
Metadata
Metadata Repository
Focus of this course
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Why are we presenting this course ?
This course is being presented to BearingPoint’s TS technical
personnel
 To establish a base line as regards nomenclature and
knowledge
 To set expectations for how Data Models should be presented
in client documents.
 (In short – so we can all read and understand each others
work.)
 To establish a basic Dimensional Modelling Paradigm
 To supply practical skills
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Contents
Recap Models and Model Types
The Basics
Schema Types
─
Star schema
─
Snowflake schema
Components of a Dimensional Model
─
Facts
─
Dimensions
─
Relationships
─
Grain
Technique Thomsen Diagram
The Process
Slightly Advanced
Internal
•
Date and Time, Time Variant
•
Surrogate keys, Slowly Changing Dimensions
•
Aggregate fact tables
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
MODELS AND MODEL TYPES
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
What is a Model ?
Definitions of ‘Model’ abound
 “the act of representing something (usually on a smaller scale)”
Properties
 They aren’t real

Their function is to aid communication, between users,
technologists, machines
“All Models are wrong, some models are useful.”
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Are there different types of ‘Data Models’ ?
Several levels of ‘Data Models’ are usually used and each has, as a
focus, a different audience
These were covered in the Data Modelling course. All three types are
applicable Dimensional modelling also, however, Logical and Physical
are more prominent.
Logical
Model
Conceptual
Model
Physical
Model
Business
Users
Internal
Technicians
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
THE BASICS
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Dimensional Modelling Vs Normalisation

Normalisation is good for the middle layer of a 3-tier DW
design
 Minimal redundancy improves maintainability – data is
updated in one place.
 Normalised form can unify a diversity of enterprise data
sources in a flexible manner.

Denormalisation is good for Business Intelligence
 Minimal redundancy is not necessary because data is derived
from other sources, not directly maintained in dimensional
form.
 Redundancy improves comprehension and usability of data
structures.
 Data mart SQL tends to consist of complex queries affecting a
large number of tables and columns and returning large result
sets. A simple structure can improve query performance.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Spreadsheet – Two Dimensions
Lets us start with a typical example of two-dimensional data. Anything
that you track, whether it is hours per employee, costs per department,
balance per customer, or complaints per store, can be arrange in a twodimensional format.
Month
January
February
March
April
May
June
July
August
September
October
November
December
Total
Sales
750
700
810
820
900
930
890
740
840
900
830
900
10,010
Direct Costs Indirect Costs Total Costs
Margin
420
100
520
230
500
110
610
90
530
90
620
190
450
130
580
240
410
80
490
410
630
130
760
170
540
100
640
250
550
110
660
80
470
120
590
250
520
150
670
230
430
100
530
300
570
90
660
240
6,020
1,310
7,330
2,680
The data set may be said to be arranged to have two dimensions: a
row-arranged month dimension and a column-arranged measures.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Pivot Table - Three Dimensions
Now, let’s add a THIRD Contextual Dimension to the same spreadsheet –
Products. The spreadsheet now highlights that the Date and Measures
data presented relates specifically to the Product Category – Shoes.
context
Product: shoes
Month
January
February
March
April
May
June
July
August
September
October
November
December
Total
columns
Measures: all
Sales
750
700
810
820
900
930
890
740
840
900
830
900
10,010
Direct Costs Indirect Costs Total Costs Margin
420
100
520
230
500
110
610
90
530
90
620
190
450
130
580
240
410
80
490
410
630
130
760
170
540
100
640
250
550
110
660
80
470
120
590
250
520
150
670
230
430
100
530
300
570
90
660
240
6,020
1,310
7,330
2,680
rows
Time: Months
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
COGNOS 8 – Analysis Studio
This kind of
display can be
found in many
current OLAP
tools.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
SCHEMA TYPES
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Dimensional Modelling
Based around ‘Measures’ (Fact Tables) that are constrained by
‘Dimensions’ (Dimension Tables).
 Very common in Data Warehouse applications.
 Can directly feed other tools such as MOLAP databases..
 Tend to have a very specific focus that is easy for users to
understand
 Users will get confused with more than about six dimensions.
 The model is easy - the ‘Fact’ is unique by ALL of the
‘Dimensions’.
 Is concerned primarily with retrieval needs.
 Is almost always summarised, eg using SUM or MAX and so on.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Star Schema

Always start with
this simple form
Basic form includes a central
table with a number of
descriptive tables joined directly
•
Central table known as the
Fact table
•
Satellite tables known as
Dimension tables
Fact Table
Customer ID
Claim ID
Customer Number
Effective Date
Policy Years
Policy Count
Income
Claim Number
Claim Status Desc
Cause Description
Cause Code
Claim Status Code
Reported Date
Entered Date
Claim Transaction
Coverage
Coverage ID
•

Internal
A simpler design that can be
easily optimized for data
retrieval
All dimension tables completely
denormalised

All dimension tables relate
directly to the fact table

The Grain is set by the
dimensions

Best for straightforward
modelling requirements.
Claim
Customer
Policy Number
Coverage Effective Date
Coverage Expiration Date
Coverage Status
Sum Insured
Claim Count
Customer ID (FK)
Product ID (FK)
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID (FK)
Catastrophe ID (FK)
Claim Trans Date ID (FK)
Claim Transaction Type
Claim Trans Type ID
Claim Trans Type Code
Claim Trans Type Desc
Reversal Indicator
Reversal Indicator Desc
Amount
Catastrophe
Catastrophe ID
Product
Product ID
Product
Product
Product
Product
Type Desc
Category Code
Category Desc
Desc
Dimension
Tables
© 2008 BearingPoint, Inc.
Calendar
Calendar ID
Catastrophe Desc
Catastrophe Start Date
Catastrophe End Date
Catastrophe Location
Cal Date
Cal Year
Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
Relationships
INFORMATION MANAGEMENT
Snow Flake Schema
Customer

Described as a variant option

One or more dimension tables
are not completely
denormalised


Some data represented in a
snowflake or outrigger table
Benefits
Claim ID
Customer Number
Effective Date
Policy Years
Policy Count
Income
Claim Number
Claim Status Desc
Cause Description
Cause Code
Claim Status Code
Reported Date
Entered Date
Claim Transaction
Coverage
Coverage ID
Policy Number
Coverage Effective Date
Coverage Expiration Date
Coverage Status
Sum Insured
Claim Count
─ Useful for complex
modelling situations, such
as dynamic hierarchies, or Product
Product ID
shared dimensions
Product Type Desc

Drawbacks
Product Category Code (FK)
Product Desc
─ More navigation needed
and it can get complicated Product Category
Product Category Code
quickly

Internal
Avoid unless requirements
demand it
Claim
Customer ID
Customer ID (FK)
Product ID (FK)
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID (FK)
Catastrophe ID (FK)
Claim Trans Date ID (FK)
Claim Transaction Type
Claim Trans Type ID
Claim Trans Type Code
Claim Trans Type Desc
Reversal Indicator
Reversal Indicator Desc
Amount
Catastrophe
Catastrophe ID
Calendar
Calendar ID
Catastrophe Desc
Catastrophe Start Date
Catastrophe End Date
Catastrophe Location
Cal Date
Cal Year
Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
Product Category Desc
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
COMPONENTS OF A
DIMENSIONAL MODEL
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model
─ Fact Tables
Customer
─ Dimensional Tables
Customer ID
─ Relationships
─ Grain
Fact Table
Customer Number
Effective Date
Policy Years
Policy Count
Income
Claim Transaction
Coverage
Coverage ID
Dimension
Tables
Policy Number
Coverage Effective Date
Coverage Expiration Date
Coverage Status
Sum Insured
Claim Count
Customer ID (FK)
Product ID (FK)
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID (FK)
Catastrophe ID (FK)
Claim Trans Date ID (FK)
Claim
Claim ID
Claim Number
Claim Status Desc
Cause Description
Cause Code
Claim Status Code
Reported Date
Entered Date
Claim Transaction Type
Claim Trans Type ID
Claim Trans Type Code
Claim Trans Type Desc
Reversal Indicator
Reversal Indicator Desc
Amount
Catastrophe
Catastrophe ID
Product
Product ID
Product
Product
Product
Product
Calendar
Calendar ID
Type Desc
Category Code
Category Desc
Desc
Catastrophe
Catastrophe
Catastrophe
Catastrophe
Desc
Start Date
End Date
Location
Cal Date
Cal Year
Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
Relationships
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model - FACT
Tables

Contain measures
─ Usually numeric, measures quantify the business

Most useful measures are additive
─ Additive measures can be meaningfully added across rows

Row population is sparse
─ A row exists only where there are non-zero measures

Fact tables are not denormalised

Examples:
─ Sales
─ Counts
─ Percentage
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model DIMENSION Tables
 Purpose of a Dimension – To Add Context to the Fact
 Contain attributes
─ Usually textual, attributes describe the business
 Attributes are used for filtering and grouping
 Dimension tables are typically denormalised
─ Increases comprehension
─ Facilitates browsing
 Have unique primary keys to identify every row
 Example Dimensions
─ Date
─ Product
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model RELATIONSHIPS
 Relationships are one-to-many
─ Dimensions are parents (one)
─ Facts are children (many)
─ Any logical many-to-many relationships must be decomposed
 Fact tables contain foreign keys
─ Point to primary keys in dimension tables
 Referential Integrity is critical
─ Every fact must have a parent row in each dimension table
─ Violations lead to incorrect and inconsistent query results
─ If you have to put a ‘Not Applicable’ value in the dimension so that
the Fact has something to link to.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model - GRAIN
 ‘Grain’ is the fundamental atomic level of data to be represented in
the fact table.
 Business analysis discovers the level at which the data needs to
be represented
 Fact grain is determined
─ Transaction is the finest grain
─ Data is aggregated if transaction grain is not needed
 Dimension grain is matched to fact grain
 Fine grains have performance implications – hardware must be
adequate to handle the load
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Components of a Dimensional Model - RECAP
Each Fact, (A ‘Claim
Customer
Transaction. Amount’) is defined Customer ID
Customer Number
by
Effective Date
Policy Years
Policy Count
ONE Customer,
Income
ONE Claim,
ONE Claim Transaction Type, Coverage
Coverage ID
ONE Catastrophe,
Policy Number
Coverage Effective Date
ONE Calendar (date),
Coverage Expiration Date
Coverage Status
ONE Product,
Sum Insured
Claim Count
ONE Coverage.
Other than the Claim
Transaction all entities are
Dimensions’
Grain here is at the transaction
level – each ‘claim’. Without
this we would have to, say,
total the amounts across all
claims for the one day.
Internal
Claim
Claim ID
Claim Transaction
Customer ID (FK)
Product ID (FK)
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID (FK)
Catastrophe ID (FK)
Claim Trans Date ID (FK)
Claim Number
Claim Status Desc
Cause Description
Cause Code
Claim Status Code
Reported Date
Entered Date
Claim Transaction Type
Claim Trans Type ID
Claim Trans Type Code
Claim Trans Type Desc
Reversal Indicator
Reversal Indicator Desc
Amount
Catastrophe
Catastrophe ID
Product
Product ID
Product
Product
Product
Product
Type Desc
Category Code
Category Desc
Desc
© 2008 BearingPoint, Inc.
Calendar
Calendar ID
Catastrophe
Catastrophe
Catastrophe
Catastrophe
Desc
Start Date
End Date
Location
Cal Date
Cal Year
Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
INFORMATION MANAGEMENT
Components of a Dimensional Model - RECAP
With this model we can ‘drill
down’ to the most granular
fact level (one customer, one
claim etc etc
Customer
Claim
Customer ID
Claim ID
Customer Number
Effective Date
Policy Years
Policy Count
Income
Claim Number
Claim Status Desc
Cause Description
Cause Code
Claim Status Code
Reported Date
Entered Date
Claim Transaction
OR by discounting one or
more dimensions we can
aggregate up to, say, a
whole day or across all
claims of a certain type
Coverage
Coverage ID
Policy Number
Coverage Effective Date
Coverage Expiration Date
Coverage Status
Sum Insured
Claim Count
Claim Transaction Type
Claim Trans Type ID
Claim Trans Type Code
Claim Trans Type Desc
Reversal Indicator
Reversal Indicator Desc
Amount
Catastrophe
Catastrophe ID
Product
Product ID
Product
Product
Product
Product
Internal
Customer ID (FK)
Product ID (FK)
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID (FK)
Catastrophe ID (FK)
Claim Trans Date ID (FK)
Type Desc
Category Code
Category Desc
Desc
© 2008 BearingPoint, Inc.
Calendar
Calendar ID
Catastrophe
Catastrophe
Catastrophe
Catastrophe
Desc
Start Date
End Date
Location
Cal Date
Cal Year
Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
INFORMATION MANAGEMENT
TECHNIQUE: THOMSEN
DIAGRAM
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Technique: Thomsen Diagrams
 Purpose – Rich Documentation
/Design Technique
─ particularly good beyond 3
dimensions where the cube
analogy falls apart
 Represents ‘Dimensions’ as line
segment with Hierarchy or
aggregation levels specifically
noted
 Measures / Facts are listed to
the side
Total
Products
Total
All Years
Product
category
Total
Year
Product
Type
Total
Month
Product
Day
(Date)
Measures
(Facts)
* Claim
Transaction
Amount
 Strengths
─ Scalability, Simplicity
─ Hierarchies, and their order,
immediately visible and
verifiable
Dimension
Fact
 Weaknesses
─ 10+ Dimensions
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
PROCESS
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Inputs to Dimensional Modelling
 Requirements documents
 Source data models / database schemas
 Metadata / data dictionary
 Existing reports / analysis models
 Consultation with business users
 Consultation with technical users
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Outputs of Dimensional Modelling
 Logical data model
 Physical data model
 Source-target mapping
 Validated business rules / transformations
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Dimensional Modelling Steps
1. Understanding the Business Problem
2. Choose the Dimensions
3. Choose the Grain of the Fact Table
4. Choose the Measured Facts
5. Choose the Dimension Attributes
6. Deriving the Physical Model from Logical
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Example
We are going to explore the fictional ‘ACME Bolt company’ and it’s Key
performance Indicator: “Total Bolts Sold per Customer (TBSC)” This
measure has been handed down from ACME’s foreign parent and
everybody’s bonus is related to proving this value has risen over the
latest periods.
1. Understanding the Business Problem
What is our problem ?
Getting our bonus of course but for this exercise lets just satisfy the request.
We will at least need to discover what is meant by ‘Customer’ and ‘Bolts sold’
(does this mean ordered? Invoiced? Delivered?)
Interviews with the clients reveals that its not across the board, individual
regions and branches will be judged also. We have to prove a general trend
across 12 months.
The user would also like to know something of the Customer’s category also so
as to be able to manage trends during the year. Similarly the groupings of
bolts.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Example (Cont)
2. Choose the Dimensions
 These are readily apparent if using the Thomsen Diagram.
 If not then these will need to be deduced from the Problem
definition in step one.
 Draw a small data model with the measure / fact in the centre
and the other concepts around this – including an entity for
each aggregation level.
 A Star schema would collapse (denormalise) all these levels.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Example (Cont)
3. Choose the Grain of the Fact Table
4. Choose the Measured Facts

These steps go hand in hand. In our example just one fact has been
requested – Number (count) of bolts sold. In the real world this may also
include ‘profit’ but anything included has to be constrained by the exact same
dimensions.

The Grain may be determined by the request in this case total bolts per day
per customer per product per store

Sometimes extra detail is included eg choose ‘day’ even if only month has
been asked for.
•

Sometimes the detail requested cannot be stored – whilst we aim for
Transaction level some clients have hundreds of millions of low level
transactions which it is just not economical to replicate and manage.
•
Internal
Extra detail is more flexible for the future but costs more today to load and
summarise
Losing detail sacrifices flexibility but can reduce cost.
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Example (Cont)
5. Choose the Dimension Attributes
 Dimension attributes describe the business. They are used to
filter and group in reports and queries.
 Choose to decode any codes eg don’t just take a ‘region code’
– decode it as well and take ‘Region Name’ – these will
become user selectable items.
 Character fields are almost always attributes.
 Err on the side of including too many attributes from the
source data – the performance penalty is negligible and it is
simple to hide any attributes which are later found to be
useless.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Exercise
 Lets Build the Thomsen Diagram for our
ACME KPI.
Total
Products
 Step 1 is analysis. This has been completed
and is summarised below
 Step 2 is choosing the dimensions – I have
added one to start us off.
Product
category
 Step 3 is choose the Grain of the Fact Table
Product
(Bolt)
 Step 4 is choose the measured fact
Problem Restatement:
“Total Bolts Sold per Customer (TBSC)”
Additional management info requested.
Internal

Geographic detail of Store, Branch Region

Product and category

Customer and the customer category hierarchy
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Modelling Steps Exercise
─ Here is a finished version
─ As discussed it can also
confirm that Branch is
within Region (not the other
way around)
Total All
Areas
Total All
Customers
Region
Customer
Category
Branch
Customer
Sub Category
Product
category
* Bolts sold
Total
Month
Store
Customer
Product
(Bolt)
Day
(Date)
Total
Total
Products All Years
Total
Year
Measures
(Facts)
Dimension
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Exercise Star Model
From the Thomsen Diagram we can create a data model…
Product
Calendar
Product Code
groups
Product Code (FK)
Date (FK)
Store Code (FK)
Customer Number (FK)
Bolts Sold Quantity
Geography
Store Code
Date
Bolts Sold Fact
Product Description
Product Category Code
Product Categort Description
groups
groups
groups

Customer
Customer Number
Customer Name
Customer Sub Category Code
Customer Sub Category Description
Custoemer Category Code
Customer Category Description
Store Name
Branch Code
Branch Name
Region Code
Region Name

Month
Year
Day Name
Because this is a star schema (i.e. totally denormalised) you can see how we cannot tell
whether Branch or Region is higher.
Remember the purpose of the model is communication
(p.s. I know it not the purest data model)
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
SLIGHTLY ADVANCED
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Slightly Advanced Topics
 Date and Time
 Time Variant
 Surrogate keys
 Hierarchies
 Aggregate fact tables
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Date and Time
 Don’t confuse the two!
 You will meet many situations where it is stated that a
dimension is ‘time’. Almost universally this is actually date.
 Date and Time are ‘static’ reference dimensions and should be
populated in advance – usually as part of initial build.
 The DAK Data standards document has a sample schema for
Calendar which includes extra data columns for ‘is last day of
month’ etc. These can make later queries much easier.
Sample spreadsheets to load also exist.
 Do not be temped to combine into one dimension
─ At the grain of Date there would be 3650 rows to represent 10
years.
─ At the grain of Minute there are 1440 minutes in a day – so 1440
rows needed
─ Combined this would need 5,256,000 rows to represent all the
minutes for 10 years.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Time Variant
 Time Variance: “A characteristic of a data warehouse that
defines the moment in time that the data or variant of the data
is valid. If Order No. 123 has a value of $1,500.00 on Dec 1
and $1,700 on Dec 10, Dec 1 and Dec 10 shows us the time
variance of Order No. 123.”
 Many operational source systems only record one item of
information and if that changes the new value is simply
replaced. E.g. if you move house your doctor or movie rental
company really doesn’t care where you used to live.
 For big organisations this is sometimes overcome by using a
separate data warehouse where each change is noted by
boundary dates. I.e. a start and end date.
 Adding these is not simple as it can subtly changes the overall
granularity – if dates (not times) are used then only one fact
value is possible per day. You now have to be extremely clear
about what value is to be used. The last of the day? The
maximum of the day?
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Natural and Surrogate Keys
 Natural keys are the values usually referred to by people as
the identifiers of entities (customer number, claim number,
etc.).
─ They are often the primary keys in source systems.
─ As a general rule they should not be used in a warehouse - but
they might be in a dimensional model directly off a single source
system.
─ The problem is what happens when a second data source
populates the dimension? Now the jumble of numbers means
nothing and in fact the same identifier could be used in different
systems to refer to different things.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Natural and Surrogate Keys
 A surrogate is something used ‘instead of’. A surrogate is an
artificial, numeric, key generated from a pool of numbers
inside the warehouse.
─ Use these as Primary Keys for Dimensions.
─ Will facilitate efficient Fact to Dimension joins
─ Support Slowly changing dimension (next slide)
 If using surrogates then bring in the source system natural key
along with another field to say which source system this value
came from – i.e. put the context back.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Hierarchies
 Hierarchies are pervasive in the vast majority of
organisations.
Total
Products
 Hierarchies are quite disorganised in the vast
majority of organisations.
─ Multiple independent hierarchies often needed
 Hierarchies within the dimensions are very
important
Product
category
Product
(Bolt)
─ Within the proper tool they enable “drill up/ drill
down”
─ e.g. day, week, month, quarter, year
─ e.g. Product, Product Category, Total Products
 Details usually need to be explicitly stored.
─ E.g. Decode all codes.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Hierarchies – Simple, Static Hierarchies
Simple, static hierarchies are best designed directly into
the dimensions.
Total
Products
 This is what was meant when we said we ‘denormalised’ for a Star Schema
 Easiest to use
Product
category
 Most efficient to query
 e.g. Product, Product Category, Total Products
Product
(Bolt)
 Example: Geography (Store, Branch, Region,
[State, Country])
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Hierarchies – Complex And/Or Dynamic
Hierarchies
If hierarchies are complex, if there are multiple hierarchies on a
dimension or if the hierarchy changes often, it could be messy to
design the hierarchy into the dimension.
 Snowflake the dimension, creating one or more outboard
hierarchy tables.
 Changes to hierarchies do not affect the base dimension.
 Multiple hierarchies can be represented with multiple tables or
with a hierarchy ID column which must be filtered on in any
query.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Hierarchies – Example
Below is a common situation, caused where
different ‘departments’ view the summarised data
differently.
Total All
Warehouses
Region
 In this case it is Critical that it is the SAME
fact with the same granularity – in this
case Store.
 It is just the summaries beyond ‘Store’
Distribution Hierarchy
that
differ.
Distribution Identifier
Warehouse
 Watch for the thesummarises
same item being used Bolts
inSold Fact
Geography
Date (FK)
different contexts e.g.
Region in this
Geography Identifier
Product Identifier (FK)
Geography Identifier (FK)
Store Code
example is NOT the same
thing. groups
Customer Identifier (FK)
Store Name
Geography Managment Hierarchy
Geography Mgt Identifier
Region
Measures
(Facts)
* Bolts sold
Distribution
Node
Distribution Node Code
Distribution Node Name
Warehouse Code
Warehouse Name
Distrubution Region Code
Distrubution Region Name
Distribution Identifier (FK)
Geography Mgt Identifier (FK)
Total All
Areas
Branch
Store
Bolts Sold Quantity
summarises
Branch Code
Branch Name
Management Region Code
Management Region Name
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Aggregate Fact Tables
 Fact tables are very large
 Aggregates (pre-stored summaries) are the most effective way
of improving data warehouse performance
 An aggregate is a fact table records representing a
summarisation of base level fact table records.
 Can be Explicitly Designed and managed or many DBMS now
have inbuilt Aggregations available.
 Aggregate awareness – DBMS implicit and OLAP tool explicit
 Each grain of aggregate should occupy its own fact table, and
be supported by appropriate category dimension tables
─ What will that do to the number of tables? Can be an exponential
blow out.
─ Complexity from end-users point of view? They can be forced to
remember what summaries exist and what they are called.
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Revised Star Model
This is our data model revised to use surrogate keys, some time
variance and one split hierarchy.
Calendar
Date
Month
Year
Day Name
Distribution Hierarchy
Distribution Identifier
Distribution Node Code
Distribution Node Name
Warehouse Code
Warehouse Name
Distrubution Region Code
Distrubution Region Name
Geography Managment Hierarchy
groups
summarises
Geography
Bolts Sold Fact
Geography Identifier
Store Code
Store Name
Distribution Identifier (FK)
Geography Mgt Identifier (FK)
groups
summarises
Product Identifier
Product Code (AK1.1)
Start Date (AK1.2)
End Date
Product Description
Product Category Code
Product Categort Description
Date (FK)
Product Identifier (FK)
Geography Identifier (FK)
Customer Identifier (FK)
groups
groups
Bolts Sold Quantity
Customer
Customer Identifier
Geography Mgt Identifier
Customer Number (AK1.1)
Start Date (AK1.2)
End Date
Customer Name
Customer Sub Category Code
Customer Sub Category Description
Custoemer Category Code
Customer Category Description
Branch Code
Branch Name
Management Region Code
Management Region Name
Internal
Product
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
QUESTIONS
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT
Internal
© 2008 BearingPoint, Inc.
INFORMATION MANAGEMENT