Data Warehouse : Modeling and Design N. L. Sarda NLS/IITB/DWH

Download Report

Transcript Data Warehouse : Modeling and Design N. L. Sarda NLS/IITB/DWH

Data Warehouse :
Modeling and Design
N. L. Sarda
NLS/IITB/DWH
1
Outline
•
•
•
•
Introduction
Warehouse structure
A case study
Dimensional analysis
NLS/IITB/DWH
2
Introduction
• DW is a single, complete and consistent store of
data from different sources to understand &
analyze the business
• Contains history data
• Warehouse to facilitate browsing, navigating,
aggregating and visualization of related data to
understand performance, problems, customer
preferences, trends, etc.
• Warehouse data organized by important
business subjects (customer, product, etc…)
NLS/IITB/DWH
3
Warehouse Structure
• Organized to facilitate ease of access and
aggregation
• warehouse structure decomposed into
dimensions and facts
– Dimensions like ‘independent variables’, represent
entities for analysis
– Fact represents business data; relates to a set of
dimensions
– Eg : customer, time, type of account are dimensions,
and balances are facts
NLS/IITB/DWH
4
Warehouse Structure...
• The complex network of business entities and
their relationships as depicted in an operational
DB (using, say, ER model) is difficult for
navigation and analysis
• A ‘2-level’ structure defined by ‘star schema’ is
performed where a fact is at the center and
dimensions form ‘spokes’
• Data not stored in ‘normalized’ form
NLS/IITB/DWH
5
Star Schema
• Contains a fact table and for each dimension one
dimension table
date, custno, prodno, cityname, ...
Time
Cust
NLS/IITB/DWH
f
a
c
t
Prod
City
6
Dimensions
•
•
•
•
Stored as a database table
Contains many descriptive attributes for analysis
Small and slowly changing data
Data often group-able for analysis
– Customers by age, occupation, income level
– Time by weeks, months, years
– Branches as rural, suburban or by size
• Thus, dimension data viewable as a hierarchy
NLS/IITB/DWH
7
Facts
• Contain business activity data
• May be at detailed level or status level; called
transaction-oriented or snap-shot oriented
• Deciding on granularity : every sale or total sales
of a day ?
• Often contain numeric attributes for aggregation
(additive, semi-additive,…)
• Contain dimensional table keys also
NLS/IITB/DWH
8
Snowflake Schema
• Hierarchies not captured explicitly in a star
schema
• Snowflake schema represents hierarchy directly
• Saves on storage but requires more join
NLS/IITB/DWH
9
Snowflake Schema
• Represent dimensional hierarchy directly by
normalizing tables.
T
i
m
e
c
u
s
t
NLS/IITB/DWH
p
r
o
d
date, custno, prodno, cityname, ...
f
a
c
t
c
i
t
y
r
e
g
i
o
n
10
Conformed Dimensions and Facts
• Goal is to produce a master suite of conformed
dimensions and to standardize facts
• conformed dimension means same thing with
every fact table (eg., customer, time, geography)
• it may contain data brought together from many
sources
• ensures same units and meaning, same time
durations and geographies across marts
NLS/IITB/DWH
11
Financial Services : A Case Study
• A bank offers various products/services like
saving/checking accounts, mortgage loans,
personal loans, TD, credit cards, etc…
• Purpose : track various a/c, customer profiles,
etc…, for marketing and offering new services
• Requirements:
– Get end-of-month summary of a/c for last 5 years
– Valid snapshot as of yesterday for current month
(with full details)
– Ability to group a/c in various ways & compare
balances
– demographic behavior
NLS/IITB/DWH
12
Case Study ...
• Each account type has some unique attributes
(requiring customized dimension and facts for
each)
• Old data (a/c & customers ) may be incomplete
or even different
• The warehouse data may come from multiple
sources :
–
–
–
–
NLS/IITB/DWH
Loan processing system(customer,loan,dues,payment)
Fixed deposit system(customer,TD,…)
Front-office system(customer, account, transaction,..)
Credit-card system customer, transactions, interest,..)
13
Case Study ...
• Must plan extraction, correlation, consistent
representation,…
• Let us consider a possible warehouse design for
the indicated requirements
• Core fact table : balance in each account, # of
transactions, grain : month
• Dimensions : a/c, household, branch, product,
status, time
• A/c and household separate : many accounts per
family; household definitions change
NLS/IITB/DWH
14
Case Study ...
• Product dimension permits hierarchy and
defining specific attributes; separate because it
changes
• Status : active or not, closed, etc. with reasons
• Account contains customer’s data; for historical
reasons, customer to accounts relationship not
well maintained
NLS/IITB/DWH
15
The household data warehouse
account key
primary_name
secondary_name
account_address
account_city
account_state
account_zip
date_opened
primary_age
primary_sex
primary_marital
household key
household_head_name
household_address
household_city
household_state
household_zip
household_income
household_type
NLS/IITB/DWH
Household Facts
account_key
household_key
branch_key
product_key
status_key
time_key
primary_balance
transaction_count
branch key
branch-name
branch_address
branch_city
branch_state
branch_zip
branch_type
product key
product_description
type
category
status key
status_description
status_reason
new_account_flag
closed_account_flag
time key
month
year
fiscal_quarter
16
Case Study ...
• Balance is semi-additive : can not be added
across time
• Products highly heterogeneous : different
attributes characterize different accounts
(balance, deposit options, interest rate, over draft
limit,..)
• Can’t combine all in a dimension as many not
applicable to all products
NLS/IITB/DWH
17
Case Study ...
• Solution: create many facts, customized for each
product, and one core fact with a product
dimension having common attributes; leads to
100% replication, but facilitates clarifications,
browsing, etc. and avoids join of customized and
core facts
• When many facts are to be stored together go for
snapshots (eg. monthly)
NLS/IITB/DWH
18
Case Study ...
• Transaction-grained facts usually have a single
fact (eg. amount) that is directly involved in the
transaction; we need a transaction dimension to
represent these amounts
• In transaction grained fact table, we do not need
customized fact tables per product; instead we
create customized dimension tables
NLS/IITB/DWH
19
Data Warehouse Life Cycle
Project
planning
Business
Requirement
Definition
Technical
Architecture
Design
Product
Selection &
Installation
Dimensional
Modeling
Physical
Design
End-User
Application
Specification
Data Staging
Design &
Development
Deployment
Maintenence &
Growth
End-User
Application
Development
Project Management
NLS/IITB/DWH
20
Life Cycle : summary
• Project planning
• Business requirements definition
• Data track
– Dimensional modeling
– Physical design
– Data staging design and development
• Technology track
– Technical architectural design
– Product selection and and installation
NLS/IITB/DWH
21
Life Cycle...
• Application track
– End user application specification
– End user application development
• Deployment
• Maintenance and growth
• Project management
NLS/IITB/DWH
22
Collecting Requirements...
• Interviews/write-ups
• Requirements findings document
–
–
–
–
–
Project overview
review of business objectives
analytic and information requirements
preliminary source systems analysis
Preliminary success criteria
• Prepare and publish the requirements
NLS/IITB/DWH
23
Collecting Data about Existing Systems
• Understanding the candidate data sources
• Detailed criteria for selecting the data sources
–
–
–
–
Data accessibility
Longevity of the feed
Data accuracy
Project scheduling
• Customer matching and house-holding
• Browsing and data content
• Mapping data from source to target
NLS/IITB/DWH
24
Designing the Data Warehouse /
Data Marts
• Identifying marts and dimensions
• identify marts based on facts likely to be used
together, as a mart is a kind of subject area or
application (divide-and-conquer strategy)
• often based on a single business process or a
single source
• 10 to 30 marts common for a large organization
• build a matrix of marts versus dimensions
NLS/IITB/DWH
25
Designing a Fact
• Define fact grain based on the basic business
facts stored in legacy systems
• Choose dimensions and match them with
granularity of facts
• Combine as many facts as possible with the
context of defined granularity
NLS/IITB/DWH
26
Detailed Design Tips
• Names for dimensions and attributes should be
chosen carefully to refer to corresponding
business entities
• An attribute (in a dimension) is not replicated,
but a fact may be present in many fact tables
• If a dimension occurs multiple times (eg, time), it
is playing multiple roles; name them uniquely
• Every fact should have a default aggregation rule
so that it is not aggregated wrongly
NLS/IITB/DWH
27
Dimension Attributes
• The quality of the data warehouse is measured
by the quality of the dimension attributes
• The user interface responses and final reports
are restricted to the precise contents of the
dimension table attributes
• Properties
– Verbose, descriptive, complete
– Quality assured, indexed
– Equally available, documented
NLS/IITB/DWH
28
Time Dimension
• Every data warehouse fact table is a time series
of some observations
• We always seems to have one or more time
dimensions in our fact table designs
• Provides useful hierarchies : week, month,
quarter, year, etc
• Represents calendar with many useful attributes
like day of week, day of month, week#, day#,
quarter, weekday-flag, last-day-of-month-flag,
holiday flag, etc.
NLS/IITB/DWH
29
Slowly Changing Dimensions
• The product key or customer key does not
change, but the description of the product or
customer does
• The data warehouse has three options for above
changes
– Overwrite the dimension record with the new values,
thereby losing history
• It is used whenever the old value of the attribute has
no significance
• The corrections of any error falls into this category
NLS/IITB/DWH
30
Slowly Changing Dimensions...
– Create a new additional dimension record using a new
value of the surrogate key
• is primary technique for accurately tracking a change
in an attribute within a dimension
• requires use of a surrogate key
• a slowly changing dimension is used when a true
physical change to the dimension entity has taken
place
– Create an “old” field in the dimension record to store
the immediate previous attribute value
• It is used when a change is tentative
NLS/IITB/DWH
31
Time Stamping the Changes
• The design of slowly changing dimension may be
established by adding begin and end time stamps
and a transaction description in each instance of
a dimension record
• This design allows very precise time slicing of the
dimension by itself
NLS/IITB/DWH
32
Large Dimensions
• Data warehouses that store extremely granular
data may require some extremely large
dimensions
• To support large dimensions we must choose the
indexing technologies and data design
approaches that:
– supports rapid browsing of the unconditional
dimension, especially for low cardinality attributes
– Supports efficient browsing of cross-constrained
values in the dimension table
– Find and suppress duplicate entries in the dimension
NLS/IITB/DWH
33
Foreign Key, Primary Key,
Surrogate Key
• All dimensional tables have single keys, which,
by definition, are primary keys
• All data warehouse keys must be meaningless
surrogate keys; you must not use the original
production keys
• A four byte integer makes a good surrogate key
• Surrogate date keys
• Avoid smart keys
• Avoid production keys
NLS/IITB/DWH
34
Heterogeneous Product Schemas
• Multiple fact tables are needed when a business
has heterogeneous products
• The global view needs a single core fact table
crossing all lines of business, whereas local view
focuses on specific product
• There are many attributes and facts which apply
only to a specific product; a single fact table is
not feasible
• create customized fact and (product) dimension
table for each product, and build a core fact
table with attributes that make sense across all
lines of business; this allows to create a single
portfolio (of products) for each customer
NLS/IITB/DWH
35
Transaction Schema
• Every data mart needs two separate models
– Transaction version
– Periodic snapshot version
• ‘rolling’ snapshot containing averages across time
• Snapshots allow us to quickly measure the status of
the enterprise
• The Transaction schema
– low level transactions in the organization makes for a
good dimensional frame work
– The fact record for an individual transaction
frequently contains only a single value
NLS/IITB/DWH
36
Transaction Schema..
• The transaction-based WH commonly used in
–
–
–
–
–
NLS/IITB/DWH
Time of day analysis
Queue analysis
Fraud detection
Basket analysis
Current status
37
Factless Fact Tables
• useful to describe events and their coverage
• an event fact table records occurrence of an
event; has only flag and dimension keys (eg,
student attendance)
• coverage fact table is frequently needed when a
primary fact table in a dimensional data
warehouse is sparse; eg, primary fact table will
not provide items which were on promotion but
did not sale; the coverage table, containing only
dimension keys, lists all items on sale
NLS/IITB/DWH
38
Facts of Different Granularity
• The dimensional model gains power as the
individual fact records become more and more
atomic
• At the lowest level of individual transactions, the
design is most powerful because
– More of the descriptive attributes have single values
– The design withstands surprise in the form of new
facts, new dimensions, or new attributes within
existing dimensions
– More expressiveness at the lowest levels of granularity
NLS/IITB/DWH
39
Metadata Catalog
• It is an integral part of the overall architecture
• It contains information that describes the
warehouse and plays an active role in its
creation, use, and maintenance
• Contains source system metadata (data and
processes), data staging metadata (dimensions,
transformations, aggregations), DBMS metadata
(tables, indexes, stored procedures), and frontroom metadata (users, applications)
NLS/IITB/DWH
40
Technical Architecture
• Metadata driven
– Metadata provides flexibility by buffering the various
components of the system from each other
– The metadata catalog provides parameters and
information that allow the application to perform
their task
NLS/IITB/DWH
41
Conclusion
• Building a corporate-wide data warehouse is a
challenging task
• A systematic methodology essential
• Plan the architecture globally but build it
incrementally
• Keep user requirements at the core of all
development activities
NLS/IITB/DWH
42