Transcript Document
NIC Exposure Level Training
Vijayendra Gururao Business Intelligence Consultant
Agenda
Data warehousing Concepts Govt Case Study Defense HR Case study Manufacturing Case Study Data mining - Day 1 - Day 2 - Day 3 - Day 4 - Day 5
The Evolution of Business Intelligence
Active Action Passive What it means What to do about it 1996 2001 Act (Intelligent Agents) Recommend (Data Mining) What’s important 1991 Analyze Focus (EIS) (OLAP) 1986 Report Summarize (MIS) Human Type of Analysis Technology
Data Warehousing
Introduction:
Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model
An ER Model
Shipper Ship Type Ship To Product District Credit Contract Order Item Sales Order Contract Type ContactL ocat.
Product Line Customer Cust.
Locat.
Contact Product Group Sales Rep Sales District Sales Region Sales Division
Why Data Warehouses?
To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”
OLTP vs. OLAP
OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.
OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support
The Goals for and Characteristics of a DW
Make organizational data accessible Facilitate consistency Adaptable and yet resilient to change Secure and reliable Designed with a focus on supporting decision making
The Goals for and Characteristics of a DW
Generate an environment in which data can be sliced and diced in multiple ways It is more than data, it is a set of tools to query, analyze, and present information The DW is the place where operational data is published (cleaned up, assembled, etc.)
Data Warehousing is Changing!
Application requirements--not just data requirements--are now driving need.
Customer Relationship Mgmt.
Call Center Campaign
ERP
Management Knowledge Management
ERP
Target Marketing Supply Chain E-commerce
Organization of data in the presentation area of the data warehouse
Data in the warehouse are dimensional, not normalized relations However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs Data should be atomic someplace in the warehouse; even if the presentation is aggregate Uses the bus architecture to support a decentralized set of data marts
Updates to a data warehouse
For many years, the dogma stated that data warehouses are never updated. This is unrealistic since labels, titles, etc. change. Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)
Basic elements of the data warehouse
Operational Source Systems Extract Extract Extract Data Staging Area Services: Clean, combine, and standardize Conform Dimensions No user query services Data Store: Flat files and relational tables Processing: Sorting and sequential processing Load Data Presentation Area Data Mart #1 Dimensional Atomic and summary data Based on a single business process DW Bus: Conformed facts and dimensions Load Data Mart #2 Similar design Data Access Tools Ad hoc query tools Access Report Writers Analytical Applications Access Modeling: Forecasting Scoring Data Mining
Data Staging Area
Extract-Transformation-Load Extract: Reading the source data and copying the data to the staging area Transformation: Cleaning Combining Duplicating Assigning keys Load: present data to the bulk loading facilities of the data mart
Dimensional Modeling Terms and Concepts
Fact table Dimension tables
Fact Tables
Fact table: a table in the data warehouse that contains Numerical performance measures Foreign keys that tie the fact table to the dimension tables
Fact Tables
Each row records a measurement describing a transaction Where? When?
Who?
How much?
How many?
The level of detail represented by this data is referred to as the grain of the data warehouse Questions can only be asked down to a level corresponding with the grain of the data warehouse
Dimension tables
Tables containing textual descriptors of the business Dimension tables are usually wide (e.g., 100 columns) Dimension tables are usually shallow (100s of thousand or a few million rows) Values in the dimensions usually provide Constraints on queries (e.g., view customer by region) Report headings
Dimension tables
The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations Product part Product Brand Dimensions are usually highly denormalized
Dimension tables
The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support
Bringing together facts and dimensions – Building the dimensional Model
Start with the normalized ER Model Group the ER diagram components into segments based on common business processes and model each as a unit Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table Denormalize the other tables as needed and designate one field as a primary key
A Dimensional Model
Time Dimension time_key day_of_Week month quarter year holiday_flag Sales Fact time_key product_key store_key dollars_sold units_sold dollars_cost Product Dimension product_key description brand category Store Dimension store_key store_name address floor_plan_type
Kimball Methodology
Conformed Dimensions
Review: A Private Data Mart
A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build.
What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?
Understand Conformed Dimensions
A conformed dimension is a dimension that is standardized across all data marts.
Location Sales Fact Time Customer Product Distribution Fact Order Fact Distributor Promotion
Advantages of Conformed Dimensions
Deliver incremental data marts in a short period of time.
Independent data marts become part of a fully integrated data warehouse.
Deliver a consistent view across your business.
Conformed Dimensions Within Bus Architecture
Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.
Dimensions Facts Sales Fact Distribution Fact Order Fact X X X X X X X X X X X X
Design of Conformed Dimensions
A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate.
Lay out a broad dimensional map for the enterprise.
Define conformed dimensions at the most granular (atomic) level possible.
Conformed dimensions should always use surrogate keys.
Define standard definitions for dimension and fact attributes.
Granularity in Conformed Dimensions
Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table.
D_Product
Product Id Description Product Type Type Description Product Line Line Description
Order Fact
Day Id Product Id Customer Id Cost NumberOrdered
D_Customer
Customer Id Last Name First Name Address
D_TimeDay
Day Id Day Month Year
Flexibility of Conformed Dimensions
Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table).
View or Snowflake table Time Dimension Customer
Customer Id Last Name First Name Address
Order Fact
Day Id Product Id Customer Id Cost NumberOrdered
Time(Day)
Day Id Day Month Id Period
Product
Product Id Description Product Type Product Line
Sales Fact
Month Id Product Id Customer Id AmountSold Revenue
Time(Month) View
Month Id Month Period
So, What is a DW?
A data warehouse is a
subject oriented
,
integrated
,
non-volatile
, and
time-variant
collection of data in support of management’s decisions W.H. Inmon (the
father
of DW)
Subject Oriented
Data in a data warehouse are organized around the major subjects of the organization
Integrated
Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment
Non-Volatile
Once added to the DW, data are not changed (barring the existence of major errors)
Time Variant
The DW captures data at a specific moment, thus, it is a
snap-shot
view of the organization at that moment in time. As these snap shots accumulate, the analyst is able to examine the organization over time (a time series!) The
snap-shot
is called a production data extract
Need for Data Warehousing
Integrated, company-wide view of high-quality information (from disparate databases) Separation of
operational
and
informational
(for improved performance) systems and data comparison of operational and informational systems
Data Warehouse Architectures
Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and @ctive Warehouse Three-Layer architecture All involve some form of
extraction
,
transformation
and
loading
(
ETL
)
Generic two-level architecture
T L
One, company wide warehouse
E
Periodic extraction data is not completely current in warehouse
Independent Data Mart
L Data marts:
Mini-warehouses, limited in scope
T E
Separate ETL for each
independent
data mart Data access complexity due to
multiple
data marts
Dependent
data mart with
operational data store
ODS
provides option for obtaining
current
data
L T E
Single ETL for
enterprise data warehouse (EDW)
Simpler data access
Dependent
data marts loaded from EDW
Logical data mart and @ctive data warehouse
ODS
and
data warehouse
are one and the same
L T E
Near real-time ETL for
@active Data Warehouse
Data marts are NOT separate databases, but logical
views
of the data warehouse Easier to create new data marts
Three-layer architecture
DW Design
Mainly consists of Logical Design Physical Design
Logical Design of DW
Identification of Entities Relationships Attributes Uniqe identifiers Conceptual and abstract. Results in Fact and dimension tables Created using Pen and Paper OR Modeling tools also
Physical Design of DW
Conversion of data gathered in Logical design to Physical database structure Mainly driven for query performance
Logical
Physical
Data Characteristics Status vs. Event Data
Example of DBMS log entry Status Event = a database action (create/update/delete) that results from a transaction Status
Data Characteristics
Transient vs. Periodic Data Figure 11-8: Transient operational data Changes to existing records are written over previous records, thus destroying the previous data content
Data Characteristics
Transient vs. Periodic Data Periodic warehouse data Data are never physically altered or deleted once they have been added to the store
Data Reconciliation
Typical operational data is: Transient – not historical Not normalized (perhaps due to denormalization for performance) Restricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errors After ETL, data should be: Detailed – not summarized yet Historical – periodic Denormalized Comprehensive – enterprise-wide perspective Quality controlled – accurate with full integrity
Extract Transform Load
Extract data from operational system, transform and load into data warehouse Why ETL?
Will your warehouse produce correct information with the current data?
How how can I ensure warehouse credibility?
Excuses for NOT Transforming Legacy Data
Old data works fine, new will work as well.
Data will be
fixed
at point of entry through GUI.
If needed, data will be cleaned after new system populated; After proof-of-concept pilot.
Keys join the data most of the time.
Users will not agree to modifying or standardizing their data.
Levels of Migration Problem
Existing metadata is insufficient and unreliable Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable Inferred from related values
Metadata Challenge
Metadata gets out of synch with details it summarizes Business grows faster than systems designed to capture business info Not at the right level of detail Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms Program/compiler view rather than business view
Character-level Challenge
Value instance level Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level Multiple meanings, contextual meanings Synonyms, homonyms Entity level No common keys or representation No integrated view across records, files, systems
The ETL Process
Capture
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
The ETL Process
Source Systems Staging Area
Extract Transform Load
Presentation System
Source Data
Record the name location and data that exists in the TPS environment .
File names and location Layout Attribute meaning Source Business Owner IS Owner Platform Location Data Source Description
Extraction
Copy specific data directly from the source tables into a working dataset in the staging area.
Target Table Target Column Dat a Typ e Le n Target Column Description Sourc e Syste m Source Table / File Source Col / Field Data Txform Notes
Transformation (Dimension Tables)
Generate surrogate key in a primary surrogate table. Make this permanent.
Insert the surrogate key into the working dimension tables.
Conduct any editing/cleaning operations you need (usually on the working table) Generate any derived attributes you need.
Generate and retain process logs.
Transformation (Fact tables)
Join all dimensions to the fact table (using original primary keys). Insert surrogate keys Generate derived facts Generate indicator flags Ch g Fla g Fact Grou p Derived Fact Name Derived Fact Description Typ e Agg Rul e Formula Constrai nts Transf or mation s
Target Data
Describe the presentation data structure.
Model Metadata Usage and constraints Table Name Column Name Dat a Typ e Len Null s?
Column Description PK PK Ord er FK
Flow Documentation
DFD for the ETL process ERD for Source, Staging and Target databases.
Metadata Usage notes.
Steps in data reconciliation
Static extract
Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse = capturing a snapshot of the source data at a point in time
Incremental extract
= capturing changes that have occurred since the last static extract
Steps in data reconciliation (continued) Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors:
misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also:
decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Steps in data reconciliation (continued) Transform = convert data from format of operational system to format of data warehouse
Record-level:
Selection
– data partitioning
Joining
– data combining
Aggregation
– data summarization
Field-level:
single-field multi-field
– from one field to one field – from many fields to one, or one field to many
Steps in data reconciliation (continued) Load/Index= place transformed data into the warehouse and create indexes
Refresh mode:
bulk rewriting of target data at periodic intervals
Update mode:
only changes in source data are written to data warehouse
Single-field transformation In general – some transformation function translates data from old form to new form
Algorithmic
transformation uses a formula or logical expression
Table lookup
– another approach
Multi field transformation M:1 –from many source fields to one target field 1:M –from one source field to many target fields
Derived Data
Objectives Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities Characteristics Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers) Most common data model =
star schema
(also called “dimensional model”)
Components of a
star schema
Fact tables
contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance
Dimension tables
contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing
Star schema example
Fact table
provides statistics for sales broken down by product, period and store dimensions
Star schema with sample data
Advanced concepts
Slowly Changing dimensions Ragged Hierarchies
What if Our Data is not Static?
Small occasional changes in dimension data are normal in business.
Examples of these changes include: addition of new members (a new product is launched) changing of relationships within the dimension (a sales rep moves to another branch) properties of members changed (a product is reformulated or renamed) deletion of members (this is rare in data warehousing)
Understand Surrogates
4-byte integer key (can hold more than 2 billion positive integers) Internally assigned and meaningless insures uniqueness always known Used in conjunction with business keys business key is often mnemonic; for example, OTA used for Ottawa office surrogate key is numeric; for example, 000128 Surrogate keys are never used in reports. They are used to link dimension tables to fact tables.
Understand Surrogate Keys Used In Operational Systems
Operational databases also sometimes use surrogate keys (for example, Employee_No). These keys typically cannot be used as the data mart surrogate keys.
A single member in a data mart (for example, a particular employee) may have several data mart surrogate keys assigned over time to deal with slowly changing dimensions.
You may have to merge entities from separate operational systems, each with its own operational surrogate key (for example, customers from separate banking and insurance applications).
Operational surrogate keys are usually considered business keys in the data mart.
Understand Natural Keys: Example
Metrics Product Dimension Prod Code Name PR X 002 39 Soup PR X 003 40 Beans PR Y 003 40 Peas Fact Table Prod Code Cust Code PR X 002 39 SA 1 11 PR X 003 40 LO 2 22 PR Y 003 40 SE 5 55 Customer Dimension Cust Code Name SA 1 11 Safeway LO 2 22 Loblaws SE 5 55 7-11 Measures
Understand Surrogate Keys: Example
Product Dimension Prod Sur Prod Code Name 1 PR X 002 39 Soup 2 PR X 003 40 Beans 3 PR Y 003 40 Peas Customer Dimension Cust Sur Cust Code Name 10 SA 1 11 Safeway 20 LO 2 22 Loblaws 30 SE 5 55 7-11 Fact Table Prod Sur Cust Sur 1 10 2 20 3 30 Measures
Track Dimensional Changes Over Time
Operational systems tend to contain data about the current state of the business.
A data warehouse is expected to hold data for five to 10 years.
Users may need to query data as of any particular date (for example, at which office(s) did Mary Smith work between January/1999 and December/1999?).
If Mary Smith changes offices, to which office do her sales apply, the old one or the new one?
Understand Slowly Changing Dimensions (SCD)
Operational dimensional data may often be thought of as static. It may only need to reflect the current state.
Data warehouse dimensional data often must show how the dimensional data changes over time. It is not static.
The term Slowly Changing Dimension (SCD) refers to the tracking of changes to dimensional data over time.
Understand Issues With Slowly Changing Dimensions
Maintaining SCDs can be complex without surrogates.
Business Key Normal Type2 Type2 Normal Type2 Surrogate Emp. No Name Branch Position Hire Date Salary 1 10001 Jack OTA VP Jan88' 50K 2 10002 Jane ARL MK Jan92' 40K 3 10003 Tom NY SS Jan93' 35K * ** *** 4 5 6 10001 10001 10001 Jack Jack Jack SJ SJ SJ VP S-VP S-VP Jan88' Jan88' Jan88' 50K 50K 60K * (Emp. No + Branch) ** (Emp. No + Branch + Position) *** (Emp. No + Branch + Position + Salary) Imagine the effect of having such a large Natural key in the fact table.
Use Different Methods of Handling Dimensional Changes
Two most commonly used types of SCDs (according to Kimball): Type 1. Overwrite the old value with the new value (do not track the changes).
Type 2. Add a new dimension record with a new surrogate key (track changes over time).
A single row may have a combination of columns of different types.
Type 1: Overwrite the Original Value
The organization may not choose to track certain data changes because: the original data may have been incorrect the change is not considered relevant for tracking
Sales Rep Dimension Table Sales Fact Table
Rep Key Order Date Cust Key 00128 1/1/1999 12345 00128 2/1/1999 12345 … Dallas
Type 2: Add a New Dimension Record
When a tracked change is detected, a new surrogate key is assigned and a new row is added to the dimension table.
Usually, an effective begin/end date is also updated on the new and old rows.
Multiple rows may have the same business key, but they will always have unique surrogate keys.
Sales Rep Dimension Table Sales Fact Table
RepSur Order Date Cust Key Key 11111 01/01/1999 12345 11111 02/01/1999 12345 11112 03/01/1999 12345 11112 04/01/1999 12345 … RepSur Rep Key Key Name Office Eff Date 11111 00128 Mary Smith Dallas 9901 11112 00128 Mary Smith NYC 9903
Balanced and Ragged Hierarchies
Dimensional data are usually structured as hierarchies, either balanced or ragged (unbalanced).
Balanced hierarchies (those with a fixed number of levels) are most common and are the easiest to understand and analyze.
In ragged hierarchies, each branch does not break down into the same number of levels. They are harder to analyze and report against.
Also, PowerPlay requires that all leaf (lowest-level) nodes be at the same level to aggregate properly.
Parent-Child Relationships
Parent-child relationship are recursive relationships.
The levels of the hierarchy are determined by rows of the same table.
Employees Reports To Orders
Ragged Hierarchies
Leaf nodes have no children.
DecisionStream fact builds only look for leaf nodes at the lowest level.
Employee Hierarchy Andrew Fuller
leaf
Nancy Davolio
leaf
Janet Leverling Stephen Buchanan
leaf
Margaret Peacock
leaf
Laura Callahan
leaf
Michael Suyama
leaf
Robert King
leaf
Anne Dodsworth
Resolve Ragged Hierarchies: Step 1
Create an auto-level hierarchy to obtain the number of levels.
Create a dimension build to create a physical table that will identify for each row the level it belongs to.
Resolve Ragged Hierarchies: Step 1 (cont’d) Use Auto-Level Hierarchies
The purpose of auto-level hierarchies in DecisionStream is to determine the number of levels in a hierarchy.
Top Level (Level 1) Report to Andrew Fuller (Level 2) Report to Steven Buchanan (Level 3) Report to Andrew Fuller (Level 2)
Issues Regarding Star Schema
Dimension table keys must be
surrogate
(non intelligent and non-business related), because: Keys may change over time Length/format consistency Granularity of Fact Table – what level of detail do you want? Transactional grain – finest level Aggregated grain – more summarized Finer grains better
market basket analysis
capability Finer grain more dimension tables, more rows in fact table
The User Interface Metadata (data catalog)
Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data warehouses, including derivation rules Indicate how data is derived from operational data store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people
Q & A
Multi-dimensional data
On-Line Analytical Processing (OLAP)
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Relational OLAP (ROLAP)
Traditional relational representation
Multidimensional OLAP (MOLAP)
Cube
structure OLAP Operations
Cube slicing
– come up with 2-D view of data
Drill-down
– going from summary to more detailed views
Overall Plan
We need fast answers to analytical questions Relational model may not be the answer We can restructure data specifically for analysis First we need to find out how people analyse data
Overall Plan
Analysis analysis reveals importance of measures and dimensions So we structure the data with that in mind The star schema is the physical structure that emerges We can implement this as ROLAP, MOLAP and HOLAP We achieve our objective – rapid analytical processing
How do we make databases faster?
Indexing Query design Application design Care with locking Lots of ways Data structuring
Relational
Data structuring – Relational model
Pros Data integrity in the face of user updates Small data volume Good for transactional queries Cons Poor analytical query performance
Poor analytical query performance
Why are relational databases slow?
Joins Functions Aggregations
Poor analytical query performance
So, there is a tension between: Transactions Analytical querying Solution: Split them up Take a copy of the transactional database and structure it in a totally different way that is optimised for analytical querying
Structure for analytical querying
Great idea, but first we need to find out how people analyse their data
How people analyse their data
People analyse their data in terms of: Graphs Grids Reports Do these have anything in common?
How people analyse their data
Do these have anything in common? Measures Numerical values Typically plotted on the Y axis Dimensions Discontinuous variables Slice the measures into aggregated groups Typically plotted on the X axis
How people analyse their data
Dimensions are often hierarchical People want to analyse: Time by Year, Quarter, Month, Day Product by Warehouse, Type, Product Customer by Country, County, Person
How people analyse their data
So, we need to summarise all of this…..
Measures Dimensions Hierarchies
Customer Country Region Name
Squashed Octopus
Item Class Product Warehouse Delay Profit Quantity Employee Employee Region County WeekDay Month Quarter Year Time
Squashed Octopus
The SO is a logical model What about a physical model?
(Recap why the relational model is slow for analytical querying) Joins Functions Aggregations
Relational
Star Schema
Star Schema
What is in the fact table?
Facts
Star Schema
What is in a dimension table?
Dimensional information Hierarchical information
Star Schema
How do dimension and fact tables work together?
Star Schema
Is it faster?
Query Time Relational Monthly totals 70 Sales in March by Product 18 Sales in March 2004 by Product 12 Time Star Schema 60 6 2
Star Schema
Is it faster?
Yes How can we make it even faster?
Aggregation
Query Time Relational 70 Monthly totals Sales in March by Product 18 Sales in March 2004 by Product 12 2 Time Star Schema 60 Time Aggregated Star Schema <1 6 <1 <1
Star Schema
If we leave this as a set of tables then it is ROLAP – Relational OLAP (OLAP – On-Line Analytical Processing) But it is a pain to manage All those aggregation tables
Star Schema
So, the answer is MOLAP (Multi-dimensional OLAP)
Star Schema
Finally HOLAP (Hybrid OLAP)
On-Line Analytical Processing (OLAP)
OLAP Operations
Cube slicing
– come up with 2-D view of data
Drill-down
– going from summary to more detailed views
Figure 11-22: Slicing a data cube
Example of drill-down Summary report Drill-down with color added
Summary
We need fast answers to analytical questions Relational model may not be the answer We can restructure data specifically for analysis First we need to find out how people analyse data
Summary
Analysis analysis reveals importance of measures and dimensions So we structure the data with that in mind The star schema is the physical structure that emerges We can implement this as ROLAP, MOLAP and HOLAP We achieve our objective – rapid analytical processing