Document

Transcript Document

NIC Exposure Level Training

Vijayendra Gururao Business Intelligence Consultant

Agenda

 Data warehousing Concepts  Govt Case Study  Defense HR Case study  Manufacturing Case Study  Data mining - Day 1 - Day 2 - Day 3 - Day 4 - Day 5

The Evolution of Business Intelligence

Active Action Passive What it means What to do about it 1996 2001 Act (Intelligent Agents) Recommend (Data Mining) What’s important 1991 Analyze Focus (EIS) (OLAP) 1986 Report Summarize (MIS) Human Type of Analysis Technology

Data Warehousing

Introduction:

 Definitions  Legacy Systems  Dimensions  Data Dependencies Model  Dimensional Model

An ER Model

Shipper Ship Type Ship To Product District Credit Contract Order Item Sales Order Contract Type ContactL ocat.

Product Line Customer Cust.

Locat.

Contact Product Group Sales Rep Sales District Sales Region Sales Division

Why Data Warehouses?

 To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”

OLTP vs. OLAP

 OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.

 OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support

The Goals for and Characteristics of a DW

 Make organizational data accessible  Facilitate consistency  Adaptable and yet resilient to change  Secure and reliable  Designed with a focus on supporting decision making

The Goals for and Characteristics of a DW

 Generate an environment in which data can be sliced and diced in multiple ways  It is more than data, it is a set of tools to query, analyze, and present information  The DW is the place where operational data is published (cleaned up, assembled, etc.)

Data Warehousing is Changing!

Application requirements--not just data requirements--are now driving need.

Customer Relationship Mgmt.

Call Center Campaign

ERP

Management Knowledge Management

ERP

Target Marketing Supply Chain E-commerce

Organization of data in the presentation area of the data warehouse

   Data in the warehouse are dimensional, not normalized relations  However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs Data should be atomic someplace in the warehouse; even if the presentation is aggregate Uses the bus architecture to support a decentralized set of data marts

Updates to a data warehouse

 For many years, the dogma stated that data warehouses are never updated.  This is unrealistic since labels, titles, etc. change.  Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)

Basic elements of the data warehouse

Operational Source Systems Extract Extract Extract Data Staging Area Services: Clean, combine, and standardize Conform Dimensions No user query services Data Store: Flat files and relational tables Processing: Sorting and sequential processing Load Data Presentation Area Data Mart #1 Dimensional Atomic and summary data Based on a single business process DW Bus: Conformed facts and dimensions Load Data Mart #2 Similar design Data Access Tools Ad hoc query tools Access Report Writers Analytical Applications Access Modeling: Forecasting Scoring Data Mining

Data Staging Area

 Extract-Transformation-Load  Extract: Reading the source data and copying the data to the staging area  Transformation:  Cleaning  Combining  Duplicating  Assigning keys  Load: present data to the bulk loading facilities of the data mart

Dimensional Modeling Terms and Concepts

 Fact table  Dimension tables

Fact Tables

 Fact table: a table in the data warehouse that contains  Numerical performance measures  Foreign keys that tie the fact table to the dimension tables

Fact Tables

  Each row records a measurement describing a transaction      Where? When?

Who?

How much?

How many?

The level of detail represented by this data is referred to as the grain of the data warehouse  Questions can only be asked down to a level corresponding with the grain of the data warehouse

Dimension tables

 Tables containing textual descriptors of the business  Dimension tables are usually wide (e.g., 100 columns)  Dimension tables are usually shallow (100s of thousand or a few million rows)  Values in the dimensions usually provide  Constraints on queries (e.g., view customer by region)  Report headings

Dimension tables

 The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes  Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations  Product part  Product  Brand  Dimensions are usually highly denormalized

Dimension tables

 The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support

Bringing together facts and dimensions – Building the dimensional Model

    Start with the normalized ER Model Group the ER diagram components into segments based on common business processes and model each as a unit Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table Denormalize the other tables as needed and designate one field as a primary key

A Dimensional Model

Time Dimension time_key day_of_Week month quarter year holiday_flag Sales Fact time_key product_key store_key dollars_sold units_sold dollars_cost Product Dimension product_key description brand category Store Dimension store_key store_name address floor_plan_type

Kimball Methodology

Conformed Dimensions

Review: A Private Data Mart

  A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build.

What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?

Understand Conformed Dimensions

 A conformed dimension is a dimension that is standardized across all data marts.

Location Sales Fact Time Customer Product Distribution Fact Order Fact Distributor Promotion

Advantages of Conformed Dimensions

 Deliver incremental data marts in a short period of time.

 Independent data marts become part of a fully integrated data warehouse.

 Deliver a consistent view across your business.

Conformed Dimensions Within Bus Architecture

 Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.

Dimensions Facts Sales Fact Distribution Fact Order Fact X X X X X X X X X X X X

Design of Conformed Dimensions

     A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate.

Lay out a broad dimensional map for the enterprise.

Define conformed dimensions at the most granular (atomic) level possible.

Conformed dimensions should always use surrogate keys.

Define standard definitions for dimension and fact attributes.

Granularity in Conformed Dimensions

 Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table.

D_Product

Product Id Description Product Type Type Description Product Line Line Description

Order Fact

Day Id Product Id Customer Id Cost NumberOrdered

D_Customer

Customer Id Last Name First Name Address

D_TimeDay

Day Id Day Month Year

Flexibility of Conformed Dimensions

 Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table).

View or Snowflake table Time Dimension Customer

Customer Id Last Name First Name Address

Order Fact

Day Id Product Id Customer Id Cost NumberOrdered

Time(Day)

Day Id Day Month Id Period

Product

Product Id Description Product Type Product Line

Sales Fact

Month Id Product Id Customer Id AmountSold Revenue

Time(Month) View

Month Id Month Period

So, What is a DW?

 A data warehouse is a

subject oriented

integrated

non-volatile

, and

time-variant

collection of data in support of management’s decisions W.H. Inmon (the

father

of DW)

Subject Oriented

 Data in a data warehouse are organized around the major subjects of the organization

Integrated

 Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment

Non-Volatile

 Once added to the DW, data are not changed (barring the existence of major errors)

Time Variant

 The DW captures data at a specific moment, thus, it is a

snap-shot

view of the organization at that moment in time. As these snap shots accumulate, the analyst is able to examine the organization over time (a time series!)  The

snap-shot

is called a production data extract

Need for Data Warehousing

  Integrated, company-wide view of high-quality information (from disparate databases) Separation of

operational

and

informational

(for improved performance) systems and data comparison of operational and informational systems

Data Warehouse Architectures

 Generic Two-Level Architecture  Independent Data Mart  Dependent Data Mart and Operational Data Store  Logical Data Mart and @ctive Warehouse  Three-Layer architecture All involve some form of

extraction

transformation

and

loading

(

ETL

)

Generic two-level architecture

T L

One, company wide warehouse

Periodic extraction  data is not completely current in warehouse

Independent Data Mart

L Data marts:

Mini-warehouses, limited in scope

T E

Separate ETL for each

independent

data mart Data access complexity due to

multiple

data marts

Dependent

data mart with

operational data store

ODS

provides option for obtaining

current

data

L T E

Single ETL for

enterprise data warehouse (EDW)

Simpler data access

Dependent

data marts loaded from EDW

Logical data mart and @ctive data warehouse

ODS

and

data warehouse

are one and the same

L T E

Near real-time ETL for

@active Data Warehouse

Data marts are NOT separate databases, but logical

views

 of the data warehouse Easier to create new data marts

Three-layer architecture

DW Design

Mainly consists of  Logical Design  Physical Design

Logical Design of DW

    Identification of Entities Relationships Attributes Uniqe identifiers Conceptual and abstract. Results in Fact and dimension tables Created using Pen and Paper OR Modeling tools also

Physical Design of DW

 Conversion of data gathered in Logical design to Physical database structure  Mainly driven for query performance

Logical



Physical

Data Characteristics Status vs. Event Data

Example of DBMS log entry Status Event = a database action (create/update/delete) that results from a transaction Status

Data Characteristics

Transient vs. Periodic Data Figure 11-8: Transient operational data Changes to existing records are written over previous records, thus destroying the previous data content

Data Characteristics

Transient vs. Periodic Data Periodic warehouse data Data are never physically altered or deleted once they have been added to the store

Data Reconciliation

  Typical operational data is:  Transient – not historical    Not normalized (perhaps due to denormalization for performance) Restricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errors After ETL, data should be:   Detailed – not summarized yet Historical – periodic    Denormalized Comprehensive – enterprise-wide perspective Quality controlled – accurate with full integrity

Extract Transform Load

 Extract data from operational system, transform and load into data warehouse  Why ETL?

 Will your warehouse produce correct information with the current data?

 How how can I ensure warehouse credibility?

Excuses for NOT Transforming Legacy Data

     Old data works fine, new will work as well.

Data will be

fixed

at point of entry through GUI.

If needed, data will be cleaned after new system populated; After proof-of-concept pilot.

Keys join the data most of the time.

Users will not agree to modifying or standardizing their data.

Levels of Migration Problem

   Existing metadata is insufficient and unreliable   Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible   Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable  Inferred from related values

Metadata Challenge

   Metadata gets out of synch with details it summarizes  Business grows faster than systems designed to capture business info Not at the right level of detail    Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms  Program/compiler view rather than business view

Character-level Challenge

   Value instance level    Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level   Multiple meanings, contextual meanings Synonyms, homonyms Entity level   No common keys or representation No integrated view across records, files, systems

The ETL Process



Capture



Scrub or data cleansing



Transform



Load and Index

ETL = Extract, transform, and load

The ETL Process

Source Systems Staging Area

Extract Transform Load

Presentation System

Source Data

Record the name location and data that exists in the TPS environment .

 File names and location  Layout  Attribute meaning Source Business Owner IS Owner Platform Location Data Source Description

Extraction

Copy specific data directly from the source tables into a working dataset in the staging area.

Target Table Target Column Dat a Typ e Le n Target Column Description Sourc e Syste m Source Table / File Source Col / Field Data Txform Notes

Transformation (Dimension Tables)

     Generate surrogate key in a primary surrogate table. Make this permanent.

Insert the surrogate key into the working dimension tables.

Conduct any editing/cleaning operations you need (usually on the working table) Generate any derived attributes you need.

Generate and retain process logs.

Transformation (Fact tables)

 Join all dimensions to the fact table (using original primary keys). Insert surrogate keys  Generate derived facts  Generate indicator flags Ch g Fla g Fact Grou p Derived Fact Name Derived Fact Description Typ e Agg Rul e Formula Constrai nts Transf or mation s

Target Data

Describe the presentation data structure.

 Model  Metadata  Usage and constraints Table Name Column Name Dat a Typ e Len Null s?

Column Description PK PK Ord er FK

Flow Documentation

 DFD for the ETL process  ERD for Source, Staging and Target databases.

 Metadata  Usage notes.

Steps in data reconciliation

Static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse = capturing a snapshot of the source data at a point in time

Incremental extract

= capturing changes that have occurred since the last static extract

Steps in data reconciliation (continued) Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors:

misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also:

decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliation (continued) Transform = convert data from format of operational system to format of data warehouse

Record-level:

Selection

– data partitioning

Joining

– data combining

Aggregation

– data summarization

Field-level:

single-field multi-field

– from one field to one field – from many fields to one, or one field to many

Steps in data reconciliation (continued) Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:

bulk rewriting of target data at periodic intervals

Update mode:

only changes in source data are written to data warehouse

Single-field transformation In general – some transformation function translates data from old form to new form

Algorithmic

transformation uses a formula or logical expression

Table lookup

– another approach

Multi field transformation M:1 –from many source fields to one target field 1:M –from one source field to many target fields

Derived Data

  Objectives  Ease of use for decision support applications  Fast response to predefined user queries  Customized data for particular target audiences  Ad-hoc query support   Data mining capabilities Characteristics  Detailed (mostly periodic) data  Aggregate (for summary)  Distributed (to departmental servers) Most common data model =

star schema

(also called “dimensional model”)

Components of a

star schema

Fact tables

contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance

Dimension tables

contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing

Star schema example

Fact table

provides statistics for sales broken down by product, period and store dimensions

Star schema with sample data

Advanced concepts

 Slowly Changing dimensions  Ragged Hierarchies

What if Our Data is not Static?

  Small occasional changes in dimension data are normal in business.

Examples of these changes include:  addition of new members (a new product is launched)  changing of relationships within the dimension (a sales rep moves to another branch)  properties of members changed (a product is reformulated or renamed)  deletion of members (this is rare in data warehousing)

Understand Surrogates

    4-byte integer key (can hold more than 2 billion positive integers) Internally assigned and meaningless   insures uniqueness always known Used in conjunction with business keys  business key is often mnemonic; for example, OTA used for Ottawa office  surrogate key is numeric; for example, 000128 Surrogate keys are never used in reports. They are used to link dimension tables to fact tables.

Understand Surrogate Keys Used In Operational Systems

    Operational databases also sometimes use surrogate keys (for example, Employee_No). These keys typically cannot be used as the data mart surrogate keys.

A single member in a data mart (for example, a particular employee) may have several data mart surrogate keys assigned over time to deal with slowly changing dimensions.

You may have to merge entities from separate operational systems, each with its own operational surrogate key (for example, customers from separate banking and insurance applications).

Operational surrogate keys are usually considered business keys in the data mart.

Understand Natural Keys: Example

Metrics Product Dimension Prod Code Name PR X 002 39 Soup PR X 003 40 Beans PR Y 003 40 Peas Fact Table Prod Code Cust Code PR X 002 39 SA 1 11 PR X 003 40 LO 2 22 PR Y 003 40 SE 5 55 Customer Dimension Cust Code Name SA 1 11 Safeway LO 2 22 Loblaws SE 5 55 7-11 Measures

Understand Surrogate Keys: Example

Product Dimension Prod Sur Prod Code Name 1 PR X 002 39 Soup 2 PR X 003 40 Beans 3 PR Y 003 40 Peas Customer Dimension Cust Sur Cust Code Name 10 SA 1 11 Safeway 20 LO 2 22 Loblaws 30 SE 5 55 7-11 Fact Table Prod Sur Cust Sur 1 10 2 20 3 30 Measures

Track Dimensional Changes Over Time

    Operational systems tend to contain data about the current state of the business.

A data warehouse is expected to hold data for five to 10 years.

Users may need to query data as of any particular date (for example, at which office(s) did Mary Smith work between January/1999 and December/1999?).

If Mary Smith changes offices, to which office do her sales apply, the old one or the new one?

Understand Slowly Changing Dimensions (SCD)

  Operational dimensional data may often be thought of as static. It may only need to reflect the current state.

Data warehouse dimensional data often must show how the dimensional data changes over time. It is not static.

 The term Slowly Changing Dimension (SCD) refers to the tracking of changes to dimensional data over time.



Understand Issues With Slowly Changing Dimensions

Maintaining SCDs can be complex without surrogates.

Business Key Normal Type2 Type2 Normal Type2 Surrogate Emp. No Name Branch Position Hire Date Salary 1 10001 Jack OTA VP Jan88' 50K 2 10002 Jane ARL MK Jan92' 40K 3 10003 Tom NY SS Jan93' 35K * ** *** 4 5 6 10001 10001 10001 Jack Jack Jack SJ SJ SJ VP S-VP S-VP Jan88' Jan88' Jan88' 50K 50K 60K * (Emp. No + Branch) ** (Emp. No + Branch + Position) *** (Emp. No + Branch + Position + Salary) Imagine the effect of having such a large Natural key in the fact table.

Use Different Methods of Handling Dimensional Changes

 Two most commonly used types of SCDs (according to Kimball):  Type 1. Overwrite the old value with the new value (do not track the changes).

 Type 2. Add a new dimension record with a new surrogate key (track changes over time).

 A single row may have a combination of columns of different types.

Type 1: Overwrite the Original Value

 The organization may not choose to track certain data changes because:  the original data may have been incorrect  the change is not considered relevant for tracking

Sales Rep Dimension Table Sales Fact Table

Rep Key Order Date Cust Key 00128 1/1/1999 12345 00128 2/1/1999 12345 … Dallas

Type 2: Add a New Dimension Record

   When a tracked change is detected, a new surrogate key is assigned and a new row is added to the dimension table.

Usually, an effective begin/end date is also updated on the new and old rows.

Multiple rows may have the same business key, but they will always have unique surrogate keys.

Sales Rep Dimension Table Sales Fact Table

RepSur Order Date Cust Key Key 11111 01/01/1999 12345 11111 02/01/1999 12345 11112 03/01/1999 12345 11112 04/01/1999 12345 … RepSur Rep Key Key Name Office Eff Date 11111 00128 Mary Smith Dallas 9901 11112 00128 Mary Smith NYC 9903

Balanced and Ragged Hierarchies

    Dimensional data are usually structured as hierarchies, either balanced or ragged (unbalanced).

Balanced hierarchies (those with a fixed number of levels) are most common and are the easiest to understand and analyze.

In ragged hierarchies, each branch does not break down into the same number of levels. They are harder to analyze and report against.

Also, PowerPlay requires that all leaf (lowest-level) nodes be at the same level to aggregate properly.

Parent-Child Relationships

  Parent-child relationship are recursive relationships.

The levels of the hierarchy are determined by rows of the same table.

Employees Reports To Orders

Ragged Hierarchies

 Leaf nodes have no children.

 DecisionStream fact builds only look for leaf nodes at the lowest level.

Employee Hierarchy Andrew Fuller

leaf

Nancy Davolio

leaf

Janet Leverling Stephen Buchanan

leaf

Margaret Peacock

leaf

Laura Callahan

leaf

Michael Suyama

leaf

Robert King

leaf

Anne Dodsworth

Resolve Ragged Hierarchies: Step 1

  Create an auto-level hierarchy to obtain the number of levels.

Create a dimension build to create a physical table that will identify for each row the level it belongs to.



Resolve Ragged Hierarchies: Step 1 (cont’d) Use Auto-Level Hierarchies

The purpose of auto-level hierarchies in DecisionStream is to determine the number of levels in a hierarchy.

Top Level (Level 1) Report to Andrew Fuller (Level 2) Report to Steven Buchanan (Level 3) Report to Andrew Fuller (Level 2)

Issues Regarding Star Schema

  Dimension table keys must be

surrogate

(non intelligent and non-business related), because:  Keys may change over time  Length/format consistency Granularity of Fact Table – what level of detail do you want?     Transactional grain – finest level Aggregated grain – more summarized Finer grains  better

market basket analysis

capability Finer grain  more dimension tables, more rows in fact table

The User Interface Metadata (data catalog)

       Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data warehouses, including derivation rules Indicate how data is derived from operational data store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people

Q & A

Multi-dimensional data

On-Line Analytical Processing (OLAP)

    The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

Relational OLAP (ROLAP)

 Traditional relational representation

Multidimensional OLAP (MOLAP)



Cube

structure OLAP Operations  

Cube slicing

– come up with 2-D view of data

Drill-down

– going from summary to more detailed views

Overall Plan

 We need fast answers to analytical questions  Relational model may not be the answer  We can restructure data specifically for analysis  First we need to find out how people analyse data

Overall Plan

     Analysis analysis reveals importance of measures and dimensions So we structure the data with that in mind The star schema is the physical structure that emerges We can implement this as ROLAP, MOLAP and HOLAP We achieve our objective – rapid analytical processing

How do we make databases faster?

 Indexing  Query design  Application design  Care with locking  Lots of ways  Data structuring

Relational

Data structuring – Relational model

 Pros  Data integrity in the face of user updates  Small data volume  Good for transactional queries  Cons  Poor analytical query performance

Poor analytical query performance

 Why are relational databases slow?

 Joins  Functions  Aggregations

Poor analytical query performance

 So, there is a tension between:  Transactions  Analytical querying  Solution:  Split them up  Take a copy of the transactional database and structure it in a totally different way that is optimised for analytical querying

Structure for analytical querying

 Great idea, but first we need to find out how people analyse their data

How people analyse their data

 People analyse their data in terms of:  Graphs  Grids  Reports  Do these have anything in common?

How people analyse their data

 Do these have anything in common?  Measures  Numerical values  Typically plotted on the Y axis  Dimensions  Discontinuous variables  Slice the measures into aggregated groups  Typically plotted on the X axis

How people analyse their data

 Dimensions are often hierarchical  People want to analyse:  Time by Year, Quarter, Month, Day  Product by Warehouse, Type, Product  Customer by Country, County, Person

How people analyse their data

 So, we need to summarise all of this…..

 Measures  Dimensions  Hierarchies

Customer Country Region Name

Squashed Octopus

Item Class Product Warehouse Delay Profit Quantity Employee Employee Region County WeekDay Month Quarter Year Time

Squashed Octopus

 The SO is a logical model   What about a physical model?

(Recap why the relational model is slow for analytical querying)  Joins  Functions  Aggregations

Relational

Star Schema

 What is in the fact table?

 Facts

Star Schema

 What is in a dimension table?

 Dimensional information  Hierarchical information

Star Schema

How do dimension and fact tables work together?

Star Schema

 Is it faster?

Query Time Relational Monthly totals 70 Sales in March by Product 18 Sales in March 2004 by Product 12 Time Star Schema 60 6 2

Star Schema

 Is it faster?

 Yes  How can we make it even faster?

 Aggregation

Query Time Relational 70 Monthly totals Sales in March by Product 18 Sales in March 2004 by Product 12 2 Time Star Schema 60 Time Aggregated Star Schema <1 6 <1 <1

Star Schema

  If we leave this as a set of tables then it is ROLAP – Relational OLAP (OLAP – On-Line Analytical Processing)  But it is a pain to manage  All those aggregation tables

Star Schema

 So, the answer is MOLAP  (Multi-dimensional OLAP)

Star Schema

 Finally HOLAP  (Hybrid OLAP)

On-Line Analytical Processing (OLAP)

 OLAP Operations 

Cube slicing

– come up with 2-D view of data 

Drill-down

– going from summary to more detailed views

Figure 11-22: Slicing a data cube

Example of drill-down Summary report Drill-down with color added

Summary

 We need fast answers to analytical questions  Relational model may not be the answer  We can restructure data specifically for analysis  First we need to find out how people analyse data