An Introduction to Data Warehousing 1

Download Report

Transcript An Introduction to Data Warehousing 1

An Introduction
to
Data Warehousing
1
Business Intelligence

Now,if the Estimates made before a Battle indicate Victory,it
is because careful calculations show that your conditions
are more favorable than those of your enemy;if they
indicate defeat ,it is because careful calculations show that
the favorable conditions for a Battle are fewer.With more
careful calculations one can win ; with less one cannot.
How much chance of Victory has one who makes no
calculations at all !!
--- Sun Tzu , The Art of War

Business these days are ,war minus shooting.
-Anonymous
Course Roadmap
• Introduction to Datawarehousing
• Difference between Operational System and DataWarehouse
• Emergence of Decision Support Systems
• DataWarehouse Theoretical Architecture
• DataWarehouse Technical Architecture
• DataWarehouse Bus Architecture
• Data Modelling concepts
• E-R Modelling for OLTP System
• Dimensional Modelling for a Datawarehouse
• Scheme generation for Datawarehouse
• Star Scheme Design
• Snowflake Scheme Design
• Key aspects in designing the Dimensional Model
• Granularity with respect to the Fact Table in the Schemas
• Conformed Facts,Dimensions
Course Roadmap
• Fact less Fact Tables,Aggregate Fact Tables
• Out Trigger Entities in the Schemas
• Types of Relationships to be maintained between Facts
and Dimensions
• Dependencies while generating Physical Scheme for
a DataWarehouse
• Case Study of design of DataWarehouse for an existing
ERmodel
Objectives

At the end of this session, you will know :
– What is Data Warehousing
– The evolution of Data Warehousing
– Need for Data Warehousing
– OLTP Vs Warehouse Applications
– Data marts Vs Data Warehouses
– Operational Data Stores
– Overview of Warehouse Architecture
Objectives
At the end of this lesson, you will know :
– Data Warehouse Architectures
– Components of Data Warehousing Architecture
– An overview of each of the components
– Considerations for Data Warehouse Design
– Common mistakes in Warehouse designs
– An overview of Warehouse on the web
What is a DataWarehouse ?
What is a Data Warehouse ?
A data warehouse is a subject-oriented,
integrated, nonvolatile, time-variant collection
of data in support of management's decisions.
- WH Inmon
WH Inmon - Regarded As Father Of Data Warehousing
Subject-Oriented- Characteristics of a Data Warehouse
Operational
Data
Warehouse
Leads
Prospects
Customers
Products
Quotes
Orders
Regions
Time
Focus is on Subject Areas rather than Applications
Integrated - Characteristics of a Data Warehouse
Appl A - m,f
Appl B - 1,0
Appl C - male,female
Appl A - balance dec fixed (13,2)
Appl B - balance pic 9(9)V99
Appl C - balance pic S9(7)V99 comp-3
m,f
balance dec
fixed (13,2)
Appl A - bal-on-hand
Appl B - current-balance
Appl C - cash-on-hand
Current balance
Appl A - date (julian)
Appl B - date (yymmdd)
Appl C - date (absolute)
date (julian)
Integrated View Is The Essence Of A Data Warehouse
Non-volatile - Characteristics of a Data Warehouse
insert
change
Data
Warehouse
Operational
delete
insert
load
replace
change
Data Warehouse Is Relatively Static In Nature
read only
access
Time Variant - Characteristics of a Data Warehouse
Operational
Current Value data
• time horizon : 60-90 days
Data
Warehouse
Snapshot data
• time horizon : 5-10 years
•data warehouse stores historical
data
Data Warehouse Typically Spans Across Time
Alternate Definitions
A collection of integrated, subject oriented databases
designed to support the DSS function, where each
unit of data is relevant to some moment of time
- Imhoff
Alternate Definitions
Data Warehouse is a repository of data summarized
or aggregated in simplified form from operational
systems. End user orientated data access and
reporting tools let user get at the data for decision
support - Babcock
Evolution of Data Warehousing
1960 - 1985 : MIS Era
• Unfriendly
• Slow
• Dependent on IS programmers
• Inflexible
• Analysis limited to defined reports
Focus on Reporting
Evolution of Data Warehousing
1985 - 1990 : Querying Era
Queries that are
formulated by the user
on the spur of the
moment
• Adhoc, unstructured access to corporate data
• SQL as interface not scalable
• Cannot handle complex analysis
Focus on Online Querying
Evolution of Data Warehousing
1990 - 20xx : Analysis Era
• Trend Analysis
• What If ?
• Cross Dimensional Comparisons
• Statistical profiles
• Automated pattern and rule discovery
Focus on Online Analysis
Need for Data Warehousing

Better business intelligence for end-users

Reduction in time to locate, access, and analyze information

Consolidation of disparate information sources

Strategic advantage over competitors

Faster time-to-market for products and services

Replacement of older, less-responsive decision support
systems

Reduction in demand on IS to generate reports
Business Queries
Typical Business Queries

Which product generated maximum revenue over last two
quarters in a chosen geographical region, city wise, relative to
the previous version of product, compared with the plan

What percent of customer procures product A with B in a chosen
region, broken down by city, season, and income group
OLTP Systems Vs Data Warehouse
Remember
Between OLTP and Data Warehouse systems
users are different
data content is different,
data structures are different
hardware is different
Understanding The Differences Is The Key
OLTP Vs Warehouse
Operational System
Data Warehouse
Transaction Processing
Query Processing
Predictable CPU Usage
Random CPU Usage
Time Sensitive
History Oriented
Operator View
Managerial View
Normalized Efficient
Denormalized Design for
Design for TP
Query Processing
OLTP Vs Warehouse
Operational System
Data Warehouse
Designed for Atmocity,
Consistency, Isolation and
Durability
Designed for quite or static
database
Organized by transactions
(Order, Input, Inventory)
Organized by subject
(Customer, Product)
Relatively smaller database Large database size
Many concurrent users
Relatively few concurrent
users
Volatile Data
Non Volatile Data
OLTP Vs Warehouse
Operational System
Data Warehouse
Stores all data
Stores relevant data
Performance Sensitive
Less Sensitive to performance
Not Flexible
Flexible
Efficiency
Effectiveness
Processing Power
Capacity Planning
Time of day
Processing Load Peaks During the Beginning and End of Day
Examples Of Some Applications
Manufacturers
Retailers

Target Marketing

Market Segmentation

Budgeting

Credit Rating Agencies

Financial Reporting and Consolidation

Market Basket Analysis - POS Analysis

Fraud Management

Profitability Management

Event tracking
Customers
Do we need a separate database ?

OLTP and data warehousing require two very
differently configured systems

Isolation of Production System from Business
Intelligence System

Significant and highly variable resource demands of
the data warehouse

Cost of disk space no longer a concern

Production systems not designed for query
processing
Data Marts

Enterprise wide data warehousing projects have a
very large cycle time

Getting consensus between multiple parties may
also be difficult

Departments may not be satisfied with priority
accorded to them

Sometimes individual departmental needs may be
strong enough to warrant a local implementation

Application/database distribution is also an
important factor
Data Marts
Subject or Application Oriented Business View of
Warehouse
» Finance, Manufacturing, Sales etc.
» Smaller amount of data used for Analytic Processing
» Address a single business process
A Logical Subset of The Complete Data Warehouse
Data Warehouse and Data Mart
Data Warehouse
Data Marts
Scope
 Application Neutral
 Centralized, Shared
 Cross LOB/enterprise
Data
Perspective
 Historical Detailed data
 Some summary
 Specific Application
Requirement
 LOB, department
 Business Process
Oriented
 Detailed (some history)
 Summarized
Subjects
 Multiple subject areas
 Single Partial subject
 Multiple partial subjects
 OLTP snapshots
Data Warehouse and Data Mart
Data Warehouse Data Marts
Data Sources
 Many
 Operational/ External
Data
 Few
 Operational, external
data
 OLTP snapshots
Implement
Time Frame
 9-18 months for first
stage
 Multiple stage
implementation
 Flexible, extensible
 Durable/Strategic
 Data orientation
 4-12 months
Characteristics
 Restrictive, non
extensible
 Short life/tactical
 Project Orientation
Warehouse or Mart First ?
Data Warehouse First
Data Mart first
Expensive
Relatively cheap
Large development cycle
Delivered in < 6 months
Change management is
difficult
Easy to manage change
Difficult to obtain continuous
corporate support
Can lead to independent and
incompatible marts
Technical challenges in
building large databases
Cleansing, transformation,
modeling techniques may be
incompatible
Different kinds of Information Needs



Current
Recent
Historical
Is this medicine available
in stock
What are the tests this
patient has completed so
far
Has the incidence of
Tuberculosis increased in
last 5 years in Southern
region
Operational Data Store - Definition
Can I see credit
report from
Accounts, Sales
from marketing
and open order
report from
order entry for
this customer
Data from multiple
sources is integrated
for a subject
A subject oriented, integrated,
volatile, current valued data store
containing only corporate
Identical queries may
give different results
at different times.
Supports analysis
requiring current
data
detailed data
Data stored only for
current period. Old
Data is either
archived or moved to
Data Warehouse
Operational Data Store

Increasingly becoming integrated with the data
warehouse

Are nothing but more responsive real time data
warehouses

Data Mining has anyway forced Data Warehouses
to store transactional level data
OLTP Vs ODS Vs DWH
Characteristic
OLTP
ODS
Data Warehouse
Operating
Analysts
Managers and
Personnel
analysts
Individual records, Individual records, Set of records,
Data access
transaction driven transaction or
analysis driven
analysis driven
Current, real-time Current and near- Historical
Data content
current
Detailed and lightly Summarized and
Data granularity Detailed
summarized
derived
Subject-oriented
Subject-oriented
Data organization Functional
Audience
Data quality
All application
specific detailed
data needed to
support a business
activity
All integrated data Data relevant to
needed to support a management
business activity
information needs
OLTP Vs ODS Vs DWH
Characteristic
OLTP
ODS
Data Warehouse
Data redundancy
Somewhat
redundant with
operational
databases
Managed
redundancy
Data stability
Non-redundant
within system;
Unmanaged
redundancy among
systems
Dynamic
Data update
Field by field
Field by field
Controlled batch
Data usage
Highly structured,
repetitive
Somewhat
structured, some
analytical
Database size
Moderate
Moderate
Highly
unstructured,
heuristic or
analytical
Large to very large
Somewhat stable
Dynamic
Stable
Database
structure stability
Somewhat dynamic Static
OLTP Vs ODS Vs DWH
Characteristic
OLTP
ODS
Data Warehouse
Development
methodology
Requirements
driven, structured
Data driven,
evolutionary
Operational
priorities
Performance and
availability
Data driven,
somewhat
evolutionary
Availability
Philosophy
Support day-today operation
Predictability
Stable
Response time
Sub-second
Support day-to-day
decisions &
operational
activities
Mostly stable, some Unpredictable
unpredictability
Seconds to minutes Seconds to minutes
Return set
Small amount of
data
Small to medium
amount of data
Access flexibility
and end user
autonomy
Support managing
the enterprise
Small to large
amount of data
Typical Data Warehouse Architecture
Data
Marts
EIS /DSS
Select
Metadata
Query Tools
Extract
Transform
Integrate
Maintain
Data
Warehouse
OLAP/ROLAP
Web Browsers
Operational
Systems/Data
Data
Preparation
Middleware/
API
Multi-tiered Data Warehouse without ODS
Data Mining
Typical Data Warehouse Architecture
Data
Marts
Metadata
Metadata
Select
Select
Extract
Extract
Transform
Integrate
ODS
Transform
Data
Warehouse
Load
Maintain
Operational
Systems/Data
Data
Preparation
Data
Preparation
Multi-tiered Data Warehouse with ODS
Benefits of DWH
These capabilities empower the corporate...
 To formulate effective business, marketing
and sales strategies.
 To precisely target promotional activity.
 To discover and penetrate new markets.
 To successfully compete in the marketplace
from a position of informed strength.
 To build predictive rather than retrospective models.
Warehouse Architecture - 1
EIS /DSS
Metadata
Query Tools
Select
Extract
Transform
Integrate
Data
Warehouse
OLAP/ROLAP
Maintain
Web Browsers
Operational
Systems/Data
Data
Preparation
Middleware/
API
Data Mining
Enterprise Data Warehouse
Warehouse Architecture - 2
Metadata
EIS /DSS
Data Mart
Select
Metadata
Query Tools
Extract
Transform
Data Mart
Integrate
Maintain
OLAP/ROLAP
Metadata
Web Browsers
Operational
Systems/Data
Data Mart
Data
Preparation
Middleware/
API
Data Mining
Single Department Data Mart
Warehouse Architecture - 3
Data
Marts
EIS /DSS
Metadata
Query Tools
Select
Extract
Transform
Data
Warehouse
Integrate
OLAP/ROLAP
Maintain
Web Browsers
Operational
Systems/Data
Data
Preparation
Operational
Data Store
Multi-tiered Data Warehouse
Middleware/
API
Data Mining
Data Warehouse Architectures
There are three schools of thought about DW
architectures
– One supports Dimensional Modeling all through
(Ralph Kimball)
– Second supports ER for Data Warehouse and Star
Schemas for Data Marts
– Third supports ER model for DW (NCR)
Kimball’s View
Operational Systems
Presentation Server
Staging Area
Each Star is
a Data Mart
and has both
summary and
detail data
LAN
Data Warehouse
Server
Processes
•Extract
•Scrubbing
•Transformation
•Load Jobs
•Aggregation Jobs
•Replication
•Monitoring
•Management
•Meta Data Repository
•Meta Data Population
•Meta Data Maintenance
DW is sum
total of all
Data Marts
DW Bus using
Conformed Dimensions
Multiple Data Marts With Conformed Dimensions
Inmon’s View
Operational Systems
Staging Area
Data Warehouse
Data Marts
LAN
Data Warehouse Server
Processes
•Extract
•Scrubbing
•Transformation
•Load Jobs
•Aggregation Jobs
•Replication
•Monitoring
•Management
•Meta Data Repository
•Meta Data Population
•Meta Data Maintenance
Detail Data
in ER format
Summarized Data
in Star formats
Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)
Components of a Data Warehouse Architecture

Source Databases

Data extraction/transformation/load (ETL) tool

Data warehouse maintenance and administration
tools

Data modeling tool or interface to external data
models

Warehouse databases

End-user data access and analysis tools
Components of a Data Warehouse Architecture
Data
Cleansing
Tools
Source
Databases
Data
Modeling
Tool
ETL Tool
Central
Metadata
ROLAP
Engine
Data Access and
Analysis Tools
-Managed Query
Central
Warehouse
(RDBMS)
RDBMS
-Desktop OLAP
-ROLAP
-MOLAP
Local meta
data
Warehouse
Admin Tool
- Data Mining
MDDB
Architected
Datamarts
Warehouse Databases
Data Warehouse Is Not Just About Data... But Tools Too
Source Databases - Characteristics

Legacy, relational, text or external sources

Designed for high-speed transaction processing

Real-time, current, volatile data

Fast response for larger numbers of concurrent users

Many short transactions

Update-intensive; modifications by row

Inquiry-oriented; access by keys

High integrity, security, recoverability

Source data is often inconsistent and poorly modeled
Data Cleaning Tools

To clean data at the source

Clean up source data in-place on the host

Business rule discovery tools which analyse the
source data and write cleaning rules based on
lexical analysis and AI techniques

Poorly integrated with data warehousing tools

ETL tools have limited yet adequate data cleansing
functionality