No Slide Title

Download Report

Transcript No Slide Title

ACCTG 6910
Building Enterprise &
Business Intelligence Systems
(e.bis)
Introduction to Data Warehouse
Olivia R. Liu Sheng, Ph.D.
Emma Eccles Jones Presidential Chair of Business
1
Outline
• Why Data Warehouse?
– Problems, causes and data warehouse solutions
• What is Data Warehouse?
– Characteristics and components
• Current Practices of Data Warehouse
2
Why Data Warehouse?
•
Knowledge Management Problems (Drowning in
data, starving for knowledge)
1. Can’t access data (easily)
E.g., data from different branches, years, functional areas, etc.
2. Give me only what’s important (knowledge)
E.g., Regions and products that have upward sales trends over
the last five years.
3. I need to reduce data to what’s important by slicing
and dicing.
E.g., by branch, product, year, etc.
3
Why Data Warehouse?
4. Data inconsistency and poor data quality
E.g., the 2001 PC sales amount in SLC from the CFO and the SLC
Account Manager are not the same.
5. Need to improve the practices of making informed
decisions.
E.g., Did the VP for Marketing decide on the advertising budgets
for branches in the SW region based on their sales
performances over the last five years?
6. Hard and slow to query the database?
E.g., VP for Marketing, CFO and Account Manager had to wait for
the MIS Department to generate sales performance reports
and analyses.
4
Why Data Warehouse?
•
ROI Problems
7. Can I get more value out of my data?
Ans: Make informed, potent decisions using knowledge extracted
from integrated and consistent data over a long period of
time.
8. Can I do this cost-effectively?
Options: federated (interoperable) databases vs. a data
warehouse
9. Can I easily scale up or change how I get knowledge
out of my data?
E.g., Add more regions, functional areas or years in sales
performance analyses.
5
Causes for the Problems
Cause 1: Isolated databases distributed in
an enterprise
CRM
Sales
Inventory
A Root cause for
problems 1, 4, 5, 6,
7, 8 and 9
6
Why Data Warehouse
• Cause 1: Isolated databases distributed
in an enterprise
Sales
CRM
Inventory
Ad hoc access
solutions cannot
alleviate the
problems
7
Why Data Warehouse
• Cause 2: Historical data is archived in
offline storage systems
Sales
Historical
Sales Data
Another Root cause
for problems 1, 4, 5,
6, 7, 8 and 9
Archive
8
Why Data Warehouse
• Cause 2: Historical data is archived in
offline storage systems
Sales
Historical
Sales Data
Ad hoc accesses are
slow and
inconvenient
Archive
9
Cause 3: Metadata for Transaction DB systems
is Not User Friendly
SSN
Name
SSN
1
Has
M
Dependent
Rank
Name
Address
Instructor
1
Student
M
Take
M
Course
Name
Sex
Phone
Relation
Grade
IS-A
Undergraduate
Major
C-Name
C-No
Graduate
Major
Minor
10
11
Why Data Warehouse
• Cause 4: Query and programming
languages are even less user friendly
– DESB students’ academic grades and GPAs
since the freshman year
– Sales amount distribution by product
category, customer state and year
– Slicing and dicing
– SQL statements???
– Report/screen interface codes???
12
Why Data Warehouse
• Cause 5: Transaction databases are
optimized (normalized) to process
transactions but not to answer decision
support queries
– Bad query performance to join the
normalized tables
– Heavy transaction processing workload
13
What is Data Warehouse
Designed to solve problems associated
with current database practices:
• Isolated, distributed databases
Sales
Inventory
CRM
Extract,
replicate,
integrate,
cleanse &
load
Data
Warehouse
14
Why Data Warehouse
• Historical data is archived in offline
storage systems
Integrate
Historical
Data with
Current Data
Sales
Historical
Sales Data
Archive
Data
Warehouse
15
What is Data Warehouse
• Causes 3, 4 and 5: Hard-to-understand
metadata, and query and programming
languages; poor decision support query
performances
• Solution: In data warehouse, organize data in
subject –oriented way rather than processoriented way – dimensional modeling.
16
Dimensional Modeling (Star Schema)
Course
Instructor
. Name
. Rank
Academic
Performance
. Number
. Title
. Grade
Student
. Name
. UG/PG
. Major
Semester
. Year
. Length
. Start date
17
Dimensional Modeling (Star Schema)
Customer
Branch
. Name
. State
. City
Sales
. Qty
. Amt
Product
. Name
. Category
. Name
. State
. City
Time
. Year
. Quarter
. Month
18
One System for Multiple Uses
Application
Program
Database
Interactive Application
Program
Queries/
Transactions
Database
Management
System (DBMS)
Metadata
Database System
19
Two Worlds -> Two Systems
Executive
Information
System
Operational
Application
Operational
Application
Operational
Application
OLTP
DBs
Operational
Decision
Support
System
(DSS)
Data warehouse Reporting
DSS
20
What is Data Warehouse
• Data Warehouse is a subject-oriented,
integrated, time-variant, non-volatile collection of
data in support of management’s decision
making process.
• 1. Subject-oriented means the data warehouse focuses
on the high-level entities of business such as sales,
products, and customers. This is in contrast to database
systems, which deals with processes such as placing an
order.
21
What is Data Warehouse
 2. Integrated means the data is integrated from
distributed data sources and historical data sources
and stored in a consistent format.
3. Time-variant means the data associates with a
point in time (i.e., semester, fiscal year and pay
period)
4. Non-volatile means the data doesn’t change once
it gets into the warehouse.
22
Characteristics of Data Warehouse
Purpose
OLAP Data
Warehouse
OLTP DB
Decision Support Transaction
Processing
Data
Dimensional
Normalized
Model
Relational
Time Span Historical and
Current Data
Current Data
Query
Scan a
Scan a small
processing substantial
set of data
subset of data
Operation Read-only
Read & Update
23
Data Warehouse and Data Mart
• Data warehouse – defined by its decision
support purpose and other characteristics
– Other characteristics: subject-oriented, integrated
• Data mart – a data warehouse for a more
limited business scope (e.g., a department,
etc.)
• A data warehouse may be built from several
data marts
24
Basic Elements of a Data Warehouse System
Source
System
(Legacy)
Storage:
Relational
The Data Warehouse
Presentation Servers
Data Staging Area
extract
Flat files (fastest); RDBMS;
Other
Populate,
replicate,
recover
Processing:
Flat files
Spreadsheets extract
ERP
Legacy
extract
Uploaded
cleaned
dimensions
Clean; Prune; Combine;
Remove duplicates;
households; standardize;
conform dimensions; store
awaiting replications;
archive; export to data marts
Populate,
replicate,
recover
No user query services
Populate,
replicate,
recover
Data Mart #1: OLAP
( ROLAP and/or
MOLAP) query
services; dimensional!
Subject oriented;
locally implemented;
user group driven;
may store atomic data;
may be frequently
refreshed; conform to
DW Bus
End User
Data Access
feed
Ad Hoc
Query
Tools
feed
Report
Writers
Conformed
feed
dimensions and
facts
DW
BUS
Data Mart #2
DW
BUS
feed
Conformed
dimensions and facts
Data Mart #3
Uploaded
model results
End User
Applications
Models:
forecasting;
scoring;
allocating; data
mining; other
downstream
Systems; other
parameters;
25
special UI
Current Practice of DW*
• Expected DW market value in 2002 was
projected to have grown to $113.5 billion.
• Average DW development cost is $1.5 million
and average maintenance cost is $0.5 million.
• DW development time ranges from 1 to 3 yrs.
26
* Source: H.J. Watson, “ Current Practicing in Data Warehousing”, I.S. Management,
2001
Current Practice of DW*
• Sponsorship for the DW project
Sponsor
Percentage
VP of a business unit 39.8
CIO
26.9
Business unit manager 16.7
CEO
11.1
Other
25.0
27
* Source: H.J. Watson, “ Current Practicing in Data Warehousing”, I.S. Management,
2001
Current Practice of DW*
• DW Benefits
–
–
–
–
Less effort to produce better information
Better decisions
Improvement of business processes
Support for accomplishments of strategic
business objectives
• Return on Investments and Cost of
Ownership?
28
* Source: H.J. Watson, “ Current Practicing in Data Warehousing”, I.S. Management,
2001