Introduction to Data Warehousing
Download
Report
Transcript Introduction to Data Warehousing
Data Warehouse Design
Enrico Franconi
CS 636
Implementing a Warehouse
CS 336
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
2
Monitoring
Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, …
How to get data out?
Replication tool
Dump file
Create report
ODBC or third-party “wrappers”
CS 336
3
Monitoring Techniques
CS 336
Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring
4
Monitoring Issues
Frequency
periodic: daily, weekly, …
triggered: on “big” change, lots of changes, ...
Data transformation
convert data to uniform format
remove & add fields (e.g., add date to get history)
Standards (e.g., ODBC)
Gateways
CS 336
5
Wrapper
Converts data and queries from one data model to another
Data
Model
A
Queries
Data
Data
Model
B
Extends query capabilities for sources with limited capabilities
Queries
CS 336
Wrapper
Source
6
Wrapper Generation
Solution 1: Hard code for each source
Solution 2: Automatic wrapper generation
Wrapper
CS 336
Wrapper
Generator
Definition
7
Integration
Data Cleaning
Data Loading
Derived Data
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
8
Data Integration
Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
Rule-based
Actions
CS 336
Resolve inconsistencies
Eliminate duplicates
Integrate into warehouse (may not be empty)
Summarize data
Fetch more data from sources (wh updates)
etc.
9
Data Cleaning
Find (& remove) duplicate tuples
e.g., Jane Doe vs. Jane Q. Doe
Detect inconsistent, wrong data
Attribute values that don’t match
Patch missing, unreadable data
Insert default values
Notify sources of errors found
CS 336
10
Data Cleaning
Migration (e.g., yen to dollars)
Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
Fusion (e.g., mail list, customer merging)
billing DB
customer1(Joe)
merged_customer(Joe)
service DB
CS 336
customer2(Joe)
11
Loading Data in the Warehouse
Incremental vs. refresh
Off-line vs. on-line
Frequency of loading
At night, 1x a week/month, continuously
Parallel/Partitioned load
CS 336
12
Warehouse Maintenance
Warehouse data materialized view
Initial loading
View maintenance
Derived Warehouse Data
indexes
aggregates
materialized views
View maintenance
CS 336
13
Materialized Views
Define new warehouse relations using SQL
expressions
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
joinTb
CS 336
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
does not exist
at any source
14
Differs from Conventional View
Maintenance...
Warehouses may be highly aggregated and
summarized
Warehouse views may be over history of
base data
Process large batch updates
Schema may evolve
CS 336
15
Differs from Conventional View
Maintenance...
Base data doesn’t participate in view
maintenance
Simply reports changes
Loosely coupled
Absence of locking, global transactions
May not be queriable
CS 336
16
Warehouse Maintenance Anomalies
Materialized view maintenance in loosely
coupled, non-transactional environment
Simple example
Data
Warehouse
Sold (item,clerk,age)
Sold = Sale
Emp
Integrator
Sales
Sale(item,clerk)
CS 336
Comp.
Emp(clerk,age)
17
Warehouse Maintenance Anomalies
Data
Warehouse
Sold (item,clerk,age)
Integrator
Sales
Sale(item,clerk)
Comp.
Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) integrator adds Sale
(Mary,25)
4. (2) integrator adds (Computer,Mary)
Emp
5. View incorrect (duplicate tuple)
CS 336
18
Maintenance Anomaly - Solutions
Incremental update algorithms (ECA,
Strobe, etc.)
Research issues: Self-maintainable views
What views are self-maintainable
Store auxiliary views so original + auxiliary
views are self-maintainable
CS 336
19
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk)
Emp(clerk,age)
Inserts into Emp
If Emp.clerk is key and Sale.clerk is foreign
key (with ref. int.) then no effect
Inserts into Sale
Maintain auxiliary view: Emp-clerk,age(Sold)
Deletes from Emp
Delete from Sold based on clerk
CS 336
20
Self-Maintainability: Examples
Deletes from Sale
Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant
Auxiliary views for self-maintainability
Must themselves be self-maintainable
One solution: all source data
But want minimal set
CS 336
21
Partial Self-Maintainability
Avoid (but don’t prohibit) going to sources
Sold=Sale(item,clerk)
Emp(clerk,age)
Inserts into Sale
Check if clerk already in Sold, go to source if
not
Or replicate all clerks over age 30
Or ...
CS 336
22
Warehouse Specification (ideally)
View Definitions
Warehouse
Configuration
Module
Integration
rules
Warehouse
Change
Detection
Requirements
Integrator
Extractor/
Monitor
Extractor/
Monitor
Metadata
Extractor/
Monitor
...
CS 336
23
Processing
ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
24
ROLAP Server
Relational OLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
Special indices, tuning;
Schema is “denormalized”
relational
DBMS
CS 336
25
MOLAP Server
Multi-Dimensional OLAP Server
Sales
M.D. tools
Product
B
A
milk
soda
eggs
soap
1
utilities
CS 336
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
26
Index Structures (sketch)
Traditional Access Methods
B-trees, hash tables, R-trees, grids, …
Popular in Warehouses
inverted lists
bit map indexes
join indexes
text indexes
CS 336
27
What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
materialize
CS 336
c1
56
11
c2
4
8
c3
50
...
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
28
Materialization Factors
CS 336
Type/frequency of queries
Query response time
Storage cost
Update cost
29
Cube Aggregates Lattice
129
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
city, date
date
product, date
c3
50
day 2
day 1
CS 336
product
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
use greedy
algorithm to
decide what
to materialize
30
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
CS 336
31
Dimension Hierarchies
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
state, product, date
not all arcs shown...
CS 336
32
Interesting Hierarchy
time
all
years
weeks
quarters
months
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
days
CS 336
33
Managing
Metadata
Warehouse Design
Tools
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
34
Metadata
Administrative
definition of sources, tools, ...
schemas, dimension hierarchies, …
rules for extraction, cleaning, …
refresh, purging policies
user profiles, access control, ...
CS 336
35
Metadata
Business
business terms & definition
data ownership, charging
Operational
data lineage
data currency (e.g., active, archived, purged)
use stats, error reports, audit trails
CS 336
36
Design Summary
CS 336
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?
37
Tools
Development
design & edit: schemas, views, scripts, rules, queries, reports
Planning & Analysis
what-if scenarios (schema changes, refresh rates), capacity planning
Warehouse Management
performance monitoring, usage patterns, exception reporting
System & Network Management
measure traffic (sources, warehouse, clients)
Workflow Management
“reliable scripts” for cleaning & analyzing data
CS 336
38
Current State of Industry
Extraction and integration done off-line
Usually in large, time-consuming, batches
Everything copied at warehouse
Not selective about what is stored
Query benefit vs storage & update cost
Query optimization aimed at OLTP
High throughput instead of fast response
Process whole query before displaying anything
CS 336
39
State of Commercial Practice ...
Connectivity to sources
CS 336
Apertus
Information Builders
Informix Enterprise Gateway
Oracle Open Connect
CA-Ingres gateway
MS ODBC
Platinum InfoHub
Data extract, clean,
transform, refresh
CA-Ingres Replicator
ETI-Extract
IBM Data Joiner, Data
Propagator
Prism Warehouse manager
SAS Access
Sybase Replication Server
Trinzic InfoPump
40
… State of Commercial Practice ...
Multidimensional
Database Engines
Arbor Essbase
Oracle RIR Express
Comshare Commader
SAS System
ROLAP Servers
HP Intelligent Warehouse
Informix Metacube
MicroStrategy DSS Server
Information Advantage Asxys
Warehouse Data Servers
CS 336
CA-Ingres
Oracle 8
RedBrick
Sybase IQ
Informix Dynamic Server
IBM DB2
41
… State of Commercial Practice
Query/Reporting
Environments
IBM DataGuide
SAS Access CA Visual Express
Platinum Forest&Trees
Informix ViewPoint
Multidimensional Analysis
Kenan Systems Acumate
Microsoft Excel
Arbor Essbase Analysis server
Cognos PowerPlay
IQ Software IQ/Vision
Lotus 123
SAS OLAP++
Business Objects
Lots and lots of consulting!!
CS 336
42
Future Directions
CS 336
Better performance
Larger warehouses
Easier to use
What are companies & research labs
working on?
43
Research (1)
CS 336
Incremental Maintenance
Data Consistency
Data Expiration
Recovery
Data Quality
Error Handling (Back Flush)
44
Research (2)
CS 336
Rapid Monitor Construction
Temporal Warehouses
Materialization & Index Selection
Data Fusion
Data Mining
Integration of Text & Relational Data
Conceptual Modelling
45
Conclusions
Massive amounts of data and
complexity of queries will push limits
of current warehouses
Need better systems:
easier to use
provide quality information
CS 336
46