Introduction to Data Warehousing

Download Report

Transcript Introduction to Data Warehousing

Data Warehouse Design
Enrico Franconi
CS 636
Implementing a Warehouse




CS 336
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
2
Monitoring
 Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, …
 How to get data out?
 Replication tool
 Dump file
 Create report
 ODBC or third-party “wrappers”
CS 336
3
Monitoring Techniques








CS 336
Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring
4
Monitoring Issues
 Frequency
 periodic: daily, weekly, …
 triggered: on “big” change, lots of changes, ...
 Data transformation
 convert data to uniform format
 remove & add fields (e.g., add date to get history)
 Standards (e.g., ODBC)
 Gateways
CS 336
5
Wrapper
Converts data and queries from one data model to another
Data
Model
A
Queries
Data
Data
Model
B
Extends query capabilities for sources with limited capabilities
Queries
CS 336
Wrapper
Source
6
Wrapper Generation
 Solution 1: Hard code for each source
 Solution 2: Automatic wrapper generation
Wrapper
CS 336
Wrapper
Generator
Definition
7
Integration
 Data Cleaning
 Data Loading
 Derived Data
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
8
Data Integration
 Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
 Rule-based
 Actions






CS 336
Resolve inconsistencies
Eliminate duplicates
Integrate into warehouse (may not be empty)
Summarize data
Fetch more data from sources (wh updates)
etc.
9
Data Cleaning
 Find (& remove) duplicate tuples
 e.g., Jane Doe vs. Jane Q. Doe
 Detect inconsistent, wrong data
 Attribute values that don’t match
 Patch missing, unreadable data
 Insert default values
 Notify sources of errors found
CS 336
10
Data Cleaning
 Migration (e.g., yen to dollars)
 Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
 Fusion (e.g., mail list, customer merging)
billing DB
customer1(Joe)
merged_customer(Joe)
service DB
CS 336
customer2(Joe)
11
Loading Data in the Warehouse
 Incremental vs. refresh
 Off-line vs. on-line
 Frequency of loading
 At night, 1x a week/month, continuously
 Parallel/Partitioned load
CS 336
12
Warehouse Maintenance
 Warehouse data  materialized view
 Initial loading
 View maintenance
 Derived Warehouse Data
 indexes
 aggregates
 materialized views
 View maintenance
CS 336
13
Materialized Views
 Define new warehouse relations using SQL
expressions
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
joinTb
CS 336
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
does not exist
at any source
14
Differs from Conventional View
Maintenance...
 Warehouses may be highly aggregated and
summarized
 Warehouse views may be over history of
base data
 Process large batch updates
 Schema may evolve
CS 336
15
Differs from Conventional View
Maintenance...
 Base data doesn’t participate in view
maintenance
 Simply reports changes
 Loosely coupled
 Absence of locking, global transactions
 May not be queriable
CS 336
16
Warehouse Maintenance Anomalies
 Materialized view maintenance in loosely
coupled, non-transactional environment
 Simple example
Data
Warehouse
Sold (item,clerk,age)
Sold = Sale
Emp
Integrator
Sales
Sale(item,clerk)
CS 336
Comp.
Emp(clerk,age)
17
Warehouse Maintenance Anomalies
Data
Warehouse
Sold (item,clerk,age)
Integrator
Sales
Sale(item,clerk)
Comp.
Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1)  integrator adds Sale
(Mary,25)
4. (2)  integrator adds (Computer,Mary)
Emp
5. View incorrect (duplicate tuple)
CS 336
18
Maintenance Anomaly - Solutions
 Incremental update algorithms (ECA,
Strobe, etc.)
 Research issues: Self-maintainable views
 What views are self-maintainable
 Store auxiliary views so original + auxiliary
views are self-maintainable
CS 336
19
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk)
Emp(clerk,age)
 Inserts into Emp
If Emp.clerk is key and Sale.clerk is foreign
key (with ref. int.) then no effect
 Inserts into Sale
Maintain auxiliary view: Emp-clerk,age(Sold)
 Deletes from Emp
Delete from Sold based on clerk
CS 336
20
Self-Maintainability: Examples
 Deletes from Sale
Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant
 Auxiliary views for self-maintainability
 Must themselves be self-maintainable
 One solution: all source data
 But want minimal set
CS 336
21
Partial Self-Maintainability
 Avoid (but don’t prohibit) going to sources
Sold=Sale(item,clerk)
Emp(clerk,age)
 Inserts into Sale
 Check if clerk already in Sold, go to source if
not
 Or replicate all clerks over age 30
 Or ...
CS 336
22
Warehouse Specification (ideally)
View Definitions
Warehouse
Configuration
Module
Integration
rules
Warehouse
Change
Detection
Requirements
Integrator
Extractor/
Monitor
Extractor/
Monitor
Metadata
Extractor/
Monitor
...
CS 336
23
Processing




ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
24
ROLAP Server
 Relational OLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
Special indices, tuning;
Schema is “denormalized”
relational
DBMS
CS 336
25
MOLAP Server
 Multi-Dimensional OLAP Server
Sales
M.D. tools
Product
B
A
milk
soda
eggs
soap
1
utilities
CS 336
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
26
Index Structures (sketch)
 Traditional Access Methods
 B-trees, hash tables, R-trees, grids, …
 Popular in Warehouses
 inverted lists
 bit map indexes
 join indexes
 text indexes
CS 336
27
What to Materialize?
 Store in warehouse results useful for
common queries
 Example:
total sales
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
materialize
CS 336
c1
56
11
c2
4
8
c3
50
...
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
28
Materialization Factors




CS 336
Type/frequency of queries
Query response time
Storage cost
Update cost
29
Cube Aggregates Lattice
129
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
city, date
date
product, date
c3
50
day 2
day 1
CS 336
product
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
use greedy
algorithm to
decide what
to materialize
30
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
CS 336
31
Dimension Hierarchies
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
state, product, date
not all arcs shown...
CS 336
32
Interesting Hierarchy
time
all
years
weeks
quarters
months
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
days
CS 336
33
Managing
 Metadata
 Warehouse Design
 Tools
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
CS 336
Source
Source
34
Metadata
 Administrative
 definition of sources, tools, ...
 schemas, dimension hierarchies, …
 rules for extraction, cleaning, …
 refresh, purging policies
 user profiles, access control, ...
CS 336
35
Metadata
 Business
 business terms & definition
 data ownership, charging
 Operational
 data lineage
 data currency (e.g., active, archived, purged)
 use stats, error reports, audit trails
CS 336
36
Design Summary







CS 336
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?
37
Tools
 Development
 design & edit: schemas, views, scripts, rules, queries, reports
 Planning & Analysis
 what-if scenarios (schema changes, refresh rates), capacity planning
 Warehouse Management
 performance monitoring, usage patterns, exception reporting
 System & Network Management
 measure traffic (sources, warehouse, clients)
 Workflow Management
 “reliable scripts” for cleaning & analyzing data
CS 336
38
Current State of Industry
 Extraction and integration done off-line
 Usually in large, time-consuming, batches
 Everything copied at warehouse
 Not selective about what is stored
 Query benefit vs storage & update cost
 Query optimization aimed at OLTP
 High throughput instead of fast response
 Process whole query before displaying anything
CS 336
39
State of Commercial Practice ...
 Connectivity to sources







CS 336
Apertus
Information Builders
Informix Enterprise Gateway
Oracle Open Connect
CA-Ingres gateway
MS ODBC
Platinum InfoHub
 Data extract, clean,
transform, refresh
 CA-Ingres Replicator
 ETI-Extract
 IBM Data Joiner, Data
Propagator
 Prism Warehouse manager
 SAS Access
 Sybase Replication Server
 Trinzic InfoPump
40
… State of Commercial Practice ...
 Multidimensional
Database Engines




Arbor Essbase
Oracle RIR Express
Comshare Commader
SAS System
 ROLAP Servers




HP Intelligent Warehouse
Informix Metacube
MicroStrategy DSS Server
Information Advantage Asxys
 Warehouse Data Servers






CS 336
CA-Ingres
Oracle 8
RedBrick
Sybase IQ
Informix Dynamic Server
IBM DB2
41
… State of Commercial Practice
 Query/Reporting
Environments
 IBM DataGuide
 SAS Access CA Visual Express
Platinum Forest&Trees
 Informix ViewPoint
 Multidimensional Analysis








Kenan Systems Acumate
Microsoft Excel
Arbor Essbase Analysis server
Cognos PowerPlay
IQ Software IQ/Vision
Lotus 123
SAS OLAP++
Business Objects
 Lots and lots of consulting!!
CS 336
42
Future Directions




CS 336
Better performance
Larger warehouses
Easier to use
What are companies & research labs
working on?
43
Research (1)






CS 336
Incremental Maintenance
Data Consistency
Data Expiration
Recovery
Data Quality
Error Handling (Back Flush)
44
Research (2)







CS 336
Rapid Monitor Construction
Temporal Warehouses
Materialization & Index Selection
Data Fusion
Data Mining
Integration of Text & Relational Data
Conceptual Modelling
45
Conclusions
 Massive amounts of data and
complexity of queries will push limits
of current warehouses
 Need better systems:
 easier to use
 provide quality information
CS 336
46