Transcript Slide 1

Designing the data warehouse
/ data mart
Methodologies and Techniques
Basic principles
Life cycle of the DW
First time load
Operational Databases
Warehouse Database
Refresh
Refresh
Purge or Archive
Refresh
Data transfers into a database
• First time system implementation
– From a manual system
•
•
•
•
Data warehousing projects
Database version upgrade
ERP projects
Migration
– From old to new system
Data transfers between systems
• Dynamic data (eg. sales orders)
– Interface required?
• Static data (eg. customers)
– Conversion required?
What can go wrong
• Data not available
– feature activated from implementation onwards
– Massive data entry
– Eg: different account structure
•
•
•
•
•
Data incomplete
Data inconsistent (eg: engineering vs accounts)
Wrong level of granularity
Data not clean
New system requires changes – new product
codes
Data cleaning must address
• Different department record same info
under different codes
• Multiple records of same company (under
different names)
• Fields missing in input tables (eg: c/o)
• Different depts. Record different
addresses for same customer
• Use of different units for time periods
Labour intensive tasks
•
•
•
•
Data entry
Data checks
Working on solving conflicts
Allocating new codes
• Solution = introduce as much automation as possible
– SQL / SQL loader (Oracle)
– Custom conversion programmes to extract, modify and upload
data
– Filtering
– Parsing (eg: excel)
– Staging areas for conversion in progress
Data utilities
• ORACLE is king of data handling
• Export: to transfer data between DBs
– Extract both table structure and data content
into dump file
• Import: corresponding facility
• SQL*loader automatic import from a
variety of file formats into DB files
– Needs a control file
Control files: using SQLloader
• Data tranfers in and out of DB can be automated
using the loader
– Create a data file with the data(!)
– Create a control file to guide the operation
• Load creates two files
– Log file
– “bad transactions” file
• Also a discard file if control file has selection
criteria in it
Example 1 – the supplier file
New supplier code to include city where firm is based
Assignation of category based on amounts purchased
OLD
Sup code
4 digits
Sup name
Sup address
City
Phone
Example 1 – the supplier file
New supplier code to include city where firm is based
Assignation of category based on amounts purchased
OLD
Sup code
4 digits
Sup name
Sup address
City
Phone
NEW
Sup code
3 letters +
4 digits
Sup name
Sup address… Phone Cat
1,2,3 depending
on total purchases
last year
Example 2 – New Cost
Accounting Structure
Maintenance department expenditure:
1 account => separate accounts for different production activities
OLD
Intervention code
Desc. Date
Labour Parts Total
Example 2 – New Cost
Accounting Structure
Maintenance department expenditure:
1 account => separate accounts for different production activities
OLD
Intervention code
Desc. Date
Labour Parts Total
NEW
Intervention code
Desc. Date
labour Parts Total Account
Example 3: merging files
• Complete customer file based on Accounts
and Sales and Shipping
OLD (finance)
CustID name
address city
account number credit limit
balance
discount rates
rep_name
OLD (sales)
CustID* name
address city
sales_to_date
OLD (Shipping)
CustID**
name
address city
Preferred haulier
Example 4: change of business
practices
• Payment by bank draft for international customers
• Automatic payment into account for national customers
• Payment direct into account for all customers
Data Staging Area
•
•
•
•
The construction site for the warehouse
Required by most scenarios
Connected to wide variety of sources
Clean / aggregate / compute / validate
data
Operational
system
Extract
Data
staging
area
Transform
Transport
(Load)
Warehouse
Remote Staging Model
Data staging area within the warehouse environment
Warehouse environment
Oper. envt.
Operational
system
Extract,
transform,
transport
Data
staging
area
Transport
Transform (Load)
Warehouse
Data staging area in its own environment, avoiding
negative impact on the warehouse environment
Oper. envt.
Operational
system
Staging envt.
Data
staging
area
Warehouse envt.
Transport
Extract,
(Load)
transform, Transform
transport
Warehouse
Onsite Staging Model
Data staging area within the operational
environment,
possibly affecting the operational system
WH envt.
Operational environment
Operational
system
Extract
Data
staging
area
Transform
Transport
(Load)
Warehouse
Data Mart
• A subset of a data warehouse that
supports the requirements of a
particular department or business
function.
• Characteristics include:
– Do not normally contain detailed operational
data unlike data warehouses.
– May contain certain levels of aggregation
Dependent Data Mart
Flat Files
Operational
Systems
Marketing
Marketing
Sales
Finance
Human Resources
Data
Warehouse
Sales
Finance
Data Marts
External Data
Independent Data Mart
Operational
Systems
Flat Files
Sales or Marketing
External Data
Reasons for Creating a Data
Mart
• To give users more flexible access to
the data they need to analyse most
often.
• To provide data in a form that matches
the specific needs of a group of users
• To improve end-user response time.
• Potential users of a data mart are
clearly defined and can be targeted for
support
Reasons for Creating a Data
Mart
• To provide appropriately structured data as
dictated by the requirements of the end-user
access tools.
• Building a data mart is simpler (and much
quicker) compared with establishing a
corporate data warehouse.
• The cost of implementing data marts is far
less than that required to establish a data
warehouse.
Exploiting the DW data
• DW is a platform for creating a wide array
of reports
• It solves data feed problems, but does not
lead to specific decision support
• Need a model for organising data into
meaningful reports
• Need specific interfaces for users
Exploiting the DW data
Static Reporting
Data Staging
Area
Source Systems
Scrutinising
Data Warehouse
Relational Database
on a dedicated Server
Extraction
Cleaning
Transformation
Loading
Multidimensional
Data Cubes
OLAP tools
De normalised, data
Discovering
Data Mining
…….
Multidimensional Models
Customer
Market
Product
Time
SALES
Product
Time
FINANCE
P/L_Line
The data is found at the intersection of
dimensions.
Representing multidimensional data
MOLAP Server
• The application layer
stores data in a
multidimensional structure
DSS client
• The presentation layer
provides the
MOLAP
multidimensional view
Engine
• Efficient storage and processing Application
layer
• Complexity hidden from the
user (but NOT from developer)
• Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server
• The warehouse stores
DSS client
atomic data.
• The application layer
ROLAP
generates SQL for the
engine
three- dimensional view.
Application
• The presentation layer Multiple layer
SQL
provides the
multidimensional view.
Warehouse
server
MOLAP
MDDB
Query
Periodic
load
Warehouse
Data
Server
user
ROLAP
Cache
Live
fetch
Query
Data
cache
Warehouse
Data
Server
user
Also Hybrid (HOLAP)
Choosing a Reporting
Architecture
•
•
•
•
•
•
•
Business needs
Good
Potential for growth
Query
interface
Performance
enterprise architecture
OK
Network architecture
Speed of access
Openness
MOLAP
ROLAP
Simple
Complex
Analysis
Modeling
• Warehouses differ from operational
structures:
– Analytical requirements
– Subject orientation
• Data must map to subject oriented
information:
– Identify business subjects
– Define relationships between subjects
– Name the attributes of each subject
• Modeling is iterative
• Modeling tools are available
Modeling the Data Warehouse
1
1. Defining the business
model
2. Creating the dimensional
model
2, 3
3. Modeling summaries
4. Creating the physical model
4
Physical model
Select a
business
process
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
Creating the Dimensional Model
Identify fact tables
– Translate business measures into fact
tables
– Analyze source system information for
additional measures
– Identify base and derived measures
– Document additivity of measures
Identify dimension tables
Link fact tables to the dimension
tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
• Contain textual information that
represents the attributes of the business
• Contain relatively static data
• Are joined to a fact table through a
foreign key reference
Product
Channel
Facts
(units,
price)
Customer
Time
Fact Tables
Fact tables have the following characteristics:
• Contain numeric measures (metrics) of the
business
• May contain summarized (aggregated) data
• May contain date-stamped data
• Are typically additive
• Have key value that is typically a concatenated
key composed of the primary keys of the
dimensions
• Joined to dimension tables through foreign
keys that reference primary keys in the
dimension tables
Dimensional Model (Star
Schema)
Fact table
Product
Channel
Facts
(units,
price)
Customer
Time
Dimension tables
Star Schema Model
Product Table
Product_id
Product_desc
…
• Central fact table
• Radiating dimensions
• Denormalized model
Time Table
Day_id
Month_id
Period_id
Year_id
Store Table
Store_id
District_id
...
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_dollars
Sales_units
...
Item Table
Item_id
Item_desc
...
Star Schema Model
•
•
•
•
•
•
Easy for users to understand
Fast response to simple queries
Simple metadata
Supported by many front end tools
Less robust to change
Does not support history
Using Summary Data
Phase 3: Modeling summaries
• Provides fast access to precomputed
data
• Reduces use of I/O, CPU, and memory
• Is distilled from source systems and
precalculated summaries
• Usually exists in summary fact tables
Designing Summary Tables
• Average
• Maximum
• Total
• Percentage
Units
Product A
Total
Product B
Total
Product C
Total
Sales(€)
Store
Summary Tables Example
SALES FACTS
Sales Region Month
10,000 North Jan 99
12,000 South Feb 99
11,000 North Jan 99
15,000 West Mar 99
18,000 South Feb 99
20,000 North Jan 99
10,000 East Jan 99
2,000 West Mar 99
SALES BY MONTH/REGION
Month Region Tot_Sales$
Jan 99 North 41,000
Jan 99 East 10,000
Feb 99 South 40,000
Mar 99 West 17,000
SALES BY MONTH
Month Tot_Sales
Jan 99 51,000
Feb 99 40,000
Mar 99 17,000