The Kimball Lifecycle - Doctor

Download Report

Transcript The Kimball Lifecycle - Doctor

ETL
By Dr. Gabriel
ETL Process
• 4 major components:
– Extracting
• Gathering raw data from source systems and storing it in ETL
staging environment
– Cleaning and conforming
• Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions
– Delivering
• Loading data into data warehouse tables
– Managing
• Management of ETL environment
ETL: Extracting
• Data profiling
• Identifying data that changed since last
load
• extraction
ETL: Cleaning and Conforming
•
•
•
•
•
Data cleansing
Recording error events
Audit dimensions
Deduping
Creating and maintaining conformed
dimensions and facts
ETL: Delivering
•
•
•
•
Implementation of SCD logic
Surrogate key generation
Managing hierarchies in dimensions
Managing special dimensions such as date and
time, junk, mini, shrunken, small static, and usermaintained dimensions
– Mini dimensions
• used to track changes of dimension attribute when type 2
technique is infeasible.
• Similar to junk dimensions Typically is used for large
dimensions
• Combinations can be built in advance or on the fly
• Built from dimension table input
ETL: Delivering (Cont)
– Small static dimensions
• Dimensions created by the ETL system without
real source
• Lookup dimensions for translations of codes, etc.
– User maintained dimensions
• Master dimensions without real source system
• Descriptions, groupings, hierarchies created for
reporting and analysis purposes.
ETL: Delivering (Cont)
• Fact table loading
• Building and maintaining bridge dimension
tables
• Handling late arriving data
• Management of conformed dimensions
• Administration of fact tables
• Building aggregations
• Building OLAP cubes
• Transferring DW data to other environment for
specific purposes
ETL: Managing
• Management of ETL environment
– Goals
• Reliability
• Availability
• Manageability
– Job scheduler
– backup system
– Recovery and restart system
– Version control system
ETL: Managing (Cont.)
•
•
•
•
•
•
•
•
•
Version migration system
Workflow monitor
Sorting system
Analyzing dependencies and lineage
Problem escalation system
Parallelization
Security system
Compliance manager
Metadata repository manager
ETL Process
• Planning
– High level source to target data flow diagram
– Selection and implementation of ETL tool
– Development of default strategies for
dimension management, error handling, and
other processes
– Development data transformations diagrams
by target table
– Development of job sequencing
ETL Process
• Developing one-time historic load
– Build and test the historic dimension and fact
tables load
• Developing incremental load process
– Build and test dimension and fact tables
incremental load processes
– Build and test aggregate table loads and/or
OLAP processing
– Design, build, and test the ETL system
automation
ETL Tools: Build vs Buy
• Many off-the-shelf tools exist
• Benefits are not seen right away
– Setup
– Learning curve
• High-end tools may not justify value for
smaller warehouses
Off-the-shelf ETL Tools
Tool
Vendor
Oracle Warehouse Builder (OWB)
Oracle
Data Integrator (BODI)
Business Objects
IBM Information Server (Ascential)
IBM
SAS Data Integration Studio
SAS Institute
PowerCenter
Informatica
Oracle Data Integrator (Sunopsis)
Oracle
Data Migrator
Information Builders
Integration Services
Microsoft
Talend Open Studio
Talend
DataFlow
Group 1 Software (Sagent)
Data Integrator
Pervasive
Transformation Server
DataMirror
Transformation Manager
ETL Solutions Ltd.
Data Manager
Cognos
DT/Studio
Embarcadero Technologies
ETL4ALL
IKAN
DB2 Warehouse Edition
IBM
Jitterbit
Jitterbit
Pentaho Data Integration
Pentaho
ETL Specification Document
• Can be as large as 100 pages per business
process; In reality, the work starts after the high
level design is documented in a few pages.
• Source-to-target mappings
• Data profiling reports
• Physical design decisions
• Default strategy for extracting from each major
source system
• Archival strategy
• Data quality tracking and metadata
• Default strategy for managing changes to
dimension attributes
ETL Specification Document (Cont)
•
•
•
•
System availability requirements and strategy
Design of data auditing mechanism
Location of staging areas
Historic and incremental load strategies for each
table
– Detailed table design
– Historic data load parameters (# of months) and
volumes (# of rows)
– Incremental data volumes
ETL Specification Document (Cont)
– Handling of late arriving data
– Load frequency
– Handling of changes in each dimension attribute
(types 1,2,3)
– Table partitioning
– Overview of data sources; discussion of sourcespecific characteristics
– Extract strategy for the source data
– Change data capture logic for each source table
– Dependencies
– Transformation logic (diagram or pseudo code)
ETL Specification Document (Cont)
– Preconditions to avoid error conditions
– Recovery and restart assumptions for each
major step of the ETL pipeline
– Archiving assumptions for each table
– Cleanup steps
– Estimated effort
• Overall workflow
• Job sequencing
• Logical dependencies
Loading Pointers
• One time historic load
– Disable RI constraints (FKs) and re-enable
them after the load is complete
– Drop indexes and re-create them after the
load is complete
– Use bulk loading techniques
– Not always the case
Loading Pointers (Cont)
• Incremental load
Loading Pointers (Cont)
– Sometimes historic and incremental load logic
is the same; many times- is similar.
– Updating aggregations, if necessary
– Error handling
Sample: Generation of Surrogate
Keys on SQL Server
As simple as:
DECLARE @i INTEGER
SELECT @i = MAX(ID) + 1
FROM TableName
But may not work with concurrent processes
OR
Create PROCEDURE pGetNextID
(@SeedName VARCHAR(32),
@SeedValue BIGINT OUTPUT)
AS
UPDATE Lookup_Seed
SET @SeedValue = SeedValue = SeedValue + 1
WHERE SeedID = @SeedName
Lookup_Seed table:
SeedID varchar (32)
SeedValue bigint
Questions ?