幻灯片 1 - 这是一个测试

Download Report

Transcript 幻灯片 1 - 这是一个测试

Data Warehousing Fundamentals
noynot@163.com
Course Objectives
 After completing this course, you should be able
to do the following:
 Describe the role of business intelligence (BI) and
data warehousing in today’s marketplace
 Describe data warehousing terminology and the
various technologies that are required to
implement a data warehouse
 Explain the implementation and organizational
issues surrounding a data warehouse project
 Identify data warehouse modeling concepts
 Explain the extraction, transformation, and loading
processes for building a data warehouse
noynot@163.com
Course Objectives
 Identify management and maintenance
processes that are associated with a data
warehouse project
 Describe methods for refreshing warehouse data
 Explain warehouse metadata concepts
 Identify tools that can be employed at each stage
of the data warehouse project
 Describe user profiles and techniques for
querying the warehouse
 Identify methods and tools for accessing and
analyzing warehouse data
noynot@163.com
Lessons
1.
2.
Business Intelligence and Data Warehousing
Defining Data Warehouse Concepts and
Terminology
3. Planning and Managing the Data Warehouse Project
4. Modeling the Data Warehouse
5. Building the Data Warehouse: Extracting Data
6. Building the Data Warehouse: Transforming Data
7. Building the Data Warehouse: Loading Warehouse
Data
8. Refreshing Warehouse Data
9. Leaving a Metadata Trail
10. Managing and Maintaining the Data Warehouse
noynot@163.com
Let’s Get Started
Lesson 1
noynot@163.com
Lesson 1 Objectives
 After completing this lesson, you should be
able to do the following:
 Describe the role of business intelligence in
today’s marketplace
 Describe why an online transaction processing
system (OLTP) is not suitable for analytical
reporting
 Describe how extract processing for decision
support querying led to data warehouse
solutions that are employed today
 Explain why businesses are driven to employ
data warehouse technology
noynot@163.com
What Is Business Intelligence?
“Business Intelligence is the process of transforming
data into information and through discovery transforming
that information into knowledge.”
Gartner Group
noynot@163.com
Purpose of Business Intelligence
The purpose of business intelligence is to convert
the volume of data into business value through
analytical reporting.
Decision
Knowledge
Information
Data
Value
Volume
noynot@163.com
Early Management
Information Systems
 MIS systems provided business data.
 Reports were developed on request.
 Reports provided little analysis capability.
 Decision support tools gave personal ad hoc
access to data.
Ad hoc access
Production
platforms
Operational reports
Decision makers
noynot@163.com
Analyzing Data from
Operational Systems
 Data structures are complex.
 Systems are designed for high performance and
throughput.
 Data is not meaningfully represented.
 Data is dispersed.
 OLTP systems may be unsuitable for intensive
queries.
Production
platforms
Operational reports
noynot@163.com
Why OLTP Is Not Suitable
for Analytical Reporting
OLTP
Analytical Reporting
Information to support
day-to-day service
Historical information
to analyze
Data stored at transaction
level
Data needs to be integrated
Database design:
Normalized
Database design:
Denormalized, star schema
noynot@163.com
Data Extract Processing
 End user computing offloaded from the operational
environment
 User’s own data
Operational
systems
Extracts
Decision
makers
noynot@163.com
Management Issues with
Data Extract Programs
Operational
systems
Extracts
Decision
makers
Extract Explosion
noynot@163.com
Productivity Issues with
Extract Processing
 Duplicated effort
 Multiple technologies
 Obsolete reports
 No metadata
noynot@163.com
Data Quality Issues with
Extract Processing
 No common time basis
 Different calculation algorithms
 Different levels of extraction
 Different levels of granularity
 Different data field names
 Different data field meanings
 Missing information
 No data correction rules
 No drill-down capability
noynot@163.com
Data Warehousing and
Business Intelligence
Legacy
Data
Enterprise Data
Warehouse
Operations
Data
External
Data
Data Marts
Analytical
Reporting
noynot@163.com
Advantages of Warehouse
Processing Environments
 Controlled
 Reliable
 Quality information
 Single source of data
Internal and
external systems
Data
warehouse
Decision
makers
noynot@163.com
Advantages of Warehouse
Processing Environments
 No duplication of effort
 No need for tools to support many technologies
 No disparity in data, meaning, or representation
 No time period conflict
 No algorithm confusion
 No drill-down restrictions
noynot@163.com
Success Factors for a Dynamic
Business Environment
 Know the business
 Reinvent to face new challenges
 Invest in products
 Invest in customers
 Retain customers
 Invest in technology
 Improve access to business information
 Provide superior services and products
 Be profitable
noynot@163.com
Business Drivers for
Data Warehouses
 Provide supporting information systems
 Get quality information:
• Reduce costs
• Streamline the business
• Improve margins
noynot@163.com
Technological Advances
Enabling Data Warehousing
 Hardware
 Operating system
 Database
 Query tools
•
 Applications
•
•
•
•
•
Large databases
64-bit architectures
Indexing techniques
Affordable, cost-effective
open systems
Robust warehouse tools
Sophisticated end user tools
noynot@163.com
两种数据的区别
noynot@163.com
Summary
 In this lesson, you should have learned how to:
 Describe the role of business intelligence in
today’s marketplace
 Describe why an online transaction
processing system (OLTP) is not suitable for
analytical reporting
 Describe how extract processing for decision
support querying led to data warehouse
solutions employed today
 Explain why businesses are driven to employ
data warehouse technology
noynot@163.com
Practice 1-1 Overview
 This practice covers the following topics:
 Answering questions about data warehousing
 Discussing how data warehousing meets business
needs
noynot@163.com
Lesson 2
Defining Data Warehouse
Concepts and Terminology
Objectives
 After completing this lesson, you should be able to
do the following:
 Identify a common, broadly accepted definition of a
data warehouse
 Describe the differences of dependent and
independent data marts
 Identify some of the main warehouse development
approaches
 Recognize some of the operational properties and
common terminology of a data warehouse
noynot@163.com
Definition of a Data Warehouse
 “A data warehouse is a subject oriented,
integrated, non-volatile, and time variant
collection of data in support of management’s
decisions.”
 — W.H. Inmon
“数据仓库是一个面向主题的、集成的、随时间变化
的、非易失的、用于战略决策的数据集合”
“Building the Data
Warehouse”(1991)
noynot@163.com
Definition of a Data Warehouse
 “…数据仓库无非是所有数据集市的集合...”
— Ralph Kimball
 “数据仓库是信息数据库的具体实现,用来存储源
自业务数据库的共享数据。典型的数据仓库应该是
一个主题数据库,支持用户从巨大的运营数据存储
中发现信息,支持对业务趋势进行跟踪和响应,实
现业务的预测和计划。”
— DM Review
noynot@163.com
Definition of a Data Warehouse
“An enterprise structured repository of
subject-oriented, time-variant, historical data
used for information retrieval and decision
support. The data warehouse stores atomic
and summary data.”
— Oracle’s Data Warehouse Definition
“数据仓库是一个过程而不是一个项目”
—另一角度描述数据仓库
noynot@163.com
Data Warehouse Properties
Integrated
Subjectoriented
Data
Warehouse
Nonvolatile
Time-variant
noynot@163.com
Subject-Oriented
 Data is categorized and stored by business subject
rather than by application.
OLTP Applications
Data Warehouse
Subject
Equity Plans
Shares
Insurance
Loans
Savings
Customer financial
information
noynot@163.com
Integrated
 Data on a given subject is defined and stored once.
Savings
Current
Accounts
Loans
OLTP Applications
Customer
Data Warehouse
noynot@163.com
Time-Variant
 Data is stored as a series of snapshots, each
representing a period of time.
Data
Warehouse
noynot@163.com
Nonvolatile
 Typically data in the data warehouse is not updated
or deleted.
Operational
Warehouse
Load
Insert, Update,
Delete, or Read
Read
noynot@163.com
Changing Warehouse Data
Operational Databases
Warehouse Database
First time load
Refresh
Refresh
Refresh
Purge or
Archive
noynot@163.com
Data Warehouse Versus OLTP
Property
OLTP
Data Warehouse
Response Time
Sub seconds to
seconds
Seconds to hours
Operations
DML
Primarily Read only
Nature of Data
30 – 60 days
Snapshots over time
Data Organization
Application
Subject, time
Size
Small to large
Large to very large
Data Sources
Operational, Internal
Operational,
Internal, External
Activities
Processes
Analysis
noynot@163.com
Usage Curves
 Operational system is predictable
 Data warehouse:
• Variable
• Random
noynot@163.com
Enterprise wide Warehouse
 Large scale implementation
 Scopes the entire business
 Data from all subject areas
 Developed incrementally
 Single source of enterprisewide data
 Synchronized enterprisewide data
 Single distribution point to dependent data
marts
noynot@163.com
数据仓库设计中心思想
 具有一个合适的粒度或细节以满足所有的数据集市
 设计不能阻碍在数据集市中使用各种技术,能适应多维集市、
统计、挖掘及探索型仓库
noynot@163.com
Data Marts
 数据仓库数据的一个子集。
BI环境中的大部分分析活动均在数据集市中进行。每个数
据集市中的数据通常是为特定的功能所定制,不必对其他
的使用有效。
noynot@163.com
Data Warehouses Versus
Data Marts
Property
Data Warehouse
Data Mart
Scope
Enterprise
Department
Subjects
Multiple
Single-subject, LOB
Data Source
Many
Few
Implementation time
Months to years
Months
noynot@163.com
Dependent Data Mart
Data Marts
Operational
Systems
Legacy
Data
Flat Files
Operations
Data
External
Data
Data
Warehouse
Marketing
Sales
Finance
HR
External
Data
Marketing
Sales
Finance
noynot@163.com
Independent Data Mart
Operational
Systems
Legacy
Data
Flat Files
Sales or
Marketing
Operations
Data
External
Data
External
Data
noynot@163.com
Features of a Data Mart
 Not Real-Time Data
 Consolidation and Cleansing
noynot@163.com
Warehouse Development Approaches
 “Big bang” approach
 Incremental approach:
• Top-down incremental approach
• Bottom-up incremental approach
noynot@163.com
“Big Bang” Approach
Analyze enterprise
requirements
Build enterprise
data warehouse
Report in subsets or
store in data marts
noynot@163.com
Top-Down Approach
Analyze requirements at the enterprise level
Develop conceptual information model
Identify and prioritize subject areas
Complete a model of selected subject area
Map to available data
Perform a source system analysis
Implement base technical architecture
Establish metadata, extraction, and load
processes for the initial subject area
Create and populate the initial subject area
data mart within the overall warehouse
framework
noynot@163.com
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source
systems within this scope
Define the initial increment based on the
political pressure, assumed business
benefit and data volume
Implement base technical architecture
and establish metadata, extraction, and
load processes as required by increment
Create and populate the initial subject
areas within the overall warehouse
framework
noynot@163.com
Incremental Approach
to Warehouse Development
 Multiple iterations
 Shorter implementations
 Validation of each phase
Increment 1
Strategy
Definition
Analysis
Design
Iterative
Build
Production
noynot@163.com
Data Warehousing
Process Components
 Methodology
 Architecture
 Extraction, Transformation, and Load (ETL)
 Implementation
 Operation and Support
noynot@163.com
Methodology
 Ensures a successful data warehouse
 Encourages incremental development
 Provides a staged approach to an
enterprisewide warehouse:
•
•
•
•
Safe
Manageable
Proven
Recommended
noynot@163.com
Architecture
 “Provides the planning, structure, and
standardization needed to ensure integration of
multiple components, projects, and processes
across time.”
 “Establishes the framework, standards, and
procedures for the data warehouse at an
enterprise level.”
– — The Data Warehousing Institute
noynot@163.com
Extraction, Transformation,
and Load (ETL)
“Effective data extract, transform and load (ETL)
processes represent the number one success
factor for your data warehouse project and can
absorb up to 70 percent of the time spent on a
typical data warehousing project.”
— DM Review, March 2001
Source
Staging Area
Target
noynot@163.com
Implementation
Data Warehouse Architecture
Ex., Incremental Implementation
Implementation
Increment 1
Increment 2
.
.
.
Increment n
noynot@163.com
Operation and Support
 Data access and reporting
 Refreshing warehouse data
 Monitoring
 Responding to change
noynot@163.com
Phases of the
Incremental Approach
 Strategy
 Definition
 Analysis
 Design
 Build
 Production
Increment 1
Strategy
Definition
Analysis
Design
Build
Production
noynot@163.com
Strategy Phase Deliverables
 Business goals and objectives
 Data warehouse purpose, objectives, and scope
 Enterprise data warehouse logical model
 Incremental milestones
 Source systems data flows
 Subject area gap analysis
noynot@163.com
Strategy Phase Deliverables
 Data acquisition strategy
 Data quality strategy
 Metadata strategy
 Data access environment
 Training strategy
noynot@163.com
Summary
 In this lesson, you should have learned how
to:
 Identify a common, broadly accepted definition
of a data warehouse
 Describe the differences of dependent and
independent data marts
 Identify some of the main warehouse
development approaches
 Recognize some of the operational properties
and common terminology of a data warehouse
noynot@163.com
Practice 2-1 Overview
 This practice covers the following topics:
 Answering questions regarding data warehousing
concept and terminology
 Discussing some of the data warehouse concept and
terminology
noynot@163.com
Lesson 3
Modeling the Data Warehouse
Objectives
After completing this lesson, you should be
able to do the following:
 Discuss data warehouse environment data
structures
 Discuss data warehouse database design
phases:
• Defining the business model
• Defining the dimensional model
• Defining the physical model
noynot@163.com
Data Warehouse Modeling Issues
 Among the main issues that data warehouse data
modelers face are:
 Different data types
 Many ways to use warehouse data
 Many ways to structure the data
 Multiple modeling techniques
 Planned replication
 Large volumes of data
noynot@163.com
Data Warehouse Environment
Data Structures
 The data modeling structures that are
commonly found in a data warehouse
environment are:
 Third normal form (3NF)
 Star schema
 Snowflake schema
noynot@163.com
Star Schema Model
Product Table
Store Table
Product_id
Product_disc,...
Store_id
District_id,...
Sales Fact Table
Central
fact table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units, ...
Denormalized
dimensions
Time Table
Day_id
Month_id
Year_id,...
Item Table
Item_id
Item_desc,...
noynot@163.com
Snowflake Schema Model
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Product_id
Week_id
Sales_amount
Sales_units
Time Table
Item Table
Dept Table
Mgr Table
Week_id
Period_id
Year_id
Item_id
Item_desc
Dept_id
Dept_id
Dept_desc
Mgr_id
Dept_id
Mgr_id
Mgr_name
noynot@163.com
Snowflake Schema Model
 Direct use by some tools
 More flexible to change
 Provides for speedier data loading
 Can become large and unmanageable
 Degrades query performance
 More complex metadata
Country
State
County
City
noynot@163.com
Data Warehouse Database
Design Phases
 Phase 1:
Defining the business model
 Phase 2:
Defining the dimensional model
 Phase 3:
Defining the physical model
noynot@163.com
Phase 1: Defining the Business Model
 Performing strategic analysis
 Creating the business model
 Documenting metadata
noynot@163.com
Performing Strategic Analysis
 Identify crucial business processes
 Understand business processes
 Prioritize and select the business processes to
implement
High
Business
Benefit
Low
Low
Feasibility
High
noynot@163.com
Creating the Business Model
 Defining business requirements:
•
•
•
•
Identifying the business measures
Identifying the dimensions
Identifying the grain
Identifying the business definitions and rules
 Verifying data sources
noynot@163.com
Business Requirements Drive
the Design Process
 Primary input
Business
Requirements
 Secondary input
Existing Metadata
Production ERD Model
Research
noynot@163.com
Identifying Measures
and Dimensions
Measures
The attribute varies
continuously:
 Balance
 Units Sold
 Cost
 Sales
The attribute is
perceived as constant or
discrete:
 Product
 Location
 Time
 Size
Dimensions
noynot@163.com
Using a Business Process Matrix
Business Processes
Business
Dimensions
Sales
Returns
Inventory
Customer
Date
Product
Channel
Promotion
Sample of business process matrix
noynot@163.com
Determining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
noynot@163.com
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
noynot@163.com
Documenting Metadata
 Documenting metadata should include:
 Documenting the design process
 Documenting the development process
 Providing a record of changes
 Recording enhancements over time
noynot@163.com
Metadata Documentation Approaches
 Automated
• Data modeling tools
• ETL tools
• End-user tools
 Manual
noynot@163.com
Phase 2: Defining the Dimensional Model
 Identify fact tables:
• Translate business measures into fact tables
• Analyze source system information for additional
measures
 Identify dimension tables
 Link fact tables to the dimension tables
 Model the time dimension
noynot@163.com
Star Dimensional Modeling
Product Table
Product_id
Product_desc
...
Time Table
Day_id
Month_id
Period_id
Year_id
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units
...
Store Table
Store_id
District_id
...
Item Table
Item_id
Item_desc
...
noynot@163.com
Fact Table Characteristics
 Contain numerical metrics of the business
 Can hold large volumes of data
 Can grow quickly
 Can contain base, derived,
and summarized data
Sales Fact Table
 Are typically additive
Product_id
 Are joined to dimension
Store_id
tables through foreign keys
Item_id
that reference primary
Day_id
keys in the dimension tables
Sales_amount
Sales_units
...
noynot@163.com
Dimension Table Characteristics
 Dimension tables have the following characteristics:
 Contain textual information that represents the
attributes of the business
 Contain relatively static data
 Are joined to a fact table through
a foreign key reference
noynot@163.com
Star Dimensional
Model Characteristics
 The model is easy for users to understand.
 Primary keys represent a dimension.
 Nonforeign key columns are values.
 Facts are usually highly normalized.
 Dimensions are completely denormalized.
 Fast response to queries is provided.
 Performance is improved by reducing table
joins.
 End users can express complex queries.
 Support is provided by many front-end tools.
noynot@163.com
Using Time in the Data Warehouse
 Defining standards for time is critical.
 Aggregation based on time is complex.
noynot@163.com
The Time Dimension
 Time is critical to the data warehouse. A consistent
representation of time is required for extensibility.
Sales fact
Time
dimension
Where should the element of time be stored?
noynot@163.com
Using Data Modeling Tools
 Tools with a GUI enable definition, modeling,
and reporting.
 Avoid a mix of modeling techniques caused by:
• Development pressures
• Developers with lack of knowledge
• No strategy
 Determine a strategy.
 Write and publish formally.
 Make available electronically.
noynot@163.com
Phase 3: Defining the
Physical Model
 Translate the dimensional design to a physical
model for implementation.
 Define storage strategy for tables and indexes.
 Perform database sizing.
 Define initial indexing strategy.
 Define partitioning strategy.
 Update metadata document with physical
information.
noynot@163.com
Physical Model Design Tasks
 Define naming and database standards.
 Perform database sizing.
 Develop initial indexing strategy.
 Develop data partition strategy.
 Define storage parameters.
 Set initialization parameters.
 Use parallel processing.
 Define summary data.
 Determine hardware architecture.
noynot@163.com
Database Object Naming Conventions
 Develop a reasonable list of abbreviations.
 List all the objects’ names, and work with the
user community to define them.
 Resolve name disputes.
 Document your naming standards in the
metadata document.
 Plan for the naming standards to be a living
document.
noynot@163.com
Architectural Requirements
Scalability
Manageability
Availability
Extensibility
Integration
Flexibility
User
Budget
Business
Technology
noynot@163.com
Strategy for Architecture Definition
 Obtain existing architecture plans.
 Obtain existing capacity plans.
 Document existing interfaces.
 Prepare capacity plan.
 Prepare technical architecture.
 Document operating system requirements.
 Develop recovery plans.
 Develop security and control plans.
 Create architecture.
 Create technical risk assessment.
noynot@163.com
Hardware Requirements
 SMP
 Cluster
 MPP
 Hybrids (employing both SMP and MPP)
noynot@163.com
Making the Right Choice
 Requirements differ from operational systems.
 Benchmark
• Available from vendors
• Develop your own
• Use realistic queries
 Scalability is important.
noynot@163.com
Storage and Performance Considerations
 Database sizing
• Test Load Sampling
 Data partitioning
• Horizontal partitioning
• Vertical partitioning
 Indexing
• B-Tree indexes
• Bitmap indexes
• Bitmap-join indexes
 Star query optimization
• Star transformation
noynot@163.com
Database Sizing
 Sizing influences capacity planning and
systems environment management.
 Sizing is required for:
• The database
• Other storage areas
 Sizing is not an exact science.
 Techniques vary.
noynot@163.com
Test Load Sampling
 Analyze a representative sample of the data chosen
using proven statistical methods. Ensure that the
sample reflects:
 Test loads for different periods
 Day-to-day operations
 Seasonal data and worst-case scenarios
 Indexes and summaries
noynot@163.com
Data Partitioning
 Breaking up of data into separate physical
units that can be handled independently
 Data partitioning provides ease of:
•
•
•
•
•
•
•
•
Restructuring
Reorganization
Removal
Recovery
Monitoring
Management
Archiving
Indexing
noynot@163.com
Horizontal Partitioning
 Table and index data are split by:
•
•
•
•
•
Time
Sales region or person
Geography
Organization
Line of business
 Candidate columns appear in
a WHERE clause.
 Analysis determines requirements.
noynot@163.com
Vertical Partitioning
 You can use vertical partitioning when:
•
•
•
•
Speed of query and update actions are improved by it
Users require access to specific columns
Some data is changed infrequently
Descriptive dimension text may be better moved
away from the dimension itself
noynot@163.com
Partitioning Methods
 Range partitioning
 List partitioning
 Hash partitioning
 Composite partitioning
• Composite range-hash partitioning
• Composite range-list partitioning
 Index partitioning
noynot@163.com
Indexing
 Indexing is used for the following reasons:
 It is a huge cost saving, greatly
improving performance and
scalability.
 It can replace a full table scan by
a quick read of the index followed
by a read of only those disk
blocks that contain the rows
needed.
noynot@163.com
B-Tree Index
 Most common type of indexing
 Used for high cardinality columns
 Designed for few rows returned
noynot@163.com
Bitmap Indexes
 Provide performance benefits and storage savings
 Store values as 1s and 0s
 Use instead of B-tree indexes when:
• Tables are large
• Columns have relatively low cardinality
noynot@163.com
Bitmap Join Indexes
 A bitmap index for the join of two or more tables:
 They are new to Oracle9i.
 They provide better performance and storage savings.
noynot@163.com
Parallelism
Sales table
Customers
table
P1
P2
P3
P1
P2
P3
Parallel Execution Servers
noynot@163.com
Using Summary Data
 Designing summary tables offers the following
benefits:
 Provides fast access to precomputed data
 Reduces use of I/O, CPU, and memory
noynot@163.com
Summary
 In this lesson, you should have learned how to:
 Describe Data Warehouse Environment Data
Structures
 Define the business model:
• Performing strategic analysis
• Creating the business model
• Identifying business rules
 Define the dimensional model:
• Star dimensional model characteristics
 Define the physical model:
• Physical model design tasks
• Architectural and hardware requirements
• Storage and performance considerations
noynot@163.com
Practice 3-1 Overview
 This practice covers the following topics:
 Specifying true or false to a series of statements
 Completing a series of sentences accurately
 Practicing identifying a simple business model
 Identifying indexing method
noynot@163.com
Lesson 4
Building the Data Warehouse:
Extracting Data
Objectives
 After completing this lesson, you should be
able to do the following:
 Outline the ETL (Extraction, Transformation,
and Loading) processes for building a data
warehouse
 Identify ETL tasks, importance, and cost
 Explain how to examine data sources
 Identify extraction techniques and methods
 Identify analysis issues and design options for
extraction processes
 List the selection criteria for the ETL tools
noynot@163.com
Extraction, Transformation, Loading
(ETL) Processes
 Extract source data
 Transform/clean data
 Index and summarize
 Load data into warehouse
 Detect changes
 Refresh data
Operational systems
Programs
Gateways
Tools
Data Warehouse
ETL
noynot@163.com
ETL: Tasks, Importance, and Cost
Operational
systems
Extract
Clean up
Consolidate
Restructure
Load
Maintain
Refresh
Data Warehouse
ETL
Relevant
Useful
Quality
Accurate
Accessible
noynot@163.com
Extracting Data
 Source systems
• Data from various data sources in various formats
 Extraction Routines
• Developed to select data fields from sources
• Consist of business rules, audit trails, error correction
facilities
Data mapping
Transform
Operational
databases
Data staging area
Warehouse
database
noynot@163.com
Examining Data Sources
 Production
 Archive
 Internal
 External
noynot@163.com
Production Data
 Operating system platforms
 File systems
 Database systems and vertical applications
IMS
DB2
Oracle
Sybase
Informix
VSAM
SAP
Shared Medical
Systems
Dun and Bradstreet
Financials
Hogan Financials
Oracle Financials
noynot@163.com
Archive Data
 Historical data
 Useful for analysis over long periods of time
 Useful for first-time load
 May require unique transformations
Operation
databases
Warehouse
database
noynot@163.com
Internal Data
 Planning, sales, and marketing organization data
 Maintained in the form of:
• Spreadsheets (structured)
• Documents (unstructured)
 Treated like any other source data
Planning
Marketing
Accounting
Warehouse database
noynot@163.com
External Data
 Information from outside the organization
 Issues of frequency, format, and predictability
 Described and tracked using metadata
Purchased
databases
A.C. Nielsen, IRI, IMS,
Walsh America
Dun and
Bradstreet
Competitive
information
Economic
forecasts
Barron's
Warehousing
databases
Wall Street
Journal
noynot@163.com
Mapping Data
 Mapping data defines:
 Which operational attributes to use
 How to transform the attributes for the warehouse
 Where the attributes exist in the warehouse
File A
F1
F2
F3
123
Bloggs
10/12/56
Staging File One
Number
USA123
Name
Mr. Bloggs
DOB
10-Dec-56
Metadata
File A
F1
Staging File One
Number
F2
F3
Name
DOB
noynot@163.com
Extraction Techniques
 Programs: C, C++, COBOL, PL/SQL, Java
 Gateways: transparent database access
 Tools:
• In-house developed tools
• Vendor’s data extraction tools
noynot@163.com
Extraction Methods
 Logical Extraction methods:
• Full Extraction
• Incremental Extraction
 Physical Extraction methods:
• Online Extraction
• Offline Extraction
noynot@163.com
Designing Extraction Processes
 Analysis:
• Sources, technologies
• Data types, quality, owners
 Design options:
• Manual, custom, gateway, third-party
• Replication, full, or delta refresh
 Design issues:
• Volume and consistency of data
• Automation, skills needed, resources
noynot@163.com
Maintaining Extraction Metadata
 Source location, type, structure
 Access method
 Privilege information
 Temporary storage
 Failure procedures
 Validity checks
 Handlers for missing data
noynot@163.com
Extraction Tools
noynot@163.com
Selection Criteria
 Base functionality
 Interface features
 Metadata repository
 Open API
 Metadata access
 Repository utilities
 Input and output processing
 Cleansing, reformatting, and auditing
 References
 Training requirements
noynot@163.com
Possible ETL Failures
 A missing source file
 A system failure
 Inadequate metadata
 Poor mapping information
 Inadequate storage planning
 A source structural change
 No contingency plan
 Inadequate data validation
noynot@163.com
Maintaining ETL Quality
 ETL must be:
• Tested
• Documented
• Monitored and reviewed
 Disparate metadata must be coordinated.
noynot@163.com
Summary
 In this lesson, you should have learned how to:
 Outline the ETL (Extraction, Transformation,
and Loading) processes for building a data
warehouse
 Identify ETL tasks, importance, and cost
 Explain how to examine data sources
 Identify extraction techniques and methods
 Identify analysis issues and design options for
extraction processes
 List the selection criteria for the ETL tools
 Identify Oracle’s solution for ETL process
noynot@163.com
Practice 4-1 Overview
 This practice covers the following topics:
 Answering a series of short questions
 Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 5
Building the Data Warehouse:
Transforming Data
Objectives
 After completing this lesson, you should be
able to do the following:
 Define transformation
 Identify possible staging models
 Identify data anomalies and eliminate them
 Explain the importance of quality data
 Describe techniques for transforming data
 Design transformation process
noynot@163.com
Transformation
 Transformation eliminates anomalies from
operational data:
 Cleans and standardizes
 Presents subject-oriented data
Extract
Transform:
Clean up
Consolidate
Restructure
Warehouse
Operational
systems
Load
Data Staging Area
noynot@163.com
Possible Staging Models
 Remote staging model
 Onsite staging model
noynot@163.com
Remote Staging Model
Data staging area within the warehouse environment
Transform
Extract
Operational
system
Load
Staging area
Warehouse
Data staging area in its own environment
Transform
Operational
system
Extract
Load
Staging area
Warehouse
noynot@163.com
On-site Staging Model
 Data staging area within the operational environment,
possibly affecting the operational system
Transform
Extract
Operational
system
Load
Staging area
Warehouse
noynot@163.com
Data Anomalies
 No unique key
 Data naming and coding anomalies
 Data meaning anomalies between groups
 Spelling and text inconsistencies
CUSNUM
NAME
ADDRESS
90233479 Oracle Limited
100 N.E. 1st St.
90233489 Oracle Computing
15 Main Road, Ft. Lauderdale
90234889 Oracle Corp. UK
15 Main Road, Ft. Lauderdale,
FLA
90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA
noynot@163.com
Transformation Routines
 Cleaning data
 Eliminating inconsistencies
 Adding elements
 Merging data
 Integrating data
 Transforming data before load
noynot@163.com
Transforming Data:
Problems and Solutions
 Multipart keys
 Multiple local standards
 Multiple files
 Missing values
 Duplicate values
 Element names
 Element meanings
 Input formats
 Referential Integrity constraints
 Name and address
noynot@163.com
Multipart Keys Problem
 Multipart keys
Product code = 12 M 654313 45
Salesperson
code
Country Sales
code
territory
Product
number
noynot@163.com
Multiple Local Standards Problem
 Multiple local standards
 Tools or filters to preprocess
cm
DD/MM/YY
1,000 GBP
inches
MM/DD/YY
FF 9,990
cm
DD-Mon-YY
USD 600
noynot@163.com
Multiple Files Problem
 Added complexity of multiple source files
 Start simple
Multiple
source files
Logic to detect
correct source
Transformed
data
noynot@163.com
Missing Values Problem
 Solution:
 Ignore
 Wait
 Mark rows
 Extract when time-stamped
If NULL then
field = ‘A’
A
noynot@163.com
Duplicate Values Problem
 Solution:
 SQL self-join techniques
 RDMBS constraint utilities
ACME Inc
ACME Inc
ACME Inc
SQL>
2
3
4
5
6
7
SELECT ...
FROM table_a, table_b
WHERE table_a.key (+)= table_b.key
UNION
SELECT ...
FROM table_a, table_b
WHERE table_a.key = table_b.key (+);
noynot@163.com
Element Names Problem
 Solution:
Common naming
conventions
Customer
Client
Customer
Contact
Name
noynot@163.com
Element Meaning Problem
 Avoid misinterpretation
 Complex solution
 Document meaning in metadata
Customer’s
name
All customer
details
All details
except name
Customer_detail
noynot@163.com
Input Format Problem
EBCDIC
“123-73”
ASCII
12373
ACME Co.
áøåëéí äáàéí
Beer (Pack of 8)
noynot@163.com
Referential Integrity Problem
 Solution:
 SQL anti-join
 Server constraints
 Dedicated tools
Department
Emp
Name
Department
10
1099
Smith
10
20
1289
Jones
20
30
1234
Doe
50
40
6786
Harris
60
noynot@163.com
Name and Address Problem
 Single-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
 Multiple-field format
Database 1
Name
Mr. J. Smith
NAME
LOCATION
Street
100 Main St.
DIANNE ZIEFELD
N100
Town
Bigtown
HARRY H. ENFIELD
M300
Country
County Luth
Code
23565
Database 2
NAME
LOCATION
ZIEFELD, DIANNE
100
ENFIELD, HARRY H
300
noynot@163.com
Quality Data: Importance and Benefits
 Quality data:
• Key to a successful warehouse implementation
 Quality data helps you in:
•
•
•
•
•
Targeting right customers
Determining buying patterns
Identifying householders: private and commercial
Matching customers
Identify historical data
noynot@163.com
Data Quality Guidelines
 Operational data:
 Should not be used directly in the warehouse
 Must be cleaned for each increment
 Is not simply fixed by modifying applications
noynot@163.com
Transformation Techniques
 Merging data
 Adding a Date Stamp
 Adding Keys to Data
noynot@163.com
Merging Data
 Operational transactions do not usually map
one-to-one with warehouse data.
 Data for the warehouse is merged to provide
information for analysis.
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:02
Anchovy Pizza
$12.00
Return 1/2/02
12:00:03
Anchovy Pizza
- $12.00
Sale
12:00:04
Sausage Pizza
$11.00
1/2/02
noynot@163.com
Merging Data
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:02
Anchovy Pizza
$12.00
Return 1/2/02
12:00:03
Anchovy Pizza
- $12.00
Sale
12:00:04
Sausage Pizza
$11.00
1/2/02
Pizza sales
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:04
Sausage Pizza
$11.00
noynot@163.com
Adding a Date Stamp
 Time element can be represented as a:
• Single point in time
• Time span
 Add time element to:
• Fact tables
• Dimension data
noynot@163.com
Adding a Date Stamp:
Fact Tables and Dimensions
Product Table
Product_id
Time_key
Product_desc
Time Table
Week_id
Period_id
Year_id
Time_key
Store Table
Store_id
District_id
Time_key
Sales Fact Table
Item_id
Store_id
Time_key
Sales_dollars
Sales_units
Item Table
Item_id
Dept_id
Time_key
noynot@163.com
Adding Keys to Data
#1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#3
Sale
1/2/98
12:00:02 Anchovy Pizza $12.00
#4
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
#5
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Data values
or artificial keys
#dw1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#dw2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#dw3
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
noynot@163.com
Summarizing Data
1. During extraction on staging area
2. After loading to the warehouse server
Operational
databases
Staging area
Warehouse
database
noynot@163.com
Maintaining Transformation Metadata
 Transformation metadata contains:
 Transformation rules
 Algorithms and routines
Sources
Stage
Rules
Extract
Transform
Publish
Load
Query
noynot@163.com
Maintaining Transformation Metadata
 Restructure keys
 Identify and resolve coding differences
 Validate data from multiple sources
 Handle exception rules
 Identify and resolve format differences
 Fix referential integrity inconsistencies
 Identify summary data
noynot@163.com
Data Ownership and Responsibilities
 Data ownership and responsibilities should be
shared by the:
• Operational team
• Data warehouse team
 Business benefit gained with “work together”
approach
noynot@163.com
Transformation Timing and Location
 Transformation is performed:
• Before load
• In parallel
 Can be initiated at different points:
• On the operational platform
• In a separate staging area
noynot@163.com
Choosing a Transformation Point
 Workload
 Impact on environment
 CPU usage
 Disk space
•
•
•
•
Network bandwidth
Parallel execution
Load window time
User information needs
noynot@163.com
Monitoring and Tracking
 Transformations should:
 Be self-documenting
 Provide summary statistics
 Handle process exceptions
noynot@163.com
Designing Transformation Processes
 Analysis:
• Sources and target mappings, business rules
• Key users, metadata, grain
 Design options:
•
•
•
•
Third-party tools
Custom 3GL programs(FORTRAN、Cobol、C、C++、JAVA )
4GLs like SQL or PL/SQL
Replication
 Design issues:
• Performance
• Size of the staging area
• Exception handling, integrity maintenance
noynot@163.com
Transformation Tools
 Third-party tools
 SQL*Loader
 In-house developed programs
noynot@163.com
Summary
 In this lesson, you should have learned how
to:
 Define transformation
 Identify possible staging models
 Identify data anomalies and eliminate them
 Explain the importance of quality data
 Describe techniques for transforming data
 Design transformation process
noynot@163.com
Practice 5-1 Overview
 This practice covers the following topics:
 Answering a series of questions based on the
business scenario for Frontier Airways
 Answering a series of short questions
noynot@163.com
Lesson 6
Building the Data Warehouse:
Loading Warehouse Data
Objectives
 After completing this lesson, you should be able to
do the following:
 Explain key concepts in loading warehouse data
 Outline how to build the loading process for the initial
load
 Identify loading techniques
 Describe the loading techniques provided by Oracle
 Identify the tasks that take place after data is loaded
 Explain the issues involved in designing the
transportation, loading, and scheduling processes
noynot@163.com
Loading Data into the Warehouse
 Loading moves the data into the warehouse
 Loading can be time-consuming:
• Consider the load window
• Schedule and automate the loading
 Initial load moves large volumes of data
 Subsequent refresh moves smaller volumes of
data
Transform
Extract
Operational
databases
Transport,
Load
Staging area
Warehouse
database
noynot@163.com
Initial Load and Refresh
 Initial Load:
 Single event that populates the database with
historical data
 Involves large volumes of data
 Employs distinct ETL tasks
 Involves large amounts of processing after load
 Refresh:
 Performed according to a business cycle
 Less data to load than first-time load
 Less-complex ETL tasks
 Smaller amounts of post-load processing
noynot@163.com
Data Refresh Models: Extract
Processing Environment
 After each time interval, build a new snapshot of the
database.
 Purge old snap shots.
Operational
databases
T1
T2
T3
noynot@163.com
Data Refresh Models: Warehouse
Processing Environment
 Build a new database.
 After each time interval, add changes to database.
 Archive or purge oldest data.
Operational
databases
T1
T2
T3
noynot@163.com
Building the Loading Process
 Techniques and tools
 File transfer methods
 The load window
 Time window for other tasks
 First-time and refresh volumes
 Frequency of the refresh cycle
 Connectivity bandwidth
noynot@163.com
Building the Loading Process
 Test the proposed technique
 Document proposed load
 Monitor, review, and revise
noynot@163.com
Data Granularity
 Important design and operational issue
 Low-level grain:
Expensive, high level of processing,
more disk space, more details
 High-level grain:
Cheaper, less processing, less
disk space, little details
noynot@163.com
Loading Techniques
 Tools
 Utilities and 3GL
 Gateways
 Customized copy programs
 Replication
 FTP
 Manual
noynot@163.com
Loading Technique Considerations
 Tools are comprehensive, but costly.
 Data-movement utilities are fast and powerful.
 Gateways are suitable for specific instances:
•
•
•
•
Access other databases
Supply dependent data marts
Support a distributed environment
Provide real-time access if needed
 Use customized programs as a
last resort.
 Replication is limited by
data-transfer rates.
noynot@163.com
Post-Processing of Loaded Data
Transform
Extract
Load
Staging area
Create
indexes
Warehouse
Generate
keys
Post-processing
of loaded data
Summarize
Filter
noynot@163.com
Indexing Data
 Before load:
Enable indexes at server
 During load:
Adds time to load window, row-by-row approach
 After load:
Adds time to load window, but faster than row-by- row
approach
Index
Operational
databases
Staging
area
Warehouse
database
noynot@163.com
Unique Indexes
 Disable constraints before load.
 Enable constraints after load.
 Re-create index if necessary.
Disable
constraints
Enable
constraints
Load data
Create index
Catch
errors
Reprocess
noynot@163.com
Creating Derived Keys
 The use of derived or generalized keys is
recommended to maintain the uniqueness of a
row.
 Methods:
• Concatenate operational key with a number
• Assign a number sequentially from a list
109908
109908 01
109908
100
noynot@163.com
Summary Management
 Summary tables
 Materialized views
Summary data
noynot@163.com
Filtering Data
 From warehouse to data marts
Summary data
Warehouse
Data marts
noynot@163.com
Verifying Data Integrity
 Load data into intermediate file.
 Compare target flash totals with totals before load.
Counts &
Amounts
=
Flash
Totals
Load
Intermediate file
Target
Counts &
Amounts
Flash
Totals
=
Preserve, inspect,
fix, then load
noynot@163.com
Steps for Verifying Data Integrity
Source files
Source files
Source files
Control
Target
3
4
1
Extract
SQL*Loader
5
2
6
7
.log
.bad
noynot@163.com
Standard Quality Assurance Checks
 Load status
 Completion of the process
 Completeness of the data
 Data reconciliation
 Referential integrity violations
 Reprocessing
 Comparison of counts and amounts
1+1=3
noynot@163.com
Summary
 In this lesson, you should have learned how to:
 Explain key concepts in loading data into the
warehouse
 Outline how to build the loading process for the
initial load
 Identify loading techniques
 Describe the loading techniques provided by
Oracle
 Identify the tasks that take place after data is
loaded
 Explain the issues involved in designing the
transportation, loading, and scheduling
processes
noynot@163.com
Practice 6-1 Overview
 This practice covers the following topics:
 Answering a series of short questions
 Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 7
Refreshing Warehouse Data
Objectives
 After completing this lesson, you should be
able to do the following:
 Describe methods for capturing changed data
 Explain techniques for applying the changes
 Describe Change Data Capture mechanism and
refresh mechanisms supported in Oracle9i
 Describe techniques for purging and archiving
data and outline the techniques supported by
Oracle
 Outline final tasks, such as publishing the data,
controlling access, and automating processes
 List the selection criteria for choosing ETL tools
noynot@163.com
Developing a Refresh Strategy
for Capturing Changed Data
 Consider load window.
 Identify data volumes.
 Identify cycle.
 Know the technical infrastructure.
 Plan a staging area.
 Determine how to detect changes.
Operational
databases
T1
T2
T3
noynot@163.com
User Requirements and Assistance
 Users define the refresh cycle.
 IT balances requirements against technical issues.
 Document all tasks and processes.
 Employ user skills.
Operational
databases
T1
T2
T3
noynot@163.com
Load Window Requirements
 Time available for entire ETL process
 Plan
 Test
 Prove
 Monitor
Load Window
0
3 am
6
User Access Period Load Window
9
12 pm
3
6
9
12
noynot@163.com
Planning the Load Window
 Plan and build processes according to a strategy.
 Consider volumes of data.
 Identify technical infrastructure.
 Ensure currency of data.
 Consider user access requirements first.
 High availability requirements may mean a small load
window.
User Access Period
0
3 am
6
9
12 pm
3
6
9
12
noynot@163.com
Scheduling the Load Window
1
Receive data
FTP
0
2
Requirements
Load cycle
Control File
File names
File types
Number of files
Number of loads
First-time load or refresh
Date of file
Date range
Records in file - counts
Totals – amounts
4
Control process
Open and read
files to verify
and analyze
3
3 a.m.
noynot@163.com
Scheduling the Load Window
5
Load into
warehouse
6
8
Verify,
analyze,
reapply
Create
summaries
7
Index
data
9
Update
metadata
Parallel
load
3 a.m.
6 a.m.
9 a.m.
noynot@163.com
Scheduling the Load Window
11
10
Backup
warehouse
6 a.m.
Create
views for
specialized
tools
12
Users
access
summary
data
13
Publish
9 a.m.
User
access
noynot@163.com
Capturing Changed Data for Refresh
 Capture new fact data
 Capture changed dimension data
 Determine method of capture in each case
 Methods:
•
•
•
•
•
Wholesale data replacement
Comparison of database instances
Time stamping
Database triggers
Database log
noynot@163.com
Wholesale Data Replacement
 Expensive
 Useful for data marts with less data
 Limited historical data analysis is possible
 Time period often exceeds load window
 Mirroring techniques can be used to provide
access to the users
noynot@163.com
Comparison of Database Instances
 Delta file:
• Changes to operational data since last refresh
• Used to update the warehouse
 Simple to perform, but expensive in terms of time and
processing
 Efficient for smaller volumes of data
Yesterday’s
operational
database
Today’s
operational
database
Database
comparison
Delta file holds
changed data
noynot@163.com
Time and Date Stamping
 Fast scanning for records changed since last refresh
cycle
 Useful for data with updated date field
 No detection of deleted data
Operational
data
Delta file holds
changed data
based on time stamp
noynot@163.com
Database Triggers
 Changed data intersected at the server level
 Extra I/O required
 Maintenance overhead
Operational
data
Operational
server
(RDBMS)
Delta file holds
changed data
Triggers on server
noynot@163.com
Using a Database Log
 Contains before and after images
 Requires system checkpoint
 Common technique
Operational
data
Operational
server
(DBMS)
Log
Log analysis
and
data extraction
Delta file holds
changed data
noynot@163.com
Choosing a Method
for Change Data Capture
 Consider each method on merit.
 Consider a hybrid approach if one approach is not
suitable.
 Consider current technical, operational, and
application issues.
noynot@163.com
Applying the Changes to Data
 You have a choice of techniques:
 Overwrite a record
 Add a record
 Add a field
 Maintain history
 Add version numbers
noynot@163.com
Overwriting a Record
 Easy to implement
 Loses all history
 Not recommended
42135
John Doe Married 42135
John Doe Single
noynot@163.com
Adding a New Record
 History is preserved; dimensions grow.
 Time constraints are not required.
 Generalized key is created.
 Metadata tracks usage of keys.
42135
42135_01
John Doe
John Doe
Single
Married
noynot@163.com
Adding a Current Field
 Maintains some history
 Loses intermediate values
 Is enhanced by adding an Effective Date field
42135
John Doe Single
42135
John Doe Single
Married
1-Jan-01
noynot@163.com
Limitations of Methods
for Applying Changes
 Difficult to maintain History
 Dimensions may grow large
 Maintenance overhead
noynot@163.com
Maintaining History: Techniques
 History tables
 One-to-many relationships
 Versioning
 Preserve complete history
noynot@163.com
Maintaining History: Techniques
 History tables:
 Normalize dimensions
 Hold current and historical data
 One-to-many relationships:
 One current record and many history records
HIST_CUST
Time
CUSTOMER
Sales
Product
noynot@163.com
Versioning
 Avoid double counting
 Facts hold version number
Time
Customer.CustId
Version
Customer Name
1234
1
Comer
1234
2
Comer
Sales.CustId
Version
Sales Facts
1234
1
$11,000
1234
2
$12,000
Customer
Sales
Product
noynot@163.com
Preserve Complete History
 Complete history:
• Enables realistic historical analysis
• Retains context of data
 Model must be able to:
• Reflect business changes
• Maintain context between fact and dimension data
• Retain sufficient data to relate old to new
noynot@163.com
Purging and Archiving Data
 As data ages, its value depreciates.
 Remove old data from the warehouse:
• Archive for later use (if needed)
• Purge without copy
noynot@163.com
Final Tasks
 Update metadata
 Publish data
 Use database roles to control access to the
warehouse
Sources
Stage
Rules
Publish
Extract
Transform
Load
Query
noynot@163.com
Publishing Data
 Control access using database roles
 Compromise between load action and user access
 Consider:
• Staggering updates
• Using temporary tables
• Using separate tables
noynot@163.com
ETL Tools: Selection Criteria
 Overlap with existing tools
 Availability of meta model
 Supported data sources
 Ease of modification and maintenance
 Required fine tuning of code
 Ease of change control
 Power of transformation logic
 Level of modularization
 Power of error, exception, resubmission features
 Intuitive documentation
 Performance of code
noynot@163.com
ETL Tool Selection Criteria
 Activity scheduling and sophistication
 Metadata generation
 Learning curve
 Flexibility
 Supported operating systems
 Cost
noynot@163.com
Summary
 In this lesson, you should have learned how to:
 Describe methods for capturing changed data
 Explain techniques for applying the changes
 Describe Change Data Capture mechanism and
refresh mechanisms supported in Oracle9i
 Describe techniques for purging and archiving
data and outline the techniques supported by
Oracle
 Outline final tasks, such as publishing the data,
controlling access, and automating processes
 List the selection criteria for choosing ETL tools
noynot@163.com
Practice 7-1 Overview
 This practice covers the following topics:
 Answering a series of questions based on the
business scenario for Frontier Airways
 Answering a series of short questions
noynot@163.com
Lesson 8
Leaving a Metadata Trail
Objectives
 After completing this lesson, you should be
able to do the following:
 Define warehouse metadata, its types, and its
role in a warehouse environment
 Examine each type of warehouse metadata
 Develop a metadata strategy
 Outline the Common Warehouse Metamodel
(CWM)
noynot@163.com
Defining Warehouse Metadata
 Data about warehouse data and processing
 Vital to the warehouse
 Used by everyone
 The key to understanding warehouse information
Metadata
noynot@163.com
Metadata Users
End users
Metadata
repository
Developers
IT Professionals
noynot@163.com
Types of Metadata
 End-user metadata:
• Key to a good warehouse
• Navigation aid
• Information provider
 ETL metadata:
•
•
•
•
Maps structure
Source and target information
Transformations
Context
 Operational metadata:
• Load, management, scheduling processes
• Performance
noynot@163.com
Examining Types of Metadata
 ETL metadata
 End-user metadata
Metadata
repository
External
sources
Operational
data sources
ETL
End
user
Warehouse
noynot@163.com
Examining Metadata: ETL Metadata
 Business rules
 Source tables, fields, and key values
 Ownership
 Field conversions
 Encoding and reference table
 Name changes
 Key value changes
 Default values
 Logic to handle multiple sources
 Algorithms
 Time stamp
noynot@163.com
Extraction Metadata
 Space and storage requirements
 Source location information
 Diverse source data
 Access information
 Security
 Contacts
Extraction
 Program names
 Frequency details
 Failure procedures
 Validity checking information
Metadata
repository
noynot@163.com
Transformation Metadata
 Duplication routines
 Exception handling
 Key restructuring
 Grain conversions
 Program names
 Frequency
 Summarization
Metadata
repository
Transformation
noynot@163.com
Loading Metadata
 Method of transfer
 Frequency
 Validation procedures
 Failure procedures
 Deployment rules
 Contact information
Metadata
repository
Loading
noynot@163.com
Examining Metadata: End-User Metadata
 Location of fact and dimensions
 Availability
 Description of contents and algorithms used for
derived and summary data
Metadata
 Data ownership details
repository
End users
noynot@163.com
End-User Metadata: Context
 Need to know the context of the table queried
 Associate the metadata with its description
Metadata
repository
End users
noynot@163.com
Example of End-User Metadata
Table
Name
Column
Name
Data
Meaning
Product
Prod_ID
739516
Unique identifier for the
product
Product
Valid_date
01/97
Last refresh date
Product
Ware_loc
1816
Warehouse location number
Product
Ware_bin
666
Warehouse bin number
Product
Code
15
The color of the product;
please refer to table
COL_REF for details
Product
Weight
17.62
Packed shipping weight in
kilograms
noynot@163.com
Historic Context of Data
 Supports change history
 Maintains the context of information
Metadata
repository
End users
1994 1995 1996 1997 1998
noynot@163.com
Types of Context
 Simple:
• Data structures
• Naming conventions
• Metrics
Metadata
repository
 Complex:
• Product definitions
• Markets
• Pricing
 External:
• Economic
• Political
End users
1994 1995 1996 1997 1998
noynot@163.com
Developing a Metadata Strategy
 Define a strategy to ensure high-quality
metadata useful to users and developers.
 Primary strategy considerations:
•
•
•
•
•
•
•
•
Define goals and intended use
Identify target users
Choose tools and techniques
Choose the metadata location
Manage the metadata
Manage access to the metadata
Integrate metadata from multiple tools
Manage change
noynot@163.com
Defining Metadata Goals
and Intended Usage
 Define clear goals.
 Identify requirements.
 Identify intended usage.
Metadata
noynot@163.com
Identifying Target Metadata Users
 Who are the metadata users?
• Developers
• End users
 What information do they need?
 How will they access the metadata?
noynot@163.com
Choosing Metadata Tools and
Techniques
 Tools:
• Data modeling
• ETL
• End user (query and analysis)
 Database schema definitions
 COBOL copybooks
 Middleware tools
noynot@163.com
Choosing the Metadata Location
 Usually the warehouse server
 Possibly on operational platforms
 Desktop tool with metalayer
Metadata
noynot@163.com
Managing the Metadata
 Managed by the metadata manager
 Maintained by the metadata architect
 Standards should be followed
noynot@163.com
Integrating Multiple Sets of Metadata
 Multiple tools may generate their own metadata.
 These metalayers should be properly integrated.
 Metadata exchangeability is desirable.
noynot@163.com
Managing Changes to Metadata
 Different types of metadata have different rates of
change.
 Consider metadata changes resulting from refresh
cycles.
noynot@163.com
Additional Metadata Content
and Considerations
 Summarization algorithms
 Relationships
 Stewardship
 Permissions
 Pattern analysis
 Reference tables
noynot@163.com
Common Warehouse Metamodel
Design and Administration
Analytic applications
Any source
ERP
Operational
External
Warehouse
Data
integration
Information
delivery
Any access
Reporting
Ad hoc query
& analysis
Data mining
Marts
CWM metadata repository
noynot@163.com
Summary
 In this lesson, you should have learned how
to:
 Define warehouse metadata, its types, and its
role in a warehouse environment
 Examine each type of warehouse metadata
 Develop a metadata strategy
 Outline the Common Warehouse Metamodel
(CWM)
noynot@163.com
Practice 8-1 Overview
 This practice covers the following topics:
 Answering a series of short questions
 Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 9
Managing and Maintaining
the Data Warehouse
Objectives
 After completing this lesson, you should be able to
do the following:
 Develop a plan for managing the transition from
development to implementation
 Identify challenges pertaining to the growth of the data
warehouse
 Describe backup and archive mechanisms
 Identify data warehouse performance issues
noynot@163.com
Managing the Transition to Production
 Promoting support for change
 Pilot versus large-scale implementation
 Documentation
 Testing
 Training
 Post-implementation support
 Maintaining the warehouse
noynot@163.com
Promoting Support for the
Data Warehouse
 Awareness
 Feedback
 Information
 Skills
 Education
 Direction
 Control
noynot@163.com
Choosing Between Pilot and
Large-Scale Implementation
Pilot
Large-Scale
Implementation
noynot@163.com
The Warehouse Pilot
 Demonstrates benefits to:
• Management
• Users
• IT staff
 Relevant to the business
 Low technical risk
 Small and feasible
 Anticipates increased use
 Focused on an initial business issue
 Remains in context
noynot@163.com
Piloting the Warehouse
 Designers:
• Prove model, data, and access tools
 Users:
• Prove ease of use of tool
• Check data and query performance
• Identify training requirements
 Developers:
• Resolve ETL and metadata issues
• Determine users data and training requirements
• Test security and access levels, monitor performance
noynot@163.com
Documentation
 Produces textual deliverables:
 Glossary
 User and technical documentation
 Online help
 Metadata reference guide
 Warehouse management reference
 New features guide
noynot@163.com
Testing the Warehouse
 Test every stage.
 Use a realistic test database and environment.
noynot@163.com
Training
 Users:
•
•
•
•
•
Metadata
DSS tools
Ad hoc queries
Getting help
Registration of enhancement requests
 Information systems developers:
•
•
•
•
Analysis techniques
Hardware technicalities
Networking
Implementing, building, and supporting DSS
noynot@163.com
Post-Implementation Support
 Evaluate and review the implementation.
 Monitor the warehouse:
• Respond to problems
• Conduct performance tuning
• Roll out metadata, queries, reports, filters, and
conditions
• Implement security
• Incorporate new users
• Distribute data marts and catalogs
• Transfer ownership from IT
noynot@163.com
Monitoring the Success
of the Data Warehouse
Number of Users
Initial
3
Months
6
Months
12
Months
24
Months
Period After Implementation
noynot@163.com
Measuring the Success
of the Data Warehouse
 Metrics may include:
 Availability
 Response time
 Response to problems
noynot@163.com
Managing Growth
 Increasing number of users
 Broader usage
 Growth of data volumes
Period after Implementation
noynot@163.com
Expansion and Adjustment
 Evaluate continually:
•
•
•
•
Changes
New increments
Unnecessary components
Strategies
 Ensure open environment
 Document development processes
for the future:
•
•
•
•
Planning
Cost analysis
Problem assessment and correction
Performance assessment
noynot@163.com
Controlling Expansion
 Ensure the continuity of staff.
 Document processes, solutions, and metrics.
 Establish working test and production
architecture for further increments.
 Create a strategy for maintaining changes to
data.
noynot@163.com
Sizing Storage
 Consider different methods.
 Determine the best for your needs.
 Know the business requirements.
 Do not underestimate requirements.
 Plan for growth.
 Consider space for unwanted data.
noynot@163.com
Estimating Storage
 Fact volumes
 Fact lifetime
 Technology availability
 Technology purchase
 Storing presummarized data
 Mirroring or other techniques
requiring disk storage
noynot@163.com
Objects That Need Space
 ODS
 Indexes and metadata
 Summary data
 Redo logs
 Rollback information
 Sort areas
 Temporary space
 Workspace for backup
and recovery
noynot@163.com
Other Considerations and Techniques
 Queuing models
 Rule of thumb
 Total database size is three to four times the size of
the base fact tables
 Consider:
•
•
•
•
•
Sparseness
Dimensions
Indexes
Summaries
Sort operational space
noynot@163.com
Space Management
 Monitor
 Avoid fragmentation
 Test load data
 Plan for growth
 Know business patterns
 Never let space become an issue
noynot@163.com
Archiving Data
 Determine data life expectancy.
 Identify archive frequency.
 Use read-only tablespaces.
 Include in early specifications.
noynot@163.com
Purging Data
 Reduce data volumes:
• Create summaries
• Remove unwanted base data
 Choose the most effective method.
noynot@163.com
Identifying Data Warehouse
Performance Issues
 Improving query efficiency:
•
•
•
•
Use indexes.
Use query governors.
Run large jobs out of hours.
Consider a data mart approach.
 Improving network performance:
• Provide sufficient bandwidth and optimize
configuration for access.
• Analyze traffic.
• Deploy data marts at remote locations.
noynot@163.com
Review and Revise
 Monitor the warehouse:
 Usage
 Access
 Accurate grain
 Detail data
 Periodicity
noynot@163.com
Secret of Success
Think big; start small!
noynot@163.com
Course Summary
 In this course, you should have learned that
the successful warehouse:
 Is driven by the business
 Focuses on objectives
 Adds value to the business
 Can be understood and used
 Delivers good data
 Performs well
 Belongs to the users
noynot@163.com
Sample study
Note : we discuss the creation of data marts, rather than
the perhaps more familiar term, data warehouse.
Data warehouses tend to be large, one-stop-shopping
repositories where all the historical data for the
organization would be stored. Nothing is wrong with this
as a concept; however, attempting to create a data
warehouse often led to huge, multiyear technology
projects that were never quite finished or were outdated
when they finally did get done.
noynot@163.com
Data Mart Structure
 Measures
A Measure is a numeric quantity expressing some
aspect of the organization's performance. The
information represented by this quantity is used to
support or evaluate the decision making and
performance of the organization. A measure can also be
called a fact.
the tables that hold measure information are known as
fact tables.
noynot@163.com
 Dimensions
A Dimension is a categorization used to spread out an
aggregate measure to reveal its constituent parts.
Dimensions are used to facilitate this slicing and dicing.
noynot@163.com
noynot@163.com
noynot@163.com
noynot@163.com
 The Star Schema
A Star Schema is a relational database schema used to
hold measures and dimensions in a data mart. The
measures are stored in a fact table and dimensions are
stored in dimension tables.
noynot@163.com
noynot@163.com
noynot@163.com
noynot@163.com
 Attributes
An Attribute is an additional piece of information
pertaining to a dimension member that is not the unique
identifier or the description of the member.
noynot@163.com
noynot@163.com
 Hierarchies
A Hierarchy is a structure mode up of two or more levels
of related dimensions. A dimension at an upper level of
the hierarchy completely contains one or more
dimensions from the next lower level of the hierarchy.
noynot@163.com
noynot@163.com
 The Snowflake Schema
noynot@163.com