The Warehouse Components (A Quick Review)

Download Report

Transcript The Warehouse Components (A Quick Review)

The Basic Structure
Data Staging Area
Source Data
Storage:
flat files (fastest);
RDBMS;
other
Extract Processing:
Extract
Populate,
clean;
replicate,
prune;
recover
combine;
remove duplication
standardize
conform dimensions
store awaiting replication
export to data marts
No user query services
Corporate View
Data Mart #1
OLAP (ROLAP,
MOLAP,HOLAP)
dimensional access
subject oriented
user group driven
refresh frequency
conforms to the Bus
DW Bus
Data Mart #2
DW Bus
Data Mart #3
The Basic Structure
Corporate Staging Area
User Access
Data Mart #1
OLAP (ROLAP,
MOLAP,HOLAP)
dimensional access
subject oriented
user group driven
refresh frequency
conforms to the Bus
Ad Hoc Query Tools
Data Feed
Customized Applications
Data Feed
DW Bus
Data Mart #2
DW Bus
Data Mart #3
Reporting Tools and Writers
Data Feed
Models:
forecasting;
scoring;
allocating;
data mining;
scenario analysis;
etc.
The Business Dimensional
Lifecycle
• Project Planning Early Critical Tasks
–
–
–
–
definition
scope
readiness assessment
business justification
• Remaining Tasks
– Resource requirements and identification
– Schedule construction and integrations
The Business Dimensional
Lifecycle
• Business Requirements Definition
– Critical to success
– Designers must understand the business needs
– A plan to extract users needs and to understand
them must be developed.
The Business Dimensional
Lifecycle
• Three project tracks follow the business
requirements definition process:
– Data track
– Technology track
– Application track
The Business Dimensional
Lifecycle
• Data Track
– Dimensional modeling
– Physical design
– Data staging design and development
The Business Dimensional
Lifecycle
• Technology Track
– Technical architecture design
– Things to consider:
• business requirements
• current technical environment
• planned strategic technical directions
The Business Dimensional
Lifecycle
• Application Track
– Product identification, selection, and
installation
– End user application development
• Configuring the metadata repository access
• Building specialized applications
The Business Dimensional
Lifecycle
• Deployment
– The integration of all the pieces of the puzzle.
– The best warehouse will fail if deployment is
not properly planned
– Plan required prior to deployment are:
•
•
•
•
education
user support
feedback
enhancement/maintenance
The Business Dimensional
Lifecycle
• Maintenance and Growth
– Work never stops!!!
– Critical to support and stay connected to the
users to ensure the warehouse meets their
needs.
– Watch performance and plan ahead (the
backroom)
– Collect and analyze metrics regarding use and
operation
The Business Dimensional
Lifecycle
• Maintenance and Growth (cont)
– If you are successful, change is inevitable. Plan
and prioritize future initiatives with user buy-in.
– Always plan for expansion and growth with
each new increment or change.
The Business Dimensional
Lifecycle
• Project Management
–
–
–
–
–
–
–
Monitor project status
Track issues
Control change
Project communication
Project marketing
Project politician
Project visionary
The Business Dimensional
Lifecycle
Technical
Architecture
Design
Product
Selection &
Installation
Business
Project
Planning
Requirement
Dimensional
Modeling
Physical
Design
Data Staging
Design &
Development
Definition
End-User
Application
Specification
End-User
Application
Development
Project Management
Deployment
Maintenance
and
Growth
Project Planning & Management
• Who Wants the Warehouse?
– A single visionary user
• desirable because the focus remains manageable
• requires political leverage to make it work
• the need must have broad and definable impacts to show worth
– Multiple demands
• Many organizations want a data mart or warehouse
• Focus is spread, therefore politics and planning play a vital role
Project Planning & Management
• Who Wants the Warehouse? (cont)
– No identified need
• Organization wanting to get in the “warehouse”
game
• More effort on the warehouse team to identify the
need
• It is highly likely there will be one.
Project Planning & Management
• Determine Warehouse Readiness
– Do you have a strong business sponsor?
•
•
•
•
•
•
•
•
Vision
Politically savvy
Connected
Influential
History of success
Respected
Realistic
Understands the need and the process and can communicate it
Project Planning & Management
• Determine Warehouse Readiness (cont)
–
–
–
–
Without this person you will fail
Try to recruit multiple sponsors.
Is there a real and identifiable business need?
Does a strong partnership exist between IT and
the business groups?
– What is the current analytical environment?
• How are things done now?
• What culture shock will be created?
Project Planning & Management
• Determine Warehouse Readiness (cont)
– What is the feasibility?
• Is the data “dirty” beyond recovery?
• Is the target sources to dispersed and dynamic to
achieve early and significant results?
Project Planning & Management
• Take the Readiness “Litmus Test”
– The test looks at:
•
•
•
•
•
Sponsor
Business Needs
IT/Business Partnership
Current Analytical Environment
Feasibility
– A strong sponsor is the most important to get a high
rating from the test
– Business needs and IT/Business Partnerships are
secondary in importance
Project Planning & Management
• Addressing Readiness Issues
– High-level business requirements analysis
• Identify the strategic initiatives
• Identify the business metrics
• Identify the high impact and ROI areas
– Business Requirements Prioritization
• Look for high impact, ROI, and feasibility
– Proof of Concept
Project Planning & Management
• Develop the Initial Scope
– Keep the scope narrow and short to retain clarity
– The bigger the scope the more difficult it
becomes to retain focus
– Always define the scope based on business
requirements. Try to avoid deadlines or budget
cycles from driving the scope.
Project Planning & Management
• Develop the Initial Scope (cont)
– Scope definition involves both IT and business
representatives
– Make the scope have significance but ensure it is
achievable and timely
– Start with a single or few data sources and a single
business process
– Limit your initial user base (typically 25 - 35 people).
– Determine what management expects so success can be
identified
Project Planning & Management
• Develop the Initial Scope (cont)
– Document the scope definition and success
indicators
– Acknowledge that the scope will likely change
– Develop a plan to manage the change
Project Planning & Management
• Build the Business Justification
– Determine the costs
• Identify hardware and software costs (start-up and
ongoing)
• Identify maintenance costs
• Internal staff needs
• External resources (consultants, etc.)
• Operational support
• Support of growth pains
Project Planning & Management
• Build the Business Justification (cont)
– Determine the benefits (financial and other)
•
•
•
•
•
•
•
Increased revenue
Increased profit
Increased customer satisfaction
Expansion of a market or capability
Increased employee productivity
Reduction of capital investments (storage requirements, etc.)
Protection against fraud and attack
Project Planning & Management
• Build the Business Justification (cont)
– It is important to monitor and track the business
to identify and market impacts the warehouse
has made
– Look for the tangibles and intangibles
Project Planning & Management
• Plan the Project
– Establish project identity
•
•
•
•
Create a name
Create documentation describing your project
Make T-shirts, mugs, etc
Market, market, market!!!
Project Planning & Management
• Plan the Project (cont)
– Staff up
•Project Manager
•Business Lead
•Business Analyst
•Data Modeler
•DW DBA
•Data Staging System Designer
•End User Application Developer
•DW Educator
•Technical/Security Architect
•Technical Support Specialists
•Data Staging Programmers
•Data Steward
•DW QA Analyst
Project Planning & Management
• Develop the Project Plan
– Key (frequently update your plan)
– The nature of a DW project in cyclic and
resembles a spiral approach
– Identify key milestones
– Develop a high-level and detailed plan
Project Planning & Management
• Manage the Project
– Matrix management is often used because of
the numerous interlaced roles
– Data issues may lay waste to the best devised
plans (plan for the unexpected)
– The project will likely increase in visibility
(manage expectations)
– Iterative/sliding window development requires
multiple teams work in sync (communication)
Project Planning & Management
• Manage the Project (cont)
– Conduct a project kickoff meeting
•
•
•
•
•
•
Identify the team, roles, and responsibilities
Identify the scope
Identify goals
Identify the schedule
Review the preliminary PMP
Conduct preliminary education
Project Planning & Management
• Monitor the Project Status
–
–
–
–
–
Frequent communication
Project status meetings
Team meetings
Project status reports
Customer reporting
Collecting the Requirements
• The old theory was not to include the users
in the early stages.
• Build it and they will come.
• This proved to be the demise of many early
warehouse initiatives.
• A formal requirement (but flexible) is
needed to document the users needs of the
warehouse.
Collecting the Requirements
• This is a difficult process for many reasons.
– Key people may feel threatened and are not
willing to cooperate.
– The informal decision process is typically not
well documented and is dispersed.
– People have a difficult time thinking “out of the
box”
– Terminology associated to warehousing often
creates confusion and/or misinformation
Collecting the Requirements
• Talk with the business users first
– Strive to understand how they do business
– Identify how decisions are made today
– Determine how they would like to make
decisions today and tomorrow
– Do not just ask “what data do you need?”
Collecting the Requirements
• Talk with the IT community second
– Wait until some common sources and themes
are identified by the business users before
approaching IT
– Look for feasibility issues
– Start identifying technical issues such as
platforms, formats, access, and politics
– Talk DBAs, DAs, application developers, an
designers
Collecting the Requirements
• Getting the requirements (Interview VS
Facilitation)
– Interviews tend to stay focused and work well
with small groups
– Facilitated sessions work with larger groups
and encourage “brainstorming” and cross
pollination of ideas.
Collecting the Requirements
• Roles of the requirements team
–
–
–
–
–
Lead interviewer
Secondary interviewers
Scribe
Observers
Facilitator
Collecting the Requirements
• Preparation for the interview
– Look at strategic plans that relate to the
company or group you will talk with
– Look at the annual report. Important goals and
initiatives will be identified and taken seriously
by the company.
– Review marketing material
– Search the Internet for information
– Identify past attempts at similar projects
Collecting the Requirements
• Identify who will be interviewed
– Business
• Look horizontally across the organization to see the
big picture
• Get as much detail as possible in the current area of
focus (vertical)
• Request that your sponsor identify who should be
interviewed.
Collecting the Requirements
• Identify who will be interviewed
– Technology
• The data gurus (these people have been around a long time and
know the details)
• Application programmers
• Pseudo technical people within a business area
• DBAs
• Data modelers
• System administrators
• IT management to identify the future
Collecting the Requirements
• Develop an interview questionnaire
• Build an agenda for the interview sessions
• Prepare the interviewees
– Hold a single meeting with all interviewees to discuss
the project, intentions, etc
– Set the tone for all interviews
– Encourage questions
– Enables you to identify good and bad candidates early
(now you can plan for each person)
Collecting the Requirements
• Conduct the interview
– Remain within the roles established for the interview
team
– Validate what you have collected with the user as soon
as possible
– Define terms with the users (profit, revenue, sales)
– Try to talk on their level and avoid using confusing
technology terms (use their business lingo when
possible)
Collecting the Requirements
• Conduct the interview
– Try to remain flexible during the interview process
• Meet with unexpected people
• Run past the allotted time
• Discuss topics somewhat out of the focus of the interview.
– Schedule breaks and limit the number of interview
session per day to about five
– Continue to manage expectations
Collecting the Requirements
• Potential interview questions for an executive:
– What are the objectives of your organization? What are
you trying to accomplish?
– How do you measure success? How do you know you
are doing well? How often do you measure yourself?
– What are the key business issues you face today? What
could prevent you from meeting these objectives? What
would be the impact?
Collecting the Requirements
• Potential interview questions for an analyst:
– What are your groups objectives? How do you
accomplish them? How do you achieve it?
– What are your success metrics? How do you
know you are doing well? How often do you
measure?
– What issues do you currently face?
– Describe your products, vendors, etc? Is there a
natural hierarchy?
Collecting the Requirements
• Potential interview questions for an analyst:
– What type of analysis do you perform? What data is
used? How do you get it? What do you do with it?
– What analysis would you like to perform?
– What dynamic analysis needs do you have? Who drives
these needs? How long does it take to perform? Are you
able to conduct deeper levels of analysis?
– What analytical capabilities would you like?
Collecting the Requirements
• Potential interview questions for an analyst:
– Where are the bottlenecks in obtaining information?
– How much historical information is needed?
– How will improved information access impact you and
your organization? What is the financial impact?
– What reports do you currently use? Which data
elements on the reports are important? How is this
information used? Is it combined with anything else?
Collecting the Requirements
• What to discuss with IT:
– Request an overview of the operational systems
– What are the current tools and technologies used to
share information?
– What types of analyses are performed?
– How are detailed analyses supported and conducted?
– What are the data quality issues?
– Where do bottlenecks exist?
Collecting the Requirements
• What to discuss with IT:
– What concerns do you have about data
warehousing in the organization? What
roadblocks do you see?
– What expectations do you have of the
warehouse?
– How do you expect the warehouse to impact
you?
Collecting the Requirements
• Types of users you will interview
– Abused User
• Involved in earlier attempts
• Unwilling to cooperate
–
–
–
–
Overbooked User (To busy to meet)
Comatose User
Overzealous User
Nonexistent User (Use technology to drive the
needs)
Collecting the Requirements
•
•
•
•
Wrap Up
Review the interview results with the team
Prepare and publish the results
Establish what will be done next
Dimensional Modeling
Jeffrey T. Edgell
The Dimensional Model
• More intuitive structure for presentation and
reporting
• Likely predates the E/R approach
– General Mills & Dartmouth University
developed a fact and dimension structure
– Nielsen Marketing Research used this on
grocery and drug store auditing and scanner
data in the 70s and 80s.
The Dimensional Model
• Dimensions are descriptive
• Facts are likely numeric and are
measurement based
• Additive facts are vital to allow aggregation
of many records during a retrieval
• Page 145 (A typical dimensional model)
The Argument for the
Dimensional Model
• Tools can utilize a standardized framework
• Query tools can leverage against this for
performance optimization
• High performance entry browsing is
possible
• All queries can be initially constrained thus
significantly increasing performance
The Argument for the
Dimensional Model
• Easily adapts to unpredictable queries
• Extends to allow the addition of new tables
or data elements
– will not require rebuilding the database from
scratch
– data does not need to be reloaded
– existing reports and query tools do not need to
be redesigned or implemented
The Argument for the
Dimensional Model
• The model can be altered as follows without
interruption:
– The addition of new facts (consistent with the
defined grain)
– The addition of new dimensions
– The widening of a dimension table
– Changing the detail of a dimension to a lower
level
The Argument for the
Dimensional Model
• The dimensional model exhibits a
predefined set of approaches used to deal
with common issues.
– Slowly changing dimensions
– Heterogeneous products (track different lines of
business i.e. checking & savings)
– Pay-in-advance data bases (look at individual
components as well as the total)
– Event handling (no facts)
The Argument for the
Dimensional Model
• Aggregation in a warehouse allows for
query performance normally delegated to
hardware to solve (greatly increasing $)
• A standard set of schemas for different
business types and applications exist
The Bus
• Supports the incremental approach
• The data mart approach has often lead to
development of warehouse absent of a
corporate framework
• Stovepipe decision structures result
• Produces a uniform global structure
eliminating the pocket or stovepipe data
marts
The Bus
• Look at the entire enterprise as you design
and build the data marts
• A high level architecture must be defined
that explains the entire structure
• A detailed architecture must be developed
to support each data mart as they are
confronted
Conformed Dimensions
• Dimensions used to represent concepts
across the enterprise must be standardized
and agreed upon
–
–
–
–
customer
product
time
potentially not region (sales & management)
Conformed Dimensions
• Conformed dimensions must be carefully
managed, maintained, and published to
ensure consistency
• The conformed dimension represents the
central source description of which
everyone agrees
• If the conformed dimension approach is not
observed, the bus will not properly function
Conformed Dimensions
• With conformed dimensions
– One dimension table relates to multiple facts
– Browsers are consistent with the dimension
providing a unified view
– Rollups and meanings remain consistent across
facts
Conformed Dimensions
• Design
– Lowest level of granularity possible (based on
the lowest level defined)
– Use the sequential numeric key (surrogate key)
Conformed Facts
• Occurs during the definition of conformed
dimensions
• Relates common measurements accurately
– Cost
– Profit
– Unit price
• If facts are different use different names
(marketing profit & sales profit)
• As much political as technical
When the Bus is not Required
• The business you are dealing with is
intentionally segmented
– Components operated autonomously with no
unified corporate view required
– Products or business areas are disjoint
– For example a company sells music and repairs
train engines (no business or product synergy
except at the very top)
The Components of the
Dimensional Model
•
•
•
•
Facts
Dimensions
Attributes
The Bus (optional but highly suggested)
Operations
• Drill down and rollup
– Example on page 168
Snowflakes
• What is it?
The removal of low cardinality fields from a
dimension placed in a new table and linked
back with keys
• Complicates design detail
• Decreases performance
• Saves some space but normally not a
significant amount
• Bit map indexes can not be effectively
utilized
When a Snowflake is OK
• When used as a subdimesnion
– The data in the subd is related to the dimension
are at different levels of granularity
– The data load times for the data are different
– Examples:
• County and state
• District and region
• Ship and battle group
Good Descriptive Dimensions
• Large dimension tables
• Highly descriptive
• Without good descriptive dimensions, the
warehouse is not useful
• Use:
– full words, no missing values (null), QA,
metadata
Common Dimension Techniques
• Time
– example figure 5.7 page 176
• Address
– example page 178
• Commercial address
– example page 179
Slowly Changing Dimensions
• What to do:
– Type 0: Ignore the change
– Type 1: Overwrite the changed attribute
– Type 2: Add a new dimension record with new
value of the surrogate key
– Type 3: Add an “old value” field
Slowly Changing Dimensions
• Ignore the change
– Not typically a good solution to the problem,
but is done.
• Overwrite the changed attribute
– Valid when correcting a value from the source
• Add a new dimension record with a
generalized key
– Retains history of a changed product
Slowly Changing Dimensions
• Add an “old value” field
– Valid when on the previous change is needed
for decision making
Slowly Changing Dimensions
• Type 2 example:
Change in product (bottle changes from
platic to glass)
Key
Type
SKU
001
Plastic
1234
002
Glass
1234
Slowly Changing Dimensions
• Type 3 example:
Regional divisions of a company changes
(only one historical change is supported)
Region
Old Region
Gold Silver Platinum
North South East
Bronze
West
The Monster Dimension
• It is a compromise
• Avoids creating copies of dimension records
in a significantly large dimension
• Done to manage space and changes
efficiently
Example 1
The Monster Dimension
Customer_Key
name
address
city, state
birth_date
date_first_purchase
income
number_children
education
total_purchases
credit_score
Customer_Key
Basically constant
name
address
city,state
birth_date
date_first_purchase
May change
with each
purchase
Demographics_Key
income_band
number_children
education_level
Bands used to
total_purchases_band minimize
credit_group
possibilities
The Monster Dimension
• Case 1 (Rapid change)
– Large dimensions can be dynamic because of
the amount of information contained
– Certain aspects must be maintained in the
dimension, over time, to understand impacts
• demographics
• customer data
• product lines (for companies in acquisition)
The Monster Dimension
• The solution to very dynamic large
dimensions
– identify the dynamic areas of the dimension
– segment the hot areas into there own
independent dimensions
– The relative static information remains in the
original dimension
The Monster Dimension
• The trade off (plus)
– the warehouse can accurately retain significant
changes in a dimension over time
– to slow the rate of change down extremely
dynamic attributes should be banded to slow
the rate of change
– All possible combinations in the dimension
become finite (discrete) and are thus
manageable
The Monster Dimension
• The trade off (minus)
– Loss of detail in the bands (no longer exact)
– Once bands are defined they must be enforced
from that point on
– Slower browse performance required when
combining the segmented table with the
original table
– Impossible to combine the data without a single
instance of a fact (nothing to relate the
dimensions)
Example 2
The Monster Dimension
Employee Table
name
address
date_of_birth
social_security_num
…
title
years_with_company
income
division
purchase_level
Employee Table
name
address
date_of_birth
social_security_num
Corporate Demographics
position_grade
income_band
division
service_years_band
Degenerate Dimensions/Key
Definition - Critical data provided in the
legacy environment that normally remains
independent. Typically the old key from the
current fact information you are using with
no supporting data
Degenerate Dimensions/Key
• Likely found in the header of a file
• The other items have been absorbed in other
dimensions
– customer, date, vendor, item
• The remaining item has no supporting
attributes but is important
– CLIN, Requisition #, Order #
• Useful information and should be absorbed
in the fact table
Degenerate Dimensions/Key
• Useful information and should be absorbed
in the fact table
• If there is other supporting attributes, it
becomes a typical dimension
Junk Dimensions
• Resident flags, status codes, and
miscellaneous information persists after the
dimensional design is near complete
• Alternatives:
– Place the flags in the fact tables
– Make each attribute a dimension
– Remove the attributes completely
Junk Dimensions
• Leave the flags in the fact tables
– likely sparse data
– no real browse entry capability
– can significantly increase the size of the fact
table
• Remove the attributes from the design
– potentially critical information will be lost
– if they provide no relevance, remove them
Junk Dimensions
• Make a flag into it’s own dimension
– may greatly increase the number of
dimensions, increasing the size of the fact table
– can clutter and confuse the design
Combine all relevant flags, etc. into a single
dimension
– the number of possibilities remain finite
– information is retained
Keys, Keys, Keys
• Surrogate keys (always use)
– 4 byte integer (232 or two billion + integers)
• Date keys should use surrogates as well
– dates are typically 8 bytes -- saves 4 bytes per
fact
• Do not use smart keys with embedded
meanings
• Do not use legacy or production keys
Just the Facts
• Attempt to make all facts additive
– simplifies calculations across dimensions
– all numbers are not additive facts
• Semi-additive facts can be used but
understand they are there
– averages, max, min
• Non-additive facts often are avoided but
may have value
– weather conditions (non-discrete), non-discrete
discriptions
Steps to Designing a Fact Table
• Time to choose:
–
–
–
–
data mart (functional business area)
grain of the fact table (what level of detail)
dimensions associated to the data mart
the facts relative to the data mart
Data Mart
• Single operational source data marts
provide the least amount of risk
• Multiple operational source data marts
typical provide more cross functional value
• Examples: (remember, processes you
measure)
–
–
–
–
Marketing
Sales
Inventory
Productivity
Fact Table Grain
• Without this, dimensions can not be
accurately defined
• Select as low of a grain as possible
– handles unexpected queries
– adapts readily to additional facts and
dimensions
– delivers the most comprehensive solution
– Consumes more space
– Performance can be an issue
Fact Loads
• By record
– account for every transaction or activity
recorded (ATM)
• Snapshot
– A picture of the related facts at a specific point
in time (monthly reporting)
• Line item
– track and reflect the status of line item activity
(PO)
Dimensions
• Once the grain is defined, basic dimensions
will be evident from the grain (customer,
time, etc.)
• Addition of other dimensions and junk
dimensions
• All dimensions can not be at a lower level
of granularity than the lowest fact table
grain
Identifying Facts
• The grain of the fact table dictates the facts
• All facts must be at the same level
• Individual transaction tables typically have
1 fact (the numeric value of the transaction)
• Snap shot and line item fact tables will
likely contain multiple facts in that multiple
additive facts are captured
• Keep all three types separated
Fact Table Families
• Process chain (supply chain, linear)
–
–
–
–
fact table represents each step in the process
RFI-RFP-RFQ-Contact-Delivery
supply chain process example: page 200
each fact is connected on the bus
• Value Circle (parallel measurement)
– health care (example page 202)
– retail
Fact Table Families
• Heterogeneous Product Schemas
– Service offered by the business are distinct and
separate
– banking (checking, savings, loans, etc.)
– Insurance (life, home, auto, etc.)
• Transaction an Snapshot Schemas
– Snapshot (periodic picture) example page 210
– Transaction (activity detail) example page 207
Aggregate Families
• Used to improve query performance
• Typically roll ups of facts along a
dimension for anticipated reporting and
querying
• Aggregate tables can also be used to
combine details from two fact tables of
varying granularity
Factless Fact Tables
• Used for two reasons:
– record an activity (student attendance page 213)
• answers what the most popular classes were
• what days are frequently missed
– Coverage (account for activity that may not
have happened) (example page 215)
• An entry is placed in the fact table for all item of
interest
• answers questions regarding what did and did not
have activity