Database Schema Akanksha E. V. Anantha Lakshmi Sze Eng Koon (HT00-6255Y) (HD99-9054A) (HD99-2800H) Material Available at http://www.comp.nus.edu.sg/~szeek/cs6203 Agenda Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional.
Download
Report
Transcript Database Schema Akanksha E. V. Anantha Lakshmi Sze Eng Koon (HT00-6255Y) (HD99-9054A) (HD99-2800H) Material Available at http://www.comp.nus.edu.sg/~szeek/cs6203 Agenda Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional.
Database Schema
Akanksha
E. V. Anantha Lakshmi
Sze Eng Koon
(HT00-6255Y)
(HD99-9054A)
(HD99-2800H)
Material Available at
http://www.comp.nus.edu.sg/~szeek/cs6203
Agenda
Data warehouse
Identifying facts and dimensions
Designing fact tables
Designing dimension tables
Designing starflake schemas
Multidimensional schema
Data Warehouse
Relational database designed for query
and analysis
Contains historical data derived from
transaction data
Enables organization to consolidate data
from several sources
Schema
Consists of DB objects such as tables,
views, indexes
There are many types of schema, e.g.,
star, snowflake
Fact
In decision support DW, a number of
queries ask about an essential fact
E.g., top 20% spending customers
Fact: Loyalty card transactions
Characteristics
Transactions that occurred in past and won’t
change
Analyzed in many ways by cross-referencing
with attributes
Fact Table
A table that contains facts
Contains numeric, additive fields
(measurements of the business)
It has two types of columns:
containing facts
foreign keys to dimension tables
Dimension
A structure that categorizes data to enable
end users to answer business questions
e.g., Time, Location, Customer
Example
Video chain store builds DW to analyze sales of
its products across all stores over time
Facts
Effect of promoting one product on sale of related
product that is not promoted?
Product sales before and after the promotion?
sales (units sold or rented) and profits
Dimensions
products, locations (stores), promotions, time
Major Considerations
Data warehouse supports business process, not
specific queries
Must understand what information will be used
Challenge:
Solution should be effective for a reasonable time
period (3-5 years)
Design should allow unknown queries to be able to
perform
Identifying Facts & Dimensions
Given large entity model
1.
Look for elemental transactions in business
process
2.
Determine key dimensions that apply to facts
3.
Check that a candidate fact is not a dimension
table
4.
Check that a candidate dimension is not a fact
table
If 3 or 4 changes assignment go to step 2
1) Look for Elemental
Transactions in Business
Examine business and identify transactions of
interest
Transactions that describe fundamental events
e.g., record of account transactions made by a
banking customer
Is information operated on by business?
Don’t assume reported transactions are facts
Design to store most detailed transaction to
prevent restructuring in future
Example
Consider sales information in retail shop
Stored record
Store_id, Product_id, Date, Sale
Business operates on basket,
i.e.,individual sales transaction and not
aggregate sales
Fact should be basket transaction
2) Determine Key Dimensions for
Facts
Find out entities that are associated with
the facts
Need to focus on key dimensions
Account transaction
Account_id
Owner_id
Owner_id
Name
Account_type
Address
Balance
Contact_no
account
analyze account
transaction by
account
customer
analyze how
customer uses
services
3) Check If Fact Is Dimension
Entities may appear to be fact tables but
may be combination of both facts and
dimensions
Consider whether attributes were
designed around data entry requirements
Data in fact table rows should not vary
over time
Example
Address Zip Date Connect … Date
cables Date
sub
laid
added
Address
Mistaken for fact table because users will query on
information about address
Example (Cont.)
Code Address Zip
Address
Code Date
Code Date
Date
Cable laid
Date sub
added
•Will affect the DB size
•Query of operational events much easier
4) Check If Dimension Is Fact
An entity could be either dimensions or
facts
Consider focal point of analysis of the
business
Good check: by how many dimensions
entity can be viewed, >3 probably fact
Account transaction
Account_id
Owner_id
Owner_id
Name
Account_type
Address
Balance
Contact_no
account
customer
analyze how
customer uses
services
Customer_id
Name
Age
Location_id
Location
Occupation_id
customer
Location wise % patronage
Occupation
Designing Fact Table
No practical limit to size of fact table
Balance between size of fact table and
value of data
Techniques to Reduce Size of
Fact Tables
Identification of significant historical period
for each supported function
Determination of samples to satisfy
requirements of detailed data
Selection of relevant / appropriate
columns
Techniques (Cont.)
Minimization of column size
Determination of intelligent or nonintelligent keys
Determination of format of storing time
Partition the fact table (Next group)
1.Identification Of Significant
Historical Period
Draw retention period graph showing
period and detail necessary for each
business function
Sales Analysis
Jan ‘98
July ‘98
Jan ‘99
e.g., comparison of sales of seasonal
products such as winter clothing, text
books etc
2. Determine samples to satisfy
requirement of detailed data
Appropriate in situations where analysis
requirement is to analyze trends
The business requirement does not
require all the detailed data
Store samples and aggregate the rest
3.Selection of Relevant /
appropriate columns
Decide whether optional data needs to be
stored in the fact table
e.g., for a table related to bank accounts
The columns typically required are:
Acc#, IC#, Amt#, Bank # or Branch#
Optional columns related to Teller Name and
Time of entry of records to be decided based
on queries’ requirements
4. Minimization of Column Size
Saving per row can have significant effect
on total table size
e.g., a typical transaction table contains 4
billion rows (2 million customers, 3
transactions per day per customer for 2 or
3 yr period) saving of 20 bytes per row will
save 20 x 4 =80 GB
5. Intelligent /Non-intelligent keys
A fact table can be structured in two ways using
Intelligent keys:
Each key represents unique identifier for the item in
the real world
Non-Intelligent:
Each unique key is generated automatically and
refers to unique identifier of the item in the real world
Intelligent Keys
Sales
Salesman_name
Customer
Vehicle_desc
Date
Sales_price
Non – Intelligent Keys
Sales Analysis
Time
Salesman
Time id
Date
Sales id
Salesman Name
Sales
Sales id
Cust id
Vehicle id
Time id
Sales_price
Vehicle
Vehicle id
Vehicle Description
Comparison
Advantages
Intelligent keys improve query performance
by avoiding joins
Disadvantages
For intelligent keys, if any of the unique
identifier changes, a lot of rows in fact table
have to be updated to reflect new identifiers
6. Determination of format of
storing time
Time information can be stored in many
ways within fact table, e.g.,
Use of foreign key into time dimension table
Actual physical dates are stored within
dimension table
Determination of format of storing
time (Cont.)
Possible techniques to store dates are
Storing the physical date
Storing an offset from the inherent start of the
table
Storing a date range
Use physical dates rather than foreign
keys as the cost of storing is minimum
Determination of format of storing
time (Cont.)
A fact table is usually partitioned in a unit
of time (e.g. week, month,quarter etc)
Dates can be referred as offsets from the
inherent start
Storage costs are low. Two bytes enough
to store up to 31 numbers
Determination of format of storing
time (Cont.)
Disadvantages:
Queries have to be constructed to convert physical
dates into date offsets.
E.g.: To return sum of transactions on ‘9-Jan-00’
Select count(*) from customer
where date=‘9-Jan-00’ – ‘1-Jan-00’;
Solution:
If access tools can’t interpret offset, create view that
looks like logical table by adding offset to start date
Date Range
Significant saving in disk capacity and
query performance
Reduces table size
Has significance when changes occur
substantially over time
Date Range (Cont.)
Select count (*) from customer where
date between (st_date,end_date);
A number of access tools may not be able
to process fact tables in this view
Sometimes a Cartesian product is formed
by joining time dimension table against
date range in the fact table
Date Range (Cont.)
Cost of Cartesian product is high, it
requires
Significant processing power
Temporary space
Consider the use of date range if access
tools can cope directly with structure and
do not allow data expansion
Designing Dimension Table
Done after fact tables are designed
dimension tables support querying on fact
tables
Wrong design is not a major disaster
volumes relatively small
restructuring costs relatively small
Considerations
Star dimension
Hierarchies and networks
Dimensions that vary over time
Managing large dimension tables
Star Dimension
Denormalized tables
put reference information into a single table
speed up query performance
Rely on perceived use of information by
typical queries
Not appropriate when additional data is
not accessed often
overhead of scanning expanded table
Star Dimension (cont.)
View enrollment size by department,
faculty
mod#, title, dept#
1
acad_year, mod#, enrollment
Put under same table
n
dept#, department, fac#
1
n
fac#, faculty
mod#, title, dept#, department, faculty#, faculty
Star Dimension (cont.)
Shape of a star
dimension
dimension
dimension
fact
dimension
dimension
Hierarchies and Network
Not possible to denormalize all entities
into star dimensions
Many-to-many relationships should not
be denormalized into star dimension
not efficient
Multiple hierarchies are used to
represent different views
denormalized the hierarchy likely to be
used by most queries
Hierarchies and Network (cont.)
Example
Fact
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Hierarchies and Network (cont.)
Useful if query profiles do not change to a
point where other hierarchy becomes more
popular
existing star dimension table is useless
For minor changes
add columns
note impact on canned queries if columns are
dropped or modified
Dimensions That Vary Over
Time
Change in business
t-shirts move from menswear to unisex
baked beans move from canned foods to canned
vegetables
May need to support queries that compare
facts based on present and past groupings
for example, to compare sales by departments
between past and present, need to use both old
and new classification
instead of updating record in dimension table,
add new record and validity dates
Dimensions That Vary Over
Time (cont.)
Year 1996
select sum(s.revenue_achieved)
from sales_year_to_date s, product_dimension pd
where s.product_id = pd.id
and pd.department = ‘Menswear’
and pd.end_date > 1-Jan-96
and pd.start_date < 31-Dec-96;
Year 1997
select sum(s.revenue_achieved)
from sales_year_to_date s, product_dimension pd
where s.product_id = pd.id
and pd.department = ‘Menswear’
and pd.end_date > 1-Jan-97
and pd.start_date < 31-Dec-97;
Dimensions That Vary Over
Time (cont.)
If query needs to compare a significant
event in corporate calendar, year on year,
keep in time dimension
Great Singapore Sales in 1998 is from 15 Jul
to 15 Aug
Great Singapore Sales in 1999 is from 10 Jun
to 10 Jul
Instead of just joining sales_year_to_date
and product_dimension tables, now we
also join with time_dimension table.
Managing Large Dimension Tables
Need to watch out for dimension tables that
grow too large over time
Indicated by:
especially for those dimension tables that store
new values instead of updating existing tuples
size similar to that of fact table
full table scan takes too long
Horizontal partition the large dimension tables
product_dimension_1990s
product_dimension_2000s
Designing Starflake Schema
Starflake schema
combination of denormalized star and normalized
snowflake schemas, plus some additional ideas
Difficult to restructure all entities into a set of
distinct dimensions
common for entities to span one or more
dimensions
for example, common for stores to apply local
pricing for products, thus need to join product and
store tables instead of getting the price from
product table
allow a degree of crossover between dimensions
Locations
Department
Business
unit
Products
Style
Color
Sales
transactions
Products
Size
Time
Locations
Region
Time
Facts
Week
Month
Summer
sales
Easter
sales
Star dimensions
Snowflake dimensions
Designing Starflake Schema (cont.)
Crossovers of starflake schema will reduce over time
Degree of change should decrease as we move from
outlying entities towards fact tables
once system is operational
requirements become better understood
expect greatest amount of change for outlying entities
need not spend much time on requirements that use outlying
entities
Need to direct query to most appropriate source
starflake schema not supported by access tools
create views and synonyms on fact tables to be used by
access tools
Multidimensional Schemas
Class of decision support queries that
analyze data by representing facts and
dimensions within a multidimensional cube
Each dimension occupy an axis, and values
within the cube correspond to factual
transactions
Trend analysis
Good for viewing statistical
operations/aggregations
apply functions against planes of cube
cube can be pivoted, sliced, diced
Time
Location
Product
Location
Time
Location
Product
Products
Multidimensional Schemas (cont.)
Cube extracts data from data warehouse
created as physical stores in many access tools
populate on demand or in advance
If data warehouse is constantly being
accessed, performance will be poor
most appropriate solution is when cube satisfies
majority of queries, but require substantial design
effort
Summary
Data warehouse
Identifying facts and dimensions
Designing fact tables
Designing dimension tables
Designing starflake schemas
Multidimensional schema