Presentation Title Up to Four Lines of Text. Lorem Ipsum
Download
Report
Transcript Presentation Title Up to Four Lines of Text. Lorem Ipsum
Best Practices for Query
Performance In a Data Warehouse
Calisto Zuzarte
IBM
[email protected]
Session Code: D09
May 13, 2010 8:30AM–9:30AM
Platform: Linux, Unix and Windows
Data Warehouse Life Cycle
•
Database design / Application design
•
•
Database performance layer implementation
•
•
The Warehouse Application architects and Database Administrators work together
to design the queries and schema before they put the application in production
In order to meet SLAs, DBAs usual go through some iterations augmenting the
database with performance layer objects and set up the initial configuration to get
good performance
Database tuning operations
•
During production, with changing requirements and change in data, there is ongoing tuning required to keep operations smooth.
Motivation
• Data warehouse environments characteristics:
• Large volumes of data
• Millions/Billions of rows involved in some tables
• Large amounts of data rolled-in and rolled-out
• Complex queries
•
•
•
•
Large Joins
Large Sorts,
Large amounts of Aggregations
Many tables involved
• Ad Hoc Queries
• It is important to pay attention to query performance
Objectives
• Provide recommendations so that you can improve data
warehouse query performance
•
•
•
•
Database Design considerations
Application Design considerations
Performance Layer Considerations
Ongoing Tuning Considerations
Agenda
•
•
•
•
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Performance Layer
Best Practices – Configuration and Operations
Best Practices – Database Design
• Best Practices - Parallelism
• Inter-partition Shared nothing parallelism
• Intra-Query Parallelism (SMP)
• Best Practices - Partitioning
•
•
•
•
Database Partitioning
Table Partitioning
Multi-Dimension Clustering
UNION ALL Views
• Best Practices – Schema
• Best Practices - Compression
Best Practices - Parallelism
• Database partition feature (DPF) is recommended
• To achieve parallelism in a data warehouse
• For scalability and query performance
• SMP (Intra-Query Parallelism) not recommended
• In concurrent multi-user environments with heavy CPU usage
• SMP recommended
• When CPUs are highly under utilized
• When DPF is not an option
Partitioning (Complimentary Strategies in DB2)
• Database Partitioning (DPF) … DISTRIBUTE BY HASH
•
Key Benefit : Better scalability and performance through parallelism
• Multidimensional Clustering (MDC) … ORGANIZE BY DIMENSION
•
Key Benefit : Better query performance through data clustering
• Table (Range) Partitioning … PARTITION BY RANGE
•
•
Table Partitioning
Key Benefit : Better data management (roll-in and roll-out of data)
•
•
UNION ALL Views
Key Benefit : Independent branch optimization
Divide And Conquer ! Distribute, Partition, Organize !
Organize By
Partition By
Distribute By
Best Practices – Database Partitioning
• Collocate the fact and largest dimension
• Choose to avoid significant skew on some partitions
• Avoid DATE dimension where active transactions for
current date all fall on one database partition
(TIMESTAMP is good)
• Possibilities for workload isolation for data marts
• Different partition groups but common dimension tables
• Needs replicated tables (discussed later)
Best Practices – Table Partitioning
• Recommend partitioning the fact tables
• Typically based on DATE dimension
• Works better with application key predicates applied directly
• Table or Range Partitioning
• Recommend table or range partitioning (V9.7 :partitioned indexes)
• Choose partitioning based on roll-in / roll-out granularity
• UNION ALL Views
•
•
•
•
Each branch optimized independently
Use with well designed applications (Dangers of materialization)
Large number of branches require time and memory to optimize
Needs predicates with constants for branch elimination
Best Practices – Multidimensional Clustering
(MDC)
• Recommend defining MDC on the fact table
• Guaranteed clustering (Avoids the need to REORG for clustering)
• I/O optimization
• Compact indexes (compact, coexists with regular indexes)
• Choose dimensions based on query predicates
• Recommend the use of 1 to 4 dimensions
• Need to ensure dimensions are chosen such that they do not
waste storage
• Could choose a finer granularity of Table partitioning
range
• For example: Table partition range by month, MDC by date
Star Schema
STORE
PRODUCT
Store_id
Product_id
SALES
Region_id
…
TIME
Date_id
Month_id
Quarter_id
Year_id
Product_id
Store_id
Channel_id
Date_id
Amount
Quantity
…
Class_id
Group_id
Family_id
Line_id
Division_id
…
CHANNEL
Channel_id
…
Product Dimension
Dimension Hierarchy
Division
Level 5
Time Dimension
Line
Level 4
Year
Family
Level 3
Group
Level 2
Class
Level 1
Product
Level 0
Store Dimension
Channel Dimension
Retailer
Channel
Store
Quarter
Month
Date
Sales Fact
Best Practices - Schema
• Surrogate Keys
• As far as possible use application keys themselves
• allows predicates to be applied/transferred directly on the fact table
• DATE is a good candidate (easier to roll-in/roll-out and for MDC )
• Star Schema / Snowflakes
• Separate tables for each dimension hierarchy (snowflake) may
result in a large number of joins
• Flattened dimensions may contain a lot of redundancy (space)
• Define Columns NOT NULL when appropriate
• Many optimizations that are done based on NOT NULL
• Define Uniqueness when appropriate
• Primary Keys / Unique Constraints / Unique Indexes
Compression
• Table, Index and Temp Table compression
• Huge benefits with storage savings
• With table and TEMP compression 30-70%
• With Index compression 30-40%
• Performance gains because
• Less I/O and better use of bufferpools
• TEMP table compression helps operators like Hash
Join, Merge Join, Sorts and Table Queues if they spill
Best Practices - Compression
• Consider compression particularly with the fact table
• Strongly recommend compression on the fact table when
not CPU bound
Agenda
•
•
•
•
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Performance Layer
Best Practices – Configuration and Operations
Best Practices – Application Considerations
• Use constants instead of expressions in the query
• Example
• WHERE DateCol <= CURRENT DATE – 5
• Use VALUES(CURRENT DATE – 5) to get the resulting constant first
and use it in the query
• Avoid expressions on indexed columns
• Example
• WHERE DATECOL – 2 DAYS > ‘2009-10-22’
• WHERE DATECOL > ‘2009-10-22’ + 2 DAYS
Best Practices – Application Considerations
• Avoid mixing data types in join predicates
• Example
• WHERE IntegerCol = DecimalCol
• Use Global Temporary Tables to split a query if it contains
more than 10-15 tables
• Reduces optimization time
Agenda
•
•
•
•
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Performance Layer
Best Practices – Configuration and Operations
Best Practices – Performance Layer
• Indexes
• Statistics
• Distribution Statistics
• Column Group Statistics
• Statistical Views
• Constraints
• Referential Integrity
• Materialized Query Tables
• Replicated Tables
Indexes
• Indexes are a vertical subset of the data in the table
• Indexes provide ORDER
• Indexes may allow for clustered access to the table
Index Considerations
• To get Index Only Access instead of more expensive ISCANFETCH or TSCAN (Table Scan)
• To avoid SORTs particularly those that spill
• To promote index-ORing and index-ANDing
• To promote Star Joins
• When you have range join predicates
• Better possibilities with Nested Loop Join
• Indexes for clustering (MDC)
Cardinality Estimation
• Estimating the size of intermediate results is critical to getting
good query execution plans
• Without sufficient information, the optimizer can only guess
based on some assumptions
• Data skew and statistical correlation between multiple
column values introduce uncertainty
• Pay attention to DATE columns
Best Practices - Statistics
• Collect distribution Statistics when there is skew and
predicates use constants
• Consider a high number of quantile statistics on columns
with DATE range predicates and character string columns
Column Group Statistics
Country
1
2
3
City
Hotel Name
Germany Bremen
Hilton
Germany Bremen
Best Western
Germany Frankfurt
InterCity
Germany Frankfurt
Shangri-La
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
• Example: COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’
• No CGS: Selectivity = ½ * 1/3 = 1/6 …Estimate 1 row
• With CGS: Selectivity = 1/3 …Estimate 2 rows
Problem Scenario - Skew
10000000 rows
CUST Table 100 rows, 100 custids
Frequency Statistics – SALES Table
CUSTID
CNAME
CUSTID
# of Rows
1
‘ABC’
4
2000000
2
‘DEF’
10
700000
3
‘GHI’
8
500000
4
‘IBM’
63
300000
5
‘JKL’
72
100000
6
‘MNO’
9
50000
7
‘PQR’
12
20000
…
…
…
…
…
…
…
…
100
‘XYZ’
5
5
SELECT … FROM SALES, CUST
WHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID
Cardinality Estimate with Uniformity = 100,000
Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!
Best Practices - Statistics
• Collect Column Group Statistics with multiple predicates on the
same table
• WHERE Country = ‘CANADA’ and City = ‘TORONTO’
RUNSTATS … ON ALL COLUMNS AND ON COLUMNS ((country, city) … ) …
• Consider Statistical Views when
• There is skew on the join column
• There is a significant difference in the range of values in the fact and the
dimension
CREATE VIEW SV1 AS
(SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID)
ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION
RUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION
Referential Integrity (RI)
• Facilitates aggregation push down
• Example in the appendix section
• Eliminates redundant joins in views
• RI helps determine that queries that do not require data from a
primary key table need not do that join even if it is in the view
• Helps with Materialized Query Table matching
• Allows Queries to match MQTs with more dimension table joins
Consider Materialized Query Tables
Joe’s Query
Q9
Q9
GB
Sue’s
GB
Joe’s Q
JOIN
JOIN
JOIN
Dim2
Bob’s Q
GB
JOIN
Dim2
JOIN
Dim1
GB
Fact
Dim1
JOIN
Dim2
JOIN
Fact
Dim1
JOIN
Dim2
Fact
GB
Sue’s Query
Fact
JOIN
Dim1
JOIN
Bob’s Query
Fact
Dim2
Dim1
MQT
Best Practices - Defining Materialized Query
Tables
• What MQTs should I define ?
• Estimate the size of the candidate MQTs by executing COUNT
queries against base tables.
• Try to achieve at least a 10X reduction in size between fact and
the MQT
• Build MQTs with a reasonable number of GROUP BY columns (3
to 6 dimension keys) at a time based on query patterns
• As far as possible build the MQT from the fact table alone
• Use Table Partitioning for the fact table and the MQTs
Best Practices - MQT Matching
• Define Referential Integrity to help with matching MQTs
that contain more tables than the queries
• Define Functional Dependencies for thinner MQTs
• Use COUNT_BIG instead of COUNT for DPF MQTs
• Define indexes on MQTs
• Keep statistics up-to-date
• Define base table columns NOT NULL as far as possible
• For example we can match SUM(A + B) with SUM(A) + SUM(B)
Best Practices – MQT Maintenance
• REFRESH IMMEDIATE
• Create an index on the GROUP BY columns
• Create the index on the set of columns that form a unique key
• Always keep the base table and MQT statistics up-to-date
• REFRESH DEFERRED
• If log space is an issue, consider NOT LOGGED INITIALLY or LOAD
from cursor
• An MQT can be temporarily toggled into a regular table by using
• ALTER TABLE … DROP MATERIALIZED QUERY
• ALTER TABLE … ADD MATERIALIZED QUERY
• Use ATTACH / DETACH if fact table and MQT are range partitioned
tables
Replicated Tables
JOIN
BTQ
CUST
•
•
•
•
CUST
COPY
JOIN
BTQ
SALES
CUST
COPY
JOIN
BTQ
SALES
CUST
COPY
SALES
Replicate dimension tables (unless collocated with fact )
Benefit : Avoids data movement
Important : Define suitable indexes
If too large, replicate a subset of frequently used columns
Agenda
•
•
•
•
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Performance Layer
Best Practices – Configuration and Operations
Best Practices – Configuration
• Optimization Level 5
• Registry Variables
• DB2_ANTIJOIN=EXTEND
• If slow queries have NOT EXISTS, NOT IN predicates
• DB2_REDUCED_OPTIMIZATION=YES
• If compile time is an issue
• Configuration thumb rules
• BUFFPOOL ~= SHEAPTHRES
• SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)
Best Practices - Statistics
• The DB2 Query Optimizer relies on reasonably accurate
statistics to get a good query plans
• User runs RUNSTATS when data changes (part of ETL)
• Statistics Fabrication (unreliable)
• DB2 keeps UPDATE / DELETE / INSERT counters
• Fabrication limited to a few statistics – Not enough
• Consider configuring Automatic Statistics
• Automatically collects statistics on tables in need
• Runs in the background as a low priority job
• Consider configuring Real Time Statistics
• Collects statistics on-the-fly
Summary – Best Practices
• Database Design :
• Parallelism, Partitioning, Schema, Compression
• Application Design
• SQL Tips
• Performance Layer
• Indexes, Statistics, Referential Integrity, Materialized Query
Tables, Replicated Tables
• Configuration and Operations
• Configuration, Collecting Statistics
Calisto Zuzarte
[email protected]