Transcript Title Slide
0
Best Practices to Improve Query Performance in a Data Warehouse - 1
Calisto Zuzarte, STSM, IBM, [email protected]
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
1
Data Warehouse Life Cycle Database design / Application design
– The Warehouse Application architects and Database Administrators work together to design the queries and schema before they put the application in production
Database performance layer implementation
– In order to meet SLAs, DBAs usual go through some iterations augmenting the database with performance layer objects and set up the initial configuration to get good performance
Database tuning operations
– During production, with changing requirements and change in data, there is on-going tuning required to keep operations smooth. December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
2
Motivation
Data warehouse environments characteristics: – Large volumes of data • • • • • • Millions/Billions of rows involved in some tables Large Joins Large Sorts, Large Aggregations Many tables involved Large amount of data rolled-in and rolled-out – Complex queries • • Report Queries Ad Hoc Queries It is important to pay attention to query performance December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
3
Objective
Provide recommendations from a DB2 optimizer perspective to improve query performance through the Data Warehouse life cycle
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
4
Agenda
SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
5
Best Practices – Database Design
Parallelism – – Inter-partition Shared nothing parallelism (DPF) Intra-Query Parallelism (SMP) Partitioning – – Database Partitioning Table Partitioning • • Table (Range) Partitioning UNION ALL Views – Multi-Dimension Clustering Schema December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
6
Best Practices - Parallelism
DPF or SMP or both ?
Database partition feature (DPF) is generally recommended to achieve parallelism in a data warehouse – Achieves scalability and query performance SMP (Intra-Query Parallelism) is NOT recommended in concurrent multi-user environments with heavy CPU usage SMP is only recommended – When CPUs are highly under utilized and when DPF is not an option December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
7
Partitioning (Complimentary Strategies in DB2)
Database Partitioning (DPF) – – CREATE TABLE … DISTRIBUTE BY HASH “Database Partitioning” “Distribution Key”
Key Benefit : Better scalability and performance through parallelism
Table Partitioning – – – Table (Range) Partitioning CREATE TABLE …PARTITION BY RANGE “Table Partitioning” “Table Partitioning Key”
Key Benefit : Better data management (roll-in and roll-out of data)
– – – UNION ALL Views “UNION ALL branch Partitioning” CREATE VIEW V AS (SELECT … FROM F1 UNION ALL … )
Key Benefit : Independent branch optimization
Multidimensional Clustering (MDC) – CREATE TABLE … ORGANIZE BY DIMENSION – “Cells”, “Blocks”, “Dimensions”
Key Benefit : Better query performance through data clustering
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
Distribute By … Partition By … Organize By ..
CREATE TABLE … DISTRIBUTE BY HASH PARTITION BY RANGE ORGANIZE BY DIMENSION Database Partition 1 Database Partition 2 Database Partition 3 8 TS1 Jan TS2 Feb North South East West North South East West TS1 Jan TS2 Feb TS1 Jan TS2 Feb North South East West North South East West North South East West North South East West December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
9
Best Practices – DPF Partitioning
Collocate the fact and largest frequently joined dimension Choose to avoid significant skew on some partitions Avoid DATE dimension where active transactions for current date all fall on one database partition (TIMESTAMP is good) Possibilities for workload isolation for data marts – – Different partition groups but common dimension tables Recommend that dimension tables be replicated (discussed later) December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
10
Best Practices – Table Partitioning
Recommend partitioning the fact tables Recommend using the DATE dimension Works better with application key predicates applied directly Table (Range) Partitioning – – Consider partitioned indexes with V9.7
Choose partitioning based on roll-in / roll-out granularity UNION ALL Views – Define view predicates or CHECK Constraints to get branch elimination with query predicates (with constants only) –
Use UNION ALL views only with well designed applications
• •
Dangers of materialization with ad hoc queries Large number of branches needs time and memory to optimize
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
11
Best Practices – Multidimensional Clustering (MDC)
Recommend defining MDC on the fact table – Guaranteed clustering (Avoids the need to REORG for clustering) – – I/O optimization Compact indexes (compact, coexists with regular indexes) Choose dimensions based on query predicates – – Recommend the use of 1 to 4 dimensions Need to ensure dimensions are chosen such that they do not waste storage Could choose a finer granularity of Table partitioning range – For example: Table partition range by month, MDC by date December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
12
Star Schema
STORE Store_id Region_id … TIME Date_id Month_id Quarter_id Year_id PRODUCT SALES Product_id Store_id Channel_id Date_id Amount Quantity … Product_id Class_id Group_id Family_id Line_id Division_id … CHANNEL Channel_id … December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
13
Dimension Hierarchy
Channel Dimension Channel Store Dimension Retailer Store Time Dimension Year Quarter Month Date Product Dimension Division Line Family Group Class Product Level 5 Level 4 Level 3 Level 2 Level 1 Level 0 Sales Fact December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
14
Best Practices - Schema
Surrogate Keys – As far as possible use application keys themselves • • allows predicates to be applied/transferred directly on the fact table DATE is a good candidate (easier to roll-in/roll-out and for MDC ) Star Schema / Snowflakes – Separate tables for each dimension hierarchy (snowflake) may result in a large number of joins – Flattened dimensions may contain a lot of redundancy (space) Define Columns NOT NULL when appropriate – Many optimizations that are done based on NOT NULL Define Uniqueness when appropriate – Primary Keys / Unique Constraints / Unique Indexes December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
15
Agenda
SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
16
Application Considerations - Expressions
Use constants instead of expressions in the query – Example • • SELECT … WHERE DateCol <= CURRENT DATE – 5 Use VALUES(CURRENT DATE – 5) to get the constant first and use it in the query Avoid expressions on indexed columns – Example • SELECT … WHERE DATECOL – 2 DAYS > ‘2009-10-22’ • SELECT … WHERE DATECOL > ‘2009-10-22’ + 2 DAYS Similar recommendation with cast functions – Example • • • SELECT … WHERE INT(CHARCOL) = 2009 SELECT … WHERE CHARCOL = ‘2009’ Note you may lose Errors/Warnings December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
17
Application Considerations – Table Partitioning / MDC
As far as possible put local predicates directly on Table Partition or MDC dimension columns of the fact table SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.Date = ‘2009-01-15’ and T.KEYCOL= F.TIMEKEYCOL
Simplify if the TIMEKEYCOL is correlated to the TIME values (For example TIMEKEYCOL= 20090115 for the date ‘2009-01-15’) SELECT ... FROM CUSTDIM C, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and F.TIMEKEYCOL = 20090115 December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
18
Application Considerations – Table Partitioning / MDC
Another example … consider SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL
First get the values for MINKEY and MAXKEY SELECT MIN(KEYCOL) FROM TIMEDIM WHERE YEAR=2009 SELECT MAX(KEYCOL) FROM TIMEDIM WHERE YEAR=2009 Then write the SQL as follows SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL AND F.TIMEKEYCOL >= MINKEY AND F.TIMEKEYCOL <= MAXKEY December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
Application Considerations – General Recommendations
Avoid repetitions of complex expressions Use Global Temporary Tables to split a query if it contains more than about 15 tables and compile time is an issue 19 December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
20
Agenda
SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
21
Best Practices – Configuration and Operations
Configuration – – – Database Configuration DBMS Configuration Registry Settings Operations – Collecting Statistics December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
22
Configuration
Optimization Level 5 Avoid multiple bufferpools of the same page size Configuration thumb rules – – BUFFPOOL ~= SHEAPTHRES SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN) December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
Registry Variables
DB2_ANTIJOIN=EXTEND • If slow queries have NOT EXISTS, NOT IN predicates 23 December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
24
Registry Variables
DB2_REDUCED_OPTIMIZATION=YES – Set if compile time is an issue
IBM Service may recommend a more complex setting for example:
– DB2_REDUCED_OPTIMIZATION=
10,15,20
, 00011000….
•
First Part
: DB2_REDUCED_OPTIMIZATION=
A,B,C
– – – IF more than C joins, then "quick greedy" ELSE IF more than B joins, then use “greedy” ELSE IF more than A joins, use reduced “dynamic” strategy.
•
Second Part
not documented (Mainly intended for setting by service) December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
25
Best Practices Optimization Level 5 BUFFERPOOL~=SHEAPTHRES
DB2_ANTIJOIN=EXTEND DB2_REDUCED_OPTIMIZATION=YES
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
26
Collecting Statistics
The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans User runs RUNSTATS when data changes (part of ETL) Statistics Fabrication (unreliable) – – DB2 keeps UPDATE / DELETE / INSERT counters Fabrication limited to a few statistics – Not enough Automatic Statistics – – Automatically collects statistics on tables in need Runs in the background as a low priority job Real Time Statistics – Collects statistics on-the-fly December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
27
AUTO RUNSTATS
Set Under Automatic Table Maintenance hierarchy
– AUTO_RUNSTATS cannot be ON unless AUTO_TBL_MAINT is ON
Automatic maintenance (AUTO_MAINT) = ON
Automatic database backup (AUTO_DB_BACKUP) = OFF
Automatic table maintenance (AUTO_TBL_MAINT) = ON Automatic runstats (AUTO_RUNSTATS) = ON
Automatic statement statistics (AUTO_STMT_STATS) = OFF Automatic statistics profiling (AUTO_STATS_PROF) = OFF Automatic profile updates (AUTO_PROF_UPD) = OFF Automatic reorganization (AUTO_REORG) = OFF December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
REAL TIME STATISTICS
Set Under Automatic Table Maintenance hierarchy
– Real Time Statistics cannot be ON unless AUTO RUNSTATS is ON – AUTO_RUNSTATS cannot be ON unless AUTO_TBL_MAINT is ON 28
Automatic maintenance (AUTO_MAINT) = ON
Automatic database backup (AUTO_DB_BACKUP) = OFF
Automatic table maintenance (AUTO_TBL_MAINT) = ON Automatic runstats (AUTO_RUNSTATS) = ON Automatic statement statistics (AUTO_STMT_STATS) = ON
Automatic statistics profiling (AUTO_STATS_PROF) = OFF Automatic profile updates (AUTO_PROF_UPD) = OFF Automatic reorganization (AUTO_REORG) = OFF December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
29
Best Practices – RUNSTATS
Distribution Statistics – – Collect large Quantile Statistics for Date columns Collect distribution statistics on columns used in predicates Index Statistics – Do not collect DETAILED INDEX statistics . Use SAMPLED DETAILED INDEX statistics instead Avoid statistics on columns you know will never be used in predicates or GROUP BY columns Use TABLESAMPLE option for very large tables and statistical views Use RUNSTATS Profiles to store customized invocations RUNSTATS with ATTACH ?
COMMIT immediately after RUNSTATS of each table December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
30
Collecting Statistics
Automatic RUNSTATS Real Time Statistics SAMPLED DETAILED INDEX TABLESAMPLE Selective column statistic specification Use RUNSTATS PROFILES
December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
31
Summary
Tips and best practices to improve data warehouse query performance have been discussed.
– – – Database Design Application Design Configuration and Operations These include key considerations related to : – – – – – Parallelism Partitioning Schema Application queries Configuration Session 2 will cover the Performance Layer December 01 03, 2009 •Minneapolis, Chicago, Milwaukee
32
Disclaimer
© Copyright IBM Corporation [current year]. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.
Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc
IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
If you have mentioned trademarks that are not from IBM, please update and add the following lines: [Insert any special 3rd party trademark names/attributions here] Other company, product, or service names may be trademarks or service marks of others.