Dave Ballantyne

Download Report

Transcript Dave Ballantyne

Dave Ballantyne
Clear Sky SQL
› Freelance Database Developer/Designer
– Specializing in SQL Server for 15+ years
› SQLLunch
– Lunchtime usergroup
– London & Cardiff , 2nd & 4th Tuesdays
› TSQL Smells script author
– http://tsqlsmells.codeplex.com/
› Email : [email protected]
› Twitter : @DaveBally
› This is also me
– Far to often ….
› Estimates are central
› Statistics provide estimates
› Every Journey starts with a
plan
› Is this the ‘best’ way to
Lyon?
– Fastest
– Most efficient
– Shortest
› SQL Server make similar
choices
› SQL Server is a cost based optimizer
– Cost based = compares predicted costs
› Therefore estimations are needed
› Every choice is based on these estimations
› Has to form a plan
– And the plan cannot change in execution if ‘wrong’
› Costs are not actual costs
– Little or no relevance to the execution costs
› Cost is not a metric
– 1 <> 1 anything
› Their purpose:
– Pick between different candidate plans for a query
› Included within a index
› Auto Updated and created
– Optimizer decides “It would be useful if I knew..”
– Only on single column
– Not in read-only databases
› Can be manually updated
› Auto-Creation can operate Async
› DBCC SHOW_STATISTICS(tbl,stat)
– WITH STAT_HEADER
› Display statistics header information
– WITH HISTOGRAM
› Display detailed step information
– WITH DENSITY_VECTORS
› Display only density of columns
– Density = rows / count of distinct values
– WITH STATS_STREAM
› Binary stats blob ( not supported )
Total rows in table
Avg byte len for all
columns
Rows read and
sampled
Density (Rows/Distinct
values) (exc boundaries)
not used
No of steps in
histogram – Max 200


Each step contains data on a range of values
Range is defined by
◦ <= RANGE_HI_KEY
◦ > Previous range RANGE_HI_KEY
17 Rows of data
> 407 and < 470
Row 3
Row
4 <=407
> 2 and
> 407 and
<=470
9 Distinct
values
> 407 and < 470
6 rows of data = 470
Density (17 / 9)
RANGE_HI_KEY <= Predicate
AND > Previous RANGE_HI_KEY
As Predicate == RANGE_HI_KEY
Estimate = EQ_ROWS
RANGE_HI_KEY <= Predicate
As Predicate
And<>RANGE_HI_KEY
previous RANGE_HI_KEY
Estimate = AVG_RANGE_ROWS
› Greater accuracy on Range boundary values
– Based upon the ‘maxdiff’ algorithm
› Relevant for leading column
– Estimate for Smiths
– But not Smiths called John
› Additional Columns cumulative density only
– 1/(Count of Distinct Values)
Density vector = 1/(Count of Distinct Values)
=19,517
1 / 19,517 = ~5.123738E-05
211 * 5.123728E-05
= ~1.02331
› All Diazs will estimate to the same:
– As will all Smiths,Jones & Ballantynes
– The statistics do not contain detail on FirstName
– Only how many distinct values there are
– And assumes these are evenly distributed
› Not only across a single Surname
› But ALL Surnames
› So far we have only used a single statistic for
estimations
› For this query:
› To provide the best estimate the optimizer ideally
needs to know about LastName and FirstName
› Correlating Multiple Stats
› AND conditions
› Find the intersection
LastName =
‘Sanchez’
FirstName =
‘Ken’
› And logic
– Intersection Est = ( Density 1 * Density 2)
› Or logic
– Row Est 1 + Row Est 2 –(Intersection Estimate)
– Avoid double counting
10% * 10
20 % = 1%
2%
• To keep statistics fresh they get ‘aged’
• 0 to > 0
• <= 6 Rows (For Temp Tables)
• 6 Modification
• <= 500 Rows
• 500 Modifications
• >= 501 Rows
• 500 + 20% of table
• Will cause statistics to be updated on next use
• Will cause statements to be recompiled on next
execution
• Temp tables in stored procedures more complex

Trace flag 2371 -Dynamically lower statistics
update threshold for large tables
 >25,000 rows
 2008r2 (SP1) & 2012
› When density vector is not accurate enough
› Manually created statistics only
› Additional where clause can be utilised
Rows Sampled
Number of filtered rows
sampled
Filter Expression = Filter
Unfiltered Rows = Total rows in
table before filter
Density of London *
Density of Ramos
Filter is matched and
histogram is used
› For ‘large’ data sets a smaller sample can be used
› Here 100% of the rows have been sampled
› Here ~52% of the rows have been sampled
› Statistics will assume the same distribution of values
through the entire dataset
› Also Auto/Forced Parameterization
› Remember the Density Vector ?
› 19972 (Total Rows )* 0.0008285004 =
› 16.5468
› Stored Procedures
› Enables a better plan to be built
– (most of the time)
– Uses specific values rather than average values
› Values can be seen in properties pane
› Erratic execution costs are often Parameter Sniffing
problems
› Force a value to be used in optimization
› A literal value
› Or UNKNOWN
– Falls back to density information
› OPTION(RECOMPILE)
– Recompile on every execution
› Because the plans aren’t cached
– No point as by definition the plan wont be reused
› Uses variables as if literals
– More accurate estimates
› The Achilles heel
– A plan is fixed
› But… The facts are:
– More Smiths than Ballantynes
– More Customers in London than Leeds
London Leeds
Smith
500
50
Ballantyne
10
1
› This is known as the ‘Plan Skyline’ problem
Summarised (20 Step) Surname Stats Distribution
1200
1100
1000
900
800
700
600
500
400
Abbas
Alonso
Anderson
Bailey
Bell
Blue
Bryant
Carlson
Chandra
Coleman
Cox
Dominguez
Flores
Gill
Gonzalez
Guo
Hayes
Hill
Hughes
Jenkins
Jones
Kumar
Li
Lopez
Madan
McDonald
Moore
Munoz
Navarro
Parker
Perry
Prasad
Raji
Rana
Roberts
Romero
Russell
Sanz
Serrano
Shen
Stewart
Suri
Thompson
Vazquez
Ward
Williams
Wu
Ye
Zhang
Zhu
Full Statistics distribution
250
200
150
100
50
0
Zwilling
Zhang
Xu
Williams
Walker
Thompson
Subram
Shen
Schmidt
Russell
Rodriguez
Rana
Rai
Perry
Ortega
Munoz
Miller
Madan
Liu
Kumar
Jiménez
Hughes
Henderson
Guo
Gomez
Flores
Deng
Coleman
Chande
Bryant
Black
Bailey
Anand
Abbas
› But wait…
› It gets worse
– That was only EQ_ROWS
– RANGE_ROWS ??
Range Rows statistics distribution
80
70
60
50
40
30
20
10
0
› Variations in plans
– Shape
› Which is the ‘primary’ table ?
– Physical Joins
– Index (non)usage
› Bookmark lookups
– Memory grants
› 500 rows need more memory than 50
– Parallel plans
› Can the engine resolve this ?
– No!
› We can help though
– And without recompiling
– Aim is to prevent ‘car-crash’ queries
– Not necessarily provide a ‘perfect’ plan
Demo
› “There is always a trace flag”
– Paul White ( @SQL_Kiwi)
› TF 9292
– Show when a statistic header is read
› TF 9204
– Show when statistics have been fully loaded
› TF 8666
– Display internal debugging info in QP
› Statistics Used by the Query Optimizer in Microsoft SQL
Server 2008
› Plan Caching in SQL Server 2008
› SQL Server internals 2008 book (MSPress)
› My Email : [email protected]
› Twitter : @DaveBally