Finding out what’s wrong – in search of “fast=true” Thomas Kyte Oracle Corporation.

Download Report

Transcript Finding out what’s wrong – in search of “fast=true” Thomas Kyte Oracle Corporation.

Finding out what’s wrong –
in search of “fast=true”
Thomas Kyte
Oracle Corporation
Who am I
• Been with Oracle for 10
years
• User of Oracle for almost
16 years
• The “Tom” behind AskTom
in Oracle Magazine
www.oracle.com/oramag
• Expert One on One Oracle
• Beginning Oracle
My Approach
What we’ll be doing
• My 3 best friends – autotrace, tkprof and statspack
• Their friends – dbms_profiler, runstats, and even
jdev
• Build(ing) a test environment
• Design to perform, don’t tune to perform (it is a 4
letter word)
• Have metrics – and live up to them (and only them)
• Benchmark, Benchmark, Benchmark
• Instrument your code
• Don’t look for shortcuts
Break Glass in case of Emergency
• Triage
–
–
–
In most cases you have no prior information. You are starting from
scratch
Get a statspack
Isolate an application
• Trace it
• If you implement some of the ideas I’ll talk about
–
–
–
You won’t have to triage (you can predict)
When you need to isolate where the issue is – you can
Finding the problem is almost always harder then fixing it! (time
entry example)
My three best friends
• Autotrace
–
–
–
–
Too easy
Explain plan
The statistics
Demo007.sql
My three best friends
 We have a big customer table which is referenced by many other tables.
 The table's primary key is NUMBER(12). However the foreign key columns of the some of the tables which is
referencing this customer table were defined as NUMBER. (Seen as NUMBER(38) in data dictionary).
 This datatype difference causes a data conversion during the checking of the foreign keys and thus a performance
loss. (invalid conclusion based on guessing – no hard facts to back it up)
 In order to prevent it, we tried to alter the NUMBER columns to the same type of our customer table's primary key NUMBER(12).
 Since Oracle doesn't allow us to do this unless the column values are NULL, we had to insert them with their rowids
to a temporary table, setting them to NULL, altering the column to NUMBER(12), and populating the original values
from the temporary table again. (doing a TON of work and they will see no gain from this – other then tons of IO)
 This is a very slow process. So that we can't even see its end after 8 hours of working and we had to stop it. My
question is, could you please offer us a faster way of doing this operation? (yes we can – stop doing it!)
 The total size of the tables we have to alter is about 220 million rows, we are running all DML in parallel and
nologging mode, we are creating the temporary table by create as select
• Review demo008.sql and demo008.prf
• @trace and select count(*) example
My three best friends
• Statspack
–
Keep a history
• I've been asked to look at a site who have just carried out a DB upgrade. The
feeling is that there has been some performance degradation since the
upgrade. The performance hit has not been quantified yet but is 'felt throughout
the day'.
–
–
–
Use it in sickness and in health
15-30 minute windows between snaps at most
• Averaging out over 8 hours is meaningless
Don’t try to remove every last wait
• When looking through the statspack report, I found that events, "direct path
read" and "direct path write", are always on the top 2 of "Top 5 Wait Events"
section as follows. They had 2.92 seconds of waits on it– over a 30minute
window!
• Watch for the “so what” waits – control file parallel write
My three best friends
• So, you have a 20 page statspack report. Now
what
–
–
Make sure it is 15-30 minutes long
Make sure timed statistics were on
STATSPACK report for
DB Name
DB Id
Instance Inst Num Release
Cluster Host
------------ ----------- -------- -------- ----------- ------- ---ORA9I
2272536868 ora9i
1 9.2.0.1.0
NO
aria
Snap Id
Snap Time
Sessions Curs/Sess Comment
------- ------------------ -------- --------- -------Begin Snap:
1 30-Dec-02 09:58:58
67,254
3.0
End Snap:
2 30-Dec-02 10:14:52
67,260
3.0
Elapsed:
15.90 (mins)
Cache Sizes (end)
~~~~~~~~~~~~~~~~~
Buffer Cache:
Shared Pool Size:
96M
112M
Std Block Size:
Log Buffer:
8K
512K
My three best friends
• Check hard parses, want almost none
• Executes & TPS is an indicator of the “load” on the system
• Others are useful for reference
Load Profile
~~~~~~~~~~~~
Redo size:
Logical reads:
Block changes:
Physical reads:
Physical writes:
User calls:
Parses:
Hard parses:
Sorts:
Logons:
Executes:
Transactions:
% Blocks changed per Read:
Rollback per transaction %:
Per Second
--------------75,733.92
1,535.11
449.56
562.99
62.53
8.04
56.43
0.38
11.63
0.30
94.21
3.65
29.29
9.41
Per Transaction
--------------20,737.70
420.35
123.10
154.16
17.12
2.20
15.45
0.10
3.19
0.08
25.80
Recursive Call %:
Rows per Sort:
97.14
752.04
My three best friends
• The ratios – Library hit, Soft parse are the ones I zero in on
• Buffer hit – not so much
• Execute to parse – depends on system type
–
–
–
Negative = bad
20-40% not unreasonable for web based (stateless)
Client/server should be much much higher.
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 100.00
Redo NoWait %:
Buffer Hit
%:
94.94
In-memory Sort %:
Library Hit
%:
99.81
Soft Parse %:
Execute to Parse %:
40.10
Latch Hit %:
Parse CPU to Parse Elapsd %:
86.12
% Non-Parse CPU:
Shared Pool Statistics
Memory Usage %:
% SQL with executions>1:
% Memory for SQL w/exec>1:
Begin
-----86.95
66.38
67.28
End
-----85.16
67.40
69.68
100.00
98.50
99.33
99.91
94.69
My three best friends
• Top 5 timed events
–
–
–
Not waits in 9iR2 and up – timed events, includes CPU time
This was a 4 CPU machine, I used 508 out of 3,600 CPU seconds
Direct path write caught my eye
• Direct loads
• PDML
• Uncached lobs
• Sorts to disk 
• The rest of the report helps me figure out the top of the report
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~
% Total
Event
Waits
Time (s) Ela Time
-------------------------------- ------------ ----------- -------CPU time
508
37.39
direct path write
5,168
279
20.50
db file scattered read
38,554
270
19.85
log file sync
2,610
88
6.50
direct path read
2,702
86
6.29
------------------------------------------------------
Their three best friends
• DBMS_PROFILER
–
–
Source code profiler for PLSQL
Find the low hanging fruit
• Runstats
–
–
Simple test harness to show differences in approaches
Demo009.sql
• Jdeveloper
–
–
–
–
Source code DEBUGGER for PLSQL
Interactive develop/compile/debug or…
Debug from afar
• Submit web page
• Attach debugger to mod_plsql invoked procedure
Quick Demo
Build a Test Environment
• Them: Our production application is behaving
badly, it is doing <something>
• Me: When you ran it in test, what was the outcome
there
• Them: Test? What is “test”
• Me: When you tested this process in your test
environment – that duplicate of production what
happened there?
• Them: Oh – well, we don’t have a test
environment….
Build a Test Environment
• Certain large institution
–
–
–
Upgraded
“Seemed Slower”
Flipped 16 switches and didn’t have the ability unswitch them
• A certain city asked on Friday if there were any “gotchas”
with upgrading – they just wanted to know for tomorrow
• Testing will
–
–
–
–
Benchmark it, test scale, test performance
Verify fixes actually work
Assure you that upgrade script actually works
Make sure major things don’t knock you out
Build a Test Environment
• The rules of the game:
–
–
–
Don’t test or develop on an empty database
• Importing stats doesn’t cut it
• If you can read a query plan and with 100%
accuracy say “its good” – I want to hire you
Don’t test with a single user
Don’t test in a dust free lab
• Stubbed out API’s
• Less work then in real life
Design to perform, don’t tune it
• Specific models work better then generic
• Try 15 ways to do something
–
–
–
–
Consider features you’ve never used
• Hash clusters
• IOTs
• B*tree clusters
People example (tall skinny table as an index)
Denormalization example (name search)
Demo iot_heap.sql
Design to perform, don’t tune it
I have a table with a blob field, for example:
Create table trx
( trxId
Number(18),
trxType Varchar2(20),
objValue Blob )
Blob Fields contains a java serialised object, different objects based on types,
though all of them implements same interface. We have always accessed
this object through a J2EE container, so it works fine so far. Now users
want to use reports using sqlplus, crystal reports etc. So they want a
solution to this blob issue.
• Went on to discuss how *slow* it is to parse this out
• built tons of views to give a relational view of these structures
• slow as slow can possibly be
• not to mention a slightly CPU intensive solution
• mad at the database now
Design to perform, don’t tune it
• Store the data in such a way that the most
FREQUENT or CRITICAL queries/applications
can use it
–
“we execute this query a billion times an hour – tune it”
select fact.*
from fact, members
where (fact.fk_mem = members.id
or fact.fk_id1 in (correlated subquery)
or fact.fk_id2 in (correlated subquery)
or (fact.fk_mem is null
and fact.fk_id1 is null
and fact.fk_id2 is null)
)
and members.id = 2
Have metrics and live up to them
•
•
•
•
•
•
•
•
Make it go faster
Them: “Tune this, it is going slow”
Me: “Well, how fast does it have to go?”
Them: “I don’t know, just needs to go faster”
Me: “How do we know when we are done?”
Them: “When it is going fast enough”
Me: “And what is that?”
Them: “Don’t know, we’ll know it when we see
it”
Have metrics and live up to them
• I have a theory – it can always go 1% faster
–
–
You never reach “zero” as it is always 1% of a smaller number
The cost of each 1% in time, energy and money exponentially
increases
• You need things like
–
You’ll have 1,000 users of which 100 will be active
concurrently. They must have response times of 0.25
seconds for this <well defined> transaction
• What I usually hear
–
It’s going to have lots and lots of users doing stuff. It’s gotta
be really fast.
• You cannot design, let alone size, that system.
Have metrics and live up to them
•
•
•
•
•
Because everyone says so…
Them: “We rebuild our indexes every week”
Me: “Why?”
Them: “Everyone knows you need to”
Me: “Have you ever put together solid metrics that
prove you are doing a good thing?”
• Them: “Why?”
• Me: “Because by rebuilding the indexes <any
‘maintenance operation’ could go here actually>, you
might actually be decreasing performance, increasing
the workload on the machine, and just in general
wasting lots of time and energy”
• Them: “But it is common knowledge that rebuilding
indexes is best, we don’t need to prove it”
HELP!!!! Riddle me this batman...
Why does an rebuilding an index cause increased redolog
generation AFTER the index has been built?
I have a table 35 million rows and an index (nothing is
partitioned) Transactions against this table are constant. It's
always 500,000 rows per day. This generally creates 10 logs
a day
Once a month the indexes are rebuilt. (Alter index rebuild)
On the day following the indexes rebuild 50 logs are created
On the following days 45...40...35...30....25....down to 10 at
10 logs this remains constant at 10 Mining the logs we see
that we have increase INTERNAL INDEX UPDATES
Why does this happen?? Is this always the case??
Have metrics and live up to them
• Keep metrics
–
–
Statspack
Application level statistics
• Evaluate against them
–
–
Do an index rebuild
Come back tomorrow and verify you did more good
then harm
Benchmark, Benchmark, Benchmark
• Small time benchmarking
–
–
–
Interested in finding the difference between two approaches
Look at number of Latches approach 1 takes vs 2
More important then elapsed time
• Big Time Benchmarking
–
–
–
–
–
–
Most people skip this
Too expensive
You must test with representative data
You must test with realistic inputs
You must verify/validate the results
This is not a chore
Instrumentation
• What they heck is this?
• Common pushback:
–
–
–
This is overhead
This will slow down the code
This is extra code I don’t need
• My answer
–
–
–
–
Why are you curious about the v$ tables, guess what
they are
SQL_TRACE?
The entire Events subsystem
Do I practice what I preach?
Instrumentation
• Databases were born to write to – every click on
every web site I build comes with a builtin insert
• For the last 24 hours, I have observed that accessing your site, clicking
on 'ask question', 'read question','Review Question' etc., all gone slow
(taking around 2-3 minutes instead of couple of seconds before). Are
others facing the same problem?
• I just went to my statistics page, generated right
from my audit trail as part of my system
• Owarepl
• A certain Apps implementation I worked on
–
(that didn’t do this)
Instrumentation
• Use DBMS_APPLICATION_INFO everywhere!
–
Demo005.sql
• Debug.f
–
Demo006.sql
• Make it so your applications can all enable
SQL_TRACE!
• If you use java – use the J2SE/EE industry
standard logging. Help us help you
• Audit is 5 letters, not 4
Don’t look for short cuts
• Everyone is looking for fast=true
–
(it doesn’t exist)
• Frequently I’m asked:
–
–
Tell me about these undocumented parameters
Where can I learn about Oracle internals
• These are people who haven’t read the concepts
guide yet.
• Undocumented – for a reason
–
–
_trace_files_public for example
Neat event for bitmaps in 7.3.2 – killed intermedia in 8.0
Quote
And now, for something completely different
“Question Authority.”
- Unknown
There are lots of “experts” out there
• Make them prove everything
• Statements that should raise your eyebrows:
–
–
–
–
•
•
•
•
It is my opinion...
I claim...
I think...
I feel…
Everything can (and should) be proven
TKPROF goes a long way here
Statspack is great
“Runstats” is a tool I use as well
(search asktom for runstats)
• Things change, expect that
• It only takes a single counter case
Questions
and
Answers
Q&A
Quote
“Rebuilding an index
will always save
space and increase
performance.”
Quote
“Raw is faster, if you
are IO bound, go
RAW.”
Quote
“When coding in
PL/SQL, one should
always use explicit
cursors.”
[email protected]> declare
2
l_cnt
number;
3
l_start number;
4
cursor c is select count(*) from dual;
5 begin
6
l_start := dbms_utility.get_time;
7
for i in 1 .. 10000
8
loop
9
select count(*) into l_cnt from dual;
10
end loop;
11
dbms_output.put_line( (dbms_utility.get_time-l_start) || ' hsecs' );
12
13
l_start := dbms_utility.get_time;
14
for i in 1 .. 10000
15
loop
16
open c;
17
fetch c into l_cnt;
18
close c;
19
end loop;
20
dbms_output.put_line( (dbms_utility.get_time-l_start) || ' hsecs' );
21 end;
22 /
198 hsecs
260 hsecs
PL/SQL procedure successfully completed.
Quote
“Adding more CPU
will make all systems
faster for sure.”
Quote
“Index space is never
reused.”
Quote
“NOLOGGING stops
all redo log from
being generated on
that object or
tablespace.”
Quote
“A table should
optimally be in one
extent or as few
extents as possible.”
Quote
“The most selective
fields must be first in
an index.”
Quote
“You should commit
frequently to save
resources and time.”
Quote
“A cold backup is
better and/or easier
then a hot backup.”