222_Analyzing_Perf_Probs

Transcript 222_Analyzing_Perf_Probs

Techniques for Analyzing Performance Problems
Session: What are the key indicators of problems and poor performance that every
database administrator should be tracking? Where do you find this information, and
how can you distinguish the urgent from the mundane? We will learn how to approach
common monitoring problems through live demonstrations and explain how to get
started on the road to proactively managing your database and application. Join this
session to find out the answers to all these questions and more!
Techniques For
Analyzing
Performance
Problems
It’s The Right Thing To Do.
Tom Bascom, White Star Software
[email protected]
A Few Words about the Speaker
• Tom Bascom; Progress user & roaming DBA
since 1987
• President, DBAppraise, LLC
– Remote database management service for OpenEdge.
– Simplifying the job of managing and monitoring the
world’s best business applications.
– [email protected]
• VP, White Star Software, LLC
– Expert consulting services related to all aspects of
Progress and OpenEdge.
– [email protected]
3
Get the Basic Facts
• What end-user owns the problem? Who confirms that we
have fixed it?
• Is the problem related to a specific screen, program or
business process?
• Is the problem repeatable? On demand? Is there a
pattern (i.e. only in the afternoon or whenever X is also
running…)
• Is there a specific target metric that we need to satisfy?
• Do we have before and after run times or other end-user
metrics related to the identified problem?
• Did the problem start on a particular date? Or after an
identified change?
Establish a Timeline
• When did the problem begin?
– Verify this with independent data! (Logfiles, timestamps on
output etc.)
• What changes are known to have occurred around this
time?
– Software releases – “The System”, OpenEdge, Apache, Java, OS
etc.
– Parameter changes (DB, 4gl client, OS, SAN, network etc.)
• Has the Workload changed?
– Application workload characteristics (business volume, number
of users, transaction volume, CRUD statistics, etc.)
– Environment workload changes: OS, disk, network…
– External workload changes (Shared SAN, Network, Other VMs).
When did the problem begin?
• It isn’t always the same as when the user first
noticed the problem.
• Getting the wrong begin date may lead to
identifying the wrong root cause.
• Independent verification is critical – log files,
timestamps of output files and careful
interviews of end-users are all tools that can
help establish a reliable begin date.
What Changes Occurred?
• A complete and reliable timeline of all
changes is critical.
• It is important to look “outside the box” –
changes sometimes have unexpected side
effects.
• Make sure that changes to any and all systems
involved in the business process being
analyzed are tracked.
Sample Timeline
Normal
Performance
Monthly
Release
OE 11.2
Upgrade
SAN Upgrade
Record
Reads
Increase
300%, Latch
Timeouts
Spike
Users Report
Bad
Performance
Application Workload
• Over time more users create more data and engage in more
activity. If parts of the application are inappropriately
sensitive to growth there may be problems.
• DB activity should scale linearly (or less) with business
volume.
• Number of users, record creates, reads, updates and
deletes, number of transactions etc. should all be related to
business activity.
• Reporting and data aggregation will take longer with more
data – but the additional time should be linear and
predictable.
• Any metrics which are growing exponentially faster than
business growth are indicators of an underlying scalability
problem.
Environment Workload
• As the application becomes busier so should the
environment that it is hosted in.
• Just as with the application workload the environment
workload should scale linearly with business activity.
• CPU utilization, network traffic, memory utilization and disk
space should all grow in step with business growth.
• Disk IO operations should grow more slowly than business
growth (IO ops are buffered resources that should follow an
inverse square growth path).
• Any metrics which are growing exponentially faster than
business growth are indicators of an underlying scalability
problem.
External Workload
• Changes outside the immediate environment can
impact performance.
• Network load from other applications my reduce the
available bandwidth and increase latency.
• Disk IO from other applications may impact SAN
performance.
• Demand for CPU from other applications on a shared
VM may impact the scheduling and availability of
shared CPU.
• If the application and environment workloads have
not changed but we are seeing increased latency or
deeper queuing it is likely that the external workload
has increased.
TOOLS
Tools to Narrow the Scope
• Kill USR1 -- 4GL Stack Trace
– Generates a “protrace” file from a running process.
– Does not interfere with the running process.
– Requires that you be the process owner or “root”.
• Progress Profiler
– Gathers detailed line by line execution timing.
– Can be enabled via startup parameters or by manipulating PROFILER
handle.
• Client Logging
– Logs specified 4GL session details to a file.
– Can be very verbose.
• Client Statement Cache
– Shows the 4GL data access statement and stack trace
– Or the SQL statement
About Line Numbers
• Most tools use DEBUG LIST line numbers.
• This listing expands all include files and preprocessor statements.
• COMPILE “program.p” DEBUG-LIST
“program.dbg”.
Obtaining a USR#
/home/tom> ps -fu
UID
PID
f474458 18612454
f474458 14171658
f474458 34488084
f474458 37176446
f474458 39666876
f474458 42812578
f474458 46286012
f474458 47989918
f474458
PPID
C
STIME
TTY
35669114
0 09:30:43
46286012
0 12:58:50 pts/20
44385460
0 09:34:30
39666876
0
Dec 04 pts/12
17448590
0
Dec 04
47989918 120 12:59:02 pts/11
34488084
0 09:34:31 pts/20
18612454
0 09:30:43 pts/11
TIME
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
CMD
boks_sshd: f474458@pts/11
/progress/dlc/bin/_progres -p mls.p
boks_sshd: f474458@pts/20
bash
boks_sshd: f474458@pts/12
ps -fu f474458
bash
bash
/home/tom> grep f474458 /pdb_prod/xlos.lg | grep 14171658
[2012/12/06@12:58:50] P-14171658 T-1 I ABL 124: (452) Login by f474458 on /dev/pts/20.
[2012/12/06@12:58:52] P-14171658 T-1 I ABL 124: (708) Userid is now f474458.
KILL USR1
4GL Stack Trace From The Command Line
Generate a 4GL Stack Trace
• Great for batch processes and “hung” users.
• Use “kill -USR1” to generate a stack trace and
establish what code is actually running.
$ ps –fu f474458
UID
PID
PPID
C
STIME
TTY
f474458 33570788 37176446 62 11:21:57 pts/12
f474458 37176446 39666876 0 08:44:23 pts/12
f474458 39666876 17448590 0 08:44:23
f474458 42288226 45303030 0 11:21:30 pts/48
f474458 45302816 42288226 0 11:21:37 pts/48
$ kill –USR1 45302816
$ ls –l /los_prod/protrace.45302816
$ -rw-rw-r-- 1 f474458 los_prd 2194 Dec 4
TIME
0:00
0:00
0:00
0:00
0:00
CMD
ps -fu f474458
bash
boks_sshd: f474458@pts/12
bash
/progress/dlc/bin/_progres -p mls.p
11:28 /los_prod/protrace.45302816
• The “protrace” output will be in the process’
current directory.
Sample 4GL Stack Trace
$ cat /los_prod/protrace.45302816
PROGRESS stack trace as of Tue Dec 4 11:28:00 2012
Command line arguments are
/progress/dlc/bin/_progres -p mls.p -pf /los_prod/develop.pf -T /dbtmp
Startup parameters:
-pf /progress/dlc/startup.pf,-T /dbtmp,-clientlog
/los_logs/debug/client_f474458_20121204_112720_45302816.log
... (snip)
+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
... (snip)
** 4GL Stack Trace **
--> obj/mnu/menu (/los_prod/obj/mnu/menu.r) at line 416
obj/src/startup (/los_prod/obj/src/startup.r) at line 2042
mls.p (/los_prod/mls.r) at line 26
** Persistent procedures/Classes **
Handle
001010
001009
000000
File Name
/los_prod/obj/tools/proclib.r
/los_prod/obj/tools/quick-supers.r
/los_prod/config.r (STATIC)
CODE PROFILER
Line By Line Execution Time
Profiler – The Hard Way
• -profile (startup option)
– Non-intrusive
– Non-selective
• PROFILER: (session handle)
– More selective
– But requires code insertion or “wrappering”
– profiler/profctrl.p
• Analysis Tools
– http://communities.progress.com/pcom/docs/DOC-2808
– http://dbappraise.com
PROFILER Attributes
•
•
•
•
•
•
DESCRIPTION – optional text describing this session
LISTINGS – whether or not to create debug listings
DIRECTORY – where to create debug listings (default to –T)
FILE-NAME – name of output file (default profile.out)
ENABLED – yes/no; initializes listings and so forth
PROFILING – turn profiling on or off
Profiling an Entire Session
• Create file called profiler.cfg with 3 lines:
-OUTFILE /tmp/profiler.out
-LISTINGS /tmp
-DESCRIBE someDescription
• Add –profile to session startup:
mpro dbName –p start.p –profile profiler.cfg
• Run normally.
• Terminate cleanly & analyze the output.
• Multiple gigabytes of data may be generated…
Minimal Embedded Usage
assign
profiler:enabled = yes
profiler:profiling = yes
.
do i = 1 to 1000000:
/* do something */
end.
assign
profiler:enabled = no
profiler:profiling = no
.
profiler:write-data().
<Escape>-P Profiler Tool
• Dynamically capture line by line execution time at the
point where issues occur.
• Send output to a user-defined destination.
• <Esc>P to configure and enable. (Don’t forget “Yes”.)
• <Esc>P again to complete the capture and disable.
Sample Profiling Output
Description: XYZZY
Top Total Time Lines
Program
Line
Avg Time
------------------------------ ----- ----------xtabsms2.p
19281
93.256652
getdocprep2.p
939
63.346967
proc_create_sitm xtabsms2.p
11926
12.611345
xtcountry_x2_x3.p
359
0.000013
proc_read-database sysval.p
536
0.000117
xtcountry_x2_x3.p
360
0.000009
getdocprep2.p
741
0.067411
proc_upd_nref xtmfintb2.p
3582
0.003438
proc_upd_nref xtmfintb2.p
3298
0.003137
proc_process_tasks xttskscn.r
3091
0.012560
findClient sysval.p
328
0.000094
Date: 10/04/11
Time
Calls
---------- ------186.513304
2
126.693934
2
50.445380
4
0.459658 34,967
0.336606
2,879
0.324163 34,894
0.269642
4
0.233802
68
0.213345
68
0.200956
16
0.165167
1,763
CLIENT LOGGING
Detailed 4GL Session Logging
Client Logging – The Hard Way
• -clientlog filename
– /logs/debug/USER_DATE_TIME_PID.log
• -logginglevel
– 0 = disable
– 1 = errors only
– 2 thru 4 = basic, verbose & extended -logentrytypes
• -logentrytypes
– 4GLTrace, 4GLTrans, DynObjects.*, FileID, QryInfo
– LOG-MANAGER:LOG-ENTRY-TYPES = ”4GLTrace:2”
<Escape>-T Trace Tool
• Dynamically combine LOG-ENTRY-TYPEs &
LOGGING-LEVELs to suit.
• 4GLTrace, QryInfo, DynObjects…
• 2 = Basic, 3 = Verbose, 4 = Extended
• Specify a custom path to log file (/home/SID…)
• Trace only the relevant portion of your session
(not all or nothing).
Client Logging Output
[12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Logging level set to = 1
[12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Log entry types activated: 4GLTrace:2
[12/12/06@09:12:22.490-0500] P-41043172 T-000001 2 4GL 4GLTRACE Run obj/src/connect [Main Block - mls.p @ 25]
Run obj/src/startup [Main Block - mls.p @ 26]
PUBLISH Monitor [Main Block - obj/src/startup @ 270]
Run obj/sec/access "startup no" [Main Block - obj/src/startup @ 366]
Run obj/mnt/setcolor [Main Block - obj/src/startup @ 435]
Func isDbReadOnly [Main Block - obj/src/startup @ 849]
Func get-env [Main Block - obj/src/startup @ 882]
Run obj/prt/genBarcode PERSIST [Main Block - obj/src/startup @ 903]
Run loadCodes "0" [Main Block - obj/src/startup @ 914]
Func addCode "00
00 11011001100 " [loadCodes - obj/prt/genBarcode @ 352]
Func buildPCLString "11011001100 00" [addCode - obj/prt/genBarcode @ 291]
…
Run obj/mnu/menu "main" [Main Block - obj/src/startup @ 2042]
Run obj/app/ckqaread [Main Block - obj/mnu/menu @ 136]
Run obj/sec/access "main no" [Main Block - obj/mnu/menu @ 140]
Run obj/lkp/fixmod [Main Block - obj/mnu/menu @ 189]
Run obj/hlp/helpmess "main Retail Mortgage Processing" [Main Block - obj/mnu/menu @ 404]
Run obj/tools/logout [Main Block - obj/src/startup @ 2119]
Prolog removed to improve readability…
STATEMENT CACHE
Tracing Data Access Statements
Client Statement Cache
• Shows the session’s most recent data access
statement.
– Both 4gl and SQL clients are supported.
• 4gl stack trace is optionally available for 4gl
clients.
• Enable via PROMON or via _CONNECT VST
• Access results via PROMON or _CONNECT
PROMON – Statement Cache
User number
User name
User type
Login date/time
:
:
:
:
89
f474458
SELF/ABL
12/06/12 09:30
Statement caching type
Statement caching last updated
: SQL Statement or Partial ABL Program Stack
: 12/06/12 09:33
Statement cache information
: 101 : obj/mrxv/lnstatex-val
1152 : execute obj/mrxe/50005399-2012-06-13-31926.r
153 : obj/mrxe/50005399-2012-06-13-31926.r
80 : obj/mrx/runcompmtrx
169 : obj/mrx/mtrxeval-stateless
282 : obj/sec/access
4300 : obj/adm/mortype4
575 : obj/mnu/menu
568 : obj/mnu/menu
3018 : flow-proc obj/src/loanflow
1947 : obj/src/loanflow
51 : obj/lkp/retailfl
575 : obj/mnu/menu
568 : obj/mnu/menu
568 : obj/mnu/menu
2042 : obj/src/startup
26 : mls.p
SCENARIOS
Scenarios
•
•
•
•
Hung Session
Runaway
Slow Program
Slow System
NON-RESPONSIVE
SESSION
Checking a “Hung” Session
•
•
•
•
Check for Scroll-Lock (Control-S)
Examine log files.
Is the process consuming any CPU ticks?
Examine table statistics by user – is there any
activity?
• Check “blocked users” – is the session waiting
on a record lock or other resource?
• Get a 4GL stack trace – examine the active code.
Could we be waiting for something?
– Be especially wary of OS-* commands, file IO and
any external interfaces.
CPU Utilization
• Use “ps” to check CPU utilization:
/home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658
f474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.p
f474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.p
• No change in CPU in 60 seconds – we might
be “hung”.
/home/tom> ps -ef | grep [1]4171658
f474458 14171658 46286012 0 12:58:50 pts/20 0:06 _progres -p mls.p
f474458 50676836 14171658 0 13:48:56 pts/20 0:00 rm -i /tmp/xyzzy
• In this case a sub-process is blocked waiting
for input.
Is There a Problem?
• Holding record locks, preventing access etc.
• No? Don’t worry about it. Move on.
• Yes? It might be a hung session that needs to be
killed.
– Save log files and other analysis from above.
– Dynamic restart: kill –INT (raises OpenEdge STOP
condition and re-runs startup procedure)
– proshut dbname –disconnect usr#
– Terminate: kill –HUP or kill –TERM
– Check the .lg files for transaction rollback completion
before killing again.
– Do NOT “kill -9” if there is DB activity – you may crash the
db!
Key Metrics
• ProTop – Active TRX, Blocked Users, Table IO
by User
MY LITTLE
RUNAWAY
Runaway?
• A particular OS process is consistently
consuming significant CPU resources (usually
something in excess of 90% of a core).
• It may, or may not, be executing as designed
and expected.
• How can we tell?
Checking a Runaway
•
•
•
•
•
Examine log files.
Is the process consuming a lot of CPU ticks?
Get a 4GL stack trace.
Examine table and index statistics by user.
Verify the Client Statement Cache.
CPU Utilization
• Check CPU utilization over time
• If the utilization is a significant portion of the
time slice then you might have a “runaway”:
/home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658
f474458 14171658 46286012 120 12:58:50 pts/20 0:08 _progres -p mls.p
f474458 14171658 46286012 120 12:58:50 pts/20 0:46 _progres -p mls.p
0:46 – 0:08 = 34 seconds of CPU time, 57% of the available timeslice.
Performing as Expected?
• Yes? It may be time to refactor.
• No? It might be a runaway that needs to be
killed.
– Save log files and other analysis from above.
– Dynamic restart: kill –INT (raises OpenEdge STOP
condition and re-runs startup procedure)
– proshut dbname –disconnect usr#
– Terminate: kill –HUP or kill –TERM
– Check for transaction rollback completion before
killing again.
– Do NOT “kill -9” if there is DB activity – you may
crash the db!
Key Metrics
• ProTop – Active TRX, Blocked Users, Table IO
by User
RAPID READERS
Is it a “Rapid Reader”?
• Very heavy DB read activity.
• Probably close to the theoretical limit for a
session (200,000 rec/sec for current HW).
• Usually focused on just one table.
• Often a small table held entirely in memory.
• Often associated with a table scan (WHOLEINDEX search).
• Poorly constructed joins may be involved.
ProTop Example
• Show summary, uio, tableStat data
Fixing a Rapid Reader
• Add an appropriate Index
• Refactor JOIN logic.
– Rewriting
– Use a temp-table
• Use a temp-table to replace real table.
*** Tuesday at 2:30 -- Pick an Index, Any Index
A SPECIFIC
PROGRAM IS
SLOW
The Easiest Scenario
Slow Program
• A specific program is slow.
• The problem is repeatable and measurable.
• We have used tools to establish which code is
slow.
– If the profiler shows that data access statements
are at fault follow “Data Access” analysis.
– Otherwise follow “Other Cause”.
Data Access Issues
• Possible Data Access Issues:
–
–
–
–
Table scans.
Missing or inappropriate indexes.
Poorly constructed joins.
Non-linear scaling.
• May be sensitive to the amount of data present.
• May be sensitive to data values.
• May not be reproducible outside the problem
environment.
Expected Data Access
• First – determine what you *expect*:
–
–
–
–
What business metrics can we relate to this process?
How much activity should we see for X business?
How long should that activity take?
What tables and indexes should be used by this
business process?
Example:
Extracting 100 loans of data should result in 100 “loanfile” and
225 “borrower” reads…
Actual Data Access
• Examine Table & Index Statistics By User
– Are the expected tables being accessed at the
expected rates?
– Are the indexes being used at runtime the
indexes that you expected?
– Do data access operations vary linearly with the
amount of useful work being performed?
– Is any activity approaching the known limits of
the infrastructure?
ProTop – CRUD Data
09:38:33
ProTop -- Progress Database Monitor
09/19/04
Sample
sports2000 [/data/s2k/sports2000]
Rate
Hit Ratio:
14:1
14:1
Commits:
62
65
Local: 51
Miss%: 7.239%
7.140%
Latch Waits:
45
46 Remote:
0
Hit%:92.761% 92.860% Tot/Mod Bufs:
1002 370
Batch: 50
Log Reads: 22960
26486
Evict Bufs:
26602 6225 Server:
0
OS Reads:
1662
1891
Lock Table:
8192
11
Other:
1
Chkpts:
1
0 Lock Tbl HWM:
138
TRX:
1
Flushed:
0
0
Old/Curr BI:
6141 6141 Blocked:
5
Area Full:
1 100.00%
After Image: DISABLED
Total: 52
Table Statistics
Tbl# Table Name
Create
Read
Update
Delete
---- -------------------- --------- --------- --------- --------4 OrderLine
0
5937
152
0
24 POLine
0
2641
56
0
23 PurchaseOrder
0
1699
36
0
18 Order
0
1608
37
0
21 Bin
0
286
14
0
2 Customer
0
206
16
0
12 Vacation
0
111
5
0
Theoretical Limits
• Record Reads/sec = 200,000 per session, 1.2M total
(in a “shared” LPAR environment, 2M if “dedicated”).
• Transactions/sec = 18,000
• Disk Read/sec = 20,000
• Disk Write/sec = 5,000
Actual limits may be lower – these are the best
values that we obtained in the performance
lab.
Other Cause
• External Resources
– Slow File IO.
– Interfaces (especially networked services).
• Excessive recursion.
• Dynamic Object allocation.
– May be revealed by memory growth.
• Parameter passing with deep copies.
– May be revealed by memory growth.
• Poor “Big O” algorithm choice.
Key Metrics
• ProTop – Active TRX, Blocked Users, Table IO
by User
“IT’S SLOW”
Let’s Go Fishing…
Slow System
• “The System”, as a whole, is slow.
• The problem is repeatable or predictable.
– From 2pm until 5pm.
– The last Thursday of the month.
– Whenever the X extract is run.
Verify
• Are there workload changes that correlate with
the problem?
– Has business volume increased?
– Have there been changes in how the system is being
used?
– Has there been a change in the mix of products being
originated?
• Are any system limits being approached?
– OS limits: CPU utilization? Disk IO? Network Ops?
– DB limits: TRX/sec? Record Rd/sec?
Database Activity
• Are there behavioral changes that correlate
with the problem?
– Do certain tables show increased IO activity?
• Have the “top 10” tables changed?
– Do certain indexes become active?
• Have the “top 10” indexes changed?
– Do particular background processes become
exceptionally busy?
– Are any bottleneck metrics worrisome?
• Latch timeouts, blocked users, long transactions
Key Metrics
• Record & Index Creates, Reads, Updates, Deletes, Locks
and Waits per second.
– Aggregate
– Per Table
– Per User
• Transactions (Commits) per second
• Latch Wait Timeouts
• Number of Users:
– Blocked
– With Active Transactions
– With Old Active Transactions
Questions?
66
Thank You!
67