Capacity Purchase Planning

Download Report

Transcript Capacity Purchase Planning

Edward Jones IS Capacity Planning and
Performance Management
Jim Poletti
October 23, 2007
About Edward Jones. . .
• Full service investment firm
• 10,000+ branches – US, Canada, UK
• 1 "broker" and 1 branch office administrator
per branch
• Land-line WAN – DSL or T1
• St Louis datacenter is hub for most traffic
• Tempe datacenter primarily DR for mainframe
• 21,000 users signed on to CICS at high-water
IS Capacity Planning & Performance Management
Rich Unnerstall (Director – Data Center Operations)
Art Morlock (Department Leader)
• Jim Poletti (MF Performance Analyst)
• Gerry Oliver (MF Performance
Analyst)
• Greg Volk (Network Performance
Analyst)
• Rick Pranger (Open Systems
Performance Analyst)
• Dwayne Allen (Open Systems
Performance Analyst)
• Tom Siech (Load Tester)
• Brandy Brown (Load Tester)
St. Louis Mainframe Hardware
•
•
•
•
•
•
All LPARs run on 1 physical mainframe
IBM Z9 2094-707 – 3516 MIPs – Z/os 1.7
80 GB memory
40 TB DASD – EMC Raid -1 and -7, 5 Ms
Older symmetrix – replacing with DMX-4
Data replication to Tempe using SRDF
CPU by LPAR
Production Environment/LPAR
• 1 LPAR (no data-sharing SYSPLEX yet)
• 25 CICS regions – 19 AORs, 5 TORs,1 FOR
• 32 Million CICS transactions/day = 7 million
user "enters"
• DB2 – 1 subsystem
• IDMS – 5 regions, 15 million run units/day
• RRDF replication in DB2 and IDMS to Tempe
Responsibilities
• Assure system performance and scalability.
• Provide capacity planning support for
purchasing decisions.
• Tune the mainframe hardware "till the
wheels come off", then buy capacity.
• Hotline, war room participation.
• Performance Testing.
Early Morning "System Checks"
•
•
•
•
•
•
•
•
•
Check system "barometers" from yesterday
Check performance graphs and reports
CICS transactions – Volume, CPU, Response
LPAR CPU
Memory
DASD
DB2
IDMS
Development response time – TSO, compiles
Houston, we have a problem !
• Go into detective mode
• Start at high level, look at service classes
within LPAR for abnormalities
Daily Workload Statistics
For 9:30-10:30 on Wed, Oct 17, 2007
Compared to Prior 4 Wednesdays
Service
CPU
CPU
Change
%
Real
Real
Class
Util
Util
in
Change
Memory
Memory
17-Oct
Prior 4
CPU
CPU
Gb
Prior 4
Wednesdays
Util
Wednesdays
BAT_HOT
0.3
0.3
0
-8
7.6
8.6
BAT_1
1.6
1.5
0.1
5
20.1
15.7
BAT_2
3.6
3.6
0
1
52
126.2
CICS_1
11.8
11.2
0.6
6
1490
1490
CICS_2
33.4
34.5
-1.2
-3
2037
2246
CICS_3
0.6
0.8
-0.2
-27
315.5
352.5
DB2_HI
1.6
1.8
-0.2
-11
6648
6636
DB2_LO
0.6
0.6
-0.1
-11
21.9
25.5
11.3
11.9
-0.6
-5
1390
1398
MQSERIES
0.3
0.2
0.1
35
775
418.7
NEWWORK
0
0
0
-44
0
IDMS
Dig deeper into details of the workload
Program
SUM CPU
CICS
+DB2
CPU
%
DB2
DB2
Pct
Resp
Resp
Name
Time
CPU
Time
Change
CPU
Time
Change
Time
Time
9:30 to
Time
Prior 4
CPU
Time
Prior 4
DB2
10:30
Per
Weds
Per
Weds
Tran
Prior 4
Weds
Tran
CMSOC300
884
0.0025
0.0025
1
0.0021
0.0021
DFHMIRS
424
0.0006
0.0006
-2
0
0
MYDOC016
391
0.0072
0.0075
-3
0.006
0.0062
PRTOC515
284
0.0141
0.0145
-3
0.0102
BRHOC053
190
0.0008
0.0008
1
PRTOC630
188
0.0111
0.0116
CMSOC320
187
0.0052
CHSOC120
133
CMSOC330
1
0.076
0.078
0.031
0.034
-3
0.301
0.314
0.0104
-3
0.189
0.21
0.0006
0.0006
1
0.011
0.012
-4
0.0053
0.0056
-5
0.07
0.077
0.0052
1
0.0048
0.0048
1
0.149
0.153
0.0025
0.0025
-2
0.0006
0.0006
-2
0.052
0.057
95
0.006
0.0059
2
0.0058
0.0057
2
0.182
0.184
BRIOC022
93
0.001
0.001
0
0
0
1
0.018
0.019
IAAOC222
91
0.0156
0.0156
0
0.0116
0.0116
0
0.482
0.485
PRTOC001
84
0.005
0.005
0
0.0019
0.0019
0
0.074
0.08
.
Once problem is found, find cause
• Run strobe on CICS or
batch job.
• Ask if program was
changed.
• Was a system parm
changed?
• Lurking problem
surfaced when user
patterns changed
• Did a new system go in?
Recommend change to fix problem
•
•
•
•
•
•
•
Code fix
Parameter change
SQL or IDMS call change
Run workload different time; smooth peaks
Redesign database or add index
Completely shutdown workload
If you don't know how to fix it, ask others
It helps to make performance recommendations if…
• You were a programmer in a previous life
• You were a DBA in a previous life
• Knowledgeable in MVS,CICS, DASD etc.
Integrity matters
• Be right, study before you speak
• Go for tuning that gives a payback
• If the workload isn't measurable, put in
mechanisms to measure it before doing the
tuning change
• Do some PR work - Send tuning results to
programmer and their management
Mainframe tools
•
•
•
•
•
•
•
SAS
MXG
Strobe
Jones built performance repositories
Our performance website
RMF 3
Omegamon
Capacity Management’s Prime Objective:
When Do We Run Out?
• When do we need more of a resource?
• How much lead time do you need?
– Approval cycle
– Floor space
– Vendor Delivery Time
– Installation Time
– Acceptable Risk
Forecasting Processes
Business
Forecasts
Performance and
Workload Data
Repositories
Resource
Utilization
Trends
Workload
Models
Resource
Utilization
Models
Performance
Prediction
Validate,
Assess and
Revise
Performance Tuning:
• We continually tune hardware and software, as well
as their interrelationships, to improve the
performance of systems.
• Shares ownership across multiple departments.
• Very highly iterative – never done!
• Why:
– Direct positive impact upon end user experience.
– Tuning  cost avoidance.
Performance Tuning: How do we improve programs?
• Divide and Conquer:
– Which program in a batch job takes the longest?
– Which program uses the most CPU?
– Profile Code
– Tune infrastructure (including
network).
– Prioritize process
Performance Tuning
Identify Opportunities for Improvement – aka
"Hawgs" and "Dawgs".
• Which programs are slowest
(Dawgs)?
• Which programs use the most
resources (Hawgs)?
• Which programs are used the
most?
• Business criticality: How
important are they to the business?
Performance Data Repositories
• We maintain many performance data repositories –
these tend to be collections of statistics not detail
data.
• For example, we will not retain CICS transaction
detail, but we will calculate counts of transactions by
region by transaction name as well as average,
maximum and percentile statistics for a variety of
variables and intervals.
• SAS is our primary tool.
Performance Data Repositories: Data Sources
•
•
•
•
•
•
•
•
CICS – by day, by tran
DASD Type 74 – by day, by LPAR, by VOLSER
Jones application instrumentation
MVS level – by day, by LPAR
IDMS- by day, by program
DB2 – by day, by tran
Service and report classes – by day, by service class
Proc summary, proc append
Business Metrics and Workloads
• Business Metrics typically use different time frames
than workload metrics.
• Business doesn’t forecast in terms of megabytes of
DASD, cpu seconds used, interactive sessions,
concurrent users or paging rates.
• They refer to branches, IRs, customers, trades,
purchases, $$$, payments, visits, exorbitant cost of
IT,…
Loved Ones: Sorry, all apps are not equal
• What is the business importance of
the application / workload?
• If there are diverse workloads on a
system it is necessary to prioritize
the work to ensure that the work is
processed in an order that reflects its
business priority.
• To understand priorities you have
to understand the business.
• Capacity planning activities should
also ensure that when work is
constrained, the highest priority work
is favored.
Performance testing
•
•
•
•
•
Jones has clone environment of production
Use Loadrunner tool to generate transactions
Think time adjustable
A few hundred users is usually enough
All major system enhancements are loaded
tested
Load Testing: Objectives
 Is End User Performance acceptable?
 Will the introduction of these new features threaten the
health of other applications?
 How does response & resource utilization compare to
current production levels?
 Reproduce and troubleshoot production problems.
 Will we need to add capacity?
 In stress testing we measure response times at production peak
load and 5x production peak.
 Often identify 'Break Points' to watch for in production.
Interaction with Availability
• A badly performing application is
effectively the same as the application
being unavailable.
• Capacity and Availability Management
share common goals / tools and
complement each other.
• Capacity Management needs to be aware
of Availability techniques deployed, such
as mirroring, load balancers or clustering,
in order to plan accurately for Capacity.
Questions: