Virginia POWER User Group May 19, 2015 PowerVM Dynamic Platform Optimizer & PowerVP © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in.

Download Report

Transcript Virginia POWER User Group May 19, 2015 PowerVM Dynamic Platform Optimizer & PowerVP © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in.

Virginia POWER User Group
May 19, 2015
PowerVM
Dynamic Platform Optimizer &
PowerVP
© Copyright IBM Corporation 2015
Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
Optimization Redbook
POWER7 & POWER8
PowerVM Hypervisor
AIX, i & Linux
Java, WAS, DB2…
Compilers & optimization
Performance tools & tuning
Performance Optimization & Tuning Techniques for IBM Processors
2
© 2015 IBM Corporation
Dynamic Platform Optimizer
Update
3
© 2015 IBM Corporation
What is Dynamic Platform Optimizer: DPO
 DPO is a PowerVM virtualization feature that enables users to
improve partition memory and processor placement (affinity) on
Power Servers after they are up and running.
 DPO performs a sequence of memory and processor
relocations to transform the existing server layout to the optimal
layout based on the server topology.
 Client Benefits
–Ability to run without a platform IPL (entire system)
–Improved performance in a cloud or highly virtualized environments
–Dynamically adjust topology after mobility
4
© 2015 IBM Corporation
What is Affinity?
 Affinity is a locality measurement of an entity with respect to physical resources
– An entity could be a thread within an OS instance (AIX/i/Linux) or the
OS/Virtual Machine itself. For this presentation, we focus on the latter.
– Physical resources could be a core, chip, node, socket, cache (L1/L2/L3),
memory controller, memory DIMMs, or I/O buses
 Affinity is optimal when the number of cycles required to access resources is
minimized
Chip
Dual Chip Module
Socket / Node
POWER7+ 760 Planar
Note x & z buses between chips, and A & B
buses between Dual Chip Modules (DCM)
In this model, each DCM is a “node”
5
© 2015 IBM Corporation
DIMM Memory
Power 750/760 D Technical Overview
Thread Affinity
 Performance is closer to optimal when threads stay close to physical
resources. Thread Affinity is a measurement of proximity to a
resource
– Examples of resources: L2/L3 cache, memory, core, chip and node
– Cache Affinity: threads in different domains need to communicate
with each other, or cache needs to move with thread(s) migrating
across domains
– Memory Affinity: threads need to access data held in a different
memory bank not associated with the same chip or node
 Modern highly multi-threaded workloads are architected to have lightweight threads and distributed application memory
– Can span domains with limited impact
– Unix scheduler/dispatch/memory manager mechanisms spread
workloads
6
© 2015 IBM Corporation
How does partition placement work?
 PowerVM knows the chip types and memory configuration, and
attempts to pack partitions onto the smallest number of chips / nodes /
drawers
– Optimizing placement will result in higher exploitation of local CPU
and memory resources
– Dispatches across node boundaries will incur longer latencies,
and both AIX and PowerVM the are actively trying to minimize that
via active Enhanced Affinity mechanisms
 It considers the partition profiles and calculates optimal placements
– Placement is a function of Desired Entitlement, Desired &
Maximum Memory settings
– Maximum memory defines the size of the Hardware Page Table
maintained for each partition. For POWER7, it is 1/64th of
Maximum and 1/128th on POWER7+ and POWER8
– Ideally, Desired + (Maximum/HPT ratio) < node memory size if
possible
7
© 2015 IBM Corporation
Partition Affinity: Why is it not always optimal?
Partition placement can become sub-optimal because of:
 Poor choices in Virtual Processor, Entitlement or Memory sizing
–The Hypervisor uses Entitlement & Memory settings to place a
partition. Wide use of 10:1 Virtual Processor to Entitlement settings
does not lend any information for optimal placement.
–Before you ask, there is no single golden rule, magic formula, or
IBM-wide Best Practice for Virtual Processor & Entitlement sizing.
If you want education in sizing, ask for it.
 Dynamic creation/deletion, processor and memory ops (DLPAR)
 Hibernation (Suspend or Resume)
 Live Partition Mobility (LPM)
 CEC Hot add, Repair, & Maintenance (CHARM)
 Older firmware levels are less sophisticated in placement and
dynamic operations
8
© 2015 IBM Corporation
Partition Affinity: Hypothetical 4 Node Frame
Partition X
Partition X
DPO
operation
Partition Y
Partition Z
Partition Y
Partition X
Partition Y
Partition Z
Free LMBs
Partition Z
9
© 2015 IBM Corporation
Current & Predicted Affinity, System & LPARs
lsmemopt –m managed_system –o currscore –r [sys | lpar]
lsmemopt –m managed_system –o calcscore –r [sys | lpar]
[--id request_partition_list]
[--xid protect_partition_list]
sys = system-wide score (default if the –r option not specified)
lpar = partition scoring
10
© 2015 IBM Corporation
Example: V7R780 firmware affinity scores
> lsmemopt -m Doc -o currscore -r sys
curr_sys_score=89
Current Scores
System & LPAR
> lsmemopt -m Doc -o currscore -r lpar
lpar_name=mdvio1_production,lpar_id=1,curr_lpar_score=100
lpar_name=mdvio2_production,lpar_id=2,curr_lpar_score=100
lpar_name=ec07_sn,lpar_id=7,curr_lpar_score=80
lpar_name=ec09_mm,lpar_id=9,curr_lpar_score=100
lpar_name=ec10_mm,lpar_id=10,curr_lpar_score=100
lpar_name=mhnode1,lpar_id=13,curr_lpar_score=70
> lsmemopt -m Doc -o calcscore -r sys
Predicted Scores
System or LPAR
curr_sys_score=89,predicted_sys_score=100,requested_lpar_id
s=none,protected_lpar_ids=none
11
© 2015 IBM Corporation
HMC CLI: Starting/Stopping a DPO Operation
optmem –m managed_system –t affinity –o start
[--id requested_partition_list]
[--xid protect_partition_list]
optmem –m managed_system –t affinity –o stop
Partition lists are comma-separated and can include ranges
Include: --id <1,3,5-8>
Requested partitions: partitions prioritized (default = all LPARs)
Protected partitions: partitions that should not be touched
Exclude by name –x <name,> or id number --xid <5,10,16-20>
Optimization by –t [affinity | mirror], latter for Hypervisor mirroring
Typically exclude partitions that are not DPO aware (more later)
12
© 2015 IBM Corporation
HMC CLI: DPO Status
> lsmemopt –m managed_system
in_progress=0,status=Finished,type=affinity,opt_id=1,
progress=39,requested_lpar_ids=none,protected_lpar_ids=none
,”impacted_lpar_ids=106,110”
Estimated progress %
LPARs impacted by optimization (moved CPU, memory, hypervisor
memory)
13
© 2015 IBM Corporation
What’s New: Schedule, Thresholds, Notifications
14
© 2015 IBM Corporation
DPO: Supported Hardware and Firmware levels
 Introduced in fall 2012 (with feature code EB33)
• 770-MMD and 780-MHD with firmware level 760.00
• 795-FHB with firmware level 760.10 (760 with fix pack 1)
• Recommend 760_069 has enhancements below
 Additional systems added spring 2013 with firmware level 770
– 710,720,730,740 D-models with firmware level 770.00
– 750,760 D-models with firmware level 770.10 (770 with fix pack 1)
– 770-MMC and 780-MHC with firmware level 770.20 (770 with fix pack 2)
– Performance enhancements – DPO memory movement time reduced
– Scoring algorithm improvements
– Recommend firmware at 770_021
 Affinity scoring at the LPAR level with firmware level 780 delivered Dec 2013
http://www 770-MMB, 780-MHB added with 780.00
304.ibm.com/support/customercare/
sas/f/power5cm/power7.html
 795-FHB updated with 780.00
 770-MMD, 780-MHD (AM780_056_040 level released 4/30/2014)
15
© 2015 IBM Corporation
Running DPO
 DPO aware Operating Systems
– AIX: 6.1 TL8 or later, AIX 7.1 TL2 or later
– IBM i: 7.1 TR6 or later
– Linux: Some reaffinitization in RHEL7/SLES12
(Fully implemented in follow-on releases)
– VIOS 2.2.2.0 or later
– HMC V7R7.6.1
– Partitions that are DPO aware are notified after DPO completes
 Re-affinitization Required to ensure affinity is as good as a CEC IPL
– Performance team measurements show reaffinitization is critical
– For older OS levels, users can exclude those partitions from optimization, or
reboot them after running the DPO Optimizer
16
© 2015 IBM Corporation
More Information
 IBM PowerVM Virtualization Managing and Monitoring (June 2013)
– SG24-7590-04: http://www.redbooks.ibm.com/abstracts/sg247590.html?Open
 IBM PowerVM Virtualization Introduction and Configuration (June 2013)
– SG24-7940-05: http://www.redbooks.ibm.com/abstracts/sg247940.html?Open
 POWER7 Information Center under logical partitioning topiccs
– http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=%2Fp7hat%2Fip
hblmanagedlparp6.htm
 POWER7 Logical Partitions “Under the Hood”
– http://www03.ibm.com/systems/resources/power_software_i_perfmgmt_processor_lpar.pdf
17
© 2015 IBM Corporation
PowerVP
18
© 2015 IBM Corporation
PowerVP Redbook
IBM PowerVP Introduction & Technical Overview
19
© 2015 IBM Corporation
Review: POWER7+ 750/760 Planer
Chip
Dual Chip Module
Socket / Node
DIMM Memory
Memory Controller
Intra-DCM bus: x & z
Inter-DCM/socket bus: AB
20
© 2015 IBM Corporation
I/O Bus
Power 750/760 D Technical Overview
Review: POWER7+ 770/780 Planer
Not as pretty as 750+ diagram,
note we have x, w & z buses
between chips with this model
and buses to other nodes (not
pictured) and IO are a little
more cryptic
Loc Code Conn Ref
Power 770/780 D Technical Overview
21
© 2015 IBM Corporation
Why PowerVP: Power Virtualization Performance
 During an IPL of the entire Power System, the Hypervisor
determines an optimal resource placement strategy for the server
based on the partition configuration and the hardware topology of
the system.
 There was a desire to have a visual understanding of how the
hardware resources were assigned and being consumed by the
various partitions that were running on the platform.
 It was also desired to have a visual indication showing a
resource’s consumption and when it was going past a warning
threshold (yellow) and when it was entering an overcommitted
threshold (red.)
22
© 2015 IBM Corporation
PowerVP Overview
 Graphically displays data from existing and new performance tools
 Converges performance data from across the system
 Shows CEC, node & partition level performance data
 Illustrates topology utilization with colored “heat” threshold settings
 Enables drill down for both physical and logical approaches
 Allows real-time monitoring and recording function
 Simplifies physical/virtual environment, monitoring, and analysis
 Not intended to replace any current monitoring or management product
23
© 2015 IBM Corporation
PowerVP Environment
Partition Collectors






Required for logical view
LPAR CPU utilization
Disk Activity
Network Activity
CPI analysis
Cache analysis
System-wide Collector






One required per system
P7 topology information
P7 chip/core utilizations
P7 Power bus utilizations
Memory and I/O utilization
LPAR entitlements, utilization
System
Collector
Partition
Collector
Operating system
Hypervisor
interfaces
IBM i, AIX, VIOS, Linux
Chip
Core
HPMCs PMUlets
FW/Hypervisor
Thread PMUs
Power Hardware
24
© 2015 IBM Corporation
You only need to
install a single
system-wide
collector to see
global metrics
PowerVP: System Info, Global Usage, Recording
System
Information
Global
Utilization
Recording/Playback
Control
25
© 2015 IBM Corporation
PowerVP: LPAR List View Options
26
© 2015 IBM Corporation
PowerVP: System, Node and Partition Views
System
Topology
27
© 2015 IBM Corporation
Node
Drill Down
Partition
Drill Down
PowerVP: System Topology
• The initial view shows the
hardware topology of the system
you are logged into
• In this view, we see a Power 795
with all eight books and/or nodes
installed, each with four sockets
• Values within boxes show CPU
usage
• Lines between nodes show SMP
fabric activity
28
© 2015 IBM Corporation
PowerVP: Node drill down
• This view appears when
you click on a node and
allows you to see the
resource assignments or
consumption
• In this view, we see a
POWER7 780 node with
four chips each with four
cores
• Active buses are shown with solid colored lines. These can be
between nodes, chips, memory controllers and IO buses.
29
© 2015 IBM Corporation
PowerVP 1.1: Node Utilization View (P8 S824)
SMP Bus
Systems like the 750+ &
S824, a node is socket
with Dual Chips (DCM)
I/O Bus
Chip
Cores &
Utilization
Memory Controller
30
© 2015 IBM Corporation
PowerVP 1.1.2: Node View with Affinity (P7 780)
31
© 2015 IBM Corporation
PowerVP 1.1.2: Chip (POWER7 780 / 4 cores)
SMP Bus
IO
Memory
Controller
Chip
LPAR
Virtual
Processors
&
Memory
32
© 2015 IBM Corporation
DIMM
PowerVP 1.1.2: CPU Affinity
LPAR 7 has 8 VPs. As we select cores, 2 VPs are “homed” to each
core. The fourth core has 4 VPs from four LPARs “homed” to it.
This does not prevent VPs from being dispatched elsewhere in the
pool as utilization requirements demand
33
© 2015 IBM Corporation
PowerVP 1.1.2: Memory Affinity
LPAR 7 Online Memory is 32768 MB, 50% of 64 GB in DIMMs
LPARs listed in color order
34
© 2015 IBM Corporation
PowerVP: Partition drill down
• View allows us to drill down
on resources being used by
selected partition
• In this view, we see CPU,
Memory, Disk IOPS, and
Ethernet being consumed.
We can also get an idea of
cache and memory affinity.
• We can drill down on several of these resources. Example: we can
drill down on the disk transfer or network activity by selecting the
resource
35
© 2015 IBM Corporation
PowerVP: Partition drill down (CPU, CPI)
36
© 2015 IBM Corporation
PowerVP: Partition drill down (Disk)
37
© 2015 IBM Corporation
PowerVP: How do I use this?
 PowerVP is not intended to replace traditional performance
management products. It is not a management tool.
 Does provide an overview of hardware resource activity that
allows you to get a high-level view of
–Node/socket activity
–Cores assigned to dedicated and shared pool
–VM’s Virtual Processors assigned to cores
–VM’s memory assigned to DIMMs
–Memory bus activity
–IO bus activity
–Provides partition activity related to
–Storage & Network
–System and partition Cycles-Per-Instruction
 PowerVP 1.1.2 is required for POWER8, but memory bus activity
is not currently available
38
© 2015 IBM Corporation
PowerVP: How do I use this? High-Level
 High-level view can allow visual identification of node and bus stress
–Thresholding is largely arbitrary, but if one memory controller is
obviously saturated and others are inactive, you have an
indication more detailed review is required
–Nodes, CPUs, buses with heaviest activity provide a start point to
correlate with DPO information
–Placement issues with CPU & Memory are clearly represented
–There are no rules-of-thumb or best practices for thresholds (yet)
–You can review system Redbooks and determine where you are
with respect to bus performance (not always available, but newer
Redbooks are more informative)
 This tool provides high-level diagnosis with some detailed view (if
partition-level collectors are installed)
39
© 2015 IBM Corporation
PowerVP: How do I use this? Low-Level
 Cycles-Per-Instruction (CPI) is a complicated subject, it will be
beyond the capacity of most customers to assess in detail
–In general, a lower CPI is better
–The fewer number of CPU cycles per instruction, the more
instructions can get done
–PowerVP gives you various CPI values. These values, in
conjunction with OS tools can tell you whether you have good
affinity
 Affinity is a measurement of a threads locality to physical resources.
Resources can be many things: L1/L2/L3 cache, core(s), chip,
memory controller, socket, node, drawer, etc.
40
© 2015 IBM Corporation
Review: AIX Enhanced Affinity
 AIX on POWER7 and above uses
Enhanced Affinity instrumentation
to localize threads by Scheduler
Resource Allocation Domain
(SRAD)
Affinity
Local
chip
 AIX Enhanced Affinity measures
Local
Usually a Chip
Near
Local Node/DCM
Far
Other Node/Drawer/CEC
Near
POWER7 770/780/795
Far
internode
 These are logical mappings, which
may or may not exactly map 1:1
with physical resources
POWER8 S824 DCM
41
© 2015 IBM Corporation
intranode
AIX Affinity: lssrad tool shows logical placement
View of 24-way, two socket POWER7+ 760 with Dual Chip Modules (DCM)
6 cores chip, 12 in each DCM
5 Virtual Processors x 4-way SMT = 20 logical cpus
Terms:
REF
Node (drawer or DCM/MCM socket)
SRAD Scheduler Resource Allocation Domain
Node 0
SRAD
# lssrad -av
REF1
SRAD
MEM
0
CPU
2
0
0
12363.94
2
4589.00
0-7
12-15
1
Node 1
SRAD
1
1
5104.50
8-11
3
3486.00
16-19
If a thread’s ‘home’ node was SRAD 0
SRAD 2 would be ‘near’
SRAD 1 & 3 would be ‘far’
42
© 2015 IBM Corporation
3
Affinity: Diagnosis
When may I have a problem?
- SRAD has CPUs but no memory or vice-versa
- When CPU or Memory are very unbalanced
But how do I really know?
- Tools tell you: lssrad/topas/mpstat/svmon (AIX), numactl (Linux),
PowerVP, Dynamic Platform Optimizer
- High percentage of threads with far dispatches
- Disparity in performance between equivalent systems
PowerVM & POWER8 provide a variety of improvements
- PowerVM has come a long way in the last three years – firmware,
AIX, Dynamic Platform Optimizer and PowerVP give you a lot of
options
- Cache (sizes, pre-fetch, L4, Non-Uniform Cache Access logic),
Controller, massive DIMM bandwidth improvement
- Inter-socket latencies and efficiency have progressively improved
from POWER7 to POWER7+ and now POWER8
43
© 2015 IBM Corporation
Review: AIX topas Logical Affinity (‘M’ option)
Topas Monitor for host: claret4
Interval: 2
===================================================================
REF1 SRAD TOTALMEM INUSE
FREE
FILECACHE HOMETHRDS CPUS
------------------------------------------------------------------0
2
4.48G
515M
3.98G
52.9M
134.0
12-15
0
12.1G 1.20G
10.9G
141M
236.0
0-7
1
1
4.98G
537M
4.46G
59.0M
129.0
8-11
3
3.40G
402M
3.01G
39.7M
116.0
16-19
Node
===================================================================
CPU
SRAD TOTALDISP
LOCALDISP% NEARDISP%
FARDISP%
---------------------------------------------------------0
0
303.0
43.6
15.5
40.9
2
0
1.00
100.0
0.0
0.0
3
0
1.00
100.0
0.0
0.0
Local
is
optimal
4
0
1.00
100.0
0.0
0.0
5
0
1.00
100.0
0.0
0.0
6
0
1.00
100.0
0.0
0.0
Chip
Dispatches
What’s a bad FARDISP% rate? No rule-of-thumb, but 1000’s of far dispatches per
second will likely indicate lower performance
How do we fix? Entitlement & Memory sizing Best Practices + Current Firmware +
Dynamic Platform Optimizer
44
© 2015 IBM Corporation
PowerVP Physical Affinity: VM View
• PowerVP can show us physical affinity (local, remote & distant)
• AIX topas can show us logical affinity (local, near & far)
• More local, more ideal
Cache Affinity
DIMM Affinity
Local is optimal
Computed CPI is an inverse calculation, lower is typically better
45
© 2015 IBM Corporation
PowerVP supported Power models and ITE’s
 Power System models and ITE’s with 770 firmware support
• 710-E1D, 720-E4D, 730-E2D, 740-E6D (also includes Linux D models)
• 750-E8D, 760-RMD
• 770-MMC, 780-MHC, ESE 9109-RMD
• p260-22X, p260-23X, p460-42X, p460-43X, p270-24X, p470-44X, p24L-7FL
• 71R-L1S, 71R-L1C, 71R-L1D, 71R-L1T, 7R2-L2C, 7R2-L2S, 7R2-L2D, 7R2-L2T
 Power System models added with 780 firmware support
– 770-MMB and 780-MHB (eConfig support 1/28/2014)
– 795-FHB Dec 2013
780 Power Firmware
http://www304.ibm.com/support/customercare/
sas/f/power5cm/power7.html
 Power System models with 780 firmware support
– 770-MMD, 780-MHD (4/30/2014)
 Pre-770 firmware models do not have instrumentation to support PowerVP
* Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced.
* All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
46
© 2015 IBM Corporation
PowerVP OS Support




Announced and GA in 4Q 2013
PowerVP 1.1.2 shipped 6/2014, SP2 8/2014
Available as standalone product or with PowerVM Enterprise Edition
Agents will run on IBM i, AIX, Linux on Power and VIOS
–System i V7R1, AIX 6.1/7.1, any VIOS level supporting POWER7
–RHEL 6.4, 6.5, 7.0, SUSE 11 SP 3
–Other Linux variants expected in 2015 update
 Client supported on Windows, Linux, and AIX
–Client requires Java 1.6 or greater
–Installer provided for Windows, Linux, and AIX
–Also includes a Java installer, which has worked on OSX (my own
testing)
Has worked on VMWARE and MAC where the others don’t
47
© 2015 IBM Corporation