Virginia POWER User Group May 19, 2015 PowerVM Dynamic Platform Optimizer & PowerVP © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in.
Download ReportTranscript Virginia POWER User Group May 19, 2015 PowerVM Dynamic Platform Optimizer & PowerVP © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in.
Virginia POWER User Group May 19, 2015 PowerVM Dynamic Platform Optimizer & PowerVP © Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. Optimization Redbook POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2… Compilers & optimization Performance tools & tuning Performance Optimization & Tuning Techniques for IBM Processors 2 © 2015 IBM Corporation Dynamic Platform Optimizer Update 3 © 2015 IBM Corporation What is Dynamic Platform Optimizer: DPO DPO is a PowerVM virtualization feature that enables users to improve partition memory and processor placement (affinity) on Power Servers after they are up and running. DPO performs a sequence of memory and processor relocations to transform the existing server layout to the optimal layout based on the server topology. Client Benefits –Ability to run without a platform IPL (entire system) –Improved performance in a cloud or highly virtualized environments –Dynamically adjust topology after mobility 4 © 2015 IBM Corporation What is Affinity? Affinity is a locality measurement of an entity with respect to physical resources – An entity could be a thread within an OS instance (AIX/i/Linux) or the OS/Virtual Machine itself. For this presentation, we focus on the latter. – Physical resources could be a core, chip, node, socket, cache (L1/L2/L3), memory controller, memory DIMMs, or I/O buses Affinity is optimal when the number of cycles required to access resources is minimized Chip Dual Chip Module Socket / Node POWER7+ 760 Planar Note x & z buses between chips, and A & B buses between Dual Chip Modules (DCM) In this model, each DCM is a “node” 5 © 2015 IBM Corporation DIMM Memory Power 750/760 D Technical Overview Thread Affinity Performance is closer to optimal when threads stay close to physical resources. Thread Affinity is a measurement of proximity to a resource – Examples of resources: L2/L3 cache, memory, core, chip and node – Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains – Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node Modern highly multi-threaded workloads are architected to have lightweight threads and distributed application memory – Can span domains with limited impact – Unix scheduler/dispatch/memory manager mechanisms spread workloads 6 © 2015 IBM Corporation How does partition placement work? PowerVM knows the chip types and memory configuration, and attempts to pack partitions onto the smallest number of chips / nodes / drawers – Optimizing placement will result in higher exploitation of local CPU and memory resources – Dispatches across node boundaries will incur longer latencies, and both AIX and PowerVM the are actively trying to minimize that via active Enhanced Affinity mechanisms It considers the partition profiles and calculates optimal placements – Placement is a function of Desired Entitlement, Desired & Maximum Memory settings – Maximum memory defines the size of the Hardware Page Table maintained for each partition. For POWER7, it is 1/64th of Maximum and 1/128th on POWER7+ and POWER8 – Ideally, Desired + (Maximum/HPT ratio) < node memory size if possible 7 © 2015 IBM Corporation Partition Affinity: Why is it not always optimal? Partition placement can become sub-optimal because of: Poor choices in Virtual Processor, Entitlement or Memory sizing –The Hypervisor uses Entitlement & Memory settings to place a partition. Wide use of 10:1 Virtual Processor to Entitlement settings does not lend any information for optimal placement. –Before you ask, there is no single golden rule, magic formula, or IBM-wide Best Practice for Virtual Processor & Entitlement sizing. If you want education in sizing, ask for it. Dynamic creation/deletion, processor and memory ops (DLPAR) Hibernation (Suspend or Resume) Live Partition Mobility (LPM) CEC Hot add, Repair, & Maintenance (CHARM) Older firmware levels are less sophisticated in placement and dynamic operations 8 © 2015 IBM Corporation Partition Affinity: Hypothetical 4 Node Frame Partition X Partition X DPO operation Partition Y Partition Z Partition Y Partition X Partition Y Partition Z Free LMBs Partition Z 9 © 2015 IBM Corporation Current & Predicted Affinity, System & LPARs lsmemopt –m managed_system –o currscore –r [sys | lpar] lsmemopt –m managed_system –o calcscore –r [sys | lpar] [--id request_partition_list] [--xid protect_partition_list] sys = system-wide score (default if the –r option not specified) lpar = partition scoring 10 © 2015 IBM Corporation Example: V7R780 firmware affinity scores > lsmemopt -m Doc -o currscore -r sys curr_sys_score=89 Current Scores System & LPAR > lsmemopt -m Doc -o currscore -r lpar lpar_name=mdvio1_production,lpar_id=1,curr_lpar_score=100 lpar_name=mdvio2_production,lpar_id=2,curr_lpar_score=100 lpar_name=ec07_sn,lpar_id=7,curr_lpar_score=80 lpar_name=ec09_mm,lpar_id=9,curr_lpar_score=100 lpar_name=ec10_mm,lpar_id=10,curr_lpar_score=100 lpar_name=mhnode1,lpar_id=13,curr_lpar_score=70 > lsmemopt -m Doc -o calcscore -r sys Predicted Scores System or LPAR curr_sys_score=89,predicted_sys_score=100,requested_lpar_id s=none,protected_lpar_ids=none 11 © 2015 IBM Corporation HMC CLI: Starting/Stopping a DPO Operation optmem –m managed_system –t affinity –o start [--id requested_partition_list] [--xid protect_partition_list] optmem –m managed_system –t affinity –o stop Partition lists are comma-separated and can include ranges Include: --id <1,3,5-8> Requested partitions: partitions prioritized (default = all LPARs) Protected partitions: partitions that should not be touched Exclude by name –x <name,> or id number --xid <5,10,16-20> Optimization by –t [affinity | mirror], latter for Hypervisor mirroring Typically exclude partitions that are not DPO aware (more later) 12 © 2015 IBM Corporation HMC CLI: DPO Status > lsmemopt –m managed_system in_progress=0,status=Finished,type=affinity,opt_id=1, progress=39,requested_lpar_ids=none,protected_lpar_ids=none ,”impacted_lpar_ids=106,110” Estimated progress % LPARs impacted by optimization (moved CPU, memory, hypervisor memory) 13 © 2015 IBM Corporation What’s New: Schedule, Thresholds, Notifications 14 © 2015 IBM Corporation DPO: Supported Hardware and Firmware levels Introduced in fall 2012 (with feature code EB33) • 770-MMD and 780-MHD with firmware level 760.00 • 795-FHB with firmware level 760.10 (760 with fix pack 1) • Recommend 760_069 has enhancements below Additional systems added spring 2013 with firmware level 770 – 710,720,730,740 D-models with firmware level 770.00 – 750,760 D-models with firmware level 770.10 (770 with fix pack 1) – 770-MMC and 780-MHC with firmware level 770.20 (770 with fix pack 2) – Performance enhancements – DPO memory movement time reduced – Scoring algorithm improvements – Recommend firmware at 770_021 Affinity scoring at the LPAR level with firmware level 780 delivered Dec 2013 http://www 770-MMB, 780-MHB added with 780.00 304.ibm.com/support/customercare/ sas/f/power5cm/power7.html 795-FHB updated with 780.00 770-MMD, 780-MHD (AM780_056_040 level released 4/30/2014) 15 © 2015 IBM Corporation Running DPO DPO aware Operating Systems – AIX: 6.1 TL8 or later, AIX 7.1 TL2 or later – IBM i: 7.1 TR6 or later – Linux: Some reaffinitization in RHEL7/SLES12 (Fully implemented in follow-on releases) – VIOS 2.2.2.0 or later – HMC V7R7.6.1 – Partitions that are DPO aware are notified after DPO completes Re-affinitization Required to ensure affinity is as good as a CEC IPL – Performance team measurements show reaffinitization is critical – For older OS levels, users can exclude those partitions from optimization, or reboot them after running the DPO Optimizer 16 © 2015 IBM Corporation More Information IBM PowerVM Virtualization Managing and Monitoring (June 2013) – SG24-7590-04: http://www.redbooks.ibm.com/abstracts/sg247590.html?Open IBM PowerVM Virtualization Introduction and Configuration (June 2013) – SG24-7940-05: http://www.redbooks.ibm.com/abstracts/sg247940.html?Open POWER7 Information Center under logical partitioning topiccs – http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=%2Fp7hat%2Fip hblmanagedlparp6.htm POWER7 Logical Partitions “Under the Hood” – http://www03.ibm.com/systems/resources/power_software_i_perfmgmt_processor_lpar.pdf 17 © 2015 IBM Corporation PowerVP 18 © 2015 IBM Corporation PowerVP Redbook IBM PowerVP Introduction & Technical Overview 19 © 2015 IBM Corporation Review: POWER7+ 750/760 Planer Chip Dual Chip Module Socket / Node DIMM Memory Memory Controller Intra-DCM bus: x & z Inter-DCM/socket bus: AB 20 © 2015 IBM Corporation I/O Bus Power 750/760 D Technical Overview Review: POWER7+ 770/780 Planer Not as pretty as 750+ diagram, note we have x, w & z buses between chips with this model and buses to other nodes (not pictured) and IO are a little more cryptic Loc Code Conn Ref Power 770/780 D Technical Overview 21 © 2015 IBM Corporation Why PowerVP: Power Virtualization Performance During an IPL of the entire Power System, the Hypervisor determines an optimal resource placement strategy for the server based on the partition configuration and the hardware topology of the system. There was a desire to have a visual understanding of how the hardware resources were assigned and being consumed by the various partitions that were running on the platform. It was also desired to have a visual indication showing a resource’s consumption and when it was going past a warning threshold (yellow) and when it was entering an overcommitted threshold (red.) 22 © 2015 IBM Corporation PowerVP Overview Graphically displays data from existing and new performance tools Converges performance data from across the system Shows CEC, node & partition level performance data Illustrates topology utilization with colored “heat” threshold settings Enables drill down for both physical and logical approaches Allows real-time monitoring and recording function Simplifies physical/virtual environment, monitoring, and analysis Not intended to replace any current monitoring or management product 23 © 2015 IBM Corporation PowerVP Environment Partition Collectors Required for logical view LPAR CPU utilization Disk Activity Network Activity CPI analysis Cache analysis System-wide Collector One required per system P7 topology information P7 chip/core utilizations P7 Power bus utilizations Memory and I/O utilization LPAR entitlements, utilization System Collector Partition Collector Operating system Hypervisor interfaces IBM i, AIX, VIOS, Linux Chip Core HPMCs PMUlets FW/Hypervisor Thread PMUs Power Hardware 24 © 2015 IBM Corporation You only need to install a single system-wide collector to see global metrics PowerVP: System Info, Global Usage, Recording System Information Global Utilization Recording/Playback Control 25 © 2015 IBM Corporation PowerVP: LPAR List View Options 26 © 2015 IBM Corporation PowerVP: System, Node and Partition Views System Topology 27 © 2015 IBM Corporation Node Drill Down Partition Drill Down PowerVP: System Topology • The initial view shows the hardware topology of the system you are logged into • In this view, we see a Power 795 with all eight books and/or nodes installed, each with four sockets • Values within boxes show CPU usage • Lines between nodes show SMP fabric activity 28 © 2015 IBM Corporation PowerVP: Node drill down • This view appears when you click on a node and allows you to see the resource assignments or consumption • In this view, we see a POWER7 780 node with four chips each with four cores • Active buses are shown with solid colored lines. These can be between nodes, chips, memory controllers and IO buses. 29 © 2015 IBM Corporation PowerVP 1.1: Node Utilization View (P8 S824) SMP Bus Systems like the 750+ & S824, a node is socket with Dual Chips (DCM) I/O Bus Chip Cores & Utilization Memory Controller 30 © 2015 IBM Corporation PowerVP 1.1.2: Node View with Affinity (P7 780) 31 © 2015 IBM Corporation PowerVP 1.1.2: Chip (POWER7 780 / 4 cores) SMP Bus IO Memory Controller Chip LPAR Virtual Processors & Memory 32 © 2015 IBM Corporation DIMM PowerVP 1.1.2: CPU Affinity LPAR 7 has 8 VPs. As we select cores, 2 VPs are “homed” to each core. The fourth core has 4 VPs from four LPARs “homed” to it. This does not prevent VPs from being dispatched elsewhere in the pool as utilization requirements demand 33 © 2015 IBM Corporation PowerVP 1.1.2: Memory Affinity LPAR 7 Online Memory is 32768 MB, 50% of 64 GB in DIMMs LPARs listed in color order 34 © 2015 IBM Corporation PowerVP: Partition drill down • View allows us to drill down on resources being used by selected partition • In this view, we see CPU, Memory, Disk IOPS, and Ethernet being consumed. We can also get an idea of cache and memory affinity. • We can drill down on several of these resources. Example: we can drill down on the disk transfer or network activity by selecting the resource 35 © 2015 IBM Corporation PowerVP: Partition drill down (CPU, CPI) 36 © 2015 IBM Corporation PowerVP: Partition drill down (Disk) 37 © 2015 IBM Corporation PowerVP: How do I use this? PowerVP is not intended to replace traditional performance management products. It is not a management tool. Does provide an overview of hardware resource activity that allows you to get a high-level view of –Node/socket activity –Cores assigned to dedicated and shared pool –VM’s Virtual Processors assigned to cores –VM’s memory assigned to DIMMs –Memory bus activity –IO bus activity –Provides partition activity related to –Storage & Network –System and partition Cycles-Per-Instruction PowerVP 1.1.2 is required for POWER8, but memory bus activity is not currently available 38 © 2015 IBM Corporation PowerVP: How do I use this? High-Level High-level view can allow visual identification of node and bus stress –Thresholding is largely arbitrary, but if one memory controller is obviously saturated and others are inactive, you have an indication more detailed review is required –Nodes, CPUs, buses with heaviest activity provide a start point to correlate with DPO information –Placement issues with CPU & Memory are clearly represented –There are no rules-of-thumb or best practices for thresholds (yet) –You can review system Redbooks and determine where you are with respect to bus performance (not always available, but newer Redbooks are more informative) This tool provides high-level diagnosis with some detailed view (if partition-level collectors are installed) 39 © 2015 IBM Corporation PowerVP: How do I use this? Low-Level Cycles-Per-Instruction (CPI) is a complicated subject, it will be beyond the capacity of most customers to assess in detail –In general, a lower CPI is better –The fewer number of CPU cycles per instruction, the more instructions can get done –PowerVP gives you various CPI values. These values, in conjunction with OS tools can tell you whether you have good affinity Affinity is a measurement of a threads locality to physical resources. Resources can be many things: L1/L2/L3 cache, core(s), chip, memory controller, socket, node, drawer, etc. 40 © 2015 IBM Corporation Review: AIX Enhanced Affinity AIX on POWER7 and above uses Enhanced Affinity instrumentation to localize threads by Scheduler Resource Allocation Domain (SRAD) Affinity Local chip AIX Enhanced Affinity measures Local Usually a Chip Near Local Node/DCM Far Other Node/Drawer/CEC Near POWER7 770/780/795 Far internode These are logical mappings, which may or may not exactly map 1:1 with physical resources POWER8 S824 DCM 41 © 2015 IBM Corporation intranode AIX Affinity: lssrad tool shows logical placement View of 24-way, two socket POWER7+ 760 with Dual Chip Modules (DCM) 6 cores chip, 12 in each DCM 5 Virtual Processors x 4-way SMT = 20 logical cpus Terms: REF Node (drawer or DCM/MCM socket) SRAD Scheduler Resource Allocation Domain Node 0 SRAD # lssrad -av REF1 SRAD MEM 0 CPU 2 0 0 12363.94 2 4589.00 0-7 12-15 1 Node 1 SRAD 1 1 5104.50 8-11 3 3486.00 16-19 If a thread’s ‘home’ node was SRAD 0 SRAD 2 would be ‘near’ SRAD 1 & 3 would be ‘far’ 42 © 2015 IBM Corporation 3 Affinity: Diagnosis When may I have a problem? - SRAD has CPUs but no memory or vice-versa - When CPU or Memory are very unbalanced But how do I really know? - Tools tell you: lssrad/topas/mpstat/svmon (AIX), numactl (Linux), PowerVP, Dynamic Platform Optimizer - High percentage of threads with far dispatches - Disparity in performance between equivalent systems PowerVM & POWER8 provide a variety of improvements - PowerVM has come a long way in the last three years – firmware, AIX, Dynamic Platform Optimizer and PowerVP give you a lot of options - Cache (sizes, pre-fetch, L4, Non-Uniform Cache Access logic), Controller, massive DIMM bandwidth improvement - Inter-socket latencies and efficiency have progressively improved from POWER7 to POWER7+ and now POWER8 43 © 2015 IBM Corporation Review: AIX topas Logical Affinity (‘M’ option) Topas Monitor for host: claret4 Interval: 2 =================================================================== REF1 SRAD TOTALMEM INUSE FREE FILECACHE HOMETHRDS CPUS ------------------------------------------------------------------0 2 4.48G 515M 3.98G 52.9M 134.0 12-15 0 12.1G 1.20G 10.9G 141M 236.0 0-7 1 1 4.98G 537M 4.46G 59.0M 129.0 8-11 3 3.40G 402M 3.01G 39.7M 116.0 16-19 Node =================================================================== CPU SRAD TOTALDISP LOCALDISP% NEARDISP% FARDISP% ---------------------------------------------------------0 0 303.0 43.6 15.5 40.9 2 0 1.00 100.0 0.0 0.0 3 0 1.00 100.0 0.0 0.0 Local is optimal 4 0 1.00 100.0 0.0 0.0 5 0 1.00 100.0 0.0 0.0 6 0 1.00 100.0 0.0 0.0 Chip Dispatches What’s a bad FARDISP% rate? No rule-of-thumb, but 1000’s of far dispatches per second will likely indicate lower performance How do we fix? Entitlement & Memory sizing Best Practices + Current Firmware + Dynamic Platform Optimizer 44 © 2015 IBM Corporation PowerVP Physical Affinity: VM View • PowerVP can show us physical affinity (local, remote & distant) • AIX topas can show us logical affinity (local, near & far) • More local, more ideal Cache Affinity DIMM Affinity Local is optimal Computed CPI is an inverse calculation, lower is typically better 45 © 2015 IBM Corporation PowerVP supported Power models and ITE’s Power System models and ITE’s with 770 firmware support • 710-E1D, 720-E4D, 730-E2D, 740-E6D (also includes Linux D models) • 750-E8D, 760-RMD • 770-MMC, 780-MHC, ESE 9109-RMD • p260-22X, p260-23X, p460-42X, p460-43X, p270-24X, p470-44X, p24L-7FL • 71R-L1S, 71R-L1C, 71R-L1D, 71R-L1T, 7R2-L2C, 7R2-L2S, 7R2-L2D, 7R2-L2T Power System models added with 780 firmware support – 770-MMB and 780-MHB (eConfig support 1/28/2014) – 795-FHB Dec 2013 780 Power Firmware http://www304.ibm.com/support/customercare/ sas/f/power5cm/power7.html Power System models with 780 firmware support – 770-MMD, 780-MHD (4/30/2014) Pre-770 firmware models do not have instrumentation to support PowerVP * Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced. * All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. 46 © 2015 IBM Corporation PowerVP OS Support Announced and GA in 4Q 2013 PowerVP 1.1.2 shipped 6/2014, SP2 8/2014 Available as standalone product or with PowerVM Enterprise Edition Agents will run on IBM i, AIX, Linux on Power and VIOS –System i V7R1, AIX 6.1/7.1, any VIOS level supporting POWER7 –RHEL 6.4, 6.5, 7.0, SUSE 11 SP 3 –Other Linux variants expected in 2015 update Client supported on Windows, Linux, and AIX –Client requires Java 1.6 or greater –Installer provided for Windows, Linux, and AIX –Also includes a Java installer, which has worked on OSX (my own testing) Has worked on VMWARE and MAC where the others don’t 47 © 2015 IBM Corporation