New Performance Updates

Download Report

Transcript New Performance Updates

Virginia POWER User Group

May 19, 2015

What’s New Performance Features for IBM PowerVM & POWER8

Steve Nasypany [email protected]

© Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.

General Performance News

© 2015 IBM Corporation

2

Optimization Redbook

Draft available now!

POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2… Compilers & optimization Performance tools & tuning

http://www.redbooks.ibm.com/redpieces/abstracts/sg248171.html

© 2015 IBM Corporation

3

Quick View of POWER8

 POWER8 Migration & Best Practices http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html

 SAP, Oracle, Siebel results linked here http://www-03.ibm.com/systems/power/hardware/benchmarks/erp.html

 IBM Power Systems Performance Report – POWER8 Single-Thread, SMT2, SMT4 & SMT8 numbers!

– Per report • Uplift from SMT2 to SMT4 is 30% • Uplift from SMT4 to SMT8 is 7% • Uplift from Single-Thread to SMT8 is 100% • Per SMT thread vs throughput should be very linear as threads are more equally biased in POWER8 (covered later) © 2015 IBM Corporation

4

Dynamic System Optimizer

 The

Dynamic System Optimizer

– AIX function, formerly called available for free function in AIX is not supported on POWER8

Active System Optimizer

(aso) daemon function – Additional charged features for • • Autonomic Large Page (16 MB) Migration Autonomic Processor Pre-Fetch Control – AIX • asoo commands will not execute anything on POWER8 If you migrate from POWER7 with it enabled, it will remain enabled, but aso daemon will not do anything • No performance concern, but can disable if you find the aso logs annoying  Future support is based on two issues – Benefit of DSO was not judged a high-priority for Scale Out systems – Functional support is not as much a technical issue as a testing resources issue – Lab is interested in feedback on customers who want Scale Up support for POWER8. Complain to CTS or me and I will forward to development  This has no impact on

Dynamic Platform Optimizer

– – DSO optimizes threads within an virtual machine (OS) instance DPO optimizes virtual machine placement within a frame © 2015 IBM Corporation

5

Java

    Java 7.1 SR1 is the preferred level for POWER7 and POWER8 – Java 6 SR7 is the minimally recommended level for POWER7, as it contains optimizations for POWER7 and the default use of 64KB (versus 4KB) pages for Java Virtual Machines (JVM) in AIX – – Java 7.1 is optimized to use specific hardware optimizations for POWER8 JIT compiler will automatically detect platform architecture and generate code optimized for that platform.

– WAS 8.5.2.2

RHEL 6, SLES 11 Linux support use of 64KB pages for JVMs As with all legacy levels, Java applications with little memory footprint typically perform better in 32-bit. Applications with larger memory requirements should use 64-bit.

A variety of other Java optimizations for AIX & Linux are covered in Section 8.3 of the

Performance Optimization & Tuning Techniques for IBM Processors, including IBM POWER8

Redbook © 2015 IBM Corporation

6

Utilization, Simultaneous Multithreading & Virtual Processors

© 2015 IBM Corporation

7

Review: POWER6 vs POWER7/8 SMT Utilization

POWER5/6 utilization does not account for SMT, POWER7/8 is calibrated in hardware POWER6 SMT2

Htc0 Htc1 busy idle

100% busy SMT2 POWER7 SMT4

Htc0 Htc1

~70% busy “busy” = user% + system%

busy idle Htc0 Htc1 Htc2 Htc3 busy idle idle idle

~63% busy SMT4 POWER8 SMT8

Htc0 Htc1 Htc2 Htc3 busy idle idle idle

~60% busy

Htc0 Htc1 Htc2 Htc3 Htc4 Htc5 Htc6 Htc7 busy idle idle idle idle idle idle idle

~56% busy

 Simulated single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each case,

physical consumption can be reported as 1.0.

 Real world production workloads will involve dozens to thousands of threads, so users may not notice any difference in the “macro” scale  See

Simultaneous Multi-Threading on POWER7 Processors

by Mark Funk http://www.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf

© 2015 IBM Corporation

8

POWER6 vs POWER7/POWER8 Dispatch

POWER6 SMT2

Htc0 Htc1 busy busy

~80% busy POWER7/8 SMT4

Htc0 Htc1 Htc2 Htc3 busy idle idle idle

~50% busy Activate Virtual Processor

There is a difference between how workloads are distributed across cores in POWER7 & POWER8  In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded  In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as

Raw Throughput

mode.

Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores

© 2015 IBM Corporation

9

Review: POWER6 vs POWER7/8 Dispatch

proc0 proc1 proc2 proc3

POWER6

Primary Secondary

proc0 proc1 proc2 proc3

POWER7 POWER8

(Raw Mode)

Primary Secondary Tertiaries

lcpu 0-3 lcpu 4-7 lcpu 8-11 lcpu 12-15 Once a Virtual Processor is dispatched, the

Physical Consumption

will typically increase to the next whole number metric

Put another way,

the more Virtual Processors you assign, the higher your Physical Consumption is likely to be in POWER7/POWER8

© 2015 IBM Corporation

10

POWER7/POWER8 Consumption

POWER7/POWER8 will activate more cores at lower utilization levels than earlier architectures when excess VP’s are present

     Customers may complain that the physical consumption metric (reported as physc or pc) is equal to or possibly even higher after migrations from earlier architectures Every POWER7/POWER8 customer with this complaint to also have significantly higher idle% percentages over earlier architectures Consolidation of workloads and may result in many more VP’s assigned to a new POWER7 or POWER8 partition Just because we let you set very high ranges of Virtual Processor to Entitlement (20:1 now on some POWER7+ and POWER8) does not mean that is always optimal. Your choices have consequences. There is no magic ratio for all environments. If you want more education on VP vs Entitlement, ask for that education.

More VPs can result in lower affinity – Broader spread across shared pool and memory domains – Lower affinity leads to more cycles, more cycles leads to lower perf © 2015 IBM Corporation

11

Virtual Processor Dispatch

 A recurring question in AIX is “how many Virtual Processors am I using?” – The physical consumption metric (physc or pc) could be used to approximate activity if the VP Folding algorithm was understood and the workload was stable (typically, 1 to 2 VPs higher than physc) – Tools like sar, mpstat and nmon could be used to display logical CPUs and divine how many Virtual Processors were active by looking at SMT sets (mapping to a VP) and their logical CPU statistics (utilization and context switches)  A new mpstat option provides information on Virtual Processor activity – mpstat –v – Displays the delta Virtual Timebase (VTB), which is time charged to a dispatched VP – If the Virtual Timebase is 0, the processor statistics associated with that VP will not be shown, simplifying the output – AIX 7.1 TL3 SP2 © 2015 IBM Corporation

12

Virtual Processors Dispatched - mpstat -v

vcpu lcpu us sy wa id pbusy pc VTB(ms) -- --- --- --- ---- ---- ---- ---- ------ 0 55.88 0.53 0.00 43.59 0.34[ 56.4%] 0.60[119.7%] 649 0 55.88 0.52 0.00 0.47 0.34[ 56.4%] 0.34[ 56.9%] 1 0.00 0.00 0.00 13.95 0.00[ 0.0%] 0.08[ 13.9%] 2 0.00 0.00 0.00 15.04 0.00[ 0.0%] 0.09[ 15.0%] 3 0.00 0.01 0.00 14.13 0.00[ 0.0%] 0.08[ 14.1%] 4 56.26 0.92 0.00 42.82 0.07[ 57.2%] 0.13[ 25.5%] 209 4 56.26 0.87 0.00 1.28 0.07[ 57.1%] 0.07[ 58.4%] 5 0.00 0.04 0.00 14.11 0.00[ 0.0%] 0.02[ 14.1%] 6 0.00 0.01 0.00 13.69 0.00[ 0.0%] 0.02[ 14.8%] 7 0.00 0.01 0.00 13.75 0.00[ 0.0%] 0.02[ 13.9%] 8 60.92 0.50 0.00 38.58 0.15[ 61.4%] 0.25[ 49.0%] 404 8 60.92 0.49 0.00 0.64 0.15[ 61.4%] 0.15[ 62.0%] 9 0.00 0.00 0.00 12.61 0.00[ 0.0%] 0.03[ 12.9%] 10 0.00 0.00 0.00 12.66 0.00[ 0.0%] 0.03[ 13.0%] 11 0.00 0.00 0.00 12.67 0.00[ 0.0%] 0.03[ 13.0%] ALL 173.05 1.95 0.00 124.99 0.56[175.0%] 0.97[194.2%] 1262

VCPU values appears to be tied to lowest logical CPU number. In this Example there are only 3 active VPs and VCPU does not represent some internal AIX numbering scheme © 2015 IBM Corporation

13

© 2015 IBM Corporation

Migration Guidance

14

Migrations: Dispatching, SMT… Will I have a problem?

 If you are migrating between POWER7 and POWER8 –

Not

a problem – AIX SMT4 default makes these migrations “apples-to-apples” – Default dispatcher behaves the same  If you are migrating between POWER5/POWER6 to POWER8?

Maybe

a problem – POWER7 & POWER8 behave the same way – Now that you understand the dispatch behavior, you know why customers may complain – What are my options?

• Get the VP counts right the first time. Do not do 1:1 VP sizings for larger partitions between POWER5/6 and POWER7/8. This will get you into trouble!

• If a customer ignores updated VP sizings, consider using

Throughput

tunings

Scaled

Use

Scaled Throughput

tunings AIX uses more SMT threads before dispatching a VP. See backup material for detail and guidance.

© 2015 IBM Corporation

15

POWER8 SMT Default: Why SMT4?

 AIX 6.1 will only support SMT4. Most customers are still running AIX 6.1

 After early experiences with POWER7, AIX chose the conservative path for POWER8 at the expense of some capacity – Most workloads will be fine with SMT4 or SMT8 – All those problems you thought were SMT issues in POWER7 weren’t. They were firmware, affinity, aggressive dispatcher, too many VPs.

– We avoid application scalability issues made visible by more SMT threads, but often blamed

incorrectly

on SMT  Lab view is most customers do not run at utilization levels (> 80%) to benefit from

SMT8

. The reality is, many, if not most of our customers

do not run at utilization levels to fully exercise SMT4

.

 SMT4 is the best of all worlds for now, but there are now more options to exploit SMT. This can be done via the

Scaled Throughput

tunings which are covered in the backup material © 2015 IBM Corporation

16

POWER8 SMT: Should I use SMT8?

 Any PoC or benchmark where you’re going to drive to 80% utilization –

Absolutely

try SMT8, don’t leave capacity on the table – – You can’t get to the highest rPerf without SMT8 OLTP DB, large WAS appservers, etc have seen 5 to 15% increases  We should be open to letting experienced customers trying SMT8 – These customers typically know what they’re doing and understand if higher SMT is appropriate for their environment – It is easy and free to test SMT4 and SMT8 modes, no reboot  For new customers/applications, need to review software stack – – If application space is will known on AIX, should not be a problem If application new to AIX or Linux, should be tested for scaling issues (product may have never been tested to 24 cores / 192 logical cpus) © 2015 IBM Corporation

17

POWER8 SMT: Flexible SMT

 POWER7 & POWER8 are different in SMT bias – In POWER7, there is a correlation between the Hardware Thread number (logical CPU 0, 1, 2 & 3) and physical resources within the processor. Lower threads may also have a higher priority.

– POWER8 Hardware Threads are equally biased and provide the same performance regardless of which thread is active. This is true for AIX & Linux. For AIX, you do not need to worry about using bindprocessor or RSET function with various threads, or always “pinning” to a Virtual Processors Primary Hardware Thread for the best performance.

– This topic, called

Flexible SMT

of the tuning Redbook , is covered in more detail in Section 4.2  AIX will dynamically adjust between SMT and ST mode based on the workload utilization. A 1:1 equivalent in Linux does not really exist, but I expect similar function will migrate to Linux and/or PowerKVM https://www.ibm.com/developerworks/community/blogs/aixpert/entry/local_near_far_ memory_part_4_aggressive_intelligent_threads46?lang=en © 2015 IBM Corporation

18

POWER8 SMT Opinion: What about Linux?

 The Linux space is a bit more complicated – As of right now, there does not appear to be seamless handling of SMT between all Linux distros and PowerKVM comparable to PowerVM hosting AIX and i.

– Most Linux workloads are more scale out than scale up • Smaller partitions • More HPC-like, manual SMT tunings, manual bindings to processors  IBM and the industry is working on this – SMT can be dynamically changed – – – Distros have added more SMT awareness, NUMA tooling (numastat) Visibility of SMT through host & client layers may differ in distros “Split-Core” function offered where a single core with SMT8 will be split into four SMT2 “cores” from the guest perspective  Rely on guidance provided by the Linux OS and application space. LTC is

very

responsive at DeveloperWorks Community questions: https://www.ibm.com/developerworks/community/forums/html/forum?id=a95a744c-e8fd-4228 a57a-1ae837efe457&ps=25 © 2015 IBM Corporation

19

Migrating Memory & Storage I/O

 If your environment has been memory constrained, consider profiling existing workloads for

Advanced Memory Expansion

– – – We are getting many field questions about this feature in 2015 AIX amepat tool can profile running workloads Generates output report with guidance on recommended expansion factors and CPU use required to implement – – Can select target architecture of POWER7 or POWER8 Supported on AIX 6.1 with POWER6 and above  For storage I/O, use existing tools, knowledge base for planning – Ask for a Disk Magic study – Use documents/tools at IBM Techdocs • Search for documents on POWER8 or written by Dan Braden, Sue Baker, John Hock http://www-03.ibm.com/support/techdocs/atsmastr.nsf/Web/TechDocs • For example, the updated Fibre Channel Planning tool estimates adapters required based on IOPS, MB/sec, paths and LUN counts (should work fine for System i & AIX) http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS5166 © 2015 IBM Corporation

20

Migrating Network & General Tuning

 For all network efforts, see Steve Knudson’s and/or Alexander Paul’s presentations (10 Gb SEA tuning, SR-IOV, etc) – High packet counts (>100K/sec) or low-latency tiny-packets require tuning – Learn about mtu_bypass – Beware using Large Receive & Send on VIOS with Linux clients • • Linux does not support this feature (LTC is trying!) Mixing AIX/i clients with Linux virtual ethernet/SEA will result in performance issues • Separate Linux clients  See Lab’s Performance Tuning Best Practices links – Single sheets for POWER7 & POWER8 – Transition and Service Strategy guidance – All at: https://www-304.ibm.com/support/customercare/sas/f/best/home.html

© 2015 IBM Corporation

21

© 2015 IBM Corporation

Scaled Throughput

22

What is Scaled Throughput?

 

Scaled Throughput is an alternative to the default “Raw” AIX scheduling mechanism

– – An alternative for some customers at the cost of some performance Not an alternative to addressing AIX and pHyp defects, partition placement issues, realistic entitlement settings and excessive Virtual Processor assignments – – Will dispatch more SMT threads to a VP/core before unfolding more VPs It can be considered to be more like the POWER6 folding mechanism, but this is a

generalization

, not a technical statement – – Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02 Does not apply to dedicated partitions unless you enable VP folding

Raw vs Scaled Performance

– Raw provides the highest per-thread throughput and best response times at the expense of activating more physical cores – Scaled provides the highest core throughput at the expense of per-thread response times and throughput. It also provides the highest system-wide throughput per VP because hardware thread capacity is “not left on the table.” © 2015 IBM Corporation

23

Raw vs Scaled

proc0 proc1 proc2 proc3

Raw

default

Primary Secondary Tertiaries

Scaled

Mode 2

lcpu 0-3 proc0 lcpu 4-7 proc1 lcpu 8-11 proc2 lcpu 12-15 proc3 proc0 proc1 proc2 proc3

Scaled

Mode 4

POWER8 Mode + AIX 7.1 Supports Scaled Mode 8

© 2015 IBM Corporation Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number

24

Scaled Throughput: Tuning

 Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts – Dynamic schedo tunable –

Actual thresholds used by these modes are not documented and may change at any time

schedo –p –o vpm_throughput_mode= 0 1 2

Legacy Raw mode (default) Scaled or “Enhanced Raw” mode with a higher threshold than legacy Scaled mode, use primary and secondary SMT threads 4 8 Scaled mode, use all four SMT threads Scaled mode, use eight SMT threads (POWER8, AIX 7.1 required)  Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode – Allows fine-tuning for workloads depending on utilization level – VP’s will “ramp up” quicker to a desired number of cores, and then be more conservative under chosen Scaled mode © 2015 IBM Corporation

25

Scaled Throughput: Guidance

 

Workloads

– Workloads with many light-weight threads with short dispatch cycles and low IO (the same types of workloads that benefit well from SMT) – Customers who are easily meeting network and I/O SLA’s may find the tradeoff between higher latencies and lower core consumption attractive – Customers who will not reduce over-allocated VPs and prefer to see POWER6 behavior – Use mpstat (-v) in AIX 7.1 TL3 to view Virtual Processor dispatches

Performance

– –

It depends,

we can’t guarantee what all workloads will do Mode 1 may see little or no impact but higher per-core utilization with lower physical consumed (typically 10-15%) – Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will see double-digit per-thread performance degradation (higher latency, slower completion times) – POWER6 workloads migrating to POWER7 or POWER8 and using Mode 2 will likely perform as well, or better and minimize complaints about higher than expected physical consumption.

– Many POWER7 workloads could migrate to POWER8 mode 2 and reduce core usage without performance impact.

– These are non-restricted dynamic tunings, easily tested like SMT mode changes © 2015 IBM Corporation

26

Raw Throughput: Default and Mode 1

Raw Throughput

4 3 2 1 0 12 11 10 9 8 7 6 5 Active_Threads

Time

Active_VP Phys_Busy Phys_Consumed   AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature VP’s are activated and deactivated one second at a time

Scaled Throughput: Mode 1

12 11 10 2 1 0 9 8 7 4 3 6 5 Active_Threads Active_VP

Time

Phys_Busy Phys_Consumed  Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation  It is less aggressive about VP activations. Many workloads may see little or no performance impact © 2015 IBM Corporation

27

Scaled Throughput: Modes 2 & 4

Scaled Throughput: Mode 2

12 5 4 1 0 3 2 11 10 7 6 9 8 Active_Threads

Time

Active_VP Phys_Busy Phys_Consumed  Mode 2 utilizes both the primary and secondary SMT threads  Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores  “Physical Busy” or utilization percentage reaches ~80% of Physical Consumption

Scaled Throughput: Mode 4

6 5 4 3 2 1 0 12 11 10 9 8 7 Active_Threads Active_VP

Time

Phys_Busy Phys_Consumed  Mode 4 utilizes both the primary, secondary and tertiary SMT threads  Eight threads are collapsed onto two cores  “Physical Busy” or utilization percentage reaches 90-100% of Physical Consumption © 2015 IBM Corporation

28

Tuning (other)

   Never adjust the legacy vpm_fold_threshold without L3 Support guidance Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active.

If you use RSET or bindprocessor function and bind a workload – To a secondary thread, that VP will always stay in at least SMT2 mode – – If you bind to a tertiary thread, that VP cannot leave SMT4 mode POWER8 threads are more balanced whereas lower POWER7 threads typically have a higher priority.

– – These functions should only be used to bind to primary threads unless you know what you are doing or are an application developer familiar with the RSET API Use bindprocessor –s to list primary, secondary and tertiary threads © 2015 IBM Corporation

29