Presentation Title

Transcript Presentation Title

Open VMS Performance Tips & Tricks

Guy Peleg President Maklee Engineering [email protected]

Tuesday, April 28, 2020

Performance – Why should you care?

1800 1600 1400 1200 1000 800 600 400 200 0 Alpha ES47 Original Code RX3600 Original Code RX3600 PRONE version 1 CPU RX3600 PRONE version 4 CPUs 160000 140000 120000 100000 80000 60000 40000 20000 0 Application Tuning Alpha ES40 Alpha ES40 PRONE version System Tuning 100 90 80 70 60 50 40 30 20 10 0 400 350 300 250 200 150 100 50 0 Java Tuning Alpha GS1280 rx7640 after tuning rx7640 PRONE version rx7640 PRONE version + HT Oracle Tuning rx2600 1.4 Ghz rx2600 1.4 Ghz PRONE version

The Golden Rules

The best performing code is the code not being executed The fastest I/Os are those avoided Idle CPUs are the fastest CPUs Look at your code….be ready to be surprised

RMS

RMS holds great potential for improving performance The C RTL uses RMS Most C applications would benefit from RMS tuning

RMS

RMS parameters related to performance: FAB/RAB parameters (should you have access to the code):

ASY, RAH, WBH, DFW, SQO ALQ & DEQ MBC & MBF NOSHR, NQL, NLK

SET RMS …

/SYSTEM | /PROCESS /BUFFER_COUNT=n /BLOCK_COUNT=n

SYSGEN> SET RMS_SEQFILE_WBH 1 Don’t be afraid of Global Buffers

FTP Performance & Simple RMS Tuning

FTP into IT13 and transfer the file Brutel> ftp it13 220 IT13.bruclass.com FTP Server (Version 5.6) Ready.

Connected to ALPH13.BRUCLASS.COM.

Name (ALPH13.BRUCLASS.COM:bru_guy): peleg 331 Username peleg requires a Password Password: 230 User logged in.

FTP> cd $1$dga703:[000000] 250-CWD command successful.

250 New default directory is $1$DGA703:[000000] FTP> put HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE

200 TYPE set to IMAGE.

200 PORT command successful.

150 Opening data connection for $1$DGA703:[000000]HP-I64VMS-JAVA150-V0105-1-1.PC

SI_SFX_I64EXE; (192.168.1.7,49428) 226 Transfer complete.

local: SYS$SYSDEVICE:[BRU_GUY]HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE;1 rem ote: HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE

286026004 bytes sent in 00:00:49.92 seconds (5594.83 Kbytes/s) 200 TYPE set to ASCII.

FTP Performance & Simple RMS Tuning

$ set rms/sys/exte=60000/seq/block=127/buf=8 $ mc sysgen SYSGEN> SET RMS_SEQ 1 SYSGEN> W A SYSGEN> Exit Throughput increased by more than 50% FTP> put HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE

200 TYPE set to IMAGE.

200 PORT command successful.

150 Opening data connection for $1$DGA703:[000000]HP-I64VMS-JAVA150-V0105-1-1.PC

SI_SFX_I64EXE; (192.168.1.7,49432) 226 Transfer complete.

local: SYS$SYSDEVICE:[BRU_GUY]HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE;1 rem ote: HP-I64VMS-JAVA150-V0105-1-1.PCSI_SFX_I64EXE

286026004 bytes sent in 00:00:31.83 seconds (8773.78 Kbytes/s ) 200 TYPE set to ASCII.

gZIP & RMS gZIP is written in C – I/Os eventually reach RMS 1.6 Ghz rx2600, MSA30, OpenVMS V8.3

Test 1 Compress 5.67 GB saveset Decompress 2.74 gZIP archive Default O/S & RMS settings Test 2 Compress 5.67 GB saveset Decompress 2.74 gZIP archive SET RMS/BLOCK=127/EXTEN=60000/BUFFER=8, RMS_SEQFILE_WBH=1

gZIP & RMS

40 35 30 25 20 15 10 5 0 Test 1 - Compress Test 1 Decompress Test 2 - Compress Test 2 Decompress Elapsed Time in Minutes (less is better)

Smaller MBC for Random Access

Times to read 1,000,000 records randomly (same sequence of records (where mbc passed as first parameter:

$ frand 1 Elapsed time == 31233ms $ frand 2 Elapsed time == 31680ms $ frand 4 Elapsed time == 32607ms $ frand 8 Elapsed time == 33698ms $ frand 16 Elapsed time == 36101ms $ frand 32 Elapsed time == 42823ms $ frand 64 Elapsed time == 54761ms $ frand 96 Elapsed time == 66343ms $ frand 124 Elapsed time == 80122ms $ $ frand 1 Elapsed time == 31205ms $

RMS & fsynch()

Writing small amount of data?

Using fsynch() ?

Slow !

Setting MBC & MBF to 1 is (almost!) identical Still need to take care of EOF

Sequential Writes

Frequent file expansions are expensive Typically seen with:

BACKUP savesets Database Imports FTP’ing large files

The significant amount spent expanding files impacts performance If possible – pre allocate files (container files) Limit the number of expansions on a volume:

$ SET VOLUME/EXTEND=65535

Black Magic… What would you say about improving system performance by 5% - 20%?

A typical response would be – “What does it take?” Nothing ! Just a small change to one SYSGEN parameter ….and some physical memory Sounds interesting?

Introducing the VHPT Each CPU contains a translation buffer Special cache to hold recent translations of virtual memory address to physical address When a TB miss occurs the O/S has to resolve the translation by walking the page tables Itanium provides an extra layer for resolving addresses – Virtual Hash Page Table (VHPT) VHPT – linear array of 32 byte entries Created by OpenVMS at boot time but not accessed by it

VHPT Order of use CPU TB cache VHPT OpenVMS performs 3 level address translation walks the page tables.

The VHPT is sized by a system parameter - VHPT_SIZE Default value of 1 means allocate 32KB per CPU for the VHPT

VHPT Default VHPT settings should be sufficient for small applications (up to 8MB of virtual address space).

Large applications with poor locality would benefit from increasing the VHPT.

Generally speaking – an application that benefits from enabling HT would benefit from an increase to the VHPT.

YMMV !!

VHPT Benchmark

The following charts illustrate the impact of increasing the VHPT made on Oracle batch jobs rx6600 – 8 cores OpenVMS V8.3-1H1 EVA8000 Oracle 10gR2 HyperThreads Enabled 64 GB of physical memory

With VHPT = 10000, 2.5GB of physical memory is allocated for the VHPT.

Oracle Batch job A

9 8 3 2 1 0 7 6 5 4 Elapsed Time in Minutes (less is better) VHPT = 1 VHPT = 2048 VHPT = 10000

Oracle Batch job B

40 35 30 25 20 15 10 5 0 Elapsed Time in Minutes (less is better) VHPT = 1 VHPT = 2048 VHPT = 10000

CPU Power Management (IA64 only) CPUs may be placed in a “lower power mode” when idle.

Reduces energy costs for the system.

SYSGEN parameter CPU_POWER_MGMT turns this feature on/off. May impact performance.

In a recent engagement we noted 30% performance improvement on an rx6600 by turning power management off (set CPU_POWER_MGMT=0)

Shadowed RAM disk Shadowed RAM disk for applications that frequently read data from disk.

The Shadow server will read from memory and will write to both devices.

Forces data to remain resident in memory Significantly boosts performance when files are opened cluster wide by multiple users XFC will not help Beneficial if file update rate is low compared to the read rate Included in the EOE & MCOE packages

Physical Disk Vs. RAM disk C application that processes records read from sequential file Each I/O 124 Blocks RX2600, OpenVMS V8.3, HSG80

25 20 15 10 $1$DGA704 Single member DSA666 DSA666 with RAM disk File in Cache 5 0 Elapsed time to read 250MB file (less is better)

V8.3-1H1 When possible upgrade to V8.3-1H1 Performance improvements Always inspire to stay current with O/S version Relink Applications using the V8.3-1H1 Linker The new linker produces smaller images Reduction between 2% - 18% 0% is also possible Montvale based systems – There is more than meets the eye…

V8.3-1H1 – Addendum kit EFICHK operation is performed during the patch installation

Performance improvements The following product will be installed to destination: HP I64VMS VMS831H1I_ADDENDUM V1.0

DISK$SYS831H1:[VMS$COMMON.] Portion done: 0%...10%...20%...30%...40%...50%...70%...80%...90% %MOUNT-I-FATCHECK, volume created by EFI$CP version V5.2-5 allocation of 127.

checking for errors, repairing, and updating FAT information.

%EFICP-W-BADCCNT, FS0:\EFI\VMS\TOOLS\ACPIDUMP.EFI actual cluster count of 126 does not match the file Filesize of 258232 bytes, requires 508 blocks (rounded to the cluster factor of 4) 508 blocks shown allocated, but 126 actual clusters (504 blocks) counted in file The disk storage (258048 bytes) is smaller than the file size (258232 bytes) Truncating file!

%EFICP-I-FATCHECK, 1 errors found, 1 fixed.

***CHECK CONTENTS FOR VALIDITY*** 18 files in 4 folders checked, 12095166 total bytes in 5913 clusters %EFICP-I-FATCHECK, Updating the FAT EFI$CP version information to V6.0-1, FAT version 1 %EFI-I-COPIED, copied FS0:\EFI\VMS\IPB.EXE to PCSI$DESTINATION:[SYSEXE]FLAG_IPB.EXE

%EFI-I-COPIED, copied PCSI$DESTINATION:[SYSEXE]IPB.EXE to FS0:\EFI\VMS\ %EFI-I-COPIED, copied FS0:\EFI\VMS\IPB.EXE to PCSI$DESTINATION:[SYSEXE]CHECK_IPB.EXE

...100%COPIED, copied FS0:\EFI\VMS\VMS_LOADER.EFI to PCSI$DESTINATION:[SYSEXE]CHECK_VMS_LOADER.EFI

Resident Images – a mystery

AlphaServer GS1280 7/1150 120 100 80 60 40 20 0 Not Installed Installed Resident Elapsed time to execute a program (less is better)

Resident Images

AlphaServer GS1280 7/1150 120 100 80 60 40 20 0 Not Installed /section=(code,data) /section=code Elapsed time to execute a program (less is better)

Resident Images

rx6600 4P/8C 1.6 Ghz 80 70 60 50 40 30 20 10 0 Not Installed /section=(code,data) /section=code Elapsed time to execute a program (less is better)

Resident Images Alpha the image activator has to apply the relocations pagefaults Link using /section=code Avoid /section=data IA64 relocations are mapped into memory (the dynamic segment stays in paged pool)

SORTing

HYPERSORT

Multi-threaded $ define sortshr sys$library:hypersort.exe

Spread work files among disks/controllers/adaptors

Apart from input/output disks No problem to have input and output on same disk

Sort 100,000,000 Records

36:00.0

28:48.0

21:36.0

14:24.0

07:12.0

00:00.0

Sort32 CPU Sort32 Elapsed HyperSort CPU Hypersort Elapsed rx8620 1.6 16p

100 bytes each 19,531,250 blocks 3 work files ~618,000 IO Sort32 ~922,000 IO HyperSort No XFC file caching of input, output or work HyperSort Elapsed < CPU

PEDRIVER Data Compression

OpenVMS V8.3

Reduces traffic between nodes May be beneficial for Shadow copy and MSCP traffic Can be enabled system wide or per VC

Turn on compression for one VC

SCACP> set vc it14/comp SCACP> sh vc IT13 PEA0 VC Summary 30-JAN-2007 07:43:28.02: Remote VC Total Channels ECS MaxPkt ReXmt --XmtWindow- Xmt Total ----------- Most Recent ---------- Node State Errors Xmt:TMO Open ECS Pri Size TMO(uSec) Cur Max Mgt Options Pkts(S+R) VC Opened Time VC Closed Time ----- -- ----- ------ -------- ---- -- -- --- --------- ---- ---- ---- ----- --------- ------------------ -------------- ALPH50 Open 4 115444 2 2 0 1426 672330.3 33 64 0 889107 21-JAN 13:34:25.78 (No time) ALPH40 Open 0 Infinite 2 2 0 1426 516452.3 16 32 0 803545 21-JAN 13:34:25.72 (No time) IT14 Open 1 790292 2 2 0 1426 223273.5 32 64 0 CMP 1242954 21-JAN 13:34:25.93 (No time) IT13 Open 0 Infinite 1 1 0 1426 3000000.0 1 8 0 5 21-JAN 13:34:23.05 (No time)

PEDRIVER Data Compression Copy 250MB file to MSCP served SCSI disk Both systems are rx2600, running OpenVMS V8.3

40 35 30 25 20 15 10 5 0 Data Compression Turned Off Data Compression Turned On Elapsed time to copy 250MB file (less is better)

Alignment Faults No performance talk is complete without mentioning Alignment Faults Alignment faults on Itanium will have serious impact on performance May be an (performance) issue on Alpha as well

What is an Alignment Fault?

When an attempted:

Longword memory access is not aligned on a memory boundary that is divisible by 4 Quadword memory access is not aligned on a memory boundary that is divisible by 8 Word memory access is not aligned on a boundary that is divisible by 2 An alignment fault is generated and control is transferred to code that will complete the load/store through shifting, masking and setting bits.

Why Worry?

OpenVMS Monitor Utility ALIGNMENT FAULT STATISTICS on node DWARF 3-MAY-2007 14:26:56.27

CUR AVE MIN MAX Kernel Fault Rate 0.00 0.66 0.00 1.33

Exec Fault Rate 0.00 0.00 0.00 0.00

Super Fault Rate 0.00 0.00 0.00 0.00

User Fault Rate 640253.31 662505.00 640253.31 684756.68

Total Fault Rate 640253.31 662505.83 640253.31 684758.31

Why Worry?

+-----+ TIME IN PROCESSOR MODES | CUR | on node DWARF +-----+ 3-MAY-2007 14:26:59.27

Let the Compiler Warn You in Advance

$ cc/nomember/warning=enable=alignment align_test int x; ................^ %CC-I-MISALGNDMEM, This member is at offset 1, which is not a multiple of the member's alignment of longword. Consider padding before this member, rearranging the order of member declarations, or using #pragma member_alignment.

at line number 10 in file SYS$SYSDEVICE:[test]ALIGN_TEST.C;7 int x; ................^ %CC-I-MISALGNDSTRCT, This member requires longword alignment for efficient access, but is contained in a struct containing byte alignment. Consider using #pragma nomember_alignment longword.

at line number 10 in file SYS$SYSDEVICE:[test]ALIGN_TEST.C;7 sub(&z[i].x,&z[i].a); ....................^ %CC-W-ALIGNCONFLICT, In this statement, the address "&z[i].x" has alignment of byte which is less than the $ alignment requirements of the destination pointer. Dereferencing the destination pointer may cause an alignment fault.

at line number 22 in file SYS$SYSDEVICE:[test]ALIGN_TEST.C;7

Reporting Alignment Faults

Analyze alignment faults on Alpha prior to a port

Only works on current process

sys$perm_report_align_fault sys$perm_dis_align_fault_report

$ r align_test

Address of x == 10001 %SYSTEM-I-ALIGN, data alignment trap, virtual address=0000000000010001, function=00000000, PC=000000001DCF0202, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=0000000000010001, function=00000001, PC=000000001DCF0212, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=0000000000010006, function=00000000, PC=000000001DCF0202, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=0000000000010006, function=00000001, PC=000000001DCF0212, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=000000000001000B, function=00000000, PC=000000001DCF0202, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=000000000001000B, function=00000001, PC=000000001DCF0212, PS=0000001B %SYSTEM-I-ALIGN, data alignment trap, virtual address=0000000000010015, function=00000000, PC=000000001DCF0202, PS=0000001B

Process Affinity Running on a large system with a low load?

Running on a large system with heavy load?

Better utilize the CPU caches (data cache, instruction cache & TB) by affinitizing your process to a set of CPUs In HT environment affinitize to one core Up to 25% performance increase

Generating Primes GS 1280 7/1150

25 20 15 10 5 0 21.12

14.56 14.56

6.43

6.42

EV7 @ 1150 /NOOPTIMIZE /OPTIMIZE /OPTIMIZE=TUNE=HOST /ARCHITECURE=HOST /ARCH=HOST/OPT=LEV=5

EV7 has EV68 “core”

Free Hot File Tracking Utility

$ sh mem/cache=(volume=*,topqio) System Memory Resources on 26-APR-2007 01:39:15.03

Extended File Cache Top QIO File Statistics: _$1$DGA642: (DISK$ES40), Caching mode is VIOC Compatible _$1$DGA642:[VMS$COMMON.SYSEXE]RIGHTSLIST.DAT;1 (open) Caching is enabled, active caching mode is Write Through Allocated pages 9 Total QIOs 107 Read hits 92 Virtual reads 107 Virtual writes 0 Hit rate 85 % Read aheads 0 Read throughs 107 Write throughs 0 Read arounds 0 Write arounds 0 _$1$DGA642:[VMS$COMMON.SYSEXE]VMS$OBJECTS.DAT;2 (open) Caching is enabled, active caching mode is Write Through Allocated pages 0 Total QIOs 9 Read hits 0 Virtual reads 9 Virtual writes 0 Hit rate 0 % Read aheads 0 Read throughs 9 Write throughs 0 Read arounds 0 Write arounds 0 _$1$DGA642:[VMS$COMMON.SYSEXE]VMS$AUDIT_SERVER.DAT;1 (open) Caching is enabled, active caching mode is Write Through Allocated pages 1 Total QIOs 4 Read hits 0 Virtual reads 4 Virtual writes 0 Hit rate 0 % Read aheads 0 Read throughs 4 Write throughs 0 Read arounds 0 Write arounds 0 Total of 3 files for this volume

Free Hot File Tracking Utility

_$1$DGA242: (DISK$ITANIUMVMS), Caching mode is VIOC Compatible _$1$DGA242:[VMS$COMMON.SYSLIB]DECC$SHR.EXE;1 (open) Caching is enabled, active caching mode is Write Through Allocated pages 303 Total QIOs 1646 Read hits 1561 Virtual reads 1646 Virtual writes 0 Hit rate 94 % Read aheads 0 Read throughs 1642 Write throughs 0 Read arounds 4 Write arounds 0 _$1$DGA242:[VMS$COMMON.SYSLIB]LIBRTL.EXE;1 (open) Caching is enabled, active caching mode is Write Through Allocated pages 143 Total QIOs 1165 Read hits 1123 Virtual reads 1165 Virtual writes 0 Hit rate 96 % Read aheads 0 Read throughs 1164 Write throughs 0 Read arounds 1 Write arounds 0 _$1$DGA242:[VMS$COMMON.SYSLIB]CMA$TIS_SHR.EXE;1 (open) Caching is enabled, active caching mode is Write Through Allocated pages 12 Total QIOs 720 Read hits 711 Virtual reads 720 Virtual writes 0 Hit rate 98 % Read aheads 0 Read throughs 720 Write throughs 0 Read arounds 0 Write arounds 0 Avoid caching files that pollute the cache

Elapsed time for I/Os SDA> xfc show volume/brief Summary of XFC Cached Volumes (CVBs) ----------------------------------- Volume Name CVB DISK$CARFAX DISK$UP FFFFFFFEE01895E0 FFFFFFFEE0189380 Open Files 0 0 DISK$ORADAT DISK$ORADSK DISK$IA64_V82 DISK$82SOURCE DISK$IT14_10292 FFFFFFFEE0189120 FFFFFFFEE0188EC0 FFFFFFFEE0188C60 FFFFFFFEE0188A00 FFFFFFFEE01887A0 26 73 0 0 2 DISK$ES40 DISK$IT14_DOSD DISK$SYS831H1 FFFFFFFEE0188540 FFFFFFFEE01882E0 FFFFFFFEE0188080 4 0 313 Closed Files 0 0 3 177 0 0 0 3 0 183 Total I/Os 0 0 1872255 22015701 0 1 0 27676052 0 2736618 Read Hits 0 0 0 14108183 0 0 0 27667501 0 2668894 Read Count 0 0 0 21116834 0 1 0 27674665 0 2713025 Write Count 0 0 1872255 898891 0 0 0 1387 0 23594 ... Response (Milliseconds)...

Hits (N/A) (N/A) disk (N/A) (N/A) Average (N/A) (N/A) (N/A) 0.0232

(N/A) (N/A) (N/A) 0.0000

0.5811

(N/A) (N/A) (N/A) 0.0000

0.2236

(N/A) (N/A) (N/A) 0.0118

(N/A) 0.0179

0.4007

(N/A) 0.5425

0.0120

(N/A) 0.0308

SDA>XFC SHOW VOLUME/BRIEF

The XFC “overhead”

4 2 0 12 10 8 6 Caching disabled Caching enabled - first attempt Caching enabled second attempt RDB users – consider disabling caching of .RDA files Elapsed time to copy 150MB file, rx2600, HSG80, OpenVMS V8.3

IBM MQ series

MQ is a heavy user of pthreads Set MULTITHREAD to 1

Thread manager upcalls are enabled; the creation of multiple kernel threads is disabled

Sizing Working Sets Respect AUTOGEN but don’t trust it blindly Alpha Server ES47, 16GB RAM maximum process count of 2500 processes AUTOGEN will set PQL_MWSDEFAULT to 17.38MB

17.38MB X 2500 = 43.45GB RAM

Exceeds Physical memory by almost 3 times

Sizing Working Sets It’s not 1980 any more… Determine the size of XFC cache + MPW_HILIMIT Subtract the sum from the number of fluid pages on the system (MMG$GQ_FLUID_PGCNT) Divide by the maximum number of processes that have ever been running on the system (PMS$GL_PROCCNTMAX) Multiply the result by 16 to translate from pages to pagelets If you are conservative, take 70% of the result and set working set limit and quota to this value Working set extent should be 3 times the result Make sure PGFLQUOTA is properly sized

TCP/IP & Gigabit Ethernet

Using Gigabit Ethernet?

Turn on Jumbo frames Frames larger than 1518 bytes, more data per frame > less frames -> less interrupts -> better performance Must be supported by the switch Must be configured before TCP/IP is started mc lancp set dev ewa/jumbo Bit 6 in SYSGEN parameter LAN_FLAGS

Toolbox Overview

Collection of highly valuable, undocumented & unsupported tools, subject to change without a notice Implemented as SDA extensions Use hooks in the VMS executive May be loaded and unloaded on the fly

No reboot required

Trace data is stored in ring buffer in S2 space

May be viewed from a crash dump

Toolbox Overview CNX connection manager tracing EXC exception tracing FC Fibrechannel debug and tracing FLT alignment fault tracing First shipped in V7.2-2 V8.2

V7.2-2 V8.1

IO buffered and direct I/O tracing LCK lock manager tracing LNM logical name tracing MTX mutex tracing V7.3-2 V7.2-2 V7.3-1 V7.3

PCS PC sampling PRF performance utility PSH pshared debug utility V7.3-2 V8.2

V8.2-1

Toolbox Overview First shipped in RDB Rdb lock decoding and tracing V7.3-2 RMS indexed file tracing SPL spinlock tracing V8.2-1 V7.2-1H1 TQE timer entry tracing TR debug and trace prints V7.3-1 XFC eXtended File Cache diagnostics V7.3

V7.3

Toolbox Overview

Common commands

SDA> xxx help ! Displays brief command SDA> xxx LOAD SDA> xxx START TRACE /BUFFER=3000 SDA> xxx SHOW TRACE SDA> xxx STOP TRACE SDA> xxx UNLOAD SDA> READ /EXEC /NOLOG

PRF

PRF is highly powerful SDA extension for monitoring various performance counters at the processor level.

May be used for PC sampling.

Highlights areas in the application that require performance enhancements.

PRF

SDA> prf load PRF$DEBUG load status = 00000001 SDA> prf start pc/ind=21E004DA PC Sampling started...

SDA> prf start collect SDA> Now run the application: $ r prime $ ELAPSED: 0 00:00:24.16 CPU: 0:00:24.06 BUFIO: 0 DIRIO: 0 FAULTS: 0

To look at the collected data:

SDA> prf show collect

PRF SHOW COLLECT

Start VA End VA Image Count Percent ----------------- ---------------- --------------------------------------- ---------- ------- FFFFF802.11F00000 FFFFF802.11F01FFF PRIME 305113 99.85% FFFFF802.A1000000 FFFFF802.A1015FFF Kernel Promote VA 1 0.00% FFFFFFFF.80000000 FFFFFFFF.800000FF SYS$PUBLIC_VECTORS 2 0.00% FFFFFFFF.80000100 FFFFFFFF.800111FF SYS$BASE_IMAGE 2 0.00% FFFFFFFF.80011200 FFFFFFFF.800651FF SYS$PLATFORM_SUPPORT 258 0.08% FFFFFFFF.800A0000 FFFFFFFF.801DD6FF SYSTEM_PRIMITIVES 88 0.03% FFFFFFFF.801DD700 FFFFFFFF.80243BFF SYSTEM_SYNCHRONIZATION_MIN 9 0.00% FFFFFFFF.80254600 FFFFFFFF.8026EFFF SYS$EIDRIVER.EXE 5 0.00% FFFFFFFF.8026F000 FFFFFFFF.802895FF SYS$LAN.EXE 2 0.00% FFFFFFFF.80289600 FFFFFFFF.802BA1FF SYS$LAN_CSMACD.EXE 2 0.00% FFFFFFFF.80440E00 FFFFFFFF.8052B2FF IO_ROUTINES 1 0.00% FFFFFFFF.8053A600 FFFFFFFF.80670DFF PROCESS_MANAGEMENT 7 0.00% FFFFFFFF.80670E00 FFFFFFFF.807759FF SYS$VM 11 0.00% FFFFFFFF.80779500 FFFFFFFF.807C76FF LOCKING 1 0.00% FFFFFFFF.807C7700 FFFFFFFF.807F9CFF MESSAGE_ROUTINES 1 0.00%

PRF SHOW COLLECT

SDA> prf show coll/threash=2 PC Count Rate Symbolization Module Offset ---------------- ------ -------- --------------------------------------- ------------------------- ------- FFFFF802.11F00170 63410 20.07% PRIME+10170 PRIME 00010170 [GENERATE_PRIME+00000170 / GENERATE_PRIME+00000170] FFFFF802.11F00190 6138 2.01% PRIME+10190 PRIME 00010190 [GENERATE_PRIME+00000190 / GENERATE_PRIME+00000190] FFFFF802.11F001A0 6761 2.21% PRIME+101A0 PRIME 000101A0 [GENERATE_PRIME+000001A0 / GENERATE_PRIME+000001A0] FFFFF802.11F00200 6296 2.06% PRIME+10200 PRIME 00010200 [GENERATE_PRIME+00000200 / GENERATE_PRIME+00000200] FFFFF802.11F00220 8102 2.65% PRIME+10220 PRIME 00010220 [GENERATE_PRIME+00000220 / GENERATE_PRIME+00000220] FFFFF802.11F00290 6804 2.23% PRIME+10290 PRIME 00010290

Montecito

Source: Wikipedia

Hyperthreading with Stalls vs Hyperthreading with No Stalls

Serial Execution with Stalls (no Hyperthreading) A i A Idle A i+1 B i B Idle Hyperthreading with Stalls A i A Idle A i+1 B i B Idle B i+1 Serial Execution with No Stalls (no Hyperthreading) A i A i+1 B i B i+1 Hyperthreading with No Stalls A i A i+1 B i B i+1 B i+1

Two Cores vs Hyperthreading (NoStalls)

Serial Execution with No Stalls on Two Cores A i A i+1 B i B i+1 Hyperthreading with No Stalls A i A i+1 B i B i+1

HyperThreads – Impact on Oracle Jobs

35 30 25 20 15 10 5 0 Job 1 Job 2 Job3 Job 4 Job 5 Job 6 Elapsed time (minutes) to execute 7 jobs Less is better Job7 HT Disabled HT Enabled

HyperThreads HyperThreads have the potential of improving performance Application has to meet the following criteria: COM Queue Poor locality (L2/L3 misses) No pagefulating PRF may be used to track L2 misses

PRF START PROFILE/CPU=n/CACHE=L2/INDEX=PID PRF START COLLECT

L2 Cache Misses on TC_CF (13.2% improvement) I-Cache Misses D-Cache Misses Branch Trace Buf Start VA End VA Image Latency Percent Latency Percent Count Percent ----------------- ---------------- ------ ---------- ------ ---------------------------------- ---------- ------ --------- 00000000.00000000 00000000.7ADCBFFF Process Space 17062 1.73%

6072893

96.52% 244963 8.62% 00000000.7ADCC000 00000000.7AEF7FFF DCL 101 0.01% 0 0.00% 242 0.01% FFFFF802.0806C000 FFFFF802.0825DFFF LIBRTL 4104 0.42% 1217 0.02% 21753 0.77% FFFFF802.0825E000 FFFFF802.08283FFF LIBOTS 2150 0.22% 123 0.00% 240662 8.47% FFFFF802.082E8000 FFFFF802.0837FFFF SMGSHR 52 0.01% 10 0.00% 211 0.01% FFFFF802.08404000 FFFFF802.0840DFFF CMA$TIS_SHR 281 0.03% 0 0.00% 1504 0.05% FFFFF802.08444000 FFFFF802.084F7FFF DPML$SHR 5 0.00% 0 0.00% 1 0.00% FFFFF802.084F8000 FFFFF802.085A9FFF PTHREAD$RTL 2657 0.27% 294 0.00% 6315 0.22% FFFFF802.085AA000 FFFFF802.090B3FFF DECC$SHR 24027 2.43% 6258 0.10% 369765 13.02% FFFFF804.0E000000 FFFFF804.0E015FFF Kernel Promote VA 2232 0.23% 0 0.00% 5191 0.18% FFFFFFFF.80000000 FFFFFFFF.800000FF SYS$PUBLIC_VECTORS 403 0.04%

L2 Cache Misses on PRIMES_1 (Slight Degradation)

Cache Misses Branch Trace Buf Start VA End VA Image Latency Percent Latency Percent Count Percent ----------------- ---------------- ------ ---------- ------ ---------------------------------- ---------- ------ --------- 00000000.00000000 00000000.7ADCBFFF Process Space 5077 2.77%

29968 52.88%

26607 5.27% 00000000.7ADCC000 00000000.7AEF7FFF DCL 19 0.01% 0 0.00% 22 0.00% FFFFF802.0806C000 FFFFF802.0825DFFF LIBRTL 949 0.52% 570 1.01% 3816 0.76% FFFFF802.0825E000 FFFFF802.08283FFF LIBOTS 63 0.03% 0 0.00% 201 0.04% FFFFF802.082E8000 FFFFF802.0837FFFF SMGSHR 20 0.01% 0 0.00% 46 0.01% FFFFF802.08404000 FFFFF802.0840DFFF CMA$TIS_SHR 0 0.00% 0 0.00% 6 0.00%

LNM

The LNM extension allows tracking logical name translations.

Logical name translations are expensive from a performance point of view and should be avoided when possible. MONITOR IO displays the total number of logical name translations per second

LNM Example

SDA> lnm show collect Logical Name Trace Information: ------------------------------ Count Logical Name ----------- ------------------------------ 5000 SYS$SCRATCH !SYS$SCRATCH is being translated 5000 times 10 SYS$SHARE 10 SYS$SYSROOT 5 GBL$INS$8DDE9730 5 SYS$COMMON 4 GBL$INS$8DDAE310 4 SYS$OUTPUT 3 GBL$INS$8DDC20D0 3 GBL$INS$8DDD1A60 3 IPC$ACP_NETMBX 2 CMA$TIS_SHR 2 DPML$SHR 2 LIBOTS 2 LIBRTL 2 PAS$RTL 1 GBL$INS$8DDB0B50 1 IAC$DEBUG 1 IAC$DEVO

LNM Example

SDA> lnm show trace Logical Name Trace Information: ------------------------------ Timestamp CPU EPID Main Image CallerPC Logical Name ---------------------- --- -------- ---------------------- ---------------------------------------- ---------------- -------------- 25-JAN 06:22:15.530026 01 21E0040E IPCACP FFFFFFFF.80514560 IOC$TRANDEVNAM_C+007C0 IPC$ACP_NETMBX 25-JAN 06:22:05.530027 01 21E0040E IPCACP FFFFFFFF.80514560 IOC$TRANDEVNAM_C+007C0 IPC$ACP_NETMBX 25-JAN 06:21:30.440094 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$OUTPUT 25-JAN 06:21:30.440010 00 21E004DA MANY_TRNLNMS 00000000.00000000 PAS$OUTPUT 25-JAN 06:21:30.439846 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439835 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439825 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439814 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439803 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439792 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439782 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439771 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439760 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439750 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH 25-JAN 06:21:30.439739 00 21E004DA MANY_TRNLNMS 00000000.00000000 SYS$SCRATCH

LNM & Cobol

Do you have an application written in Cobol?

COB$5644

Decoding PCs New routine to decode PC into module and routine names with offsets (IA64 only) tf$get_mod_rtn in module TRACE_ELF in SYS$SHARE:VMS$VOLATILE_PRIVATE_INTERFACES.OLB

tf$get_mod_rtn ( entry->spltre$q_pc, &mod_name, &rtn_name, &mod_rel_pc, &rtn_rel_pc );

Questions?

See us at www.maklee.com

for: • Performance improvements • Oracle Tuning • Platform Migration • Custom Engineering solutions • Custom Training

Presentation Title

Transcript Presentation Title

Open VMS Performance Tips & Tricks

The Golden Rules

The best performing code is the code not being executed The fastest I/Os are those avoided Idle CPUs are the fastest CPUs Look at your code….be ready to be surprised

RMS

RMS holds great potential for improving performance The C RTL uses RMS Most C applications would benefit from RMS tuning

RMS

Times to read 1,000,000 records randomly (same sequence of records (where mbc passed as first parameter:

Writing small amount of data?

Using fsynch() ?

Frequent file expansions are expensive Typically seen with:

The significant amount spent expanding files impacts performance If possible – pre allocate files (container files) Limit the number of expansions on a volume:

The following charts illustrate the impact of increasing the VHPT made on Oracle batch jobs rx6600 – 8 cores OpenVMS V8.3-1H1 EVA8000 Oracle 10gR2 HyperThreads Enabled 64 GB of physical memory

SORTing

HYPERSORT

Spread work files among disks/controllers/adaptors

Sort 100,000,000 Records

OpenVMS V8.3

Reduces traffic between nodes May be beneficial for Shadow copy and MSCP traffic Can be enabled system wide or per VC

Turn on compression for one VC

What is an Alignment Fault?

When an attempted:

Why Worry?

Why Worry?

Let the Compiler Warn You in Advance

Analyze alignment faults on Alpha prior to a port

sys$perm_report_align_fault sys$perm_dis_align_fault_report

Generating Primes GS 1280 7/1150

EV7 has EV68 “core”

MQ is a heavy user of pthreads Set MULTITHREAD to 1

Using Gigabit Ethernet?

Collection of highly valuable, undocumented & unsupported tools, subject to change without a notice Implemented as SDA extensions Use hooks in the VMS executive May be loaded and unloaded on the fly

Trace data is stored in ring buffer in S2 space

Toolbox Overview

Common commands

PRF is highly powerful SDA extension for monitoring various performance counters at the processor level.

May be used for PC sampling.

Highlights areas in the application that require performance enhancements.

Two Cores vs Hyperthreading (NoStalls)

HyperThreads – Impact on Oracle Jobs

The LNM extension allows tracking logical name translations.

Logical name translations are expensive from a performance point of view and should be avoided when possible. MONITOR IO displays the total number of logical name translations per second

Do you have an application written in Cobol?

COB$5644

Questions?

Directory