Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University of Houston Heterogeneous Computing Workshop, April 15, 2002 Rice01, slide 1

Download Report

Transcript Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University of Houston Heterogeneous Computing Workshop, April 15, 2002 Rice01, slide 1

Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University of Houston Heterogeneous Computing Workshop, April 15, 2002

Rice01, slide 1

Mapping/Adapting Distributed Applications on Networks Pre Data Model Stream Sim 1 Sim 2 Vis Application ?

Network

Rice01, slide 2

Automatic node selection Select 4 nodes for execution : Choice is easy m-7 Congested route m-6 Busy nodes m-5 m-8 Compute nodes Routers m-1 m-2 m-4 selected nodes m-3

Rice01, slide 3

Automatic node selection Select 5 nodes: choice depends on application m-7 Congested route m-6 Busy nodes m-5 m-8 Compute nodes Routers m-1 m-2 m-4 selected nodes m-3

Rice01, slide 4

Mapping/Adapting Distributed Applications on Networks Pre Data Model Stream Sim 1 Sim 2 Vis ?

Application Network 1) Discover application characteristics and model performance in a shared heterogeneous environment 2) Discover network structure and available resources (e.g., NWS, REMOS) 3) Algorithms to map/remap applications to networks

Rice01, slide 5

Methodology for Building Application Performance Signature Performance signature = model to predict application execution time under given network conditions 1. Execute the application on a controlled testbed 2. Measure

system level activity

during execution

such as CPU, communication and memory usage 3. Analyze and discover

program level activity

(message sizes, sequences, synchronization waits) 4. Develop a performance signature

No access to source code/libraries assumed

Rice01, slide 6

Discovering application characteristics Executable Application Code Benchmarking on a controlled testbed and analysis Model as a Performance Signature ethernet switch (crossbar) 100 Mbps links 500MHz Pentium Duos

capture patterns of CPU loads and traffic during execution

Rice01, slide 7

Executable Application Code Results in this paper Benchmarking on a controlled testbed Measure performance with resource sharing ethernet switch (crossbar) 100 Mbps links 500MHz Pentium Duos

capture patterns of CPU loads and traffic during execution Demonstrate that measured resource usage on a testbed is a good predictor of performance on a shared network for NAS benchmarks

Rice01, slide 8

Experiment Procedure

• • • •

Resource utilization of NAS benchmarks measured on a dedicated testbed

– –

CPU probes based on “top” and “vmstat” utility Bandwidth using “iptraf”, “tcpdump”, SNMP queries Performance of NAS benchmark measured with competing loads and limited bandwidth

Employ dummynet and NISTnet to limit bandwidth All measurements presented are on 500MHz Pentium Duos, 100 Mbps network, TCP/IP, FreeBSD All results on Class A, MPI, NAS Benchmarks

Rice01, slide 9

Discovered Communication Structure of NAS Benchmarks 0 1 0 1 0 1 2

BT

3 0 1 2

CG

3 0 1 2 0

IS

3 1 2

LU

3 0 2

MG

3 1 2

EP

3 2

SP

3

Rice01, slide 10

140 120 100 80 60 40 20 0 Performance with competing computation loads All nodes are loaded Most busy node loaded Least busy node loaded

• Increase beyond 50% due to lack of coordinated (gang) scheduling and synchronization

EP BT CG IS LU MG SP

• Correlation between low CPU utilization and smaller increase in execution time (e.g. MG shows only ~60% CPU utilization) • Execution time is lower if least busy node has a competing load (20% difference in the busyness level for CG) Rice01, slide 11

Performance with Limited Bandwidth (reduced from 100 to 10Mbps) on one link 140 120 100 80 60 40 20 0 CG IS MG SP BT LU EP Close correlation between link utilization and performance with a shared or slow link 2 0 8 6 4 16 14 12 10

Rice01, slide 12

Performance with Limited Bandwidth (reduced from 100 to 10 Mbps) on all links 500 450 400 350 300 250 200 150 100 50 0 80 70 60 50 40 30 20 10 0 IS CG SP MG BT LU EP Close correlation between total network traffic and performance with all shared or slow links

Rice01, slide 13

Results and Conclusions (not the last slide)

Computation and communication patterns can be captured by passive, near non-intrusive, monitoring

Benchmarked resource usage pattern is a strong indicator of performance with sharing

strong correlation between application traffic and performance with low bandwidth links

CPU utilization during normal execution a good indicator of performance with node sharing Synchronization and timing effects were not dominant for NAS Benchnmarks

Rice01, slide 14

Discussion and Ongoing Work (the last slide)

Capture application level data exchange pattern from network probes (e.g. MPI message sequence, sizes)

slowdown different for different message sizes

Infer the main synchronization/waiting patterns

Impact of unbalanced execution and lack of gang scheduling

Capture impact of CPU scheduling policy for accurate prediction with sharing

Policies try to compensate for waits Goal is to build a quantitative “performance signature” to estimate execution time under any given network conditions, and use it in a resource management prototype system

Rice01, slide 15