Preparing for PetaScale and Beyond

Download Report

Transcript Preparing for PetaScale and Beyond

Preparing for Petascale and Beyond
Celso L. Mendes
http://charm.cs.uiuc.edu/people/cmendes
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
Presentation Outline
• Present Status
– HPC Landscape, Petascale, Exascale
• Parallel Programming Lab
–
–
–
–
–
Mission and approach
Programming methodology
Scalability results for S&E applications
Other extensions and opportunities
Some ongoing research directions
• Happening at Illinois
– BlueWaters, NCSA/IACAT
– Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, …
7/17/2015
LNCC-08
2
Current HPC Landscape
• Petascale era started!
– Roadrunner@LANL (#1 in Top500):
• Linpack: 1.026 Pflops, Peak: 1,375 Pflops
– Heterogeneous systems starting to spread (Cell, GPUs, …)
– Multicore processors widely used
– Current trends:
Source: top500.org
7/17/2015
LNCC-08
3
Current HPC Landscape (cont.)
• Processor counts:
– #1 Roadrunner@LANL: 122K
– #2 BG/L@LLNL: 212K
– #3 BG/P@ANL: 163K
• Exascale: sooner than we imagine…
– U.S. Dep. of Energy town hall meetings in 2007:
• LBNL (April), ORNL (May), ANL (August)
• Goals: discuss exascale possibilities, how to accelerate it
• Sections:
– Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics,
Math & Algorithms, Software, Hardware
• Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf
7/17/2015
LNCC-08
4
Current HPC Landscape (cont.)
• Current reality:
–
–
–
–
Steady increase in processor counts
Systems become multicore or heterogeneous
“Memory wall” effects worsening
MPI programming model still dominant
• Challenges (now and into foreseeable future):
– How to explore new systems’ power
– Capacity x Capability – different problems
• Capacity is a concern for system managers
• Capability is a concern for users
– How to program in parallel effectively
• Both multicore (desktop) and million-core (supercomputers)
7/17/2015
LNCC-08
5
Parallel Programming Lab
7/17/2015
LNCC-08
6
Parallel Programming Lab - PPL
• http://charm.cs.uiuc.edu
• One of the largest research groups at Illinois
• Currently:
– 1 faculty, 3 research scientists, 4 research programmers
– 13 grad students, 1 undergrad student
– Open positions 
PPL, April’2008
7/17/2015
LNCC-08
7
PPL Mission and Approach
• To enhance Performance and Productivity in
programming complex parallel applications
– Performance: scalable to thousands of processors
– Productivity: of human programmers
– Complex: irregular structure, dynamic variations
• Application-oriented yet CS-centered research
–
–
–
–
Develop enabling technology, for a wide collection of apps.
Develop, use and test it in the context of real applications
Embody it into easy to use abstractions
Implementation: Charm++
• Object-oriented runtime infrastructure
• Freely available for non-commercial use
7/17/2015
LNCC-08
8
Application-Oriented Parallel Abstractions
Synergy between Computer Science research and
applications has been beneficial to both
Space-time
meshing
LeanCP
Other Applications
Issues
NAMD
Charm++
Techniques
& libraries
Rocket Simulation
ChaNGa
7/17/2015
LNCC-08
9
Programming Methodology
7/17/2015
LNCC-08
10
Methodology: Migratable Objects
Benefits of Virtualization
Programmer: [Over] decomposition
into objects (“virtual processors” - VPs)
• Software engineering
– Number of virtual processors can
be independently controlled
– Separate VP sets for different
modules in an application
Runtime: Assigns VPs to real processors
dynamically, during execution
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
• Message driven execution
– Adaptive overlap of
computation/communication
System implementation • Dynamic mapping
– Heterogeneous clusters
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change set of processors used
– Automatic dynamic load
balancing
– Communication optimization
User View
7/17/2015
LNCC-08
11
Adaptive MPI (AMPI): MPI + Virtualization
• Each virtual process implemented as a user-level thread
embedded in a Charm object
– Must properly handle globals and statics (analogous to what’s needed in OpenMP)
– But… thread context-switch is much faster than other techniques
MPI
processes
“processes”
Implemented
as virtual
processes
(user-level
migratable
threads)
Real Processors
7/17/2015
LNCC-08
12
Parallel Decomposition and Processors
• MPI-style:
– Encourages decomposition into P pieces, where P is the
number of physical processors available
– If the natural decomposition is a cube, then the number of
processors must be a cube
– Overlap of comput./communication is a user’s responsibility
• Charm++/AMPI style: “virtual processors”
– Decompose into natural objects of the application
– Let the runtime map them to physical processors
– Decouple decomposition from load balancing
7/17/2015
LNCC-08
13
Decomposition independent of numCores
• Rocket simulation example under traditional MPI vs.
Charm++/AMPI framework
Solid
Solid
Fluid
Fluid
Fluid
1
2
P
Solid1
Fluid1
Solid2
Fluid2
. . .
Solid3
. . .
Solid
. . .
Solidn
Fluidm
– Benefits: load balance, communication optimizations,
modularity
7/17/2015
LNCC-08
14
Dynamic Load Balancing
• Based on Principle of Persistence
– Computational loads and communication patterns tend to
persist, even in dynamic computations
– Recent past is a good predictor of near future
• Implementation in Charm++:
– Computational entities (nodes, structured grid points,
particles…) are partitioned into objects
– Load from objects may be measured during execution
– Objects are migrated across processors for balancing load
– Much smaller problem than repartitioning entire dataset
– Several available policies for load-balancing decisions
7/17/2015
LNCC-08
15
Typical Load Balancing Phases
Regular
Timesteps
Detailed, aggressive Load
Balancing
Time
Instrumented
Timesteps
7/17/2015
Refinement Load
Balancing
LNCC-08
16
Examples of Science & Engineering
Charm++ Applications
7/17/2015
LNCC-08
17
NAMD: A Production MD program
NAMD
• Fully featured program
• NIH-funded development
• Distributed free of charge
(~20,000 registered users)
• Binaries and source code
• Installed at NSF centers
• 20% cycles (NCSA, PSC)
• User training and support
• Large published simulations
• Gordon-Bell award in 2002
• URL: www.ks.uiuc.edu/Research/namd
7/17/2015
LNCC-08
18
Spatial Decomposition Via Charm++
•Atoms distributed to cubes based on
their location
• Size of each cube :
•Just a bit larger than cut-off radius
•Communicate only with neighbors
•Work: for each pair of nbr objects
•C/C ratio: O(1)
•However:
•Load Imbalance
•Limited Parallelism
Charm++ is useful to handle this
Cells, Cubes or “Patches”
7/17/2015
LNCC-08
19
Object-based Parallelization for MD
Force Decomposition + Spatial Decomposition
• Now, we have many objects
to apply load-balancing:
• Each diamond can be
assigned to any processor
• Number of diamonds (3D):
–14 * Number of patches
• 2-away variation:
– Half-size cubes, 5x5x5 inter.
• 3-away interactions: 7x7x7
• Prototype NAMD versions
created for Cell, GPUs
7/17/2015
LNCC-08
20
Time (ms per step)
Performance of NAMD: STMV
STMV: ~1 million atoms
Number of cores
7/17/2015
LNCC-08
21
7/17/2015
LNCC-08
22
ChaNGa: Cosmological Simulations
• Collaborative project (NSF ITR)
– with Prof. Tom Quinn, Univ. of Washington
• Components: gravity (done), gas dynamics (almost)
• Barnes-Hut tree code
– Particles represented hierarchically in a tree according to their
spatial position
– “Pieces” of the tree distributed across processors
– Gravity computation:
• “Nearby” particles: computed precisely
• “Distant” particles: approximated by remote node’s center
• Software-caching mechanism, critical for performance
• Multi-timestepping: update frequently only the fastest
particles (see Jetley et al, IPDPS’2008)
7/17/2015
LNCC-08
23
ChaNGa Performance
• Results obtained on BlueGene/L
• No multi-timestepping, simple load-balancers
7/17/2015
LNCC-08
24
Other Opportunities
7/17/2015
LNCC-08
25
MPI Extensions in AMPI
• Automatic load balancing
– MPI_Migrate(): collective operation, possible migration
• Asynchronous collective operations
– e.g. MPI_Ialltoall()
• Post operation, test/wait for completion; do work in between
• Checkpointing support
– MPI_Checkpoint()
• Checkpoint into disk
– MPI_MemCheckpoint()
• Checkpoint in memory, with remote redundancy
7/17/2015
LNCC-08
26
Performance Tuning for Future Machines
• For example, Blue Waters will arrive in 2011
– But we need to prepare applications for it, starting now
• Even for extant machines:
– Full size machine may not be available as often as needed for
tuning runs
• A simulation-based approach is needed
• Our approach: BigSim
–
–
–
–
Based on Charm++ virtualization approach
Full scale program Emulation
Trace-driven Simulation
History: developed for BlueGene predictions
7/17/2015
LNCC-08
27
BigSim Simulation System
• General system organization
• Emulation:
– Run an existing, full-scale MPI, AMPI or Charm++ application
– Uses an emulation layer that pretends to be (say) 100k cores
• Target cores are emulated as Charm+ virtual processors
– Resulting traces (aka logs):
• Characteristics of SEBs (Sequential Execution Blocks)
• Dependences between SEBs and messages
7/17/2015
LNCC-08
28
BigSim Simulation System (cont.)
• Trace driven parallel simulation
– Typically run on tens to hundreds of processors
– Multiple resolution simulation of sequential execution:
• from simple scaling factor to cycle-accurate modeling
– Multiple resolution simulation of the Network:
• from simple latency/bw model to detailed packet and
switching port level modeling
– Generates Timing traces just as a real app on full scale machine
• Phase 3: Analyze performance
– Identify bottlenecks, even w/o predicting exact performance
– Carry out various “what-if” analysis
7/17/2015
LNCC-08
29
Projections: Performance Visualization
7/17/2015
LNCC-08
30
BigSim Validation: BG/L Predictions
time (seconds)
NAMD Apoa1
80
60
Actual
execution
time
predicted
time
40
20
0
128
256
512 1024 2250
number of processors simulated
7/17/2015
LNCC-08
31
Some Ongoing Research Directions
7/17/2015
LNCC-08
32
Load Balancing for Large Machines: I
• Centralized balancers achieve best balance
– Collect object-communication graph on one processor
– But won’t scale beyond tens of thousands of nodes
• Fully distributed load balancers
– Avoid bottleneck but.. Achieve poor load balance
– Not adequately agile
• Hierarchical load balancers
– Careful control of what information goes up and down the
hierarchy can lead to fast, high-quality balancers
7/17/2015
LNCC-08
33
Load Balancing for Large Machines: II
• Interconnection topology starts to matter again
– Was hidden due to wormhole routing etc.
– Latency variation is still small...
– But bandwidth occupancy (link contention) is a problem
• Topology aware load balancers
– Some general heuristics have shown good performance
• But may require too much compute power
– Also, special-purpose heuristic work fine when applicable
– Preliminary results:
• see Bhatele & Kale’s paper, LSPP@IPDPS’2008
– Still, many open challenges
7/17/2015
LNCC-08
34
Major Challenges in Applications
• NAMD:
– Scalable PME (long range forces) – 3D FFT
• Specialized balancers for multi-resolution cases
– Ex: ChaNGa running highly-clustered cosmological datasets
and multi-timestepping
Black: Processor Activity
processor
Time
(a) Singlestepping
7/17/2015
(b) Multi-timestepping
LNCC-08
(c) Multi-timestepping +
special load-balancing
35
BigSim: Challenges
• BigSim’s simple diagram hides many complexities
• Emulation:
– Automatic Out-of-core support for large memory footprint apps
• Simulation:
–
–
–
–
Accuracy vs cost tradeoffs
Interpolation mechanisms for prediction of serial performance
Memory management optimizations
I/O optimizations for handling (many) large trace files
• Performance analysis:
– Need scalable tools
• Active area of research
7/17/2015
LNCC-08
36
Fault Tolerance
• Automatic Checkpointing
• Scalable fault tolerance
– Migrate objects to disk
– In-memory checkpointing as an
option
– Both schemes above are available
in Charm++
• Proactive Fault Handling
– Migrate objects to other processors
upon detecting imminent fault
– Adjust processor-level parallel data
structures
– Rebalance load after migrations
– HiPC’07 paper: Chakravorty et al
7/17/2015
LNCC-08
– When a processor out of 100,000
fails, all 99,999 shouldn’t have to
run back to their checkpoints!
– Sender-side message logging
– Restart can be speeded up by
spreading out objects from failed
processor
– IPDPS’07 paper: Chakravorty &
Kale
– Ongoing effort to minimize
logging protocol overheads
37
Higher Level Languages & Interoperability
7/17/2015
LNCC-08
38
HPC at Illinois
7/17/2015
LNCC-08
39
HPC at Illinois
• Many other exciting developments
– Microsoft/Intel parallel computing research center
– Parallel Programming Classes
• CS-420: Parallel Programming Sci. and Enginnering
• ECE-498: NVIDIA/ECE collaboration
– HP/Intel/Yahoo! Institute
– NCSA’s Blue Waters system approved for 2011
• see http://www.ncsa.uiuc.edu/BlueWaters/
– NCSA/IACAT new institute
• see http://www.iacat.uiuc.edu/
7/17/2015
LNCC-08
40
Microsoft/Intel UPCRC
• Universal Parallel Computing Research Center
• 5 year funding, 2 centers:
– Univ.Illinois & Univ.Cal.-Berkeley
• Joint effort by Intel/Microsoft: $2M/year
• Mission:
– Conduct research to make parallel programming broadly
accessible and “easy”
• Focus areas:
– Programming, Translation, Execution, Applications
• URL: http://www.upcrc.illinois.edu/
7/17/2015
LNCC-08
41
Parallel Programming Classes
• CS-420: Parallel Programming
– Introduction to fundamental issues in parallelism
– Students from both CS and other engineering areas
– Offered every semester, by CS Profs. Kale or Padua
• ECE-498: Progr. Massively Parallel Processors
–
–
–
–
Focus on GPU programming techniques
ECE Prof. Wen-Mei Hwu
NVIDIA’s Chief Scientist David Kirk
URL: http://courses.ece.uiuc.edu/ece498/al1
7/17/2015
LNCC-08
42
HP/Intel/Yahoo! Initiative
• Cloud Computing Testbed - worldwide
• Goal:
– Study Internet-scale systems, focusing on data-intensive
applications using distributed computational resources
• Areas of study:
– Networking, OS, virtual machines, distributed systems, datamining, Web search, network measurement, and multimedia
• Illinois/CS testbed site:
– 1,024-core HP system with 200 TB of disk space
– External access via an upcoming proposal selection process
• URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html
7/17/2015
LNCC-08
43
Our Sponsors
7/17/2015
LNCC-08
44
PPL Funding Sources
• National Science Foundation
– BigSim, Cosmology, Languages
• Dep. of Energy
– Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry
• National Institutes of Health
– NAMD
• NCSA/NSF, NCSA/IACAT
– Blue Waters project, applications
• Dep. of Energy / UIUC Rocket Center
– AMPI, applications
• Nasa
– Cosmology/Visualization
7/17/2015
LNCC-08
45
Obrigado !
7/17/2015
LNCC-08
46