COMP3320/COMP6464 High Performance Scientific Computing

Download Report

Transcript COMP3320/COMP6464 High Performance Scientific Computing

COMP4300/COMP8300
Parallel Systems
Alistair Rendell and Joseph Antony
Research School of Computer Science
Australian National University
Concept and Rationale

The idea
–

Motivation
– Speed, Speed, Speed… at a cost effective price
–

Split your program into bits that can be executed
simultaneously
If we didn’t want it to go faster we would not be
bothered with the hassles of parallel programming!
Reduce the time to solution to acceptable levels
–
–
No point waiting 1 week for tomorrow’s weather
forecast
Simulations that take months to run are not useful in a
design environment
Sample Application Areas

Fluid flow problems
–
–

Structural Mechanics
–
–








Weather forecasting/climate modeling
Aerodynamic modeling of cars, planes, rockets etc
Building bridge, car, etc strength analysis
Car crash simulation
Speech and character recognition, image processing
Visualization, virtual reality
Semiconductor design, simulation of new chips
Structural biology, molecular level design of drugs
Human genome mapping
Financial market analysis and simulation
Datamining, machine learning
Games programming
World Climate Modeling


Atmosphere divided into 3D regions or cells
Complex mathematical equations describe conditions in
each cell, eg pressure, temperature, velocity
–
–
–

Assume
–
–
–


Conditions change according to neighbour cells
Updates repeated frequently as time passes
Cells are affected by more distant cells the longer range the
forecast
Cells are 1x1x1 mile to a height of 10 miles, 5x108 cells
200 flops to update each cell per timestep
10 minute timesteps for total of 10 days
100 days on 100 mflop machine
10 minutes on a tflop machine
ParallelSystems@ANU: NCI

NCI: National Computational Infrastructure
–


History: established APAC in 1998 with $19.5M grant from
federal government, NCI created in 2007
Current NCI collaboration agreement (2012–15)
–
–
–

http://nci.org.au
Major Collaborators: ANU, CSIRO, BoM, GA,
Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT
University Consortia: Intersect (NSW), QCIF (Queensland)
Co-investment (for recurrent operations) :
–
2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M;
2013: $11M; 2014 $11+M; to provide for all recurrent operations
Current infrastructure: Data Centre
•
•
New Data Centre: $24M (opened Nov. 2012)
Machine Room: 920 sq. m.
Power (after 2014 upgrades)
– 4.5 MW capacity raw; 1 MW UPS;
– 2 x 1.1 MVA Cummins generators
•
Cooling in two loops:
– Server: 2 x 1.8 MW Carrier chillers; 3 x 0.8 MW “free
cooling” heat exchangers; 18 deg C; 75 l/sec pump rate
– Data: 3 x 0.5 MW Carrier chillers; 15 deg C
•
PUE: approx. 1.25
6
NCI: Raijin—Petascale Supercomputer
Raijin – Supercomputer (June 2013 commissioning)
– 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz) in
3592 compute nodes
– 160 TBytes (approx.) of main memory;
– Mellanox Infiniband FDR interconnect (52 km cable)
– 10 PBytes (approx.) of usable fast filesystem (for
short-term scratch space apps, home directories).
– Power: 1.5 MW max. load
– Cooling systems: 100 tonnes of water
– 24th fastest in the world in debut (November 2012);
first petaflop system in Australia (November 2014:
#52)
•
•
•
•
Fastest file-system in the southern hemisphere
Custom monitoring and deployment
Custom Kernel
Highly customised PBS Pro scheduler.
7
NCI’s integrated high-performance environment
Internet
NCI data
movers
VMware
Cloud
Raijin Login +
Data movers
To Huxley DC
Raijin HPC
Compute
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin 56Gb FDR IB Fabric
Massdata (tape)
Cache 1.0PB,
Tape 20 PB
Persistent global
parallel filesystem
Raijin high-speed
filesystem
/g/data1
/g/data2
/short
~7 PB
~6 PB
7.6PB
/home,
/system,
/images,
/apps
8
ParallelSystems@DCS

Bunyip:
tsg.anu.edu.au/Projects/Bunyip
–
–
192 processor PC Cluster
winner of 2000 Gordon Bell prize
for best price performance

High Performance
Computing Group
–
–
–
Jabberwocky cluster
Saratoga cluster
Sunnyvale cluster
The Rise of Parallel Computing
Year
Hardware
Languages
1950
Early Designs
Fortran I (Backus, 57)
1960
Integrated circuits
Fortran 66
1970
Large scale integration
C (72)
1980
RISC and PC
C++ (83), Python 1.0 (89)
1990
Shared and distributed parallel
MPI, OpenMP, Java (95)
2000
Faster, better, hotter
Python 2.0 (00)
2010
Throughput oriented
CUDA, OpenCL
Parallelism became an issue for programmers from late 80s
People began compiling lists of big parallel systems
November 2014 Top500
(NCI now number 52)
14
Planning the Future
Growth in ANU/NCI’s
computing performance
(measured in TFlops) since
1987.
Architecture and capability
determined by research and
innovation drivers
International Top500
supercomputer growth
since 1993.
Red: #1 machine each year
Yellow: #500 machine each
Blue: Sum of all machines
Top500 Supercomputers
Graphs show growth factors
of between 8 and 9 times
every 3 years.
15
Transitioning Australia to its HPC Future
Increase in capability usage at NCI with time
100.0%
90.0%
Goal is to move the knee in
the curve in this direction
80.0%
This needs expert people
and accelerated hardware
(eventually)
Use (Cum. %)
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
Cores
2008
2009
2010
2011
2013
16
8192
16384
We also had Increased Node
Performance
Moore’s Law
‘Transistor density will double
approximately every two
years.’
Dennard Scaling
‘As MOSFET features shrink,
switching time and power
consumption will fall
proportionately’
Until the chips became too big…
250nm, 400mm2, 100%
180nm, 450mm2, 100%
130nm, 566mm2, 82%
100nm, 622mm2, 40%
70nm, 713mm2, 19%
50nm, 817mm2, 6.5%
Agarwal, Hrishikesh, Keckler Burger, Clock Rate Versus IPC, ISCA 200035nm, 937mm2, 1.9%
…so multiple cores appeared on chip
2004 Sun releases Sparc IV with dual cores and heralding the start of multicore
…until we hit a bigger problem…
…the end of Dennard scaling…
Moore’s Law
‘Transistor density will double approximately every two years.’
Dennard scaling
‘As MOSFET features shrink, switching time and power consumption will fall proportionately.’
Dennard, Gaensslen, Yu, Rideout, Bassous and Leblanc, IEEE SSC, 1974
…ushering in..
…a new philosophy in processor
design is emerging
1960-2010
Few transistors
2010-?
No shortage of transistors
No shortage of power
Limited power
Maximize transistor utility
Minimize energy
Generalize
Customize
…and a fundamentally new set of building
blocks for our petascale systems
Petascale and Beyond:
Challenges and Opportunities
Level
Characteristic
Challenge/Opportunity
As a whole
Sheer number of node
• Tianhe 2 has
equivalent >3M cores
• Programming
language/environment
• Fault tolerance
Within a
domain
Heterogeneity
• Tianhe system uses
CPUs and GPUs
• What to use when
• Co-location of data with unit
processing it
On the chip
Energy minimization
• Already processors
have frequency and
voltage scaling
• Minimize data size and
movement including use of just
enough precision
• Specialized cores
In RSCS we are working in all these areas
Other Important Parallelism

Multiple instruction units:
–

Instruction Pipelining:
–

Use multiple rendering pipes and processing elments
to render millions of polygons a second
Interleaved Memory:
–

Complicated operations are broken into simple
operations that can be overlapped
Graphics Engines:
–

Typical processors issue ~4 instructions per cycle
Multiple paths to memory that can be used at same
time
Input/Output:
–
Disks are striped with different blocks of data written
to different disks at the same time
Parallelisation

Split program up and run parts simultaneously on
different processors
–
–
–

On N computers the time to solution should (ideally!) be 1/N
Parallel Programming: the art of writing the parallel code!
Parallel Computer: the hardware on which we run our parallel
code!
COMP4300 will discuss both
Beyond raw compute power other motivations include
–
–
–
Enabling more accurate simulations in the same time (finer
grids)
Providing access to huge aggregate memories
Providing more and/or better input/output capacity
Parallelism in a Single “CPU” Box

Multiple instruction units:
–

Instruction Pipelining:
–

Use multiple rendering pipes and processing elments
to render millions of polygons a second
Interleaved Memory:
–

Complicated operations are broken into simple
operations that can be overlapped
Graphics Engines:
–

Typical processors issue ~4 instructions per cycle
Multiple paths to memory that can be used at same
time
Input/Output:
–
Disks are stripped with different blocks of data written
to different disks at the same time
Health Warning!

Course is run every other year
–

It’s a 4000/8000 level course, it’s supposed to:
–
–
–
–
–

Drop out this year and it won’t be repeated until 2017
Be more challenging that a 3000 level course!
Be less well structured
Have a greater expectation on you
Have more student participation
Be fun!
Nathan Robertson, 2002 honours student
–
“Parallel systems and thread safety at Medicare: 2/16
understood it - the other guy was a $70/hr contractor”
Learning Objectives

Parallel Architecture:
–

Specific Systems:
–

Distributed and shared memory, things in between,
Grid computing
Parallel Algorithms:
–

Will make extensive use of research systems in our
group and also visit the NCI facilities
Programming Paradigms:
–

Basic issues concerning design and likely
performance of parallel systems
Numeric and non-numeric
The future
Course Content

Discussion of Schedule:
http://cs.anu.edu.au/courses/COMP4300/schedule.html
Commitment and Assessment

The pieces
–
–
–
–
–

2 lectures per week (~30 core lecture hours)
6 Labs (not marked, solutions provided)
2 assignments (40%)
1 mid-semester exam (1 hours, 15%)
1 final exam (3 hours, 45%)
Final mark is sum of assignment, midsemester and final exam mark
Lectures

Two slots
–
–
Mon
Thu
10:00-12:00 PSYC G6
11:00-12:00 PSYC G6
Exact schedule on web site
 Partial notes will be posted on the web site

–

bring copy to lecture
Attendance at lectures and labs is strongly
recommended
–
Attendance at labs will be recorded
Course Web Site
http://cs.anu.edu.au/courses/comp4300
We will use wattle only for lecture recordings
Laboratories

Start in week 3 (March 2nd)
–

See web page for detailed schedule
4 sessions available
–
–
–
–
Mon
Tue
Wed
Fri
15:00-17:00
13:00-13:00
14:00-16:00
12:00-14:00
N113
N114
N113
N113
Who cannot make any of these?
 Not assessed, but will be examined

People

Alistair Rendell (Convener)
–
–
–

Joseph Antony (lecturer)
–
–
–
–

CSIT Bldg Rm N226 (and N338)
[email protected]
Phone 6125 4386
Senior HPC Data Specialist NCI
NCI Bldg 143 (near JCSMR)
[email protected]
Phone 6125 5988
Gaurav Mitra (tutor)
–
–
–
–
PhD student, Computer Systems
CSIT Bldg Rm 230
[email protected]
Phone 6125 9658
Course Communication

Course web page
cs.anu.edu.au/course/comp4300

Bulletin board (forum – available from streams)
cs.anu.edu.au/streams


At lectures and in labs
Email
[email protected]

In person
–
–
Office hours (Alistair), Thu 12:00-13:00 (after lecture)
Email for appointment if you want other time
Useful Books




Principles of Parallel Programming, Calvin Lin and
Lawrence Snyder, Pearson International Edition,
ISBN 978-0-321-54942-6
Introduction to Parallel Computing, 2nd Ed.,
Grama, Gupta, Karypis, Kumar, Addison-Wesley,
ISBN 0201648652 (Electronic version accessible
on line from ANU library – search for title)
Parallel Programming: techniques and
applications using networked workstations and
parallel computers, Barry Wilkinson and Michael
Allen. Prentice Hall 2nd edition. ISBN
0131405632.
and others on web page
Questions so far!?