CS267 Applications of Parallel Computers Lecture 1: Introduction Kathy Yelick [email protected] http://www.cs.berkeley.edu/~yelick 11/7/2015 CS267 Lecture 1: Intro Outline • Introduction • Large important problems require powerful computers • Why powerful computers.

Download Report

Transcript CS267 Applications of Parallel Computers Lecture 1: Introduction Kathy Yelick [email protected] http://www.cs.berkeley.edu/~yelick 11/7/2015 CS267 Lecture 1: Intro Outline • Introduction • Large important problems require powerful computers • Why powerful computers.

CS267
Applications of Parallel
Computers
Lecture 1: Introduction
Kathy Yelick
[email protected]
http://www.cs.berkeley.edu/~yelick
11/7/2015
CS267 Lecture 1: Intro
1
Outline
• Introduction
• Large important problems require powerful computers
• Why powerful computers must be parallel processors
• Principles of parallel computing performance
• Structure of the course
11/7/2015
CS267 Lecture 1: Intro
2
Administrative Information
• Instructors:
- Kathy Yelick, 777 Soda, [email protected]
- TA: David Bindel, 515 Soda, [email protected]
• Accounts – fill out online registration!
• Class survey – fill out today
• Lecture notes are based on previous semester notes:
- Jim Demmel, David Culler, David Bailey, Bob Lucas, and myself
• Discussion section only “on-demand”
• Most class material and lecture notes are at:
- http://www.cs.berkeley.edu/~dbindel/cs267ta
11/7/2015
CS267 Lecture 1: Intro
3
Why we need
powerful computers
11/7/2015
CS267 Lecture 1: Intro
4
Simulation: The Third Pillar of Science
•
Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
•
•
Limitations:
-
Too difficult -- build large wind tunnels.
Too expensive -- build a throw-away passenger jet.
-
Too slow -- wait for climate or galactic evolution.
Too dangerous -- weapons, drug design, climate experimentation.
Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
- Base on known physical laws and efficient numerical methods.
11/7/2015
CS267 Lecture 1: Intro
5
Some Particularly Challenging Computations
• Science
- Global climate modeling
- Astrophysical modeling
- Biology: Genome analysis; protein folding (drug design)
• Engineering
- Crash simulation
- Semiconductor design
- Earthquake and structural modeling
• Business
- Financial and economic modeling
- Transaction processing, web services and search engines
• Defense
- Nuclear weapons -- test by simulations
- Cryptography
11/7/2015
CS267 Lecture 1: Intro
6
Units of Measure in HPC
• High Performance Computing (HPC) units are:
- Flop/s: floating point operations
- Bytes: size of data
• Typical sizes are millions, billions, trillions…
Mega
Mflop/s = 106 flop/sec
Giga
Gflop/s = 109 flop/sec
Tera
Tflop/s = 1012 flop/sec
Peta
Pflop/s = 1015 flop/sec
11/7/2015
Mbyte = 106 byte
(also 220 = 1048576)
Gbyte = 109 byte
(also 230 = 1073741824)
Tbyte = 1012 byte
(also 240 = 10995211627776)
Pbyte = 1015 byte
(also 250 = 1125899906842624)
CS267 Lecture 1: Intro
7
Economic Impact of HPC
• Airlines:
- System-wide logistics optimization systems on parallel systems.
- Savings: approx. $100 million per airline per year.
• Automotive design:
- Major automotive companies use large systems (500+ CPUs) for:
- CAD-CAM, crash testing, structural integrity and aerodynamics.
- One company has 500+ CPU parallel system.
- Savings: approx. $1 billion per company per year.
• Semiconductor industry:
- Semiconductor firms use large systems (500+ CPUs) for
- device electronics simulation and logic validation
- Savings: approx. $1 billion per company per year.
• Securities industry:
- Savings: approx. $15 billion per year for U.S. home mortgages.
11/7/2015
CS267 Lecture 1: Intro
8
Global Climate Modeling Problem
• Problem is to compute:
f(latitude, longitude, elevation, time) 
temperature, pressure, humidity, wind velocity
• Approach:
- Discretize the domain, e.g., a measurement point every 1km
- Devise an algorithm to predict weather at time t+1 given t
• Uses:
- Predict major events,
e.g., El Nino
- Use in setting air
emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
11/7/2015
CS267 Lecture 1: Intro
9
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere
- Solve Navier-Stokes problem
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:
- To match real-time, need 5x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours)  56 Gflop/s
- Climate prediction (50 years in 30 days)  4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours)  288 Tflop/s
• To double the grid resolution, computation is at least 8x
• Current models are coarser than this
11/7/2015
CS267 Lecture 1: Intro
10
Heart Simulation
• Problem is to compute blood flow in the heart
• Approach:
- Modeled as an elastic structure in an incompressible fluid.
- The “immersed boundary method” due to Peskin and McQueen.
- 20 years of development in model
- Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others
- Use a regularly spaced mesh (set of points) for evaluating the fluid
• Uses
- Current model can be used to design artificial heart valves
- Can help in understand effects of disease (leaky valves)
- Related projects look at the behavior of the heart during a heart attack
- Ultimately: real-time clinical work
11/7/2015
CS267 Lecture 1: Intro
11
Heart Simulation Calculation
The involves solving Navier-Stokes equations
- 64^3 was possible on Cray YMP, but 128^3 required for accurate model
(would have taken 3 years).
- Done on a Cray C90 -- 100x faster and 100x more memory
- Until recently, limited to vector machines
- Needs more features:
- Electrical model of the
heart, and details of
muscles, E.g.,
- Chris Johnson
- Andrew McCulloch
- Lungs, circulatory
systems
11/7/2015
CS267 Lecture 1: Intro
12
Parallel Computing in Web Search
• Functional parallelism: crawling, indexing, sorting
• Parallelism between queries: multiple users
• Finding information amidst junk
• Preprocessing of the web data set to help find information
• General themes of sifting through large, unstructured data sets:
- when to put white socks on sale
- what advertisements should you receive
- finding medical problems in a community
11/7/2015
CS267 Lecture 1: Intro
13
Document Retrieval Computation
• Approach:
- Store the documents in a large (sparse) matrix
- Use Latent Semantic Indexing (LSI), or related algorithms to “partition”
- Needs large sparse matrix-vector multiply
# documents ~= 10 M
# keywords
24
65
18
x
~100K
•Matrix is compressed
•“Random” memory
access
•Scatter/gather vs. cache
miss per 2Flops
Ten million documents in typical matrix.
Web storage increasing 2x every 5 months.
Similar ideas may apply to image retrieval.
11/7/2015
CS267 Lecture 1: Intro
14
Transaction Processing
(mar. 15, 1996)
25000
Throughput (tpmC)
20000
15000
other
10000
Tandem Himalaya
IBM PowerPC
DEC Alpha
5000
SGI PowerChallenge
HP PA
0
0
20
40
60
80
100
120
Processors
• Parallelism is natural in relational operators: select, join, etc.
• Many difficult issues: data partitioning, locking, threading.
11/7/2015
CS267 Lecture 1: Intro
15
Why powerful
computers are
parallel
11/7/2015
CS267 Lecture 1: Intro
16
Technology Trends: Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser, and
more powerful.
11/7/2015
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of semiconductor
chips would double roughly every
18 months.
Slide source: Jack Dongarra
CS267 Lecture 1: Intro
17
Microprocessor Transistors
100,000, 000
10, 000,000
R10000
Pent ium
Transistors
1,000,000
i80386
i80286
100,000
R3000
R2000
i8086
10, 000
i8080
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
Year
11/7/2015
CS267 Lecture 1: Intro
18
Impact of Device Shrinkage
• What happens when the feature size shrinks by a factor
of x ?
• Clock rate goes up by x
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- of which x3 is devoted either to parallelism or locality
11/7/2015
CS267 Lecture 1: Intro
19
Microprocessor Clock Rate
1000
Clock Rate (MHz)
100
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
2005
Year
11/7/2015
CS267 Lecture 1: Intro
20
Empirical Trends: Microprocessor Performance
10000
1000
T94
C90
Linpack MFLOPS
DEC 8200
Ymp
IBM Power2/990
Xmp
100
MIPS R4400
Xmp
HP9000/735
DEC Alpha AXP
HP 9000/750
IBM RS6000/540
Cray 1s
10
Cray n=1000
Cray n=100
Mic ro n=1000
Mic ro n=100
MIPS M/2000
MIPS M/120
1
1975
11/7/2015
Sun 4/260
1980
1985
1990
CS267 Lecture 1: Intro
1995
2000
21
How fast can a serial computer be?
1 Tflop/s, 1 Tbyte
sequential
machine
r = 0.3 mm
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory to CPU.
- Go get 1 data element per cycle, this means 1012 times per second
at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each word occupies about 3 square Angstroms, or the size of a
small atom.
11/7/2015
CS267 Lecture 1: Intro
22
Microprocessor Transistors and Parallelism
Thread-Level
Parallelism?
100,000, 000
Instruction-Level
Parallelism
10, 000,000
R10000
1,000,000
Transistors
Pent ium
Bit-Level
Parallelism
i80386
i80286
100,000
R3000
R2000
i8086
10, 000
i8080
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
Year
11/7/2015
CS267 Lecture 1: Intro
23
“Automatic” Parallelism in Modern Machines
• Bit level parallelism: within floating point operations, etc.
• Instruction level parallelism (ILP): multiple instructions execute per
clock cycle.
• Memory system parallelism: overlap of memory operations with
computation.
• OS parallelism: multiple jobs run in parallel on commodity SMPs.
There are limitations to all of these!
Thus to achieve high performance, the programmer needs to identify,
schedule and coordinate parallel tasks and data.
11/7/2015
CS267 Lecture 1: Intro
24
Trends in Parallel Computing Performance
ASCI red
1000
Paragon XP/ S MP (6768)
Cray T3D
CM-5
Par ag on XP/S MP( 1024)
100
T932
Par ag on XP/S
GFLOPS
CM-200
Del ta
CM2
C90
10
MPP
Cray VPP
nCUBE/2
iPSC/860
Ymp/832
1
Xmp
0.1
1985
1987
1989
1991
1993
1995
• Performance of several machines on the Linpack
benchmark (dense matrix factorization)
11/7/2015
CS267 Lecture 1: Intro
25
Architectures
500
SIMD
Constellation
Cluster
400
MPP
300
200
100
SMP
Ju
n9
No 3
v93
Ju
n9
No 4
v9
Ju 4
n9
No 5
v95
Ju
n9
No 6
v96
Ju
n9
No 7
v9
Ju 7
n9
No 8
v98
Ju
n9
No 9
v99
Ju
n0
No 0
v00
0
Single
Processor
112 const, 28 clus, 343 mpp, 17 smp
Principles of Parallel Computing
• Parallelism and Amdahl’s Law
• Finding and exploiting granularity
• Preserving data locality
• Load balancing
• Coordination and synchronization
• Performance modeling
All of these things make parallel programming more
difficult than sequential programming.
11/7/2015
CS267 Lecture 1: Intro
27
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
- Let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable.
- P = number of processors.
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
Even if the parallel part speeds up perfectly, we may be
limited by the sequential portion of code.
11/7/2015
CS267 Lecture 1: Intro
28
Little’s Law
Principle (Little's Law): the relationship of a production
system in steady state is:
Inventory = Throughput × Flow Time
For parallel computing, this means:
Concurrency = latency x bandwidth
Example: 1000 processor system, 1 GHz clock, 100 ns
memory latency, 100 words of memory in data paths
between CPU and memory.
- Main memory bandwidth is:
~ 1000 x 100 words x 109/s = 1014 words/sec.
- To achieve full performance, an application needs:
~ 10-7 x 1014 = 107 way concurrency
11/7/2015
CS267 Lecture 1: Intro
29
Overhead of Parallelism
• Given enough parallel work, this is the most significant
barrier to getting desired speedup.
• Parallelism overheads include:
-
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
• Each of these can be in the range of milliseconds
(= millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (i.e. large granularity), but not so
large that there is not enough parallel work.
11/7/2015
CS267 Lecture 1: Intro
30
Locality and Parallelism
Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
potential
interconnects
L3 Cache
• Large memories are slow, fast memories are small.
• Storage hierarchies are large and fast on average.
• Parallel processors, collectively, have large, fast memories -- the slow accesses to
“remote” data we call “communication”.
• Algorithm should do most work on local data.
11/7/2015
CS267 Lecture 1: Intro
31
Load Imbalance
• Load imbalance is the time that some processors in the
system are idle due to
- insufficient parallelism (during that phase).
- unequal size tasks.
• Examples of the latter
- adapting to “interesting parts of a domain”.
- tree-structured computations.
- fundamentally unstructured problems.
• Algorithm needs to balance load
- but techniques the balance load often reduce locality
11/7/2015
CS267 Lecture 1: Intro
32
Parallel Programming for Performance is Challen
Amber (chemical modeling)
70
60
Speedup
50
Vers. 12/94
40
Vers. 9/94
Vers. 8/94
30
20
10
0
0
20
40
60
80
100
120
140
Processors
• Speedup(P) = Time(1) / Time(P)
• Applications have “learning curves”
11/7/2015
CS267 Lecture 1: Intro
33
Course
Organization
11/7/2015
CS267 Lecture 1: Intro
34
Schedule of Topics
• Introduction
• Parallel Programming Models and Machines
- Shared Memory and Multithreading
- Distributed Memory and Message Passing
- Data parallelism
• Sources of Parallelism in Simulation
• Algorithms and Software Tools
-
Dense Linear Algebra
Partial Differential Equations (PDEs)
Particle methods
Load balancing, synchronization techniques
Sparse matrices
Visualization (field trip to NERSC)
Sorting and data management
Grid computing
• Applications (including guest lectures)
• Project Reports
11/7/2015
CS267 Lecture 1: Intro
35
Reading Materials
• Some on-line texts:
- Demmel’s notes from CS267 Spring 1999, which are similar to 2000
and 2001. However, they contain links to html notes from 1996.
- http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
- Ian Foster’s book, “Designing and Building Parallel Programming”.
- http://www-unix.mcs.anl.gov/dbpp/
• Recommended text:
- “Performance Optimization of Numerically Intensive Codes” by Stefan
Goedecker and Adolfy Hoisie
- This is a practical guide to optimization, mostly for those of you
who have never done any optimization
- It won’t be available in the bookstore for a while, but you can order
online
11/7/2015
CS267 Lecture 1: Intro
36
Requirements
• Fill out on-line account request for Millennium machine.
- See course web page for pointer
- http://www-inst.eecs.berkeley.edu/~cs267
• Fill out survey
- e-mail to David if you missed this lecture
• Four programming assignments (35%).
- Hands-on experience, interdisciplinary teams.
- First one is available now on the above page
• Class participation (15%).
- Based, in part, on reading assignments
• Final Project (50%).
- Teams of 2-3, interdisciplinary is best.
- Interesting applications or advance of systems.
- Presentation (poster session)
11/7/2015
CS267 Lecture 1: Intro
- Conference quality paper
37
First Assignment
• See home page for details.
• Find an application of parallel computing and build a
web page describing it.
- Choose something from your research area.
- Or from the web or elsewhere.
• Evaluate the project. Was parallelism successful?
• Create a web page describing the application.
• Send us ({yelick,dbindel}@cs) the link.
• Due next week, Wednesday (9/5).
11/7/2015
CS267 Lecture 1: Intro
38
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Understanding of parallel computing hardware options.
• Overview of programming models (software) and tools.
• Some important parallel applications and the algorithms
• Performance analysis and tuning
11/7/2015
CS267 Lecture 1: Intro
39