Principles of Parallel Architecture

Download Report

Transcript Principles of Parallel Architecture

Principles of Parallel Architecture
Fall 2006
Keys to a happy Life: Diversity and Variety. Diversity in the people that you meet. Variety
in the things that you do
8/30/2006
eleg652-010-06F
1
Contact Information
Teaching Assistants
Name: Juergen Ributzka
Instructor
Office: 326 Dupont Hall
Name: Joseph B. Manzano
Phone: (302) 831 0327
Office: 137 Evans Hall
Email: [email protected]
Phone: N/A
Email: [email protected]
Name: Eunjung Park
Office: 326 Dupont Hall
Phone: (302) 831 0327
Email: [email protected]
Course Webpage: http://www.capsl.udel.edu/courses/eleg652/2006/
8/30/2006
eleg652-010-06F
2
Important Course Information
Final Quiz
Wednesday December 6th, 2006
Final Project Due date
Activities
Friday, December 8th, 2006
Four Homeworks, a comprehensive final
examination and a class project assigned
by the instructor with a mentor
Activity
Homeworks
Class Project
Final Exam
Participation
Grade Distribution
8/30/2006
eleg652-010-06F
Weight
33.00%
33.00%
33.00%
1.00%
3
Reference Material
Reference Books
1
John Henessy and David Patterson
Computer Architecture: A Quantitative Approach
Third Edition
Morgan Kaufmann Publishers, Inc.
2003
2
D. E. Culler, J.P. Singh, and A. Gupta
Parallel Computer Architecture
Morgan Kaufmann Publishers, Inc.
1999
8/30/2006
eleg652-010-06F
4
Supporting Materials
Selected publications from
Journals
IEEE Computer
IEEE Transactions in Computers
IEEE Transaction on Parallel and Distributed Systems
Conference Proceedings
PACT
Parallel Architectures and Compilation Techniques
MICRO
ACM/IEEE Symposium on Micro-Architectures
HPCA
ACM/IEEE Symposium High Performance Computer
Architecture
ISCA
International Symposium on Computer Architectures
PLDI
International Symposium on Parallel Language
Design and Implementation
8/30/2006
eleg652-010-06F
5
Course Contents
Provides an overview of technologies that are
applicable in almost all aspects of computers and ,
soon to be, part of consumer electronics in general.
Shows the principles in which parallel machines are
built and how these concepts have infiltrated other
parts of the computer and entertainment industry.
Provides an understanding about how these concepts
affects both hardware and software on its target machine
and their different implementations.
8/30/2006
eleg652-010-06F
6
Expectations about this Course
You should learn:
A basic idea about the lingo that is used in today's
supercomputer/parallel machine market
Vector Processing and its place in consumer electronics
Different forms of parallelism and their current implementations
Shared memory models
Parallel Programming Models and Synchronization
Multi threaded Architectures
8/30/2006
eleg652-010-06F
7
Why Study Parallel Architectures?
Concepts that soon should become ubiquitous
Productively write software that takes advantages of
new features of upcoming or existing hardware
Understand how current technologies have
evolved and how they can be improved
8/30/2006
eleg652-010-06F
8
Course Overview
Terminology and General Knowledge
Vector Processing and its Legacy
Instruction Level Parallelism: a brief overview
Multicore and Cellular architectures
Parallel (shared) memory models and
synchronization primitives
Advance Topics such as Dataflow and
Transactional Memory
8/30/2006
eleg652-010-06F
9
Course Introduction
The Role of a Computer Architect
Maximize Productivity and Performance
Productivity = Programmability and a reduction in development time
Performance = “Reasonable” Throughput given technology
and cost limitations
Parallelism
Two or more tasks may execute at the same time
Alternative to higher frequency clocks
Applies to all levels of computer design
Importance has been constantly raising since several “walls” were hit
In the near future, it will be become the paradigm on all aspects
of computing
8/30/2006
eleg652-010-06F
10
The Transition
Most consumer electronics will have some form of parallel
architecture inside of them by next year (2007)
Reasons for the Change
An evolutionary change in computing due to:
Technology
Decrease in feature
size Allowing more
components into a
chip
Architecture
More and more
performance and
power hungry
applications
Effectively organizing
components to
maximize uses of
resources and
minimizing damaging
size effects
Applications
8/30/2006
eleg652-010-06F
Find Cost Effective
ways to get the
desired performance
out of the given
Hardware / Software
combo
Economics
11
Applications Requirements
Demand for more cycles = More sophisticated Hardware
Wide Range of Performance Demands
Audio Processing = Real time response with an allowed threshold
of error
Business Loads = A given quanta of time with no error allowed
Application and parallel computer: Obtain a speed up in application
runtime
Productive Parallel Systems
Current Systems work on parallel concepts and designs (i.e.
Desktop systems are Multithreaded)
Parallel Computing and computers are becoming ubiquitous as we speak
8/30/2006
eleg652-010-06F
12
Technology: An Overview
Decrease in Feature Size (Lambda)
Clock rates ~ proportional to
Number of Transistors
in Lambda
>= Lambda square
Performance: An increase of roughly 1000x in the last decade
The fastest supercomputer in June 1996 (Tokyo's SR2201)
was 220 GFLOPS
The fastest supercomputer now is 280 TFLOPS (IBM's eServer
Blue Gene Solution)
and an increase of roughly 200x in the same decade with respect to
clock frequency
Intel Pentium Pro at 150 ~ 200 Mhz in 1996
Intel Pentium D at 3.2 Ghz in 2006
Extra components: Parallelism V.S. Data locality: Fighting for Real State
8/30/2006
eleg652-010-06F
13
Intel: An Example of Clock
Frequency Growth
Growth has been steady until now!!!!
Frequency in KHz
Clock Growth from 1971 to 2004
3750000
3500000
3250000
3000000
2750000
2500000
2250000
2000000
1750000
1500000
1250000
1000000
750000
500000
250000
0
4040
Column C
8080
8085
8086
8088
80286
80386
SX
80486
DX
Pentium
Pentium
MMX
Pentium 2
Celero
n
Pentium 3
(t)
Pentium 4
(NC)
Pentium
4E
Pentium
4F
Intel Microprocessor Family
8/30/2006
eleg652-010-06F
14
Pentium M
Thermal Maps from the Pentium M obtained from simulated power density (left) and IREM
measurement (right). Heat levels goes from black (lowest), red, orange, yellow and white
(highest)
Figures courtesy of Dani Genossar and Nachum Shamir in their paper Intel ® Pentium ® M Processor Power
Estimation, Bugdeting, Optimization and Validation published in the Intel Technical Journal, May 21, 2003
8/30/2006
eleg652-010-06F
15
Storage and Transistor Count
Growth
Transistor Count
Expected to reach one billion during this decade (2000)
Grow faster than clock rate: 40 % per year
Storage
Gap between storage and speed more pronounced
Larger memories = slower = Larger memory hierarchies (i.e.
Caches, write / read buffers, etc)
Parallelism and Locality inside memory systems: Multi port
memory, parallel caches, RAIDs, parallel disks with
caching, etc
8/30/2006
eleg652-010-06F
16
Moore's Law
The complexity for minimum component costs
has increased at a rate of roughly a factor of
two per year ... Certainly over the short term
this rate can be expected to continue, if not to
increase. Over the longer term, the rate of
increase is a bit more uncertain, although
there is no reason to believe it will not remain
nearly constant for at least 10 years. That
means by 1975, the number of components
per integrated circuit for minimum cost will be
65,000. I believe that such a large circuit can
be built on a single wafer.
Gordon Moore's original statement.
"Cramming more components onto integrated
circuits", Electronics Magazine 19 April 1965
In Layman terms: The number of components on integrated circuits will roughly double every
18 months. With that, the complexity (effort) and the headcount should increase proportionally
8/30/2006
eleg652-010-06F
17
Architectural Trends
Designed for performance
Higher Frequency == Higher Performance ?
Memory V.S. Processor
Architectural Trends: Hide Latencies at all cost!!!
Overlap Computation with Memory accesses [DMA]
Bring more used-data closer to the processor [memory hierarchies]
Multithreaded execution and sharing of resources [SMT and
HT technologies, MTA]
Give more chip real state to speculative execution [Branch prediction
and prefetching]
Power Problem? Go Multicore!!!!
x
Takes N time to finish a M size problem using T
amount of power
8/30/2006
x
Takes N/2 + 2X time to finish a M size problem using
T/2 amount of power per unit
eleg652-010-06F
18
Technology Progress Overview
Processor speeds = much faster (around 1000x)
Memory (RAM) speeds are increasing too but at a slower rate
(around 10x)
But Memory (RAM) dimensions have grown even
faster than processor's speed (around 1,000,000x)
Computation is almost free but bandwidth is very expensive
8/30/2006
eleg652-010-06F
19
The Pentium Chip
8/30/2006
eleg652-010-06F
20
Intel Pentium 4
Nine Years and Millions of Dollars Later
8/30/2006
eleg652-010-06F
21
Next Gen
The Cell Chip Layout
Many of them, simpler and cheaper!!!
8/30/2006
eleg652-010-06F
22
The Dawn of Parallelism
• Parallel architectures are becoming more
attractive
• Milestone: the introduction of Pentium D (2005)
and Centrino Duo (2006)
• Future Projects: IBM PERCS project, Cray
Eldorado, Sun Hero, IBM Cell project, etc ...
• All the factors listed contributed to this “epiphany”
in computing technology.
• Parallelism can be exploited at many levels in
many ways
8/30/2006
eleg652-010-06F
23
The World's Fastest
Japan Dominance
GFLOPS
Top SuperComputers
375
350
325
300
275
250
225
200
175
150
125
100
75
50
25
0
1993
CP PACS
368 GFLOPS
Numerical Wind Tunnel
192 GFLOPS
1994
1995
1996
Years
8/30/2006
eleg652-010-06F
24
The World Fastest
USA Takes the Lead
GFLOPS
Top SuperComputers
7500
7000
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1993
7.3 TFLOPS
ASCI White SP Power3 375
Mhz
1.3 TFLOPS
ASCI Red
1994
1995
1996
1997
1998
1999
2000
2001
Years
8/30/2006
eleg652-010-06F
25
The World Fastest
Japan Second Wind
GFLOPS
Top SuperComputers
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
0
1993
EARTH Simulator
35 TFLOPS
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
Years
8/30/2006
eleg652-010-06F
26
The World's Fastest
and again...
Top SuperComputers
300000
275000
280 TFLOPS
250000
BlueGene L eServer
Solution
225000
GFLOPS
200000
175000
150000
125000
100000
BlueGene L Beta
75000
70 TFLOPS
50000
25000
0
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Years
8/30/2006
eleg652-010-06F
27
The World's Fastest
BlueGeneL eServer Solution
• 65536 Dual Processors arrange in a 32 x 32 x 64 3D
torus network.
• Global Tree structure for fast reduction and
broadcast operations over all nodes
• A I/O node per 64 nodes
– Inside a 64 group: Tree structure connections between I/O
node and computation nodes with an aggregate bandwidth
of 2.1 GB/s
– Across 64 groups: Torus like connections
• Total Memory: 32 TeriBytes
• Total Power Consumption: 1.5 MegaWatts
8/30/2006
eleg652-010-06F
28
The World's Fastest
BlueGeneL eServer Solution
8/30/2006
eleg652-010-06F
29
The Next Step
• So what is next?
• Multicore, System on a chip, PIM, etc
– Simpler, colder, cheaper...
•
•
•
•
Intel Pentium D and Centrino Duo
AMD Opteron
The DARPA HPCS Project
IBM, Cray and SUN Multicore chips: CELL,
Cyclops, BlueGene,
• Alternatives: Clearspeed [Programmable
Co-Processors],
etc...
8/30/2006
eleg652-010-06F
30