Outline - University of Florida
Download
Report
Transcript Outline - University of Florida
CDA 3101
Fall 2013
Introduction to Computer
Organization
Multicore / Multiprocessor
Architectures
22 November 2013
Multicore Architectures
Introduction – What are Multicores?
Why Multicores?
Power and Performance Perspectives
Multiprocessor Architectures
Conclusion
CDA 3101 – Fall 2011
Copyright © 2011 Prabhat Mishra
2
How to Reduce Power Consumption
Multicore
One
core with frequency 2 GHz
Two cores with 1 GHz frequency (each)
Same performance
Two 1 GHz cores require half power/energy
– Power freq2
– 1GHz core needs one-fourth power compared to 2GHz core.
New challenges – Performance
How
to utilize the cores
It is difficult to find parallelism in programs to keep
all these cores busy.
Reducing Energy Consumption
Pentium Max Temp = 105.5 deg C
Both processors are running the same
multimedia application.
Crusoe Max Temp = 48.2 deg C
[www.transmeta.com]
Infrared Cameras (FLIR) can be used to detect thermal distribution.
Introduction
Never ending story …
Complex Applications
Faster Computation
How far did we go with uniprocessors?
Parallel Processors now play a major role
Logical way to improve performance
Connect multiple microprocessors
Not much left with ILP exploitation
Server and embedded software have parallelism
Multiprocessor architectures will become
increasingly attractive
Due
to slowdown in advances of uniprocessors
6
Level of Parallelism
Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism: ~1985 - today
Pipelining
Superscalar
VLIW
Out-of-order execution / Dynamic Instr. Scheduling
Process level or thread level parallelism
Servers are parallel
Desktop dual processor PCs
Multicore architectures (CPUs, GPUs)
7
Taxonomy of Parallel Architectures
SISD (Single Instruction Single Data)
Uniprocessors
Flynn
MISD (Multiple Instruction Single Data) Classification
Multiple processors on a single data stream
No commercial prototypes. Can be thought of as successive
refinement of a given set of data by multiple processors (units).
SIMD (Single Instruction Multiple Data)
Examples: Illiac-IV, CM-2
Simple programming model, low overhead, and flexibility
All custom integrated circuits
MIMD (Multiple Instruction Multiple Data)
Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
Flexible – Difficult to program – no unifying model of parallelism
Use off-the-shelf microprocessors
MIMD in practice: designs with <= 128 processors
8
MIMD
Two types
Centralized
shared-memory multiprocessors
Distributed-memory
multiprocessors
Exploits thread-level-parallelism
The
program should have at least n threads or
processes for a MIMD machine with n processors
Threads can be of different types
Independent
Parallel
programs
iterations of a loop (extracted by compiler)
9
Centralized Shared-Memory Multiprocessor
10
Centralized Shared-Memory Multiprocessor
Small number of processors share a centralized
memory
Use multiple buses or switches
Multiple memory banks
Main memory has a symmetric relationship to
all processors and uniform access time from
any processor
SMP: symmetric shared-memory multiprocessors
UMA: uniform memory access architectures
Increase in processor performance and
memory bandwidth requirements make
centralized memory paradigm less attractive
11
Distributed-Memory Multiprocessors
12
Distributed-Memory Multiprocessors
Distributing memory has two benefits
Cost-effective
way to scale memory bandwidth
Reduces local memory access time.
Communicating data between processors is
complex and has higher latency
Two approaches for data communication
Shared
address space (not centralized memory)
Same physical addr. refers to same memory location
DSM: Distributed Shared-Memory Architectures
NUMA: Non-uniform memory access since the access
time depends on the location of the data
Logically
disjoint address space - Multicomputers
13
Small-Scale—Shared Memory
Caches serve to:
Increase
bandwidth versus
bus/memory
Reduce latency of
access
Valuable for both
private data and
shared data
What about cache
consistency?
Time
0
1
2
3
Event
$A
CPU A
reads X
CPU B
reads X
CPU A
stores 0
into X
1
$B
X
(memory)
1
1
1
1
1
0
1
0
14
Example: Cache Coherence Problem
P2
P1
u=?
Cache
P3
3
u=?
4
Cache
Cache
5
u :5 u= 7
u :5
I/O devices
1
u:5
2
Memory
Processors see different values for u after event 3
With write back caches, value written back to memory
depends on which cache flushes or writes back value
Processes accessing main memory may see very stale value
Unacceptable for programming, and its frequent!
15
4 C’s: Sources of Cache Misses
Compulsory misses (aka cold start misses)
First access to a block
Capacity misses
Due to finite cache size
A replaced block is later accessed again
Conflict misses (aka collision misses)
In a non-fully associative cache
Due to competition for entries in a set
Would not occur in a fully associative cache of
the same total size
Coherence Misses
Graphics Processing Units (GPUs)
Moore’s Law will come to an end
Many complicated solutions
Simple solution – SPATIAL PARALLELISM
SIMD model
(single instr,
multiple data
streams)
GPUs have a
SIMD grid with
local & shared
memory model
17
Graphics Processing Units (GPUs)
Nvidia Fermi GPU – 3GB DRAM, 512 cores
CUDA
architecture
-Thread
- Thread Block
- Grid of Thread
Blocks
- Intelligent
CUDA Compiler
18
GPUs – Nvidia CUDA Hierarchy
Map Process to Thread
Group Threads in Block
Group Blocks in Grids for
Efficiency Memory Access
Also, memory coalescing operations for faster
data transfer
19
Nvidia Tesla 20xx GPU Board
20
GPU Problems and Solutions
GPUs are designed for graphics rendering
GPUs are not designed for general-purpose
computing!! (no unifying model of ||-ism)
Memory hierarchy:
Local Memory – Fast, small (MBs)
Shared Memory – Slower, larger
Global Memory – Slow, Gbytes
How to circumvent data movement cost?
Clever hand coding costly, app-specific
Automatic coding sub-optimal, softwe support
21
Advantages and Disadvantages
GPUs provide fast parallel computing
GPUs work best for parallel solutions
Sequential programs can actually run slower
Amdahl’s Law describes speedup:
Speedup =
P is fraction of program
that is parallel
S is fraction of program
that is sequential
22
Multicore CPUs
Intel Nehalem:
Servers, HPC arrays
45nm circuit technology
Intel Xeon:
DUAL NEHALEM
2001-present
2 to 8 cores
Workstations
Multiple cores
Laptops
Heat dissipation?
23
Intel Multicore CPU Performance
SINGLE CORE
24
Conclusions
Parallel machines Parallel solutions
Inherently sequential programs don’t benefit
much from parallelism
2 main types of parallel architectures
SIMD – Single-instruction, multiple data stream
MIMD – Multiple-instruction, multiple data stream
Modern parallel architectures (multicores)
GPUs – Exploit SIMD parallelism for generalpurpose parallel computing solutions
CPUs – Multicore CPUs are more amenable to
MIMD parallel applications
25