Outline - University of Florida

Download Report

Transcript Outline - University of Florida

CDA 3101
Fall 2013
Introduction to Computer
Organization
Multicore / Multiprocessor
Architectures
22 November 2013
Multicore Architectures
 Introduction – What are Multicores?
 Why Multicores?

Power and Performance Perspectives
 Multiprocessor Architectures
 Conclusion
CDA 3101 – Fall 2011
Copyright © 2011 Prabhat Mishra
2
How to Reduce Power Consumption
Multicore
One
core with frequency 2 GHz
Two cores with 1 GHz frequency (each)
 Same performance
 Two 1 GHz cores require half power/energy
– Power  freq2
– 1GHz core needs one-fourth power compared to 2GHz core.
New challenges – Performance
How
to utilize the cores
It is difficult to find parallelism in programs to keep
all these cores busy.
Reducing Energy Consumption
Pentium Max Temp = 105.5 deg C
Both processors are running the same
multimedia application.
Crusoe Max Temp = 48.2 deg C
[www.transmeta.com]
Infrared Cameras (FLIR) can be used to detect thermal distribution.
Introduction
Never ending story …

Complex Applications
 Faster Computation
 How far did we go with uniprocessors?
Parallel Processors now play a major role

Logical way to improve performance
 Connect multiple microprocessors

Not much left with ILP exploitation
 Server and embedded software have parallelism
Multiprocessor architectures will become
increasingly attractive
Due
to slowdown in advances of uniprocessors
6
Level of Parallelism
Bit level parallelism: 1970 to ~1985

4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism: ~1985 - today

Pipelining
 Superscalar
 VLIW
 Out-of-order execution / Dynamic Instr. Scheduling
Process level or thread level parallelism

Servers are parallel
 Desktop dual processor PCs
 Multicore architectures (CPUs, GPUs)
7
Taxonomy of Parallel Architectures
 SISD (Single Instruction Single Data)
 Uniprocessors
Flynn
 MISD (Multiple Instruction Single Data) Classification
 Multiple processors on a single data stream
 No commercial prototypes. Can be thought of as successive
refinement of a given set of data by multiple processors (units).
 SIMD (Single Instruction Multiple Data)
 Examples: Illiac-IV, CM-2
 Simple programming model, low overhead, and flexibility
 All custom integrated circuits
 MIMD (Multiple Instruction Multiple Data)
 Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
 Flexible – Difficult to program – no unifying model of parallelism
 Use off-the-shelf microprocessors
 MIMD in practice: designs with <= 128 processors
8
MIMD
 Two types
 Centralized
shared-memory multiprocessors
 Distributed-memory
multiprocessors
 Exploits thread-level-parallelism
 The
program should have at least n threads or
processes for a MIMD machine with n processors
 Threads can be of different types
 Independent
 Parallel
programs
iterations of a loop (extracted by compiler)
9
Centralized Shared-Memory Multiprocessor
10
Centralized Shared-Memory Multiprocessor
Small number of processors share a centralized
memory
Use multiple buses or switches
Multiple memory banks
Main memory has a symmetric relationship to
all processors and uniform access time from
any processor
SMP: symmetric shared-memory multiprocessors
UMA: uniform memory access architectures
Increase in processor performance and
memory bandwidth requirements make
centralized memory paradigm less attractive
11
Distributed-Memory Multiprocessors
12
Distributed-Memory Multiprocessors
Distributing memory has two benefits
Cost-effective
way to scale memory bandwidth
Reduces local memory access time.
Communicating data between processors is
complex and has higher latency
Two approaches for data communication
Shared
address space (not centralized memory)
 Same physical addr. refers to same memory location
 DSM: Distributed Shared-Memory Architectures
 NUMA: Non-uniform memory access since the access
time depends on the location of the data
Logically
disjoint address space - Multicomputers
13
Small-Scale—Shared Memory
Caches serve to:
Increase
bandwidth versus
bus/memory
Reduce latency of
access
Valuable for both
private data and
shared data
What about cache
consistency?
Time
0
1
2
3
Event
$A
CPU A
reads X
CPU B
reads X
CPU A
stores 0
into X
1
$B
X
(memory)
1
1
1
1
1
0
1
0
14
Example: Cache Coherence Problem
P2
P1
u=?
Cache
P3
3
u=?
4
Cache
Cache
5
u :5 u= 7
u :5
I/O devices
1
u:5
2
Memory
 Processors see different values for u after event 3
 With write back caches, value written back to memory
depends on which cache flushes or writes back value

Processes accessing main memory may see very stale value
 Unacceptable for programming, and its frequent!
15
4 C’s: Sources of Cache Misses
Compulsory misses (aka cold start misses)

First access to a block
Capacity misses

Due to finite cache size

A replaced block is later accessed again
Conflict misses (aka collision misses)

In a non-fully associative cache

Due to competition for entries in a set

Would not occur in a fully associative cache of
the same total size
 Coherence Misses
Graphics Processing Units (GPUs)




Moore’s Law will come to an end
Many complicated solutions
Simple solution – SPATIAL PARALLELISM
SIMD model
(single instr,
multiple data
streams)
 GPUs have a
SIMD grid with
local & shared
memory model
17
Graphics Processing Units (GPUs)
 Nvidia Fermi GPU – 3GB DRAM, 512 cores
CUDA
architecture
-Thread
- Thread Block
- Grid of Thread
Blocks
- Intelligent
CUDA Compiler
18
GPUs – Nvidia CUDA Hierarchy
Map Process to Thread
Group Threads in Block
Group Blocks in Grids for
Efficiency Memory Access
 Also, memory coalescing operations for faster
data transfer
19
Nvidia Tesla 20xx GPU Board
20
GPU Problems and Solutions
 GPUs are designed for graphics rendering
 GPUs are not designed for general-purpose
computing!! (no unifying model of ||-ism)
 Memory hierarchy:
Local Memory – Fast, small (MBs)
 Shared Memory – Slower, larger
 Global Memory – Slow, Gbytes

 How to circumvent data movement cost?
Clever hand coding  costly, app-specific
 Automatic coding  sub-optimal, softwe support

21
Advantages and Disadvantages
 GPUs provide fast parallel computing
 GPUs work best for parallel solutions

Sequential programs can actually run slower

Amdahl’s Law describes speedup:
Speedup =
P is fraction of program
that is parallel
S is fraction of program
that is sequential
22
Multicore CPUs
 Intel Nehalem:

Servers, HPC arrays
 45nm circuit technology
 Intel Xeon:






DUAL NEHALEM
2001-present
2 to 8 cores
Workstations
Multiple cores
Laptops
Heat dissipation?
23
Intel Multicore CPU Performance
SINGLE CORE
24
Conclusions
 Parallel machines  Parallel solutions
 Inherently sequential programs don’t benefit
much from parallelism
 2 main types of parallel architectures
SIMD – Single-instruction, multiple data stream
 MIMD – Multiple-instruction, multiple data stream

 Modern parallel architectures (multicores)
GPUs – Exploit SIMD parallelism for generalpurpose parallel computing solutions
 CPUs – Multicore CPUs are more amenable to
MIMD parallel applications

25