18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University.

Download Report

Transcript 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University.

18-742 Spring 2011
Parallel Computer Architecture
Lecture 4: Multi-core
Prof. Onur Mutlu
Carnegie Mellon University
Research Project

Project proposal due: Jan 31

Project topics



Does everyone have a topic?
Does everyone have a partner?
Does anyone have too many partners?
2
Last Lecture

Programming model vs. architecture




Message Passing vs. Shared Memory
Data Parallel
Dataflow
Generic Parallel Machine
3
Readings for Today

Required:





Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008.
Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA
2005.
Suleman et al., “Accelerating Critical Section Execution with Asymmetric
Multi-Core Architectures,” ASPLOS 2009.
Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip
Multiprocessors,” ISCA 2007.
Recommended:




Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996.
Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip
Multiprocessing,” ISCA 2000.
Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE
Micro 2005.
Amdahl, “Validity of the single processor approach to achieving large scale
computing capabilities,” AFIPS 1967.
4
Reviews

Due Today (Jan 21)



Due Next Tuesday (Jan 25)



Seitz, “The Cosmic Cube,” CACM 1985.
Suleman et al., “Accelerating Critical Section Execution with
Asymmetric Multi-Core Architectures,” ASPLOS 2009.
Papamarcos and Patel, “A low-overhead coherence solution
for multiprocessors with private cache memories,” ISCA 1984.
Kelm et al., “Cohesion: a hybrid memory model for
accelerators,” ISCA 2010.
Due Next Friday (Jan 28)

Suleman et al., “Data Marshaling for Multi-Core Architectures,”
ISCA 2010.
5
Recap of Last Lecture
6
Review: Shared Memory vs. Message Passing

Loosely coupled multiprocessors


No shared global memory address space
Multicomputer network


Usually programmed via message passing


Network-based multiprocessors
Explicit calls (send, receive) for communication
Tightly coupled multiprocessors




Shared global memory address space
Traditional multiprocessing: symmetric multiprocessing (SMP)
Existing multi-core processors, multithreaded processors
Programming model similar to uniprocessors except
(multitasking uniprocessor)

Operations on shared data require synchronization
7
Programming Models vs. Architectures

Five major models






Shared memory
Message passing
Data parallel (SIMD)
Dataflow
Systolic
Hybrid models?
8
Scalability, Convergence, and
Some Terminology
9
Scaling Shared Memory Architectures
10
Interconnection Schemes for Shared Memory

Scalability dependent on interconnect
11
UMA/UCA: Uniform Memory or Cache Access
• All processors have the same uncontended latency to memory
• Latencies get worse as system grows
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect
Main Memory
contention in memory banks
...
long
latency
Interconnection Network
contention in netw ork
Processor
Processor
...
Processor
Uniform Memory/Cache Access
+ Data placement unimportant/less important (easier to optimize code and
make use of available memory space)
- Scaling the system increases latencies
- Contention could restrict bandwidth and increase latency
Main Memory
contention in memory banks
...
long
latency
Interconnection Network
contention in netw ork
Processor
Processor
...
Processor
Example SMP

Quad-pack Intel Pentium Pro
14
How to Scale Shared Memory Machines?

Two general approaches

Maintain UMA



Provide a scalable interconnect to memory
Downside: Every memory access incurs the round-trip network
latency
Interconnect complete processors with local memory

NUMA (Non-uniform memory access)


Local memory faster than remote memory
Still needs a scalable interconnect for accessing remote
memory

Not on the critical path of local memory access
15
NUMA/NUCA: NonUniform Memory/Cache Access
• Shared memory as local versus remote memory
+ Low latency to local memory
- Much higher latency to remote memories
+ Bandwidth to local memory may be higher
- Performance very sensitive to data placement
Interconnection Network
long
contention in netw ork
latency
Memory
Memory
Processor
Processor
...
Memory
short
latency
...
Processor
Example NUMA MAchines


Sun Enterprise Server
Cray T3E
17
Convergence of Parallel Architectures


Scalable shared memory architecture is similar to scalable
message passing architecture
Main difference: is remote memory accessible with
loads/stores?
18
Historical Evolution: 1960s & 70s
• Early MPs
–
–
–
–
Mainframes
Small number of processors
crossbar interconnect
UMA
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Processor
Processor
corssbar
Processor
Processor
Historical Evolution: 1980s
• Bus-Based MPs
–
–
–
–
enabler: processor-on-a-board
economical scaling
precursor of today’s SMPs
UMA
Memory
Memory
Memory
Memory
cache
cache
cache
cache
Proc
Proc
Proc
Proc
Historical Evolution: Late 80s, mid 90s
• Large Scale MPs (Massively Parallel Processors)
– multi-dimensional interconnects
– each node a computer (proc + cache + memory)
– both shared memory and message passing versions
– NUMA
– still used for “supercomputing”
Historical Evolution: Current


Chip multiprocessors (multi-core)
Small to Mid-Scale multi-socket CMPs


Clusters/Datacenters


Use high performance LAN to connect SMP blades, racks
Driven by economics and cost



One module type: processor + caches + memory
Smaller systems => higher volumes
Off-the-shelf components
Driven by applications



Many more throughput applications (web servers)
… than parallel applications (weather prediction)
Cloud computing
Historical Evolution: Future

Cluster/datacenter on a chip?

Heterogeneous multi-core?

Bounce back to small-scale multi-core?

???
23
Multi-Core Processors
24
Moore’s Law
Moore, “Cramming more components onto integrated circuits,”
Electronics, 1965.
25
Multi-Core



Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors
to be placed on the same die area
What else could you do with the die area you dedicate to
multiple processors?




Have a bigger, more powerful core
Have larger caches in the memory hierarchy
Simultaneous multithreading
Integrate platform components on chip (e.g., network
interface, memory controllers)
26
Why Multi-Core?

Alternative: Bigger, more powerful single core

Larger superscalar issue width, larger instruction window,
more execution units, large trace caches, large branch
predictors, etc
+ Improves single-thread performance transparently to
programmer, compiler
- Very difficult to design (Scalable algorithms for improving
single-thread performance elusive)
- Power hungry – many out-of-order execution structures
consume significant power/area when scaled. Why?
- Diminishing returns on performance
- Does not significantly help memory-bound application
performance (Scalable algorithms for this elusive)
27
Large Superscalar vs. Multi-Core

Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
28
Multi-Core vs. Large Superscalar

Multi-core advantages
+ Simpler cores  more power efficient, lower complexity,
easier to design and replicate, higher frequency (shorter
wires, smaller structures)
+ Higher system throughput on multiprogrammed workloads 
reduced context switches
+ Higher system throughput in parallel applications

Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand
29
Large Superscalar vs. Multi-Core


Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
Technology push

Instruction issue queue size limits the cycle time of the
superscalar, OoO processor  diminishing performance



Quadratic increase in complexity with issue width
Large, multi-ported register files to support large instruction
windows and issue widths  reduced frequency or longer RF
access, diminishing performance
Application pull



Integer applications: little parallelism?
FP applications: abundant loop-level parallelism
Others (transaction proc., multiprogramming): CMP better fit
30
Why Multi-Core?

Alternative: Bigger caches
+ Improves single-thread performance transparently to
programmer, compiler
+ Simple to design
- Diminishing single-thread performance returns from cache size.
Why?
- Multiple levels complicate memory hierarchy
31
Cache vs. Core
Number of Transistors
Cache
Microprocessor
Tim e
32
Why Multi-Core?

Alternative: (Simultaneous) Multithreading
+
+
+
+
Exploits thread-level parallelism (just like multi-core)
Good single-thread performance when there is a single thread
No need to have an entire core for another thread
Parallel performance aided by tight sharing of caches
- Scalability is limited: need bigger register files, larger issue
width (and associated costs) to have many threads 
complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system
reduces both single-thread and parallel application
performance
33