Hyperthreading
Download
Report
Transcript Hyperthreading
Hyper-Threading, Chip
multiprocessors and
both
Zoran Jovanovic
To Be Tackled in Multithreading
Review of Threading Algorithms
Hyper-Threading Concepts
Hyper-Threading Architecture
Advantages/Disadvantages
2
Threading Algorithms
Time-slicing
A
processor switches between threads in
fixed time intervals.
High expenses, especially if one of the
processes is in the wait state. Fine grain
Switch-on-event
Task
switching in case of long pauses
Waiting for data coming from a relatively slow
source, CPU resources are given to other
processes. Coarse grain
3
Threading Algorithms (cont.)
Multiprocessing
Distribute
the load over many processors
Adds extra cost
Simultaneous multi-threading
Multiple
threads execute on a single
processor without switching.
Basis of Intel’s Hyper-Threading technology.
4
Hyper-Threading Concept
At each point of time only a part of
processor resources is used for execution
of the program code of a thread.
Unused resources can also be loaded, for
example, with parallel execution of another
thread/application.
Extremely useful in desktop and server
applications where many threads are
used.
5
Quick Recall: Many Resources
IDLE!
For an 8-way
superscalar.
From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing On-chip
Parallelism, ISCA
1995.
Slide source: John Kubiatowicz
7
(a)
(a)
(b)
(c)
(d)
(b)
(c)
(d)
A superscalar processor with no multithreading
A superscalar processor with coarse-grain
multithreading
A superscalar processor with fine-grain multithreading
A superscalar processor with simultaneous
multithreading (SMT)
8
Simultaneous Multithreading
(SMT)
Example: new Pentium with “Hyperthreading”
Key Idea: Exploit ILP across multiple threads!
i.e., convert thread-level parallelism into more
ILP
exploit following features of modern
processors:
multiple functional units
modern processors typically have more functional units
available than a single thread can utilize
register renaming and dynamic scheduling
multiple instructions from independent threads can co-exist
9
and co-execute!
Hyper-Threading Architecture
First used in Intel Xeon MP processor
Makes a single physical processor appear as
multiple logical processors.
Each logical processor has a copy of
architecture state.
Logical processors share a single set of physical
execution resources
10
Hyper-Threading Architecture
Operating systems and user programs can
schedule processes or threads to logical
processors as if they were in a
multiprocessing system with physical
processors.
From an architecture perspective we have
to worry about the logical processors using
shared resources.
Caches,
execution units, branch predictors,
control logic, and buses.
11
Power 5 dataflow ...
Why only two threads?
With 4, one of the shared resources (physical registers,
cache, memory bandwidth) would be prone to bottleneck
Cost:
The Power5 core is about 24% larger than the Power4 core
because of the addition of SMT support
Advantages
Extra architecture only
adds about 5% to the
total die area.
No performance loss if
only one thread is active.
Increased performance
with multiple threads
Better resource
utilization.
13
Disadvantages
To take advantage of hyper-threading
performance, serial execution can not be
used.
Threads
are non-deterministic and involve
extra design
Threads have increased overhead
Shared resource conflicts
14
Multicore
Multiprocessors on a single chip
15
Basic Shared Memory
Architecture
Processors all connected to a large shared
memory
Where
are caches?
P1
P2
Pn
interconnect
memory
• Now take a closer look at structure, costs, limits,
programming
CS267 Lecture 6
16
What About Caching???
P1
Pn
$
$
Bus
Mem
Want High performance for shared memory: Use Caches!
Automatic replication closer to processor
More important to multiprocessor than uniprocessor: latencies
longer
Normal uniprocessor mechanisms to access data
Each processor has its own cache (or multiple caches)
Place data from memory into cache
Writeback cache: don’t send all writes over bus to memory
Caches Reduce average latency
I/O devices
Loads and Stores form very low-overhead communication primitive
Slide source: John Kubiatowicz
Problem: Cache Coherence!
Example Cache Coherence Problem
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u= 7
u :5
I/O devices
1
u:5
Things to note:
Memory
Processors could see different values for u after event 3
With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
How to fix with a bus: Coherence Protocol
2
Use bus to broadcast writes or invalidations
Simple protocols rely on presence of broadcast medium
Bus not scalable beyond about 64 processors (max)
Capacity, bandwidth
limitations
Slide source:
John Kubiatowicz
Limits of Bus-Based Shared Memory
I/O
MEM
140 MB/s
° ° ° MEM
°°°
cache
cache
5.2 GB/s
PROC
PROC
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit
rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus
CS267 Lecture 6
20
Cache Organizations for Multi-cores
• L1 caches are always private to a core
• L2 caches can be private or shared
• Advantages of a shared L2 cache:
efficient dynamic allocation of space to each core
data shared by multiple cores is not replicated
every block has a fixed “home” – hence, easy to find
the latest copy
• Advantages of a private L2 cache:
quick access to private L2 – good for small working sets
private bus to private L2 less contention
21
A Reminder: SMT
(Simultaneous Multi Threading)
SMT vs. CMP
A Single Chip Multiprocessor
L. Hammond at al. (Stanford), IEEE Computer 97
Superscalar (SS)
SMT
•
For Same area (a billion tr. DRAM
area)
Superscalar and SMT: Very Complex
• Wide
• Advanced Branch prediction
• Register Renaming
• OOO Instruction Issue
• Non-Blocking data caches
CMP
SS and SMT vs. CMP
CPU Cores: Three main hardware design problems (of SS and
SMT):
•Area increases quadratically with core complexity
• Number of Registers O(Instruction window size)
• Register ports - O(Issue width)
CMP solves this problem (~ linear Area to Issue width)
•Longer Cycle Times
• Long Wires, many MUXes and crossbars
• Large buffers, queues and register files
Clustering (decreases ILP) or Deep Pipelining (Branch
mispredication penalties)
CMP allows small cycle time (with little effort)
Small and fast
Relies on software to schedule
- Poor ILP
•Complex Design and Verification
SS and SMT vs. CMP
Memory:
•12 issue SS or SMT require multiport data cache (4-6 ports)
• 2 X 128 Kbyte (2 cycle latency)
CMP 16 X 16 Kbyte (single cycle latency), but secondary
cache is slower (multiport)
Shared memory: write through caches
SMT
CMP
Performance comparison
• Compress: (Integer apps) Low ILP and no TLP
• Mpeg-2: (MMedia apps)
High ILP and TLP and moderate memory requirement (parallelized by hand)
+ SMT utilizes core resources better
+ But CMP has 16 issue slots instead of 12
• Tomcatv: (FP applications)
Large loop-level parallelism and large memory bandwidth (TLP by compiler)
+ CMP has large memory bandwidth on primary cache
- SMT fundamental problem: unified and slow cache
• Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)
CMP Motivation
How to utilize available silicon?
Speculation (aggressive superscalar)
Simultaneous Multithreading (SMT, Hyperthreading)
Several processors on a single chip
What is a CMP (Chip MultiProcessor)?
Several processors (several masters)
Both shared and distributed memory architectures
Both homogenous and heterogeneous processor types
Why?
Wire Delays
Diminishing of Uniprocessors
Very long design and verification times for modern processors
A Single Chip Multiprocessor
L. Hammond at al. (Stanford), IEEE Computer 97
• TLP and PLP become widespread in future applications
• Various Multimedia applications
• Compilers and OS
Favours CMP
CMP:
• Better performance with simple hardware
• Higher clock rates, better memory bandwidth
• Shorter pipelines
SMT: has better utilizations but CMP has more resources (no
wide-issue logic)
Although CMP bad for no TLP and ILP (compress), SMT and SS
not much better
A Reminder: SMT
(Simultaneous Multi Threading)
SMT
CMP
of execution units (Wide machine)• Simple Cores
• Several Logical processors
• Moderate amount of parallelism
• Copy of State for each
• Threads are running concurrently
• Mul. Threads are running
on different cores
concurrently
• Better utilization and Latency
Tolerance
• Pool
SMT Dual-core: all four threads can run
concurrently
Integer
L1 D-Cache D-TLB
Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB
Trace Cache
uCode
ROM
L2 Cache and Control
L2 Cache and Control
L1 D-Cache D-TLB
30
BTB and I-TLB
Thread 1 Thread 3
Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB
Trace Cache
uCode
ROM
Decoder
Bus
Bus
Decoder
Integer
BTB and I-TLB
Thread 2
Thread 4