Hyperthreading

Download Report

Transcript Hyperthreading

Hyper-Threading, Chip
multiprocessors and
both
Zoran Jovanovic
To Be Tackled in Multithreading
Review of Threading Algorithms
 Hyper-Threading Concepts
 Hyper-Threading Architecture
 Advantages/Disadvantages

2
Threading Algorithms

Time-slicing
A
processor switches between threads in
fixed time intervals.
 High expenses, especially if one of the
processes is in the wait state. Fine grain

Switch-on-event
 Task
switching in case of long pauses
 Waiting for data coming from a relatively slow
source, CPU resources are given to other
processes. Coarse grain
3
Threading Algorithms (cont.)

Multiprocessing
 Distribute
the load over many processors
 Adds extra cost

Simultaneous multi-threading
 Multiple
threads execute on a single
processor without switching.
 Basis of Intel’s Hyper-Threading technology.
4
Hyper-Threading Concept
At each point of time only a part of
processor resources is used for execution
of the program code of a thread.
 Unused resources can also be loaded, for
example, with parallel execution of another
thread/application.
 Extremely useful in desktop and server
applications where many threads are
used.

5
Quick Recall: Many Resources
IDLE!
For an 8-way
superscalar.
From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing On-chip
Parallelism, ISCA
1995.
Slide source: John Kubiatowicz
7
(a)
(a)
(b)
(c)
(d)
(b)
(c)
(d)
A superscalar processor with no multithreading
A superscalar processor with coarse-grain
multithreading
A superscalar processor with fine-grain multithreading
A superscalar processor with simultaneous
multithreading (SMT)
8
Simultaneous Multithreading
(SMT)
Example: new Pentium with “Hyperthreading”
Key Idea: Exploit ILP across multiple threads!
 i.e., convert thread-level parallelism into more
ILP
 exploit following features of modern
processors:

multiple functional units


modern processors typically have more functional units
available than a single thread can utilize
register renaming and dynamic scheduling

multiple instructions from independent threads can co-exist
9
and co-execute!
Hyper-Threading Architecture




First used in Intel Xeon MP processor
Makes a single physical processor appear as
multiple logical processors.
Each logical processor has a copy of
architecture state.
Logical processors share a single set of physical
execution resources
10
Hyper-Threading Architecture
Operating systems and user programs can
schedule processes or threads to logical
processors as if they were in a
multiprocessing system with physical
processors.
 From an architecture perspective we have
to worry about the logical processors using
shared resources.

 Caches,
execution units, branch predictors,
control logic, and buses.
11
Power 5 dataflow ...


Why only two threads?
 With 4, one of the shared resources (physical registers,
cache, memory bandwidth) would be prone to bottleneck
Cost:
 The Power5 core is about 24% larger than the Power4 core
because of the addition of SMT support
Advantages



Extra architecture only
adds about 5% to the
total die area.
No performance loss if
only one thread is active.
Increased performance
with multiple threads
Better resource
utilization.
13
Disadvantages

To take advantage of hyper-threading
performance, serial execution can not be
used.
 Threads
are non-deterministic and involve
extra design
 Threads have increased overhead

Shared resource conflicts
14
Multicore

Multiprocessors on a single chip
15
Basic Shared Memory
Architecture

Processors all connected to a large shared
memory
 Where
are caches?
P1
P2
Pn
interconnect
memory
• Now take a closer look at structure, costs, limits,
programming
CS267 Lecture 6
16
What About Caching???
P1
Pn
$
$
Bus
Mem

Want High performance for shared memory: Use Caches!





Automatic replication closer to processor
More important to multiprocessor than uniprocessor: latencies
longer
Normal uniprocessor mechanisms to access data


Each processor has its own cache (or multiple caches)
Place data from memory into cache
Writeback cache: don’t send all writes over bus to memory
Caches Reduce average latency


I/O devices
Loads and Stores form very low-overhead communication primitive
Slide source: John Kubiatowicz
Problem: Cache Coherence!
Example Cache Coherence Problem
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u= 7
u :5
I/O devices
1
u:5

Things to note:



Memory
Processors could see different values for u after event 3
With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
How to fix with a bus: Coherence Protocol



2
Use bus to broadcast writes or invalidations
Simple protocols rely on presence of broadcast medium
Bus not scalable beyond about 64 processors (max)
 Capacity, bandwidth
limitations
Slide source:
John Kubiatowicz
Limits of Bus-Based Shared Memory
I/O
MEM
140 MB/s
° ° ° MEM
°°°
cache
cache
5.2 GB/s
PROC
PROC
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit
rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
 140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus
CS267 Lecture 6
20
Cache Organizations for Multi-cores
• L1 caches are always private to a core
• L2 caches can be private or shared
• Advantages of a shared L2 cache:
 efficient dynamic allocation of space to each core
 data shared by multiple cores is not replicated
 every block has a fixed “home” – hence, easy to find
the latest copy
• Advantages of a private L2 cache:
 quick access to private L2 – good for small working sets
 private bus to private L2  less contention
21
A Reminder: SMT
(Simultaneous Multi Threading)
SMT vs. CMP
A Single Chip Multiprocessor
L. Hammond at al. (Stanford), IEEE Computer 97
Superscalar (SS)
SMT
•
For Same area (a billion tr. DRAM
area)
Superscalar and SMT: Very Complex
• Wide
• Advanced Branch prediction
• Register Renaming
• OOO Instruction Issue
• Non-Blocking data caches
CMP
SS and SMT vs. CMP
CPU Cores: Three main hardware design problems (of SS and
SMT):
•Area increases quadratically with core complexity
• Number of Registers O(Instruction window size)
• Register ports - O(Issue width)
CMP solves this problem (~ linear Area to Issue width)
•Longer Cycle Times
• Long Wires, many MUXes and crossbars
• Large buffers, queues and register files
Clustering (decreases ILP) or Deep Pipelining (Branch
mispredication penalties)
CMP allows small cycle time (with little effort)
Small and fast
Relies on software to schedule
- Poor ILP
•Complex Design and Verification
SS and SMT vs. CMP
Memory:
•12 issue SS or SMT require multiport data cache (4-6 ports)
• 2 X 128 Kbyte (2 cycle latency)
CMP 16 X 16 Kbyte (single cycle latency), but secondary
cache is slower (multiport)
Shared memory: write through caches
SMT
CMP
Performance comparison
• Compress: (Integer apps) Low ILP and no TLP
• Mpeg-2: (MMedia apps)
High ILP and TLP and moderate memory requirement (parallelized by hand)
+ SMT utilizes core resources better
+ But CMP has 16 issue slots instead of 12
• Tomcatv: (FP applications)
Large loop-level parallelism and large memory bandwidth (TLP by compiler)
+ CMP has large memory bandwidth on primary cache
- SMT fundamental problem: unified and slow cache
• Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)
CMP Motivation



How to utilize available silicon?
 Speculation (aggressive superscalar)
 Simultaneous Multithreading (SMT, Hyperthreading)
 Several processors on a single chip
What is a CMP (Chip MultiProcessor)?
 Several processors (several masters)
 Both shared and distributed memory architectures
 Both homogenous and heterogeneous processor types
Why?
 Wire Delays
 Diminishing of Uniprocessors
 Very long design and verification times for modern processors
A Single Chip Multiprocessor
L. Hammond at al. (Stanford), IEEE Computer 97
• TLP and PLP become widespread in future applications
• Various Multimedia applications
• Compilers and OS
 Favours CMP
CMP:
• Better performance with simple hardware
• Higher clock rates, better memory bandwidth
• Shorter pipelines
SMT: has better utilizations but CMP has more resources (no
wide-issue logic)
Although CMP bad for no TLP and ILP (compress), SMT and SS
not much better
A Reminder: SMT
(Simultaneous Multi Threading)
SMT
CMP
of execution units (Wide machine)• Simple Cores
• Several Logical processors
• Moderate amount of parallelism
• Copy of State for each
• Threads are running concurrently
• Mul. Threads are running
on different cores
concurrently
• Better utilization and Latency
Tolerance
• Pool
SMT Dual-core: all four threads can run
concurrently
Integer
L1 D-Cache D-TLB
Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB
Trace Cache
uCode
ROM
L2 Cache and Control
L2 Cache and Control
L1 D-Cache D-TLB
30
BTB and I-TLB
Thread 1 Thread 3
Floating Point
Schedulers
Uop queues
Rename/Alloc
BTB
Trace Cache
uCode
ROM
Decoder
Bus
Bus
Decoder
Integer
BTB and I-TLB
Thread 2
Thread 4