ECE Application Programming

Transcript ECE Application Programming

16.482 / 16.561
Computer Architecture and
Design
Instructor: Dr. Michael Geiger
Summer 2014
Lecture 6:
Speculation
Lecture outline

Announcements/reminders


HW 5 to be posted; due 6/13
Today’s lecture

7/17/2015
Speculation
Computer Architecture Lecture 6
2
Review: Dynamic scheduling

Dynamic scheduling - hardware rearranges the
instruction execution to reduce stalls while
maintaining data flow and exception behavior




Key idea: Allow instructions behind stall to proceed
Allow out-of-order execution and out-of-order completion
We use Tomasulo’s Algorithm
Decode stage now handles:

Issue—check for structural hazards and assign instruction to
functional unit (via reservation station)


Reservation stations implicitly perform register renaming


7/17/2015
Check for register values
Resolves potential WAW, WAR hazards
Results broadcast over common data bus
Computer Architecture Lecture 6
3
Speculation to greater ILP
3 components of HW-based speculation:
Dynamic branch prediction

1.
Need BTB to get target in 1 cycle

Ability to speculate past branches
Dynamic scheduling
In Tomasulo’s algorithm, separate
instruction completion from commit
2.
3.



Once instruction is non-speculative, it can
update registers/memory
Reorder buffer tracks program order


7/17/2015
Head of ROB can commit when ready
ROB supplies data between complete and commit
Computer Architecture Lecture 6
4
Reorder Buffer Entry

1.
Each entry in the ROB contains four fields:
Instruction type
•
2.
Destination
•
3.
Register number (for loads and ALU operations) or
memory address (for stores)
where the instruction result should be written
Value
•
4.
a branch (has no destination result), a store (has a
memory address destination), or a register operation (ALU
operation or load, which has register destinations)
Value of instruction result until the instruction commits
Ready
•
7/17/2015
Indicates that instruction has completed execution, and
the value is ready
Computer Architecture Lecture 6
5
Speculative Tomasulo’s Algorithm
1.
2.

3.

4.
Instruction fetch--get instruction from memory; place in Op Queue
Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage
sometimes called “dispatch”)
Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB
for result; when both in reservation station, execute; checks RAW
(sometimes called “issue”)
Memory access--if needed (MEM)
NOTE: Stores update memory at commit, not MEM

5.

6.

Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
7/17/2015
Computer Architecture Lecture 6
6
Tomasulo’s With Reorder buffer:
Done?
FP Op
Queue
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
ROB3
ROB2
F0
LD F0,10(R2)
Registers
Dest
7/17/2015
ROB1
Oldest
To
Memory
from
Memory
Dest
FP adders
N
Reservation
Stations
Dest
1 10+R2
FP multipliers
Computer Architecture Lecture 6
7
Revisiting stores with speculation


With ROB, store buffers eliminated
Why?



Can’t write memory until you know value is nonspeculative
Once address is calculated, store in “destination”
field of ROB entry
Need additional field in ROB for stores: equivalent
to “Q” fields in reservation stations

7/17/2015
Indicates what instruction is writing value to be stored
Computer Architecture Lecture 6
8
Reorder buffer example

Given the following code:
Loop:


L.D
MUL.D
S.D
DADDIU
BNE
F0,
F4,
F4,
R1,
R1,
0(R1)
F0, F2
0(R1)
R1, #-8
R2, Loop
Walk through two iterations of the loop
Assume




7/17/2015
2 cycles for add, load
1 cycle for address calculation
6 cycles for multiply
Forwarding via CDB
Computer Architecture Lecture 6
9
Reorder buffer example: key points

Execution stages




Fetch & issue: always in order
Execution & completion: may be out of order
Commit: always in order
Hardware

Reservation stations


Occupied from IS to WB
Reorder buffer


Occupied from IS to C
Used to



Register result status

7/17/2015
Maintain program order for in-order commit
Supply register values between WB and C
Rename registers based on ROB entries
Computer Architecture Lecture 6
10
Memory hazards, exceptions

Reorder buffer helps limit memory hazards



With additional logic for disambiguation
(determine if addresses match)
WAW / WAR automatically removed
RAW maintained by



Stalling loads if store with same address is in flight
Ensuring that effective addresses are computed in order
Precise exceptions logical extension of ROB


7/17/2015
If instruction causes exception, flag in ROB
Handle exception when instruction commits
Computer Architecture Lecture 6
11
Getting CPI below 1


CPI ≥ 1 if issue only 1 instruction every clock
cycle
Multiple-issue processors come in 3 flavors:




statically-scheduled superscalar processors,
dynamically-scheduled superscalar processors, and
VLIW (very long instruction word) processors
2 types of superscalar processors issue varying
numbers of instructions per clock


7/17/2015
use in-order execution if they are statically scheduled,
or
out-of-order execution if they are dynamically
scheduled
Computer Architecture Lecture 6
12
Performance beyond single thread ILP



There can be much higher natural parallelism in
some applications
(e.g., Database or Scientific codes)
Explicit Thread Level Parallelism or Data Level
Parallelism
Thread: process with own instructions and data



thread may be a process part of a parallel program of
multiple processes, or it may be an independent program
Each thread has all the state (instructions, data, PC,
register state, and so on) necessary to allow it to execute
Data Level Parallelism: Perform identical operations
on data, and lots of data
7/17/2015
Computer Architecture Lecture 6
13
Thread Level Parallelism (TLP)



ILP exploits implicit parallel operations within
a loop or straight-line code segment
TLP explicitly represented by the use of
multiple threads of execution that are
inherently parallel
Goal: Use multiple instruction streams to
improve



Throughput of computers that run many programs
Execution time of multi-threaded programs
TLP could be more cost-effective to exploit
than ILP
7/17/2015
Computer Architecture Lecture 6
14
New Approach: Mulithreaded Execution

Multithreading: multiple threads to share the
functional units of 1 processor via overlapping




processor must duplicate independent state of each thread
e.g., a separate copy of register file, a separate PC, and for
running independent programs, a separate page table
memory shared through the virtual memory mechanisms,
which already support multiple processes
HW for fast thread switch; much faster than full process
switch  100s to 1000s of clocks
When switch?


7/17/2015
Alternate instruction per thread (fine grain)
When a thread is stalled, perhaps for a cache miss,
another thread can be executed (coarse grain)
Computer Architecture Lecture 6
15
Fine-Grained Multithreading





Switch on each instruction
Usually done in a round-robin fashion,
skipping any stalled threads
CPU must be able to switch threads every
clock
Advantage: Hide both short/long stalls
Disadvantage: slows individual threads
7/17/2015
Computer Architecture Lecture 6
16
Coarse-Grained Multithreading


Switches only on costly stalls, such as L2 cache
misses
Advantages



Disadvantage: hard to overcome throughput losses
on shorter stalls, due to pipeline start-up costs



Relieves need to have very fast thread-switching
Doesn’t slow down individual thread
Since CPU issues instructions from 1 thread, when a stall
occurs, the pipeline must be emptied or frozen
New thread must fill pipeline before instructions can
complete
Because of this start-up overhead, coarse-grained
multithreading is better for reducing penalty of high
cost stalls, where pipeline refill << stall time
7/17/2015
Computer Architecture Lecture 6
17
Do both ILP and TLP?


TLP and ILP exploit two different kinds of parallel
structure in a program
Could a processor oriented at ILP exploit TLP?



Functional units are often idle in data path designed for ILP
because of either stalls or dependences in the code
Could the TLP be used as a source of independent
instructions that might keep the processor busy
during stalls?
Could TLP be used to employ the functional units
that would otherwise lie idle when insufficient ILP
exists?
7/17/2015
Computer Architecture Lecture 6
18
Simultaneous Multi-threading ...
One thread, 8 units
Cycle M M FX FX FP FP BR CC
Two threads, 8 units
Cycle M M FX FX FP FP BR CC
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
7/17/2015
Computer Architecture Lecture 6
19
Simultaneous Multithreading (SMT)

Simultaneous multithreading (SMT): insight that
dynamically scheduled processor already has many
HW mechanisms to support multithreading




Large set of virtual registers that can be used to hold the
register sets of independent threads
Register renaming provides unique register identifiers, so
instructions from multiple threads can be mixed in datapath
without confusing sources and destinations across threads
Out-of-order completion allows the threads to execute out
of order, and get better utilization of the HW
Just adding a per thread renaming table and
keeping separate PCs

7/17/2015
Independent commitment can be supported by logically
keeping a separate reorder buffer for each thread
Computer Architecture Lecture 6
Source: Microprocessor Report, December 6, 1999
“Compaq Chooses SMT for Alpha”
20
Time (processor cycle)
Multithreaded Categories
Superscalar
Fine-Grained Coarse-Grained
Thread 1
Thread 2
7/17/2015
Multiprocessing
Thread 3
Thread 4
Computer Architecture Lecture 6
Simultaneous
Multithreading
Thread 5
Idle slot
21
Design Challenges in SMT




Since SMT makes sense only with fine-grained implementation,
impact of fine-grained scheduling on single thread performance?
 A preferred thread approach sacrifices neither throughput nor
single-thread performance (?)
 Unfortunately, with a preferred thread, the processor is likely to
sacrifice some throughput, when preferred thread stalls
Larger register file needed to hold multiple contexts
Not affecting clock cycle time, especially in
 Instruction issue - more candidate instructions need to be
considered
 Instruction completion - choosing which instructions to commit
may be challenging
Ensuring that cache and TLB conflicts generated by SMT do not
degrade performance
7/17/2015
Computer Architecture Lecture 6
22
Multithreading examples

Assume processor with following characteristics

4 functional units





In-order scheduling
Given 3 threads, show execution using


Fine-grained multithreading
Coarse-grained multithreading


Assume any stall longer than 2 cycles causes switch
Simultaneous multithreading


2 ALU
1 memory port (either load or store)
1 branch
Thread 1 is preferred, followed by Thread 2 & Thread 3
Assume any two instructions without stalls between
them are independent
7/17/2015
Computer Architecture Lecture 6
23
Final notes

Next time


Memory hierarchies
Reminders

7/17/2015
HW 5 to be posted; due 6/13
Computer Architecture Lecture 6
24