Pipelining difficulties and the MIPS R4000

Download Report

Transcript Pipelining difficulties and the MIPS R4000

ENGS 116 Lecture 6
1
Pipelining Difficulties and MIPS R4000
Vincent H. Berk
October 6, 2008
Reading for today: A.3 – A.4, article: Yeager
Reading for Wednesday: A.5 – A.6, article: Smith&Pleszkun
FRIDAY: NO CLASS
ENGS 116 Lecture 6
2
Exception Characterization
Synchronous vs. Asynchronous
– Synchronous: event occurs same place every time
– Asynchronous: caused by devices external to CPU & memory,
also hw malfunctions
User requested vs. user coerced
– Requested: user task asks for it
– Coerced: hw event not under control of user program
User maskable vs. user nonmaskable
– Maskable: event that can be disabled by user task
Within vs. between instructions
– Within: during execution of task, hard to handle, usually
synchronous since instruction is trigger
Resume vs. terminate
– Terminating: execution always stops after the interrupt
ENGS 116 Lecture 6
3
Exception Handling
Table of Interrupt vector addresses
•
Base register of this table stored in CPU by OS
•
Addresses of Interrupt handling routines are stored in table
•
On interrupt, CPU jumps to: base + 4 * int_num
•
Usually 16 or 32 interrupts
•
Physical pins on CPU, as well as software calls
ENGS 116 Lecture 6
4
Exception Examples
(see also: figure A.27)
•
•
•
•
•
•
•
•
•
•
I/O request: device requests attention from CPU
System call or Supervisor call from software
Breakpoint or instruction tracing: software debugging, single-step
Arithmetic: Integer or FP, overflow, underflow, division by zero
Page fault: requested virtual address was not present in main memory
Misaligned address: bus error
Memory protection: read/write/execute forbidden on requested
address
Invalid opcode: CPU was given an wrongly formatted instruction
Hardware malfunction: CRC errors, component failure
Power failure
ENGS 116 Lecture 6
5
Pipelining Complications
• Exceptions: 5 instructions executing in 5-stage pipeline
– How to stop the pipeline?
– How to restart the pipeline?
– Who caused the exception?
Stage
Problem exceptions occurring
IF
Page fault on instruction fetch; misaligned memory
access; memory-protection violation
ID
Undefined or illegal opcode
EX
Arithmetic interrupt
MEM
Page fault on data fetch; misaligned memory access;
memory-protection violation
ENGS 116 Lecture 6
6
Pipelining Complications
• Simultaneous exceptions in more than one pipeline stage, e.g.,
– Load with data page fault in MEM stage
– Add with instruction page fault in IF stage
– Add fault will happen BEFORE load fault
• Solution #1
– Interrupt status vector per instruction
– Defer check till last stage, kill state update if exception
• Solution #2
– Interrupt ASAP
– Restart everything that is incomplete
Another advantage for state update late in pipeline!
ENGS 116 Lecture 6
7
Pipelining Complications
• Complex addressing modes and instructions
• Address modes: Autoincrement causes register change during
instruction execution
– Interrupts? Need to restore register state
– Adds WAR and WAW hazards since writes no longer in last stage
• Memory-memory move instructions
– Must be able to handle multiple page faults
– Long-lived instructions: partial state save on interrupt
• Floating point: long execution time; out of order completion
ENGS 116 Lecture 6
Stopping and Starting Execution
Most difficult exception occurrences have 2 properties
– They occur within instructions
– They must be restartable
The pipeline must be shut down safely and the state must be saved for
correct restarting
Restarting is usually done by saving PC of instruction at which to start
Branches and delayed branches require special treatment
Precise exceptions allow instructions just before the exception to be
completed, while restarting instructions after the exception
8
ENGS 116 Lecture 6
9
EX
Integer unit
EX
FP/Integer
multiply
IF
MEM WB
ID
EX
FP adder
EX
FP/Integer
divider
Figure A.29
The MIPS pipeline with three additional unpipelined,
floating-point, functional units.
ENGS 116 Lecture 6
10
Integer unit
E
X
FP/integer multiply
M
1
IF
M
2
M
3
M
4
M
5
M
6
M
7
MEM
ID
WB
FP adder
A1
A2
A-3
A4
FP/integer divider
DIV
Figure A.31 A pipeline that supports multiple outstanding FP operations
ENGS 116 Lecture 6
11
Clock Cycle Number
Instruction
LD, F4, 0
(R2)
MULTD F0,
F4, F6
ADDD F2,
F0, F8
SD 0 (R2),
F2
1
2
IF
ID EX MEM WB
IF
3
4
5
6
7
8
9
10
11
12
13
M6
M7 MEM WB
14
15
16
A3
A4 MEM
ID
stall
M1
M2
M3
M4
M5
IF
stall
ID
stall stall
stall
stall stall stall
A1
A2
IF
stall stall
stall
stall stall stall
ID
EX stall stall stall
Figure A.33 A typical FP code sequence showing the stalls arising from
RAW hazards.
17
MEM
ENGS 116 Lecture 6
12
Case Study: MIPS R4000
(100 MHz to 200 MHz)
• 8 Stage Pipeline:
–
–
–
–
–
–
–
–
IF – first half of fetching of instruction; PC selection happens here as
well as initiation of instruction cache access.
IS – second half of access to instruction cache.
RF – instruction decode and register fetch, hazard checking and also
instruction cache hit detection.
EX – execution, which includes effective address calculation, ALU
operation, and branch target computation and condition evaluation.
DF – data fetch, first half of access to data cache.
DS – second half of access to data cache.
TC – tag check, determine whether the data cache access hit.
WB – write back for loads and register-register operations.
• 8 Stages: What is impact on Load delay? Branch delay? Why?
ENGS 116 Lecture 6
IS
RF
Instruction memory
Figure A.37
EX
Reg
ALU
IF
13
DF
DS
Data memory
TC
WB
Reg
The eight-stage pipeline structure of the R4000 uses
pipelined instruction and data cache
accesses.
ENGS 116 Lecture 6
14
Case Study: MIPS R4000
TWO Cycle
Load Latency
IF
IS
IF
RF
IS
IF
EX
RF
IS
IF
IF
IS
RF EX
THREE Cycle
IF
IS
RF
Branch Latency
(conditions evaluated
IF
IS
during EX phase)
IF
Delay slot plus two stalls
Branch likely cancels delay slot if not taken
DF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
ENGS 116 Lecture 6
15
MIPS R4000 Floating Point
• FP Adder, FP Multiplier, FP Divider
• Last step of FP Multiplier/Divider uses FP Adder HW
• 8 kinds of stages in FP units:
Stage
Functional unit Description
A FP adderMantissa ADD stage
D FP divider
Divide pipeline stage
E FP multiplier
Exception test stage
M FP multiplier
First stage of multiplier
N FP multiplier
Second stage of multiplier
R FP adder
Rounding stage
S FP adder
Operand shift stage
U
Unpack FP numbers
ENGS 116 Lecture 6
16
R4000 Performance
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware (parallelism)
ENGS 116 Lecture 6
17
Instruction Level Parallelism
Want to exploit parallelism among instruction sequences
Branches interfere with parallelism - gcc has branch every 5 or 6
instructions (on average)
Need to find sequences of unrelated instructions that can be overlapped
Often see loop-level parallelism
for
(i = 0; i < 100; i = i +1)
x[i] = x[i] + y[i]
Want to convert loop-level parallelism to instruction-level parallelism
ENGS 116 Lecture 6
18
FP Loop: Where are the Hazards?
Loop:
LD
F0, 0(R1)
; F0=vector element
ADDD
F4, F0, F2
; add scalar in F2
SD
0 (R1), F4
; store result
SUBI
R1, R1, #8 ; decrement pointer 8 bytes (DW)
BNEZ
R1, Loop
; branch R1!=zero
NOP
; delayed branch slot
Instruction
producing result
Instruction
using result
FP ALU op
FP ALU op
Load double
Load double
Integer op
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Latency in
clock cycles
3
2
1
0
0
ENGS 116 Lecture 6
19
FP Loop Hazards
Loop:
LD
ADDD
F4, F0, F2 ; add scalar in F2
SD
0 (R1), F4 ; store result
SUBI
R1, R1, #8 ; decrement pointer 8 bytes (DW)
BNEZ
R1, Loop ; branch R1! = zero
NOP
F0, 0(R1)
; F0=vector element
; delayed branch slot
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
• Where are the stalls?
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Latency in
clock cycles
3
2
1
0
0
ENGS 116 Lecture 6
20
FP Loop Showing Stalls
1 Loop: LD
2
stall
3
ADDD
4
stall
5
stall
6
SD
7
SUBI
8
stall
9
BNEZ
10
stall
Instruction
producing result
FP ALU op
FP ALU op
Load double
F0, 0 (R1)
; F0=vector element
F4, F0, F2
; add scalar in F2
0 (R1), F4
R1, R1, #8
; store result
; decrement pointer 8 bytes (DW)
; wait for result R1
R1, Loop
; branch R1!=zero
; delayed branch slot
Instruction
Latency in
using result
clock cycles
Another FP ALU op
3
Store double
2
FP ALU op
1
• Rewrite code to minimize stalls?