無投影片標題 - VLSI Laboratory

Transcript 無投影片標題 - VLSI Laboratory

Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-1
Appendix A. Pipelining: Basic and Intermediate
Concept
• What is Pipelining?
– Pipelining is an implementation technique whereby multiple
instructions are overlaped in execution.
– Pipe stage (pipe segment)
– Throughput
– Machine cycle: The time required between moving an instruction one
step down the pipeline. This time is equal to the time required for the
slowest pipe stage.
– In a computer, the machine cycle is usually one clock cycle.
– The pipeline designer‘s goal is to balance the length of each pipe
stage.
– If the stages are perfectly balanced,
Time per instruction on unpipeline d machine
Time per instruction 
Number of pipe stages
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-2
A Simple Implementation of A RISC ISA
• Five-cycle implementation
– Instruction fetch cycle (IF)
– Instruction decode/register fetch cycle (ID)
• Operand fetches;
• Sign-extending the immediate field;
• Decoding is done in parallel with reading registers. This technique is
known as fixed-field decoding;
• Test branch condition and computed branch address; finished branching
at the end of this cycle.
– Execution/effective address cycle (EX)
• Memory reference;
• Register-Register ALU instruction;
• Register-Immediate ALU instruction;
– Memory access/branch completion cycle (MEM)
– Write-back cycle (WB)
• Register-Register ALU instruction;
• Register-Immediate ALU instruction;
• Load instruction;
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-3
Performance of the Five-Cycle Implementation
• CPI=4.54
– Branch instructions (12%) take 2 cycles
– Store instructions (10%) require 4 cycles
– Others takes 5 cycles
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
The Classic Five-Stage Pipeline for a RSIC
Processor
Appendix-4
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
The RISC Pipeline with Registers
Appendix-5
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-6
Instruction Issue
• The process of letting an instruction move from the
instruction decode stage (ID) into execution stage
(EX) of this pipeline.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-7
Basic Performance Issues in Pipelining
– Pipelining increasing instruction execution throughput,
but it does not reduce the execution time of an individual
instruction due to pipeline overhead.
• Register delay
• Clock skew
– The limitation of pipeline depth is due to
• Pipeline latency
• Pipe stage imbalance
• Pipeline overhead
– Example in A-10.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-8
The Major Hurdle of Pipelining - Pipelining
Hazards
– A hazard is a situation that prevents the next instruction in
the instruction stream from executing during its designated
clock cycle.
– Three classes of hazards
• Structural hazard: Arise from resource conflicts.
• Data hazard: Arise when an instruction depends on the results of a
previous instruction.
• Control hazard: Arise from branches and other instructions that
change the PC.
– A pipeline can be stalled by a hazard. To eliminate hazards,
• Instructions issued later than the stalled instruction are also stalled.
• Instructions issued earlier than the stalled one must continue.
– Note that a cache miss stalls the whole pipeline.
Rung-Bin Lin
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Appendix-9
Performance of Pipeline with Stalls
Average instruction time unpipeline d
Average instruction time pipelined
CPI unpipeline d Clock cycle unpipeline d


CPI pipelined
Clock cycle pipelined
Speedup from pipelining 
– When pipelining is thought of as decreasing the CPI,
Speedup 

CPI unpipeline d
1  Pipeline stall cycles per instruction
Pipeline depth
1  Pipeline stall cycles per instruction
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-10
– When pipelining is thought of as improving the clock cycle
time,
Speedup 
1
Clock cycle unpipeline d

1  Pipeline stall cycles Clock cycle pipelined

Pipeline depth
1  Pipeline stall cycles per instruction
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Structural Hazards
– Due to resource conflicts (Example in A-14)
• Due to some functional unit being not fully pipelined.
• When some resources have not been duplicated enough.
Appendix-11
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Data Hazards
– A memory access depends on the results of unfinishing
instructions.
Appendix-12
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Forwarding (Bypassing) ALU Results To
Minimize Hazards
Appendix-13
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Forwarding (Bypassing) Results to Store
Appendix-14
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Bypassing Results of LOAD
Rung-Bin Lin
Appendix-15
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-16
Data Hazard Classification
– Consider two instructions i and j, with i occurring before j,
the possible hazards are,
• RAW (read after write) : j tries to read a source before i writes it.
• WAW (write after write): j tries to write an operand before it is
written by i. For example,
LW R1, 0(R2)
IF ID EX MEM1 MEM2 WB
DADD R1, R2, R3
IF ID EX
WB
• WAR (write after read): j tries to write a destination before it is read
by i. For example, if read is done in the second half of MEM2, and
write is done in the first half of WB.
SW 0(R1), R2
IF ID EX MEM1 MEM2 WB
DADD R2, R3, R4
IF ID EX
WB
• RAR (read after read): not a hazard.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-17
Data Hazards Requiring Stalls
– Pipeline interlock
• A piece of hardware that detects a hazard and stalls the pipeline
until the hazard is cleared.
– Load interlock
• Example (Fig. A.10 at A-21)
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-18
Control Hazards
– Caused by the instructions that change PC.
– Some basics
• If a branch changes the PC to its target address, it is a taken
branch. If it does not change the PC, it falls through or it is not
taken.
• Recall that if an instruction i is a taken branch, the PC is normally
not changed until the end of ID. A stall cycle is required.
Branch Instruction
Branch successor
Branch successor+1
Branch successor+2
IF ID EX MEM WB
IF IF ID
EX MEM WB
IF
ID EX
MEM WB
IF ID
EX
MEM WB
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-19
Branch Penalty
– Branch delay: The length of a control hazard.
– Branch penalty: The branch delay, unless it is dealt with,
turns into branch penalty.
– The deeper the pipeline, the worse the branch penalty.
– The number of branch stalls can be reduced by two steps
• Find out whether the branch is taken or not taken earlier in the
pipeline.
• Compute the taken PC (i.e., the address of the branch target)
earlier.
– Branch behavior in programs
• Average frequency of taken branches : 67%
– 60% of the forward branches are taken.
– 85% of the backward branches are taken.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Reducing Pipeline Branch Penalties
• Static branch prediction methods (Compile-time guess).
– Free or flush the pipeline
• Holding or deleting any instructions after the branch until the branch
destination is known.
– Predict-not-taken (untaken) (Fig. A.12 in A-23)
– Predict-taken
• Does it have any advantage? Ans: no.
– Delayed branch:
– The execution cycle with a branch delay n is
–
Branch instruction
–
Sequential successor 1
–
Sequential successor 2
– …
–
Sequential successor n (n=1 for MIPS)
–
Branch target if taken
–
Appendix-20
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Scheduling the Branch Delay Slot
Appendix-21
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-22
Effectiveness of Scheduling Branch Delay Slots
– Requirements for being effective
• Scheduling from before: Always
• Scheduling from target: Taken
• Scheduling from fall through: Not taken
– The limitation on delayed-branch scheduling arises from
– The restrictions on the instructions that are scheduled into the
delay slots.
– The ability to predict at compile time whether a branch is likely to
be taken or not.
– Using canceling or nullified branch to relieve the limlits
• In a canceling branch, the instruction includes the direction that
the branch was predicted. When the branch behaves as predicted,
the instruction in the branch delay slot is simply executed.
Otherwise, the instruction in the branch delay slot is simply turned
into a No-Op.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
How Is Pipelining Implemented?
• Unpipelined 5-cycle implementation
Appendix-23
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-24
Simple Pipelining Implementation for MIPS
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-25
Implementing the Control for MIPS Pipeline
• Implementing the control focuses on detecting of hazards and
generating of control signals for forwarding.
– Hazard detection
• All the data hazards can be checked and forwarding control
signals can be set during the ID phase. If a data hazard exists, the
instruction is stalled before it is issued.
• Or, alternatively, hazards forwarding are checked at the beginning
of a clock cycle that uses an operand (EX and MEM for the MIPS
pipeline).
– Implementing the logic for hazard detection
• Hazard detection by comparing the destination and sources of
adjacent instructions (fig. A.20 on page A-34).
• An example shows detecting of all load interlocks when the
instruction using the load result in the ID stage (fig. A.21 on page A-34).
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-26
Implementing Forwarding Logic
– Forwarding sources: ALU or data memory output.
– Forwarding destination: ALU input, data memory input,
or zero detection unit (for BRANCH).
• The forwarding can be implemented by checking the following
conditions
– EX/MEM.IR.destination =ID/EX.IR.source ?
– MEM/WB.IR.destination = ID/EX.IR.source ?
– MEM/WB.IR.destination = EX/MEM.IR.source?
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Forwarding Data to the Two ALU Inputs
Appendix-27
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Dealing with Branches in the Pipeline
Appendix-28
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-29
What Makes Pipelining Hard to Implement
• Exception (interrupt, fault) makes pipelining
difficult to implement.
• Instruction set complications
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-30
Types of Exceptions
• Types
–
–
–
–
–
–
–
–
–
–
–
–
I/O device request
Invoking an OS service from a user program
Tracing instruction execution
Breakpoint
Integer arithmetic overflow or underflow
FP arithmetic anomaly
Page fault
Misaligned memory access
Memory-protection violation
Using an undefined instruction
Hardware malfunction
Power failure
• Exceptions for different architecture (fig. A.26 on page A40).
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-31
Classification of Exceptions
– Synchronous versus asynchronous
• If the event occurs at the same place every time that the program is
executed with the same data and memory allocation, the event is
called synchronous.
– User requested versus coerced
– User maskable versus nonmaskable
– Within versus between instruction
• Depend on whether the event prevents instruction completion by
occurring in the middle of execution or whether it is recognized
between instructions.
– Resume versus terminate (fig. 3.40 on page 182).
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-32
Action Requirements for Different Exception
Types (Fig. A.27 on page A-42)
• Actions
– Resume
– Terminate
• The most difficult exceptions have two properties:
– They occur within instructions (i.e. at EX or MEM stages).
– They must be restartable (must save the PC of the
instruction at which to restart).
Rung-Bin Lin
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Appendix-33
Exception Handling
– Stopping and restarting execution
• Force a trap instruction on the next IF
• Until the trap is taken, turn off all writes for the faulting instruction and
for all instructions that follow in the pipeline.
• After the exception-handling routine in the operating system receives
control, it immediately saves the PC of the faulting instruction.
IF ID
EX
MEM
WB <--- Faulting instruction
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
MEM
Trap instruction -> IF
ID
EX
WB
• If delayed branch is used, we need to save and restore as many PCs as the
length of the branch delay plus one.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-34
Precise Interrupt
• If a pipeline can be stopped so that the instructions
just before the faulting instruction are completed
and those after it can be restarted from scratch.
– Supporting precise interrupts is a requirement in many
systems.
• Exceptions in DLX
– With pipelining, multiple exceptions may occur in the same
clock cycle. (fig. A.28 on page A-44).
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-35
Implementations of Precise Exceptions
– Principle
• The pipeline should be able to handle the exceptions caused by
instruction i prior to the exceptions caused by instruction i+1.
– Implementation
• Hardware posts all exceptions caused by a given instruction in a
status vector associated that instruction.
• Once an exception indication is set in the exception status vector,
any control signal that may cause a data value to be written is
turned off.
• When an instruction enters WB, the exception status vector is
checked, if any exceptions are posted, they are handled in the order
in which they would occur in time on an unpipelined machine.
– This will guarantee that all exceptions will be seen on instruction i
before any are seen on i+1.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-36
Instruction Committed
– When an instruction is guaranteed to complete, it is called
committed.
– In the MIPS pipeline, all instructions are committed when
they reach the end of the MEM stage and no instruction
updates the state before that stage. Thus precise exceptions
are straight forward.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-37
Instruction Set Complications
– Some machines have instructions that change the state in
the middle if the instruction execution.
• VAX: Autoincrement addressing mode.
• VAX or IBM 360: String copy.
• Implicitly set condition code.
– Cause difficulties in scheduling any pipeline delays between
setting condition code and the branch.
ADD XXX <--- Set condition code C.
…
<- Can not place instructions that change C.
BR C, YYY <--- Use C for branch.
– In fact, the condition code must be treated as an operand that
requires hazard detection for RAW hazards with branch no matter
the condition code is set implicitly or explicitly
• Multicycle operations in VAX
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-38
Extending the MIPS Pipeline to Handle MultiCycle Operations
– Assuming four separate functional units in our MIPS
implementation
• Integer unit
– Handle loads and stores, ALU operations and branches.
• FP and integer multiplier
• FP adder
• FP and integer divider
– If an instruction cannot proceed to the EX stage , the entire
pipeline behind that instruction will be stalled.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
MIPS Pipeline with Multi-cycle Functional
Units
Appendix-39
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Pipelining Multi-cycle Functional Units
Appendix-40
Rung-Bin Lin
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Appendix-41
Latency and Initiation(repeat interval)
– Latency
• The number of intervening cycles between an instruction that
produces a result and an instruction that uses the result.
– Initiation (repeat) interval
• The number of cycles that must elapse between issuing two
operations of a given type.
– Latency and initiation interval for pipelining multi-cycle
functional units
Functional Unit
Integer ALU
Data memory access
FP add
FP (integer) multiply
FP (integer) divide
Latency
0
1
3
6
24
Initiation interval
1
1
1
1
25
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-42
Hazards and Forwarding in Longer Latency
Pipelines
– Hazard detection and forwarding for a pipeline as before.
• Structural hazards can occur because the divide unit is not fully
pipelined.
• The number of register writes can be larger than 1 because the
instructions have varying running time.
• WAW hazards are possible, but WAR hazards are not possible.
• Instructions can complete in a different order than they were
issued, causing problems with exceptions.
• Stalls for RAW hazards will be more frequent because of longer
latency.
• Assuming all hazard detection is done in ID, three checks must be
done before issuing an instruction:
– Check for structural hazards
– Check for a RAW data hazard
– Check for a WAW data hazard
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
RAW Hazards Caused by Longer Pipeline
• Fig. A.33
Appendix-43
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Structural Hazards in Longer Pipeline
• Fig. A.34
Appendix-44
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-45
Maintaining Precise Exceptions (1)
– Problems caused by out-of-order completion
DIV.D
ADD.D
SUB.D
F0, F2, F4
F10, F10, F8
F12, F12, F14
– Four possible approaches
• Ignore the problem and settle for imprecise exceptions
• Buffer the results of an operation until all the operations that were
issued earlier are completed.
– History file approach: Buffer the original register values.
– Future file approach: Keep the newer values of registers.
• Allow the exceptions to become somewhat imprecise, but to keep
enough information so that the trap-handling routines can create a
precise sequence for exceptions. This means knowing what
operations were in the pipeline and their PCs.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-46
Maintaining Precise Exceptions (2)
Worst-case scenario:
Instruction 1: A long-running instruction that interrupts.
Instruction 2 : not completed.
….
Instruction n-1: not completed.
Instruction n: completed. <-- The latest completed instruction.
The software must simulate the instruction 1 through instruction n1 and restart the execution at instruction n+1.
• Allows the instruction issue to continue only if it is certain that all
the instructions before the issuing instruction will complete without
causing an exception. This sometimes means stalling the machine
to maintain precise exceptions.
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Number of Stalls per FP Operation
Appendix-47
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Performance of a MIPS FP Pipeline
Appendix-48
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Overview of The MIPS R4000 Pipeline
• An implementation of MIPS64
• Eight pipeline stages (superpipelining)
Appendix-49
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Load Delay in MIPS R4000
Rung-Bin Lin
Appendix-50
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Branch Delay in MIPS R4000
Rung-Bin Lin
Appendix-51
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
CPI of MIPS R4000
Rung-Bin Lin
Appendix-52
Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-53
Concluding Remarks
• We can spend a little money to buy a very powerful
computer today.

無投影片標題 - VLSI Laboratory

Transcript 無投影片標題 - VLSI Laboratory

Directory