Transcript hazards
Lecture 7 Pipeline Hazards Hazards CS510 Computer Architectures Lecture 7 - 1 Pipelining Lessons 6 PM 7 30 T a s k O r d e r Hazards 40 8 40 40 9 40 Time • 20 • A • B • C • • D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup CS510 Computer Architectures Lecture 7 - 2 Its Not That Easy to Achieve the Promised Performance • Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions – Data hazards: Instruction depends on result of prior instruction still in the pipeline – Control hazards: Pipelining of branches and other instructions that change the PC • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline Hazards CS510 Computer Architectures Lecture 7 - 3 Structural Hazards /Memory Time(clock cycles) CC3 CC4 CC5 LOAD Mem Reg Mem Mem Reg Instr 1 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Mem Mem Reg ALU Mem Mem Mem Reg Instr 2 Instr 3 Instr 4 Operation on Memory by 2 different instructions in the same clock cycle Hazards CS510 Computer Architectures CC6 CC7 ALU CC2 ALU Instruction Order CC1 CC8 CC9 Reg Mem Reg Lecture 7 - 4 Structural Hazards with Single-Port Memory Time(clock cycles) CC2 CC3 CC4 CC5 Mem LOAD Mem Reg ALU Mem Reg Instr 1 Mem Reg ALU Mem Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU Instruction Order CC1 Mem Instr 2 Stall Instr 3 Stall Stall Hazards 3 cycles stall with 1-port memory CS510 Computer Architectures CC7 Mem Mem CC8 CC9 Reg Reg ALU Instr 3 CC6 Lecture 7 - 5 Avoiding Structural Hazard with Dual-Port Memory Time(clock cycles) CC5 Reg DM DM Reg Reg DM Reg IM Reg DM DM Reg Reg ALU DM Reg ALU Instr 2 CC4 ALU Instruction Order Instr 1 IM IM CC3 ALU LOAD CC2 ALU CC1 DM DM IM IM Instr 3 IM IM Instr 4 Hazards No stall with 2-port memory CS510 Computer Architectures Reg IM IM CC7 Reg CC8 ALU Instr 5 IM IM CC6 CC9 Reg DM DM Lecture 7 - 6 Speed Up Equation for Pipelining Speedup from pipelining Ave Instr Time unpipelined Ave Instr Time pipelined CPIunpipelined x Clock Cycleunpipelined CPIpipelined x Clock Cyclepipelined CPIunpipelined Clock Cycleunpipelined CPIpipelinedx Clock Cyclepipelined Ideal CPI = CPIunpipelined/Pipeline depth(Number of pipeline stages) Speedup = Ideal CPI x Pipeline depth x CPIpipelined Clock Cycleunpipelined Clock Cyclepipelined Ideal CPI for pipelined machines is almost always 1 Hazards CS510 Computer Architectures Lecture 7 - 7 Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr = 1 + Pipeline stall clock cycles per instr Hazards Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined x Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = Pipeline depth 1 + Pipeline stall CPI x Clock Cycleunpipelined Clock Cyclepipelined CS510 Computer Architectures Lecture 7 - 8 Dual-Port vs Single-Port Memory • Machine A: 2-port memory(needs no stall for Load); same clock cycle as unpipelined machine • Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times faster clock rate than the unpipelined machine • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15 Machine A is 1.15 times faster Hazards CS510 Computer Architectures Lecture 7 - 9 Data Hazard on Registers Time(clock cycles) CC1 XOR R10,R11,R1 Hazards Reg R1 Reg Reg Mem Mem Reg Reg Mem CC7 CC8 CC9 Reg Mem Reg Reg Reg Mem Mem Reg Reg ALU OR R8,R1,R9 Mem CC6 ALU AND R6,R1,R7 Mem CC5 ALU SUB R4,R1,R3 Reg CC4 ALU Mem CC3 ALU ADD R1,R2,R3 CC2 CS510 Computer Architectures Reg Mem Reg Lecture 7 - 10 Data Hazard on Registers Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle Clcok Cycle Store into Ri Read from Ri Register Ri Hazards CS510 Computer Architectures Lecture 7 - 11 Data Hazard on Registers Time(clock cycles) Mem Mem Reg R1 Reg Reg Mem Mem Reg Reg Mem CC6 CC7 CC8 CC9 Reg Mem Reg Reg Reg Mem Mem Reg Reg ALU XOR R10,R11,R1 Reg CC5 ALU OR R8,R1,R9 Mem CC4 ALU AND R6,R1,R7 CC3 ALU SUB R4,R1,R3 CC2 ALU ADD R1,R2,R3 CC1 Reg Mem Reg Needs to Stall 2 cycles Hazards CS510 Computer Architectures Lecture 7 - 12 Three Generic Data Hazards Instri followed by Instrj Read After Write (RAW) Instrj tries to read operand before Instri writes it Instri Instrj Hazards LW R1, 0(R2) SUBR 4, R1, R5 CS510 Computer Architectures Lecture 7 - 13 Three Generic Data Hazards InstrI followed by InstrJ • Write After Read (WAR) Instrj tries to write operand before Instri reads it Instri Instrj ADD R1, R2, R3 LW R2, 0(R5) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, – Reads are always in stage 2, and – Writes are always in stage 5 Hazards CS510 Computer Architectures Lecture 7 - 14 Three Generic Data Hazards InstrI followed by InstrJ Write After Write (WAW) Instrj tries to write operand before Instri writes it – Leaves wrong result ( Instri not Instrj) Instri Instrj LW LW R1, 0(R2) R1, 0(R3) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes Hazards CS510 Computer Architectures Lecture 7 - 15 Forwarding to Avoid Data Hazards Time(clock cycles) Hazards CC3 Mem Reg Reg Mem Reg Mem Mem Reg Mem CC6 CC7 Mem Reg Reg Mem Mem Reg CS510 Computer Architectures CC8 CC9 Reg ALU XOR R10,R11,R1 CC2 ALU OR R8,R1,R9 Mem ALU AND R6,R1,R7 CC5 ALU SUB R4,R1,R3 CC4 ALU ADD R1,R2,R3 CC1 Reg Mem Reg Lecture 7 - 16 HW Change for Forwarding Zero? Data Memory CS510 Computer Architectures M/W Buffer MUX A/M Buffer MUX D/A Buffer Hazards ALU Lecture 7 - 17 Hazards CS510 Computer Architectures Lecture 7 - 18 Load Delay Due to Data Hazard Time(clock cycles) Reg IM Reg Reg DM IM Reg DM Reg Reg IM Hazards CS510 Computer Architectures Reg IM Reg Reg DM ALU OR R8,R1,R9 DM ALU AND R6,R1,R7 Reg ALU IM Load Delay =2cycles ALU DM ALU SUB R4,R1,R6 IM ALU LOAD R1,0(R2) Reg DM Lecture 7 - 19 Load Delay with Forwarding Time(clock cycles) IM Reg Load Delay with Forwarding=1cycle IM DM Reg DM Reg DM Reg ALU Hazards Reg Reg ALU OR R8,R1,R9 IM DM ALU AND R6,R1,R7 Reg ALU SUB R4,R1,R6 IM ALU LOAD R1,0(R2) We need to add HW, called Pipeline Interlock IM CS510 Computer Architectures Reg Reg DM Reg Lecture 7 - 20 Software Scheduling to Avoid Load Hazards Try to produce fast code for a = b + c; d = e - f; assuming a, b, c, d ,e, and f are in memory. Slow code(with forwarding): Fast code: LW Rb,b LW LW Rc,c LW Stall RAW ADD Ra,Rb,Rc LW Stall RAW SW a,Ra ADD LW Re,e LW LW Rf,f SW Stall RAW SUB Rd,Re,Rf SUB Stall RAW SW RAW SW d,Rd Hazards CS510 Computer Architectures Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf Stall d,Rd Lecture 7 - 21 Compiler Avoiding Load Stalls scheduled 54% gcc spice 31% 42% 14% 65% tex 0% unscheduled 25% 20% 40% 60% 80% % loads stalling pipeline Hazards CS510 Computer Architectures Lecture 7 - 22 Pipelined DLX Datapath ID Stage IF Stage EX Stage WB Stage Mem Stage MUX Add Zero? LMD MUX Data Memory M/W Buffer SMD A/M Buffer Sign Ext ALU MUX Reg File 16 Hazards D/A Buffer F/D Buffer PC Instr. Memory MUX +4 32 • Branch Address Calculation • Decide Condition CS510 Computer Architectures • Branch Decision for target address Lecture 7 - 23 Control Hazard on Branches: Three Stall Cycles Time(clock cycles) CC4 CC5 IM Reg DM Reg IM IM Reg Reg IM DM DM Reg Reg Reg DM DM IM IM CC8 CC9 Reg Reg IM Reg Branch Target available Reg DM ALU 80 LD R4,R7, 100 CC7 Should’t be executed when branch condition is true ! ALU 52 ADD R14,R2, R2 CC6 ALU 48 OR R13,R6, R2 CC3 ALU 44 AND R12,R2, R5 CC2 ALU Program execution order in instructions 40 BEQ R1,R3, 36 CC1 Reg Reg DM Reg Branch Delay = 3 cycles Hazards CS510 Computer Architectures Lecture 7 - 24 Control Hazard on Branches: Three Stall Cycles We don’t know yet the instruction being executed is a branch. Fetch the branch successor. Branch instruction Branch successor Branch successor + 1 Branch successor + 2 IF ID IF Now, we know the instruction being executed is a branch. But stall until branch target address is known. Hazards EX ID Now, target address is available. MEM WB EX MEM IF ID IF EX ID 3 Wasted clock cycles for the TAKEN branch CS510 Computer Architectures Lecture 7 - 25 Branch Stall Impact • • • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 – Half of the ideal speed Two part solution: – Determine the branch is TAKEN or NOT TAKEN sooner, AND – Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1 DLX Solution: Get New PC earlier - Move Zero test to ID stage - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles Hazards CS510 Computer Architectures Lecture 7 - 26 Pipelined DLX Datapath IF Stage ID Stage EX Stage WB Stage Mem Stage To get target addr. earlier Zero? MUX When a branch instruction is in Execute stage, Next Address is available here. Add Add LMD MUX Data Memory M/W Buffer SMD A/M Buffer 16 ALU MUX Hazards Reg File D/A Buffer To get the Condition Earlier. Target Address available after ID. F/D Buffer PC Instr. Memory MUX +4 Sign Ext 32 CS510 Computer Architectures Lecture 7 - 27 Hazards CS510 Computer Architectures Lecture 7 - 28 Branch Behavior in Programs • Conditional branch frequencies – integer average --- 14 to 16 % – floating point --- 3 to 12 % • Forward and backward taken branches – forward taken --- 60 % – backward taken --- 85 % – the average of all conditional branches ---- 67 % Hazards CS510 Computer Architectures Lecture 7 - 29 4 Branch Hazard Alternatives • • • • Hazards Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch CS510 Computer Architectures Lecture 7 - 30 4 Branch Hazard Alternatives: (1) STALL Stall until branch direction is clear Branch instruction Branch successor Branch successor + 1 Branch successor + 2 IF ID EX MEM WB stall stall stall IF ID IF EX ID IF MEM EX ID 3 cycle penalty Revised DLX pipeline(get the branch address at EX) Branch instruction Branch successor Branch successor + 1 Branch successor + 2 IF ID EX stall IF MEM WB ID EX MEM WB IF ID EX MEM IF ID 1 cycle penalty(Branch Delay Slot) Hazards CS510 Computer Architectures Lecture 7 - 31 4 Branch Hazard Alternatives: (2) Predict Branch “NOT TAKEN” • Execute successor instructions in the sequence • PC+4 is already calculated, so use it to get the next instruction • Flush instructions in the pipeline if branch is actually TAKEN • Advantage of late pipeline state update • 47% of DLX branches are NOT TAKEN on the average NOT TAKEN branch instruction i IF instruction i+1 instruction i+2 TAKEN branch instruction i instruction i+1 instruction T IF ID IF ID IF EX ID IF EX ID IF MEM EX ID WB MEM EX MEM EX ID WB MEM WB MEM WB EX MEM WB No penalty 1 cycle penalty WB Flush this instruction in progress Hazards CS510 Computer Architectures Lecture 7 - 32 4 Branch Hazard Alternatives: (3) Predict Branch “TAKEN” – 53% DLX branches TAKEN on average – Branch target address available after ID in DLX • DLX still incurs 1 cycle branch penalty for TAKEN branch • Other machines: branch target known before outcome TAKEN address not available at this time NOT TAKEN instruction i Instruction T Instruction i+1 IF ID stall EX IF MEM WB IF ID EX MEM WB 2 cycle penalty in DLX(1 in other machines). TAKEN address available TAKEN branch instruction i Instruction T Instruction T+1 IF ID stall EX IF MEM ID IF WB EX ID MEM EX WB MEM WB 1 cycle penalty in DLX(0 in other machines) Hazards CS510 Computer Architectures Lecture 7 - 33 4 Branch Hazard Alternatives: (4) Delayed Branch Delayed Branch – Delay branch to take place AFTER a successor instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken Delayed Branch of length n – 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement Hazards CS510 Computer Architectures Lecture 7 - 34 Delayed Branch • Where to get instructions to fill branch delay slot? – – – – Before branch instruction From the target address: only valuable when branch TAKEN From fall through: only valuable when branch NOT TAKEN Canceling branches allow more slots to be filled • Compiler effectiveness for single delayed branch slot: – Fills about 60% of delayed branch slots – About 80% of instructions executed in delayed branch slots are useful in computation – About 50% (60% x 80%) of slots usefully filled Hazards CS510 Computer Architectures Lecture 7 - 35 4 Branch Hazard Alternatives: Delayed Branch From before ADD R1, R2, R3 if R2=0 then Delay slot if R2=0 then ADD R1, R2, R3 - Always improve performance - Branch must not depend on rescheduled instructions Hazards From target SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 - Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr. CS510 Computer Architectures From fall through ADD R1, R2, R3 if R1=0 then Delay slot SUB R4, R5, R6 ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6 - Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken Lecture 7 - 36 Limitations on Delayed Branch • Difficulty in finding useful instructions to fill the delayed branch slots • Solution - Squashing – Delayed branch associated with a branch prediction – Instructions in the predicted path are executed in the delayed branch slot – If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded) Hazards CS510 Computer Architectures Lecture 7 - 37 Canceling Branch • Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to – Restrictions on scheduling instructions at the delay slots – Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time • Instruction includes the direction that the branch was predicted – When the branch behaves as predicted, the instructions in the delay slot are executed – When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs • Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements Hazards CS510 Computer Architectures Lecture 7 - 38 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency xBranch penalty Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN Scheduling scheme Branch penalty Stall pipeline 3 Predict Taken 1 Predict Not Taken 1 Delayed branch 0.5 Hazards CPI speedup vs unpipelined 1+0.14x3=1.42 1+0.14x1=1.14 1+0.14x0.65=1.09 1+0.14x0.5=1.07 CS510 Computer Architectures 5/1.42=3.5 5/1.14=4.4 5/1.09=4.5 5/1.07=4.6 speedup vs stall 1.0 1.26 1.29 1.31 Lecture 7 - 39 Static(Compiler) Prediction of Taken/Untaken Branches Code Motion LW SUB BEQZ Depend on LW, OR need to ADD stall L: ADD R1, 0(R2) R1, R1, R3 R1, L R4, R5, R6 R10,R4,R3 R7, R8, R9 If branch is almost always NOT TAKEN, and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed If branch is almost always TAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed Hazards CS510 Computer Architectures Lecture 7 - 40 Static(Compiler) Prediction of Taken/Untaken Branches Hazards CS510 Computer Architectures Taken backwards Not Taken Forwards tomcatv ora tomcatv swm256 0% swm256 Always taken ora mdljsp2 hydro2d gcc espresso doduc compress 0% 2% mdljsp2 10% 4% hydro2d 20% 6% gcc 30% 8% espresso 40% 10% doduc 50% 12% compress 60% 14% alvinn Misprediction Rate 70% alvinn Frequency of Misprediction • Improves strategy for placing instructions in delay slot • Two strategies – Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch – Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s) Lecture 7 - 41 100000 10000 1000 100 10 Profile-based Hazards CS510 Computer Architectures tomcatv swm256 ora mdljsp2 hydro2d gcc espresso doduc 1 compress • Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric alvinn • Instructions per mispredicted branch Evaluating Static Branch Prediction Strategies Direction-based Lecture 7 - 42 Pipelining Summary • Just overlap tasks, and easy if tasks are independent • Speed Up <= Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline Depth X 1 + Pipeline stall CPI Clock Cycle Unpipelined Clock Cycle Pipelined • Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: Dynamic Prediction, Delayed branch slot, Static(compiler) Prediction Hazards CS510 Computer Architectures Lecture 7 - 43