CPE 631 Review: Pipelining Electrical and Computer Engineering Aleksandar Milenkovic,
Download ReportTranscript CPE 631 Review: Pipelining Electrical and Computer Engineering Aleksandar Milenkovic,
CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, [email protected] http://www.ece.uah.edu/~milenka Outline Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards Structural Data Control AM LaCASA 2 Laundry Example (by David Patterson) Four loads of clothes: A, B, C, D A B C D Task: each one to wash, dry, and fold Resources Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes AM LaCASA 3 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k AM LaCASA O r d e r 30 40 20 30 40 20 30 40 20 30 40 20 A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 4 Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e AM r LaCASA 30 40 40 40 40 20 A B C D 5 Pipelining Lessons 6 PM 7 8 9 Time T a s k 30 40 40 40 40 20 A O r d e r AM LaCASA B C D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate is limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup 6 Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? AM LaCASA Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place? 7 A "Typical" RISC Registers Data types LaCASA 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double words for integer data 32-bit single- or 64-bit double-precision numbers Addressing Modes for MIPS Data Transfers AM 32 64-bit general-purpose (integer) registers (R0-R31) 32 64-bit floating-point registers (F0-F31) Load-store architecture: Immediate, Displacement Memory is byte addressable with a 64-bit address Mode bit to select Big Endian or Little Endian 8 MIPS64 Instruction Formats Register-Register 31 26 25 Op Rs 2120 16 15 Rt Rd Register-Immediate 31 26 25 2120 16 15 Op Rs Rt 1110 65 shamt 0 funct 0 immediate Jump / Call 31 26 25 Op AM LaCASA 0 address Floating-point (FR) 31 26 25 2120 16 15 Op Fmt Ft Fs Floating-point (FI) 31 26 25 2120 16 15 Op Fmt Ft 65 1110 Fd funct 0 0 immediate 9 MIPS64 Instructions MIPS Operations (See Appendix B, Figure B.26) AM LaCASA Data Transfers (LB, LBU, SB, LH, LHU, SH, LW, LWU, SW, LD, SD, L.S, L.D, S.S, S.D, MFCO, MTCO, MOV.S, MOV.D, MFC1, MTC1) Arithmetic/Logical (DADD, DADDI, DADDU, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVU, MADD, AND, ANDI, OR, ORI, XOR, XORI, LUI, DSLL, DSRL, DSRA, DSLLV, DSRLV, DSRAV, SLT, SLTI, SLTU, SLTIU) Control (BEQZ, BNEZ, BEQ, BNE, BC1T, BC1F, MOVN, MOVZ, J, JR, JAL, JALR, TRAP, ERET) Floating Point (ADD.D, ADD.S, ADD.PS, SUB.D, SUB.S, SUB.PS, MUL.D, MUL.S, MUL.PS, MADD.D, MADD.S, MADD.PS, DIV.D, DIV.S, DIV.PS, CVT._._, C._.D, C._.S 10 5 Steps of Simple RISC Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU MUX MUX LaCASA Reg File Inst Memory Address RD Imm AM Zero? RS1 RS2 Write Back MUX Next PC Memory Access Sign Extend WB Data 11 5 Steps of Simple RISC Datapath (cont’d) Next SEQ PC Sign Extend RD RD RD MUX MEM/WB Data Memory EX/MEM ALU MUX MUX ID/EX Reg File IF/ID Memory Address RS2 Write Back Zero? RS1 Imm LaCASA Next SEQ PC Adder 4 Memory Access MUX Next PC AM Execute Addr. Calc Instr. Decode Reg. Fetch WB Data Instruction Fetch • Data stationary control – local decode for each instruction phase / pipeline stage 12 Visualizing Pipeline Time (clock cycles) LaCASA IM Reg IM CC 5 DM Reg Reg IM DM Reg CC 6 CC 7 Reg DM ALU AM Reg CC 4 ALU O r d e r IM CC 3 ALU I n s t r. CC 2 ALU CC 1 Reg DM Reg 13 Instruction Flow through Pipeline Time (clock cycles) CC 1 Reg Sub R6,R5,R7 ALU Add R1,R2,R3 ALU Lw R4,0(R2) Nop DM DM DM DM Nop Nop IM Nop Nop Add R1,R2,R3 Reg Reg Reg Reg LaCASA Lw R4,0(R2) ALU ALU Xor R9,R8,R1 Reg Reg Reg Add R1,R2,R3 CC 4 IM IM IM Nop AM Sub R6,R5,R7 Lw R4,0(R2) Add R1,R2,R3 CC 3 CC 2 14 Simple RISC Pipeline Definition: IF, ID Stage IF Stage ID AM LaCASA IF/ID.IR Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC EX/MEM.ALUOUT} else {IF/ID.NPC, PC PC + 4}; ID/EX.A Regs[IF/ID.IR6…10]; ID/EX.B Regs[IF/ID.IR11…15]; ID/EX.Imm (IF/ID.IR16)16 ## IF/ID.IR16…31; ID/EX.NPC IF/ID.NPC; ID/EX.IR IF/ID.IR; 15 Simple RISC Pipeline Definition: IE ALU load/store AM LaCASA EX/MEM.IR ID/EX.IR; EX/MEM.ALUOUT ID/EX.A func ID/EX.B; or EX/MEM.ALUOUT ID/EX.A func ID/EX.Imm; EX/MEM.cond 0; EX/MEM.IR ID/EX.IR; EX/MEM.B ID/EX.B; EX/MEM.ALUOUT ID/EX.A ID/EX.Imm; EX/MEM.cond 0; branch EX/MEM.Aluout ID/EX.NPC (ID/EX.Imm<< 2); EX/MEM.cond (ID/EX.A func 0); 16 Simple RISC Pipeline Def.: MEM, WB Stage MEM ALU load/store MEM/WB.IR EX/MEM.IR; MEM/WB.LMD Mem[EX/MEM.ALUOUT] or Mem[EX/MEM.ALUOUT] EX/MEM.B; Stage WB ALU AM LaCASA MEM/WB.IR EX/MEM.IR; MEM/WB.ALUOUT EX/MEM.ALUOUT; Regs[MEM/WB.IR16…20] MEM/WB.ALUOUT; or Regs[MEM/WB.IR11…15] MEM/WB.ALUOUT; load Regs[MEM/WB.IR11…15] MEM/WB.LMD; 17 Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle AM LaCASA Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) 18 One Memory Port/Structural Hazards Time (clock cycles) LaCASA Ifetch Reg DMem Reg DMem Reg ALU DMem Reg ALU O r d Instr 3 e AM r Instr 4 DMem ALU Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Ifetch Reg Reg Reg DMem Reg 19 One Memory Port/Structural Hazards (cont’d) Time (clock cycles) LaCASA Ifetch Reg DMem Reg Ifetch Bubble Reg Reg DMem Bubble Bubble Ifetch Reg Reg Bubble ALU O r d Stall e AM r Instr 3 DMem ALU Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble DMem Reg 20 Data Hazard on R1 Time (clock cycles) LaCASA or r8,r1,r9 xor r10,r1,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU AM dsub r4,r1,r3 Reg ALU O r d e r dadd r1,r2,r3 Ifetch WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg Reg Reg DMem Reg 21 Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 AM LaCASA Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 22 Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 AM LaCASA Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 23 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 AM LaCASA Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 24 Forwarding to Avoid Data Hazard LaCASA or r8,r1,r9 xor r10,r1,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU AM sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg Reg Reg DMem Reg 25 HW Change for Forwarding NextPC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory AM LaCASA 26 Forwarding to DM input - Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input) Time (clock cycles) O lw r d sw e AM r LaCASA R4,0(R1) 12(R1),R4 IM Reg IM CC 3 Reg IM CC 4 CC 5 DM Reg Reg DM ALU add R1,R2,R3 CC 2 ALU CC 1 ALU I n s t. CC 6 CC 7 Reg DM Reg 27 Forwarding to DM input (cont’d) Forward R1 from MEM/WB.ALUOUT to DM input CC 1 add R1,R2,R3 sw 0(R4),R1 IM CC 2 Reg IM CC 3 Reg CC 4 CC 5 DM Reg ALU O r d e r Time (clock cycles) ALU I n s t. DM CC 6 Reg AM LaCASA 28 Forwarding to Zero Forward R1 from EX/MEM.ALUOUT to Zero add R1,R2,R3 Reg CC 3 CC 4 CC 5 DM Reg CC 6 Z R1,50 IM Reg Reg DM Forward R1 from MEM/WB.ALUOUT to Zero Reg IM Reg DM Reg DM Reg Z IM Reg ALU IM ALU add R1,R2,R3 O sub R4,R5,R6 r AM d bneq R1,50 e r LaCASA IM CC 2 ALU beqz CC 1 ALU Time (clock cycles) ALU I n s t r u c t i o n DM Reg 29 Data Hazard Even with Forwarding LaCASA and r6,r1,r7 or r8,r1,r9 DMem Ifetch Reg DMem Reg Ifetch Ifetch Reg Reg Reg DMem ALU sub r4,r1,r6 Reg ALU O r d e r AM lw r1, 0(r2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 30 Data Hazard Even with Forwarding LaCASA and r6,r1,r7 or r8,r1,r9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem Reg Reg DMem ALU AM sub r4,r1,r6 Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) Reg DMem 31 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. AM LaCASA Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd 32 22: add r8,r1,r9 AM 36: xor r10,r1,r11 LaCASA Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU r6,r1,r7 Ifetch DMem ALU 18: or Reg ALU 14: and r2,r3,r5 Ifetch ALU 10: beq r1,r3,36 ALU Control Hazard on Branches Three Stage Stall Reg Reg Reg Reg DMem Reg 33 Example: Branch Stall Impact If 30% branch, Stall 3 cycles significant Two part solution: MIPS branch tests if register = 0 or 0 MIPS Solution: AM LaCASA Determine branch taken or not sooner, AND Compute taken branch address earlier Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 34 Pipelined Simple RISC Datapath Instruction Fetch Write Back Adder Adder Zero? RS1 RD RD WB Data RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Reg File IF/ID Memory Address RS2 Imm AM Memory Access MUX Next SEQ PC Next PC 4 Execute Addr. Calc Instr. Decode Reg. Fetch • Data stationary control LaCASA – local decode for each instruction phase / pipeline stage 35 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken AM LaCASA Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction 36 Branch not Taken 5 branch IF (not taken) ID IF Ii+1 Ii+2 branch (taken) Ii+1 branch target AM branch target+1 LaCASA Ex Mem WB ID Ex Mem WB IF ID Ex Mem WB 5 IF ID IF Time [clocks] Branch is untaken (determined during ID), we have fetched the fallthrough and just continue no wasted cycles Ex Mem WB Branch is taken (determined during ID), idle idle idle idle restart the fetch from at the branch target IF ID Ex Mem WB one cycle wasted IF ID Ex Mem WB Instructions 37 Four Branch Hazard Alternatives #3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS AM LaCASA MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome 38 Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken AM LaCASA Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this 39 Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken AM LaCASA 40 Scheduling the branch delay slot: From Before ADD R1,R2,R3 if(R2=0) then <Delay Slot> Becomes if(R2=0) then Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance <ADD R1,R2,R3> AM LaCASA 41 Scheduling the branch delay slot: From Target SUB R4,R5,R6 ... ADD R1,R2,R3 if(R1=0) then <Delay Slot> Becomes AM LaCASA ... ADD R1,R2,R3 if(R2=0) then <SUB R4,R5,R6> Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path programs are enlarged Preferred when the branch is taken with high probability 42 Scheduling the branch delay slot: From Fall Through ADD R1,R2,R3 if(R2=0) then <Delay Slot> SUB R4,R5,R6 Becomes ADD R1,R2,R3 if(R2=0) then Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken <SUB R4,R5,R6> AM LaCASA 43 Delayed Branch Effectiveness Compiler effectiveness for single branch delay slot: AM LaCASA Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) 44 Example: Branch Stall Impact Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles Op Freq Cycles CPI(i) Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%) => new CPI = 1.9, or almost 2 times slower (% Time) AM LaCASA 45 Example 2: Speed Up Equation for Pipelining CPIpipelined Ideal CPI Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI Pipeline depth Speedup Ideal CPI Pipeline stall CPI Cycle Timepipelined For simple RISC pipeline, CPI = 1: Cycle Timeunpipelined Pipeline depth Speedup 1 Pipeline stall CPI Cycle Timepipelined AM LaCASA 46 Example 3: Evaluating Branch Alternatives (for 1 program) Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1.42 Predict taken 1 1.14 Predict not taken 1 1.09 Delayed branch 0.5 1.07 Conditional & Unconditional = 14%, 65% 1.0 1.26 1.29 1.31 change PC AM LaCASA 47 Example 4: Dual-port vs. Single-port Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed AM LaCASA 48 Extended Simple RISC Pipeline DLX pipe with three unpipelined, FP functional units EX Int EX FP/I Mult IF ID Mem EX FP Add AM LaCASA EX FP/I Div WB In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1 49 Extended Simple RISC Pipeline (cont’d) Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unit AM LaCASA Latency Initiation interval Integer ALU 0 1 Data Memory 1 1 FP Add 3 1 FP/Integer Multiply 6 1 FP/Integer Divide 24 25 50 Extended Simple RISC Pipeline (cont’d) Ex M1 IF M2 M3 M4 M5 M6 M7 ID M A1 A2 A3 WB A4 .. AM LaCASA 51 Extended Simple RISC Pipeline (cont’d) Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL.D ADD.D L.D S.D IF ID M1 M2 M3 M4 M5 IF ID A1 A2 A3 A4 IF ID Ex IF ID M6 M7 Mem WB Mem WB Mem WB Ex Mem WB AM LaCASA 52 Hazards and Forwarding in Longer Pipes Structural hazard: divide unit is not fully pipelined LaCASA Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! AM detect it and stall the instruction instructions can complete in different order than they were issued RAW hazards will be more frequent 53 Examples Stalls arising from RAW hazards L.D F4, 0(R2) IF MUL.D F0, F4, F6 ADD.D F2, F0, F8 S.D 0(R2), F2 ID EX Mem WB IF ID stall M1 M2 IF stall ID stall stall stall stall stall stall A1 A2 IF stall stall stall stall stall stall ID EX ... LaCASA M4 M5 M6 M7 Mem WB A3 A4 Mem WB stall stall stall Mem Three instructions that want to perform a write back to the FP register file simultaneously MUL.D F0, F4, F6 ... AM M3 ADD.D F2, F4, F6 ... ... L.D F2, 0(R2) IF ID M1 M2 M3 M4 M5 IF ID EX IF ID EX Mem WB IF ID A1 A2 IF ID IF EX ID IF M6 M7 Mem WB A3 A4 Mem WB Mem WB Mem WB EX Mem WB ID EX Mem WB 54 Solving Register Write Conflicts First approach: track the use of the write port in the ID stage and stall an instruction before it issues Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage AM LaCASA use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places 55 WAW Hazards IF ADD.D F2, F4, F6 ID EX Mem WB IF ID A1 A2 A3 A4 IF ID EX Mem WB IF ID EX Mem L.D F2, 0(R2) LaCASA WB WB Result of ADD.D is overwritten without any instruction ever using it AM Mem WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? BNEZ DIV.D ... foo: L.D R1, foo F0, F2, F4 ; delay slot from fall-through F0, qrs 56 Solving WAW Hazards First approach: delay the issue of load instruction until ADD.D enters MEM Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing AM LaCASA stall LD, or make ADDD no-op Luckily this hazard is rare 57 Hazard Detection in ID Stage Possible hazards hazards among FP instructions hazards between an FP instruction and an integer instr. FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage AM LaCASA 58 Hazard Detection in ID Stage (cont’d) Check for structural hazards Check for RAW data hazards LaCASA wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards AM wait until the required functional unit is not busy and make sure that the register write port is available determine if any instruction in A1, .. A4, M1, .. M7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID 59 Forwarding Logic Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data AM LaCASA 60