Transcript Document
5. Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2006 D. Sima, 2006 Branch prediction • 1. Introdutcion • 2. Basic branch prediction mechanisms • 3. Auxiliary branch prediction mechanisms • 4. Accessing the branch target path 1.1 The branch processing problem of pipelining (1) ti ii b F i i+1 t i+1 t i+2 t i+3 D E W F D i i+2 ij t i+4 F F BTI Branch fetching Branch detection BTA calculation 2 bubbles BTI fetching Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline 1.1 The branch processing problem of pipelining (2) ti ii bc F i i+1 t i+1 t i+2 t i+3 D E W F D E F D i i+2 t i+5 F i i+3 ij t i+4 F BTI bc fetching bc detection Condition checking (branch!) BTA calculation 3 bubbles BTI fetching Figure 1.2: Straightforward processing of a conditional branch on a four stage pipeline with immediate condition resolution 1.1 The branch processing problem of pipelining (3) ti ii bc t i+1 F i i+1 t i+2 t i+4 E D E F D i i+2 t i+3 tj E t j+1 t j+2 E W t j+3 F bc fetching bc detection Condition Condition checking checking Condition Condition BTA checking checking calculation (branch!) Dynamic stop ij BTI F Large number of bubbles BTI fetching Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed condition resolution t j+4 1.1 The branch processing problem of pipelining (4) No of pipeline stages 40 30 20 Pentium (5) 10 * 1990 Pentium Pro (~12) K6 * (6) * 1995 Pentium 4 (~20) * Athlon (6) P4 Prescott (~30) * Athlon-64 (12) * Core Duo Conroe (14) * * 2000 2005 Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors Year 1.2 Branch statistics (1) Figure 1.5: Dynamic ratio of branches 1.2 Branch statistics (2) Figure 1.6: Ratio of the main instruction types Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp. 137-146 1.2 Branch statistics (3) Branches Unconditional branches Simple unconditional branch Branch to subroutine ~ 1/3 Return from subroutine Conditional branches Loop-closing conditional branch Other conditional branches ~ 1/3 ~ 1/3 Taken for the first (n-1) iterations ~ 1/6 ~ 1/6 Taken Not taken Taken ~ 5/6 Figure 1.7: Grohoski’s estimate of branch statistics Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58 Not taken ~ 1/6 1.2 Branch statistics (3) Reference Lee, Smith 1984 Frequency of taken Frequency of not taken branches branches 57 - 99 % 1 - 43 % Edenfield & al. 1990 75 % 25 % Grohoski 1990 ~ 5/6 ~ 1/6 Figure 1.8: Frequency of taken and not taken branches Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303 1.3 The principle of branch prediction (1) ti ii bc t i+1 F i i+1 t i+2 D E F D i i+2 t i+3 t i+4 E E t j+1 t j+2 E W t j+3 F bc fetching bc detection Condition Condition BTA checking checking calculation (branch!) Condition Condition checking checking Dynamic stop Branch prediction (branch!) BTA calculation i i+3 tj BTI (speculative) F Spec. ex. acknowledged D FF ij 2 bubbles BTA fetching BTI decode Figure 1.9: Correctly predicted conditional branch with delayed condition resolution on a four stage pipeline 1.3 The principle of branch prediction (2) ti ii bc t i+1 F i i+1 i i+2 t i+2 D E F D t i+3 t i+4 E tj E t j+1 t j+2 E W t j+3 t j+4 F bc fetching bc detection Condition checking BTA Condition Condition checking checking calculation (no branch!) Condition checking Branch pred. (branch!) Dynamic BTA calc. stop i i+3 BTI (speculative) F D FF ij BTA fetching BTI decode i j+1 F A large number of bubbles i i+1 fetching Figure 1.10: Incorrectly predicted conditional branch with delayed condition resolution on a four stage pipeline 1.3 The principle of branch prediction (3) ti ii bc i i+1 t i+1 F1 t i+2 t i+3 F2 F3 D1 D2 F1 F2 F3 D1 F1 F2 F3 i i+2 F1 t i+4 t i+5 F2 tj t j+1 E1 t j+2 t j+3 W E2 Condition checking mispred.! (branch!) F1 bc fetching BTA calculation bc detection Branch prediction (no branch!) i i+n F1 i i+n+1 BTI F1 Misprediction penalty BTI fetching Figure 1.11: Branch misprediction penalty on a long pipeline t j+4 1.4 Branch prediction accuracy/penalty (1) Guessing method (relevant for Implementation prediction accuracy) Processor Am 29000 (1987) Implicit dynamic 32-entry two-way set associative BTIC Implicit dynamic, 32-entry fully associative overridden by opcodeBTIC based static 2-bit dynamic 256-entry BTAC MC 88110 (1991) MC 68060 (1993) Prediction accuracy Reference 60 % for repetitive branches 70 % on SPEC Weiss 1987 Diefendorff, Allen 1992 > 90 % Circello, Goodrich 1993 MIPS R10000 (1996) 2-bit dynamic 512-entry BHT 90 % Halfhill, 1994 PowerPC 620 (1995) Implicit dynamic, augmented with 2-bit dynamic Implicit dynamic, overridden by 3-bit dynamic or compiler based static 2-bit dynamic 256-entry fully associative BTAC, 2-Kentry BHT 32-entry fully associative BTAC, 256-entry BHT 90 % Thomson, Ryan 1994 80 % on SPECint92 Gwennap 1994 PA-8000 (1995) UltraSparc (1995) BHT BTIC 2 K-entries in the IC, each 88 % on SPECint92 94 % shared among two on SPECfp92 instructions : Branch history table : Branch target instruction cache BTAC IC : Branch target address cache : Instruction cache Figure 1.12: Branch prediction accuracy Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340 Wayner 1994 1.4 Prediction accuracy/penalty (2) Effective penalty of branch processing (simplified) P f c Pc f m Pm fc: fm : Pc: Pm: Probability (frequency) of correctly predicted branches Probability (frequency) of mispredicted branches Penalty of correctly predicted branches Penalty of mispredicted branches If : Pc 0 P f m Pm Examples: PPro P4 Willamette P4 Prescott 1 1 1.5 0.1 0.05 0.05 10 cycles 20 cycles 30 cycles 2. Basic branch prediction mechanisms 2.1 Introduction (1) Branch processing Branch detection Branch prediction Accessing the branch target path 2.1 Introduction (2) Branch prediction mechanisms Basic branch prediction mechanism Auxilliary branch prediction mechanism 2.1 Introduction (2) Basic branch prediction mechanism Processor based Local Compiler hints ? Prediction depends only on the behaviour of the branch considered Figure 2.1.: Local prediction 2.1 Introduction (2) Basic branch prediction mechanism Processor based Local Global (2-level) Compiler hints 1 0 0 Path 2: . . 0 0 Path 1: 0 0 . . 1 0 0 0 0 ? Prediction depends on the actual execution path, that is on all branches executed Figure 2.2.: Global prediction 2.1 Introduction (2) Basic branch prediction mechanism Processor based Local Global (2-level) Compiler hints Combined (Choice prediction) 2.2. Local prediction (1) Local prediction 1-level 2-level 2.2. Local prediction (2) 1-level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1-bit prediction 80486 (1989) MC 68040 (1990) SuperSparc (1992) R4000 (1992) POWER1 (1990) POWER2 (1993) R8000 (1994) PPC 601 (1993) PPC: PowerPC PPC 601 (1993) 2.2. Local prediction (3) BHT (Branch History Table) IFA: x } x: 0: sequential cont 1: branch. Figure 2.3: Principle of the 1-bit dynamic prediction 2.2. Local prediction (4) NT T Not taken Taken NT T T: Branch has been taken NT: Branch has not been taken Figure 2.4: State transition diagram of the 1-bit dynamic prediction 2.2. Local prediction (6) 1-level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1-bit prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) POWER1 (1990) POWER2 (1993) 2-bit prediction R8000 (1994) PPC 601 (1993) PPC: PowerPC PPC 601 (1993) R10000 (1996) PPC 604 (1995) PPC 620 (1996) 2.2. Local prediction (7) BHT IFA: xx } xx: 00,01: sequential cont 10,11: branch. BHT: Branch History Table Figure 2.6: Principle of the 2-bit dynamic prediction 2.2. Local prediction (8) ANT Strongly AT Initialised when a branch is taken first ANT Weakly taken taken 11 10 AT Prediction: "Taken" AT ANT Weakly not taken Strongly not taken 01 00 AT Prediction: "Not Taken" Branch has been : AT: actually taken ANT: actually not taken Figure 2.7: State transition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm) ANT 2.2. Local prediction (5) Accessing BHTs/BTACs Cache-like access (direct / set associative) Indexed access IFA: Associative access IFA: Index BHT C (Counters) For large tables most branches will map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated prediction accuracy. IFA: Tags Index IFA Tags C Tags C IFA C (E.g. two-way set associative) Reduces interferences but increases cost. Avoids interference but stronly increases cost. Examples: 16K entry local BHT (Power4) 16K entry global BHT (Power4) 16K entry selector table (Power4) 128*4 way BHT/BTAC (Pentium Pro) 1K*4 way BHT/BTAC (Pentium II, III, 4) 128*2 way BTAC (Power3) 64 entry BTAC (PPC 604) Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers 2.2. Local prediction (9) 1-level (local) prediction Fixed prediction Always the same prediction 'Always not taken' 'Always taken' approach approach Dynamic prediction Static prediction Based on the object code Displacementbased Opcodebased Based on the execution history 1-bit prediction 80486 (1989) Pentium (1993) MC 68040 (1990) MC 68060 (1993) SuperSparc (1992) UltraSparc (1995) R4000 (1992) POWER1 (1990) POWER2 (1993) 2-bit prediction R8000 (1994) PPC 601 (1993) PPC 601 (1993) R10000 (1996) PPC 604 (1995) PPC 620 (1996) PPC: PowerPC Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars 3-bit prediction 2.2. Local prediction (10) Local prediction 1-level 2-level Fixed prediction Static prediction Dynamic prediction Always the same prediction Based on the object code Based on the execution history 2.2. Local prediction (11) 2-level local branch prediction 2-level local prediction (1.-level: branch patterns, 2.-level: history bits) Individual counters Shared counters With a shared global history table for all patterns With individual history tables for different patterns (Alpha 21264) (Pentium Pro) IFA: Local BHT (e.g. 16×2 bit) IFA: Local BHT (e.g. 1K×10 bit) 1100101001 Local BHT (e.g. 1K×3 bit)1 101 Branch The 21264 uses 3-bit saturating counters whose most significant bit provides the prediction Local BHT (e.g. 128×4 bit) 6 0110 e.g. 4-ways each 10 Branch 2.2. Local prediction (12) 76 0 BTA (linear) BHT Tag Index 127 Way 2 Way 3 Way 0 Way 1 0 1 01 0 15 0 6 x x xx: 00/01 not taken 10/11 taken Tags History 4-bit Tags History 4-bit Tags History 4-bit Tags 0 Counters Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT History 4-bit 2.2. Local prediction (13) 127 0 Tag Tag Tag Tag H C H C H C H Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT C 2.3. Global prediction (1) Basic branch prediction mechanism Processor based Local Global (2-level) Compiler hints Combined (Choice prediction) 2.3. Global prediction (1) Global prediction Simple global 2.3. Global prediction (1) Global history (shift register) 0 1 1 0 0 1 1 BHT x Figure 2.11.: Simple global prediction Branch history 2.3. Global prediction (1) Global prediction Simple global Gshare 2.3. Global prediction (1) Global history 0 1 1 0 0 1 1 } XOR IFA ... 1 0 0 1 1 0 0 BHT x Figure 2.12.: Principle of the Gshare prediction Branch history 2.3. Global prediction (1) Global prediction Simple global Gshare Gselect 2.3. Global prediction (1) Global history 0 1 1 0 0 1 1 BHT Branch history x ... 1 IFA: 0 1 1 0 Figure 2.13.: Principle of the Gselect prediction 0 2.4. Combined prediction (1) Basic branch prediction mechanism Processor based Local Global (2-level) Compiler hints Combined (Choice prediction) 2.4. Combined prediction (2) IFA: Global history Local BHT Global BHT IFA: Best choice BHT x Global prediction Local prediction Local prediction Global prediction Actual prediction (for updating) Resulting prediction Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4) 2.4. Combined prediction (3) Combined prediction Alpha 21264 1. prediction 2. prediction 2-level local dynamic prediction with a shared counter table for all patterns Simple 2-level global prediction (1K * 10 bits/1K * 3 bits) (12-bit global history/4K * 2 bits) Choice Global history referenced choice table (12-bit global history/4K * 2-bits) Figure 2.15.: Implementation alternatives of the combined prediction 2.4. Combined prediction (4) • • • • • Minimum branch penalty: 7 cycles Typical branch penalty: 11+ cycles (IQ delay) 48K bits of target addresses stored in I-cache 32-entry return address stack Predictor tables are reset on a context switch Figure 2.16.: The combined predictor of the Alpha 21264 Source: Microprocessor Report, 10/28/96 2.4. Combined prediction (5) Combined prediction Alpha 21264 1. prediction 2. prediction 2-level local dynamic prediction with a shared counter table for all patterns Simple 2-level global prediction (1K * 10 bits/1K * 3 bits) 1-level local dynamic prediction POWER 4 (16K * 1-bit) (12-bit global history/4K * 2 bits) 2-level Gshare global prediction (11-bit global history is hashed with the IFA, 16K * 1-bit counter table) Choice Global history referenced choice table (12-bit global history/4K * 2-bits) Accessed in the same way as the global counter table (16K * 1-bit) Figure 2.17.: Implementation alternatives of the combined prediction 2.4. Combined prediction (6) 11-bit global history 0 ... 18 5 1 1 1 0 0 1 1 0 XOR 0 1 1 0 0 } 1-bit per group IFA IFA: BHT 14 14 14 16K*1bit 16K*1bit Selector Table Local History Update Local prediction 16K*1bit Global History Select the better Global prediction Figure 2.18.: The principle of the combined predictor of the POWER 4 2.5. Overview of the basic branch prediction mechanisms Basi c pre diction m e chani sm Local Global Fi xe d pre diction S tatic pre diction Dyn am ic 1-bit Pe ntiu m1 2-le ve l 2-le ve l 1-le ve l S hare d coun te rs 2-bit C om bi ne d (Choice prediction) In di vidual coun te rs S im ple gl obal Gsh are Gse l e ct 3-bit Pe ntiu m (256*2) Pe ntiu m Pro (512*2) Pe ntiu m Pro P4 W il l/Northw. (4K*2) P4 W il l/Northw. P4 Pre scott (4K*2) P4 Pre scott K6 K6 (8K*2) K7 K7 K8 K8 (16K*2) PPC 604 PPC 604 (512*2) PPC 620 PPC 620 (2K*2) PO W ER 3 PO W ER 3 (2K*2) (PO W ER 4) (11-bit/16K*1) (PO W ER 4) (16K*1) PO W ER 4 1 Alph a 21164 Alph a 21164 (2K*2) (Alpha 21264) (1K*10/1K*3) Alph a 21264 PA-8000 (Alpha 21264) (12it s/4K*2) 1 1. generation superscalars Alph a 21264 PA-8000 (256*3) PA-8500/8700 UltraSPARC -III PO W ER 4 PA-8500/8700 UltraSPARC -III (12-bits/16K*2) Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars 3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Pentium Pentium Pro Pentium Pro P4 Will/Northw. P4 Will/Northw. P4 Prescott P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 POWER 5 Alpha 21164 1 Alpha 21264 PA-8000 PA-8500/8700 UltraSPARC-III 1: 1. generation superscalars 1 2: Supported by compiler hints RAS: Return Address Stack Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars Figure 3.2: Static branch prediction algorithm of the Pentium Pro Source: Shanley T., „Pentium Pro Processor System Architecture„, Addison-Wesley Developers Press, 1996 3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Preemptive use of compiler hints Pentium Pentium Pro P4 Will/Northw. P4 Prescott P4 Prescott P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 PPC 620 P4 Will/Northw. P4 Prescott K6 (16-entries) K7 (12-entries) K8 (12-entries) PPC 620 POWER 3 POWER 3 POWER 4 POWER 4 POWER 5 Alpha 21164 RA S Pentium Pro Pentium Pro P4 Will/Northw. Dedicated prediction POWER 5 1 Alpha 21264 PA-8000 POWER 4 2 POWER 52 Alpha 21164 (12-entries) Alpha 21264 (32-entries) PA-8000 PA-8500/8700 UltraSPARC-III UltraSPARC-III 1: 1. generation superscalars 1 2: Supported by compiler hints UltraSPARC-III (8-entries) RAS: Return Address Stack Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars Return Address Stack (RAS) POP return address on a RET PUSH return address on a CALL RAS used to continue execution speculatively from the popped up return address PUSH return address on a CALL POP return address on a RET Architectural stack with preserved sequential consistency The Problem of RASs: A procedure, such as a printf () might be called from many different locations, so there are many different return addresses. During speculative ooo execution however, the logical sequence of the related PUSH RET instructions may be disturbed, so the predicted return address may be wrong. For checking the prediction the RET instruction will be executed, and for a misprediction a repair mechanism will be activated (to cancel wrongly executed instructions and repair the corrupted RAS). 3. Auxillary branch prediction mechanisms Auxiliary branch prediction mechanisms Backup use of static prediction Pentium 1 Preemptive use of compiler hints RA S Loop detector Indirect branch pred. Pentium Pentium Pro Pentium Pro Pentium Pro P4 Will/Northw. P4 Will/Northw. P4 Prescott P4 Prescott P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 P4 Will/Northw. P4 Prescott P4 Prescott K6 (16-entries) K7 (12-entries) K8 (12-entries) PPC 604 PPC 620 PPC 620 PPC 620 POWER 3 POWER 3 POWER 4 POWER 4 POWER 5 Alpha 21164 Dedicated prediction POWER 5 1 Alpha 21264 PA-8000 POWER 4 2 POWER 52 Alpha 21164 (12-entries) Alpha 21264 (32-entries) POWER 4 2 POWER 5 2 POWER 4 PA-8000 PA-8500/8700 UltraSPARC-III UltraSPARC-III 1: 1. generation superscalars 1 2: Supported by compiler hints UltraSPARC-III (8-entries) RAS: Return Address Stack Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars 4. Accessing the branch target path (1) 4.1. Overview BTA Calculated on the fly Figure 4.1.: Alternatives to generate the BTA A BTA IIFA Compute BTA I F A R I I+1 I+2 I+3 Instruction fetch address + sequential address (IFA) I-cache BTI BTI+1 BTI+2 BTI+3 This scheme is employed in earlier scalar (pipeline) processors as well as in a number of superscalar processors, such as: Z 80000 (1984) i486 (1989) MC 68040 (1990) Sparc CY7C601 (1988), SuperSparc (1992p), Power PC 601 (1993), 603 (1993), Power1 (1990), Power2 (1993), POWER4 (2001), POWER5 (2005) 21064 (1992), 21064A (1994), 21164 (1995), R4000 (1992), R 10000 (1996) Ultra SPARC III (2003) Figure 4.2.: Principle of calculating the BTA on the fly Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303 4. Accessing the branch target path (1) 4.1. Overview BTA Calculated on the fly Accessed from the BTAC Figure 4.1.: Alternatives to generate the BTA + I F A R Instruction fetch address (IFA) A I I+1 I+2 I+3 BTI+1 BTI+2 IIFA BTA BTA BTAC I-cache BTI BA-1 Sequential address BTI+3 Branch target address The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are read from the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are designated as BA-1). Figure 4.3.: Principle of the BTAC scheme to access the branch target path + IFA: I$ IFA: BHT BTAC IFA: Tag I F A R Update BTAC (create/delete BTAC entry) IB C Further processing Update BHT with branch result Tags BTA Update BTAC with BTA if BHT initiates it. Figure 4.4.: The principle of branch prediction using both a BHT and a BTAC (C: counter) (Designated as BTB (Branch Target Buffer) by Intel) if BTAC misses IIFA BTA if mispred. if BTAC hits Processor Number of BTAC entries Implementation of the BTAC ES/9000 520-based procs (1992p) 4K 2-way associative Pentium (1994) 256 Fully associative Pentium Pro 512 4-way associative Pentium 4 4K 4-way associative MC 68060 (1993) 256 4-way associative R 8000 (1994)1 1K PA 8000 (1995) 32 Fully associative Power PC 604 (1994) 64 Fully associative Power PC 620 (1995) 256 Fully associative 1: Each entry is shared among 4 instructions Figure 4.5.: Examples of processors using the BTAC scheme Figure 4.6.: The physical implementation of branch prediction in Intel’s P4 Northwood and Prescott cores Source: de Vries H., „Looking at Intel’s Prescott die, part II.”, http://www.chip-architect.com, April 2003 4. Accessing the branch target path (1) 4.1. Overview BTA Calculated on the fly Accessed from BTAC Figure 4.1.: Alternatives to generate the BTA From the I$ Instruction fetch address (IFA) A I I F A R + BA I-cache BTI BTA+ BTIC To decoding The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and selected for decoding instead of the instruction from the I-cache. The address of the subsequent instruction along the taken path is also read from BTIC and becomes the next IFA Examples: Gmicrol/200 (1988), AM 29000 (1988), MC 88110 (1993). Figure 4.7.: Principle of the BTIC scheme to access the branch target path IFA 4. Accessing the branch target path (1) 4.1. Overview BTA Calculated on the fly Accessed from BTAC From the I$ PPro/PII/PIII/P4 21264 Examples Ultra SPARC III K6 Power 4, 5 K7/K8 Power 3 Figure 4.8.:Trends to generate the BTA 4.2. Case example 1: K7 (1) To each 16-Byte long fetch block a 16 bit selector block is allocated as follows: BTA Fetch block (16-Byte) 15 14 13 12 3 2 1 Instruction execution Selector block (16-bit) 15 13 14 12 1 3 2 0 The selector block identifies branches, included in the associated fetch block. Two bits of the selector block correspont to two bytes of the fetch block. RETs are a single byte long all other branches are at least two bytes long. Assuming max. a single RET in the fetch block, there may be at most one branch ending in any pair of Bytes. In a fetch block, there are up to a single RET and two non-RET branches. More branches in a fetch block lead to conflicts in the prediction logic. 0 4.2. Case example 1: K7 (2) Each two bit entry indicates whether or not there is a branch ending in the corresponding two bytes in the fetch block, if yes, it identifies the type of the branch as well. A branch instruction that crosses the 16-byte boundary is counted to the second 16 byte window. Coding of the two bits (assumed) 00: no branch 01: RET 10: There is a conditional branch whose brach is in the BTA0 field of the BTAC 11: There is a conditional branch whose brach is in the BTA1 field of the BTAC 4.2. Case example 1: K7 (3) Characteristic examples of selector settings: xx 00 00 00 00 00 00 00 No branch IFA+16 xx 00 01 00 00 00 00 00 A RET instruction Return address of the RET xx 00 00 00 10 00 00 00 A cond. branch (it’s BTA is in the BTAC 0 field) BTA0 if taken else IFA+16 xx 00 00 10 00 11 00 00 Two cond. branches (their BTAs are in the BTAC 0 and BTAC 1 fields) BC1 Y BTA0 N Y BC2 BTA1 N IFA+16 During predecoding instruction boundaries as well as branch instructions are detected and the appropriate selector entries are marked accordingly. Predecoding is performed not faster than 4 bytes/cycle If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated selector blocks are invalidated 4.2. Case example 1: K7 (4) The selector table is shared between the upper and lower part of the I$, and an extra address bit (A) identifies whether the entry belongt to the upper or the lower part of the I$. Source: Kaiser, A. ,”K7 Branch Prediction”, Dec. 1999, http://www.s.netic.de 4.2. Case example 1: K7 (5) 31 15 14 31 4 30 2-way set associative I$ IFA: Tag 14 13 43 0 BTAC IFA: Tag Index BTA 0 BTA 1 Index 1K x 2 addr. I F A R IFA [13:4] 1K*16B fetch blocks Way 0 Way 1 IFA [14:4] IFA [14:4] [31:15] [31:15] 16 b 16 b Selector Table BTA (Exec.) (shared for the upper and lower parts of the I$) 1K*16B fetch blocks BTA1 BTA0 Fetch unit (during predecoding) Tags 15 16B+P IFA [3:0] 16 B Fetch block 16B+P 0 IFA [3:1] A 15 16 bit selector block Tags 0 IFA14 W: 31 BTA 0 C: BTA x x 32-bit Decode and issue instructions beginning with the given address Sequential (no branch) 12 entries RET BTA 1 BTA 0 Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RAT RET address Figure 4.9.: Assumed simplified scheme of accessing the branch target path in the K7, without showing the global prediction (A: address bit, C: Conditional branch, W: Way) +16 4.2. Case example 2: K8 (1) The K8 doubled the size of the selector table, so each fetch block has it’s own selector entry. The K8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) / fetch block, the coding of the selector entries is modified accordingly. When instruction cache lines are evicted to the L2 cache, branch selectors and predecode information are also stored in the L2 cache. The K8 uses 48-bit addresses but the BTAC keeps only the 15 least significant bits to identify the next address. Each BTA entry identifies the least significant 15-bits of the IFA as well as additional information, such as 3-bit old IFA (bits 16,15) W bit: W identificator 4.2. Case example 2: K8 (2) 31 15 14 31 4 30 2-way set associative I$ IFA: Tag 14 13 43 0 IFA: Tag Index SA Index BTAC ? BTA 2 BTA 1 BTA 0 I F A R 512 x 4 addr. 1K*16B fetch blocks IFA [12:4] Way 1 Selector Table + 16 Way 0 IFA [14:4] IFA [14:4] [31:15] [31:15] BTA calculator ? BTA2 BTA1 BTA0 1K*16B fetch blocks Tags 15 16B+P 16 b SA Predecoding SA [3:0] 16 B Fetch block 0 16 b 16B+P 15 16 bit selector block Tags IFA [3:1] 0 x x Old IFA15 16W 14 New IFA 0 RC BTA 11-bit Decode and issue instructions beginning with the given address Sequential BTA2/RET (no branch) BTA1/RET BTA0/RET 12 entries RAT Take or not according to the global prediction (cond. branch) Take the branch (uncond. branch) RET address Figure 4.10.: Assumed simplified scheme of accessing the branch target path in the K8, without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address) 4.2. Case example 2: K8 (3) Figure 4.11.: Logical view of Opteron’s (K8’s) instruction fetch and decode stages Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003 4.2. Case example 2: K8 (4) Figure 4.12.: Physical implementation of Opteron’s (K8’s) instruction cache and decoding Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003