The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda
Download ReportTranscript The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda
The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Introduction Superscalar processing is the ability to initiate multiple instructions during the same cycle. It aims at producing ever faster microprocessors. A typical superscalar processor fetches and decodes several instructions at a time. Instructions are executed in parallel based on the availability of operand data rather than their original program sequence. Upon completion instructions are resequenced so that they can be used to update the process state in the correct program order. Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Microarchitecture Instruction Fetch and Branch Prediction Decode and Register Dependence Analysis Issue and Execution Memory Operation Analysis and Execution Instruction Reorder and Commit Organization of a superscalar processor Instruction Fetch and Branch Prediction The fetch phase supplies instructions to the rest of the processing pipeline. An instruction cache is used to reduce the latency and increase the bandwidth of instruction fetch process. PC searches the cache contents to determine if the instruction being addressed is present in one of the cache lines. In a superscalar implementation, the fetch phase fetches multiple instructions per cycle from cache memory. Branch Instructions Recognizing conditional branches Decode information (extra bits) is held in the instruction cache with every instruction Determining the branch outcome Branch prediction using information regarding past history of branch outcomes. Computing the branch target Usually integer addition (PC+ offset value) Branch Target Buffer holds target address used last time the branch was executed Transferring control If branch taken at least one clock cycle delay to recognize branch, modify PC and fetch instructions from target address Instruction Decode Instructions are removed from fetch buffers, examined and data dependence linkages are set up. Data dependences True dependences :can cause a read after write (RAW) hazard Artificial dependences: can cause write after read (WAR) and write after write (WAW) hazards. Hazards RAW: occurs when a consuming instruction reads a value before the producing instruction writes it. WAR: occurs when an instruction writes a value before a preceding instruction reads it. WAW: occurs when multiple instructions update the same storage location but not in the proper order. Instruction Decode (cont.) Example of data hazards For each instruction, decode phase sets up the operation to be executed, the identities of storage elements where input reside and the locations where result must be placed. Artificial dependences are eliminated through register renaming. Instruction Issue and Parallel Execution Run-time checking for availability of data and resources. An instruction is ready to execute as soon as its input operands are available. However, there are other constraints such as execution units and register file ports. An issue queue is responsible for holding the instructions until their input operands are available. Out-of-order execution: the ability of executing instructions not in the program order but as soon as their operands are ready. Handling Memory Operations For memory operations, the decode phase cannot identify the memory locations that will be accessed. The determination of the memory location that will be accessed requires an address calculation, usually integer addition. Once a valid address is obtained, the load or store operation is submitted to memory. Committing State The effects of an instruction are allowed to modify the logical process state. The purpose of this phase is to implement the appearance of a sequential execution model, even though the actual execution is not sequential. Machine state is separated into physical and logical. Physical state is updated as the operations complete while logical is updated in sequential program order. Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Alpha 21264 (EV6) Instruction Fetch Fetches 4 instructions per cycle Large 64 KB 2-way associative instruction cache Branch predictor: dynamically chooses between local and global history Register Renaming Assignment of a unique storage location with each write-reference to a register. The register allocated becomes part of the architectural state only when the instruction commits. Elimination of WAW and WAR dependences but preservation of RAW dependences necessary for correct computation. 64 architectural + 41 integer + 41 floating point registers available to hold speculative results prior to instruction retirement in an 80 instruction in-flight window. Issue Queues 20 entry integer queue 15 entry floating-point queue can issue 4 instructions per cycle can issue 2 instructions per cycle A list of pending instructions is kept and each cycle these queues select from these instructions as their input data are ready. Queues issue instructions speculatively and older instructions are given priority over newer in the queue. An issue queue entry becomes available when the instruction issues or is squashed due to mis-speculation. Execution Engine All execution units require access to the register file. The register file is split into two clusters that contain duplicates of the 80-entry register file. Two pipes access a single register file to form a cluster and the two clusters are combined to support 4-way integer execution. Two floating point execution pipes are organized in a single cluster with a single 72entry register file. Memory System Supports in-flight memory references and outof-order operation Receives up to 2 memory operations from the integer execution pipes every cycle 64 KB 2-way set associative data cache and direct mapped level-two cache (ranges from 1 to 16 MB) 3-cycle latency for integer loads and 4 cycles for FP loads Store/Load Memory Ordering Memory system supports capabilities of out-of-order execution but maintains an in-order architectural memory model. It would be wrong if a later load issued prior to an earlier store to the same address. This RAW memory dependency cannot be handled by rename logic because it doesn’t know the memory address before instruction issue. If a load is incorrectly issued before an earlier store to the same address, the 21264 trains the out-of-order execution core to avoid it on subsequent executions of the same load. It sets a bit on a load wait table, that forces the issue point of the load to be delayed until all prior stores have issued. Load Hit/ Miss Prediction To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the data cache. If the load eventually misses, two integer cycles are squashed and all integer instructions that issued during those cycles are pulled back in the issue queue to be re-issued later. The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss. Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Sim-alpha simulator Sim-alpha is a simulator that models Alpha 21264. It models the implementation constraints and low-level features in the 21264. Allows user to vary the different parameters of the processor, such as fetch width, reorder buffer size and issue queue sizes. Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Out-Of-Order Execution Why Out-Of-Order Execution? - In-order Processors Stalls Pipeline may not be full because of the frequent stalls Example: In the Out-Of-Order Processors - No dependency Move the Instruction for execution - Means Allow the Instructions that are Ready Out-Of-Order Execution Theme : Stall Only on RAW hazards & Structural Hazards - RAW Hazard Example: LD R4 10(R5) ADD R6 R4 R8 - Structural Hazard - Occurs because of Resource Conflicts - Example: If the CPU is designed with a single Interface to Memory This Interface is always used during IF Also used in MEM for LOAD or STORE Operations. When a Load or Store gets into MEM stage IF must stall Outline Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation Prediction & Speculation Problem: Serialization due to Dependences - Wait Until Producing Instruction Execute - But We need Optimal Utilization of Resources Solution: - Predict Unknown information - Allow the processor to proceed (Assuming predicted information is correct) - If the Prediction is Correct Proceed Normally - If not squash the speculatively executed instructions - Restore the status - Start In the correct Direction Taxonomy Of Speculation Two Possibilities -Worst case : 50 % accuracy For a 32 bit Register Possibilities - Worst case: ? Prediction & Speculation Control Speculation - Current Branch Predictors are highly Accurate - Implemented in Commercial Processors Data Speculation - Lot of research (Value Profiling, Value prediction etc) - Not Implemented in the Current Processors - Till Now Very Little Effects on Current Designs