The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda

Download Report

Transcript The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda

The anatomy of a modern
superscalar processor
Constantinos Kourouyiannis
Madhava Rao Andagunda
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Introduction



Superscalar processing is the ability to initiate
multiple instructions during the same cycle.
It aims at producing ever faster microprocessors.
A typical superscalar processor fetches and
decodes several instructions at a time. Instructions
are executed in parallel based on the availability of
operand data rather than their original program
sequence. Upon completion instructions are resequenced so that they can be used to update the
process state in the correct program order.
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Microarchitecture





Instruction Fetch and Branch Prediction
Decode and Register Dependence Analysis
Issue and Execution
Memory Operation Analysis and Execution
Instruction Reorder and Commit
Organization of a superscalar processor
Instruction Fetch and Branch Prediction




The fetch phase supplies instructions to the rest of
the processing pipeline.
An instruction cache is used to reduce the latency
and increase the bandwidth of instruction fetch
process.
PC searches the cache contents to determine if the
instruction being addressed is present in one of the
cache lines.
In a superscalar implementation, the fetch phase
fetches multiple instructions per cycle from cache
memory.
Branch Instructions




Recognizing conditional branches
 Decode information (extra bits) is held in the instruction cache
with every instruction
Determining the branch outcome
 Branch prediction using information regarding past history of
branch outcomes.
Computing the branch target
 Usually integer addition (PC+ offset value)
 Branch Target Buffer holds target address used last time the
branch was executed
Transferring control
 If branch taken  at least one clock cycle delay to recognize
branch, modify PC and fetch instructions from target address
Instruction Decode



Instructions are removed from fetch buffers, examined and data
dependence linkages are set up.
Data dependences
 True dependences :can cause a read after write (RAW) hazard
 Artificial dependences: can cause write after read (WAR) and
write after write (WAW) hazards.
Hazards
 RAW: occurs when a consuming instruction reads a value before
the producing instruction writes it.
 WAR: occurs when an instruction writes a value before a
preceding instruction reads it.
 WAW: occurs when multiple instructions update the same storage
location but not in the proper order.
Instruction Decode (cont.)

Example of data hazards

For each instruction, decode phase sets up the
operation to be executed, the identities of storage
elements where input reside and the locations
where result must be placed.
Artificial dependences are eliminated through
register renaming.

Instruction Issue and Parallel Execution




Run-time checking for availability of data and
resources.
An instruction is ready to execute as soon as its
input operands are available. However, there are
other constraints such as execution units and
register file ports.
An issue queue is responsible for holding the
instructions until their input operands are available.
Out-of-order execution: the ability of executing
instructions not in the program order but as soon as
their operands are ready.
Handling Memory Operations



For memory operations, the decode phase
cannot identify the memory locations that will
be accessed.
The determination of the memory location
that will be accessed requires an address
calculation, usually integer addition.
Once a valid address is obtained, the load or
store operation is submitted to memory.
Committing State



The effects of an instruction are allowed to
modify the logical process state.
The purpose of this phase is to implement the
appearance of a sequential execution model,
even though the actual execution is not
sequential.
Machine state is separated into physical and
logical. Physical state is updated as the
operations complete while logical is updated
in sequential program order.
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Alpha 21264 (EV6)
Instruction Fetch



Fetches 4 instructions per cycle
Large 64 KB 2-way associative instruction
cache
Branch predictor: dynamically chooses
between local and global history
Register Renaming




Assignment of a unique storage location with each
write-reference to a register.
The register allocated becomes part of the
architectural state only when the instruction
commits.
Elimination of WAW and WAR dependences but
preservation of RAW dependences necessary for
correct computation.
64 architectural + 41 integer + 41 floating point
registers available to hold speculative results prior to
instruction retirement in an 80 instruction in-flight
window.
Issue Queues

20 entry integer queue


15 entry floating-point queue



can issue 4 instructions per cycle
can issue 2 instructions per cycle
A list of pending instructions is kept and each cycle
these queues select from these instructions as their
input data are ready.
Queues issue instructions speculatively and older
instructions are given priority over newer in the queue.

An issue queue entry becomes available when the
instruction issues or is squashed due to mis-speculation.
Execution Engine




All execution units require access to the
register file.
The register file is split into two clusters that
contain duplicates of the 80-entry register file.
Two pipes access a single register file to form
a cluster and the two clusters are combined
to support 4-way integer execution.
Two floating point execution pipes are
organized in a single cluster with a single 72entry register file.
Memory System




Supports in-flight memory references and outof-order operation
Receives up to 2 memory operations from the
integer execution pipes every cycle
64 KB 2-way set associative data cache and
direct mapped level-two cache (ranges from 1
to 16 MB)
3-cycle latency for integer loads and 4 cycles
for FP loads
Store/Load Memory Ordering




Memory system supports capabilities of out-of-order
execution but maintains an in-order architectural memory
model.
It would be wrong if a later load issued prior to an earlier store
to the same address.
This RAW memory dependency cannot be handled by
rename logic because it doesn’t know the memory address
before instruction issue.
If a load is incorrectly issued before an earlier store to the
same address, the 21264 trains the out-of-order execution
core to avoid it on subsequent executions of the same load. It
sets a bit on a load wait table, that forces the issue point of
the load to be delayed until all prior stores have issued.
Load Hit/ Miss Prediction



To achieve the 3-cycle integer load hit latency, it is
necessary to speculatively issue consumers of
integer load data before knowing if the load hit or
missed in the data cache.
If the load eventually misses, two integer cycles
are squashed and all integer instructions that
issued during those cycles are pulled back in the
issue queue to be re-issued later.
The 21264 predicts when loads will miss and does
not speculatively issue the consumers of the load
in that case. Effective load latency: 5 cycles for an
integer load hit that is incorrectly predicted to miss.
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Sim-alpha simulator



Sim-alpha is a simulator that models Alpha
21264.
It models the implementation constraints and
low-level features in the 21264.
Allows user to vary the different parameters
of the processor, such as fetch width, reorder
buffer size and issue queue sizes.
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Out-Of-Order Execution

Why Out-Of-Order Execution?
- In-order Processors Stalls
Pipeline may not be full because of the frequent stalls
Example:
In the Out-Of-Order Processors
- No dependency Move the Instruction for execution
- Means Allow the Instructions that are Ready
Out-Of-Order Execution

Theme : Stall Only on RAW hazards & Structural Hazards
- RAW Hazard
Example:
LD R4 10(R5)
ADD R6 R4 R8
- Structural Hazard
- Occurs because of Resource Conflicts
- Example: If the CPU is designed with a single Interface to Memory
This Interface is always used during IF
Also used in MEM for LOAD or STORE Operations.
When a Load or Store gets into MEM stage IF must stall
Outline






Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation
Prediction & Speculation

Problem: Serialization due to Dependences
- Wait Until Producing Instruction Execute
- But We need Optimal Utilization of Resources

Solution:
- Predict Unknown information
- Allow the processor to proceed
(Assuming predicted information is correct)
- If the Prediction is Correct Proceed Normally
- If not squash the speculatively executed instructions
- Restore the status
- Start In the correct Direction
Taxonomy Of Speculation
Two Possibilities
-Worst case : 50 % accuracy
For a 32 bit Register
Possibilities
- Worst case: ?
Prediction & Speculation

Control Speculation
- Current Branch Predictors are highly Accurate
- Implemented in Commercial Processors

Data Speculation
- Lot of research (Value Profiling, Value prediction etc)
- Not Implemented in the Current Processors
- Till Now Very Little Effects on Current Designs