Document 7666443

Download Report

Transcript Document 7666443

Reducing Issue Logic Complexity
in Superscalar Microprocessors
Survey Project
CprE 585 – Advanced Computer Architecture
David Lastine
Ganesh Subramanian
Introduction


The ultimate goal of any computer architect –
designing a fast machine
Approaches





Increasing clocking rate (Help from VLSI)
Increasing bus width
Increasing pipeline depth
Superscalar architectures
Tradeoffs between hardware complexity and clock
speed

Given a particular technology, the more complex the
hardware, the lesser is the clocking rate
A New Paradigm


Retaining the effective functionality of complex
superscalar processors
Target the bottleneck in present day microprocessors



Increase the clocking rate



Instruction scheduling is the throughput limiter
Need to effectively handle register renaming, issue window
and wakeup selector
Rethinking circuit design methodologies
Modifying architectural design strategies
Wanting to have the cake and eat it too?

Aim at reducing power consumption too
Approaches to Handle Issue Logic
Complexity




Performance = IPC * Clock Frequency
Pipelining scheduling logic reduces the IPC
Non-pipelined scheduling logic reduces
clocking rate
Architectural solutions



Non-pipelined scheduling with dependence queue
based issue logic – Complexity Effective [1]
Pipelined scheduling with speculative wakeup [2]
Generic speed up and power conservation using
tag elimination [3]
Baseline Superscalar Model


The rename and the wake-up select stages of the
generic superscalar pipeline model need to be
targeted
Consider VLSI effects and decide to redesign a
particular design component
Analyzing Baseline Implementations

Physical layout implementation of
microprocessor circuits optimized for speed




Usage of dynamic logic for bottleneck circuits
Manual sizing of transistors in critical path
Logic optimizations like two level decomposition
Components analyzed




Register rename logic
Wakeup Logic / Issue window
Selection logic
Bypass logic
Register Rename Logic





RAM vs. CAM
Focus on RAM due to scalability
Decreasing feature sizes do not correspondingly scale down wire
delays, but only logic delays
Delay relation with issue width is quadratic, but effectively linear
Need to handle wordline and bitline delays in future
Wakeup Logic





CAM is preferred
Tag drive times are quadratic functions of window size as well
as issue width
Matching times are quadratic functions of issue width only
All delays are effectively linear for considered design space
Need to handle broadcast operation delays in future
Selection Logic





Tree of arbiters
Requests flow down while functional unit grants flow up to the
issue window
Necessity of a selection policy (Oldest First / Leftmost First)
Delays proportional to the logarithm of the window size
All delays considered are logic delays
Bypass Logic




Number of bypass paths
dependent upon pipeline
depth (linear) and issue
width (quadratic)
Composed of operand muxes
and buffer drivers
Delays are quadratically
proportional to length of
result wires and hence issue
width
Insignificant compared to
other delays as feature size
reduces
Complexity Effective
Microarchitecture Design Premises


Retain benefits of complex issue schemes but enable
faster clocking
Design assumption: Should not pipeline wakeup +
select, or data bypassing, as these are atomic
operations (if dependent instruction should be
executable in consecutive cycles)
Dependence Based
Microarchitecture




Replace Issue Window by FIFOs with each queue composed of
dependent instructions
Steer instructions to the appropriate FIFO in rename stage using
heuristics
‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and
wakeup
IPC reduces but clocking rate increases to give a faster implementation
Clustering Dependence Based
Microarchitectures



Reducing bypass delays by
reducing length of bypass
paths
Minimization of intercluster communication,
extra cycle penalty
otherwise
Clustered
Microarchitecture Types



Single Window, Execution
Driven Steering
Two Windows, Dispatch
Driven Steering - Best
Two Windows, Random
Steering
Pipelining Dynamic Instruction
Scheduling Logic




Wakeup+Select was held atomic in previous
implementation
Increase performance by pipelining it, but
retain execution of dependent instruction in
consecutive cycles
Speculate on the wakeup by predicting based
on both parent and grandparent instructions
Integrated into the Tomasulo approach
Wakeup Logic Details





Tag broadcast as soon as instruction begins execution
Broadcast – Execution Completion latency specified as
shown
Match bit acts as the sticky bit to enable delay countdown
Need not always be correct due to unexpected stalls
Select logic remains as in previous work
Pipelining Rename Logic




Assumption by child instruction that parent
would broadcast its tag in the next cycle, IF
grandparent instructions broadcasts tag
Speculative wakeup on grandparent tag
receiving for selection in the next cycle
Speculative since parent selection for
execution is not guaranteed
Modifications in rename map and dependency
analysis logic
Wakeup and Select Logic





Wakeup request sent after looking into ready
bits from the parents’ and grandparents’ tags
A multi-cycle parent’s field can be ignored
In addition to speculative readiness signified
by request line, a confirm line is activated
when all parents are ready
False selection involve non-confirmed
requests
Problematic only when really ready
instructions are not selected
Implementation & Experimentation
Details

Usage of a cycle accurate execution driven simulator
for the Alpha ISA






Baseline conventional scheduled (2) pipeline
Budget / Deluxe – speculatively woken up scheduling
Ideal – 1 cycle scheduling pipeline
Factors like issue width and reservation station depth
considered
Significant reduction in critical path with minor IPC
impacts
Enables higher clock frequencies, deeper pipelines
and larger instruction windows for better
performance
Paradigm shift


So far we’ve added hardware to improve
performance
However issue window could also be
improved by removing hardware
Current Situation of Issue Windows




Content Addressable Memory (CAM) latency
dominates instruction window latency.
Load Capacitance of CAM is a major limiting
factor for speed.
Parasitic Capacitance also waste power.
Issue logic uses a lot of the power budget


16% for the Pentium Pro
18% for Alpha 21264
Unnecessary Circuity



Observation: Register stations compare
broadcast tags to both operands. Often, this
is unnecessary.
Only 25% to 35% of architectural instructions
have two operands.
Simulation of speck2k programs shows only
10% to 20% of instructions need two
comparators during runtime.
Simulation



Used SimpleScalar
Varied instruction window size 16, 64, 256.
Load/Store queue of half window size.
Removing extra comparators

Specialize the reservation stations.



Number of comparators varies by station from 2 to
0.
Stall if no station with minimum comparator
available
Remove some operands by speculating on
last operand to complete.


Needs predictor
Miss-predict penalty
Predictor



Paper discuses GSHARE predictor
Its based off branch predictor not seen in class.
Idea behind it starts by noting good indexes for
selecting binary predictors are



Branch address
Global history
Thus if both are good, XORing them together should
produce an index embodying more information than
ether alone.
Predictor II

Here is how GSHARE does for various sizes of
the prediction table.
Mis-pridiction



Alpha has scoreboard of valid registers called
RDY.
Check if all operands available in register read
stage, if not flush pipeline in the same
fashion as latency miss-prediction.
RDY must be expanded to have the number
of read ports match the issue width.
IPC losses


Reservation stations with two ports can be
exhausted. Causes stalls for speck2k
benchmarks like SWIM
Adding last tag prediction improves SWIM
performance but causes 1-3% losses for
benchmarks such as Crafly and Gcc due to
misprediction
Simulation


Format show is for number of two tag/one tag/ zero
tag
Last tag predictor used only on entries with no two
tag reservation stations.
Benefits of comparator removal

In most cases clock rate can be 25-45%
faster since



Tag bus no longer must reach all reservation
stations
Removing comparators removes load capacitance
Energy saved from capacitance removal is 3060%

Power savings don’t track energy saves this clock
rate can now increase.
Simulation results for benefits
References
Complexity-effective superscalar processors
1.
1.
2.
On pipelining dynamic instruction scheduling logic
1.
3.
J. Stark, M. D. Brown, and Yale N. Patt
Efficient Dynamic
Elimination
1.
4.
Subbarao Palacharla and Norman P. Jouppi and J. E.
Smith
Scheduling
Dan Ernst and Todd Austin
Combining Branch Predictors
1.
Scott McFarling
Through
Tag
Questions?