Processor Architecture

Download Report

Transcript Processor Architecture

Chapter 4
Multiple-Issue Processors
1
Multiple-issue processors

This chapter concerns multiple-issue processors, i.e. superscalar and VLIW
(very long instruction word) processors.

Most of today's general-purpose microprocessors are four- or six-issue
superscalar often with an enhanced Tomasulo scheme.

VLIW is the choice for most signal processors.

VLIW is proposed as EPIC (explicitly parallel instruction computing) by Intel for
its IA-64 ISA.
2
Components of a superscalar processor
I-cache
MMU
BHT
BTAC
RAS
Branch
Unit
Instruction Fetch
Unit
Instruction Decode and
Register Rename Unit
Instruction Buffer
Instruction
Issue Unit
Reorder Buffer
Load/
Store
Unit(s)
MMU
FloatingPoint
Unit(s)
Integer
Unit(s)
Multimedia
Unit(s)
Retire
Unit
FloatingPoint
Registers
General
Purpose
Registers
Multimedia
Registers
Rename
Registers
Bus
Interface
Unit
D-cache
3
Floorplan
of the
PowerPC 604
4
Reservation
Stations
Reservation
Stations
...

...

Issue
Execution
...
Instruction
Fetch
Instruction
Decode
and
Rename
Instruction Window
Superscalar pipeline
(PowerPC- and enhanced Tomasulo-scheme)
Retire
and
Write
Back
Execution
Instructions in the instruction window are free from control dependencies due
to branch prediction, and free from name dependences due to register
renaming.
So, only (true) data dependences and structural conflicts remain to be solved.
5
Wakeup
Select
Reg Read
Data-cache
Bypass
Rename
Register file
Rename
Decode
Wakeup
Select
Decode
Fetch
Issue
window
Fetch
Superscalar pipeline without reservation stations
Execute
Bypass
Dcache
Access
Reg Write
Commit
6
Fetch
Decode
Rename
Wakeup
Select
Reg Read
Data-cache
Bypass
Register file
...
Wakeup
Select
...
...
Fetch
Decode
Rename
Superscalar pipeline with decoupled instruction
windows
Execute
Bypass
Dcache
Access
Reg Write
Commit
7
Issue




The issue logic examines the waiting instructions in the instruction window and
simultaneously assigns (issues) a number of instructions to the FUs up to a
maximum issue bandwidth.
Several instructions can be issued simultaneously (the issue bandwidth).
The program order of the issued instructions is stored in the reorder buffer.
Instruction issue from the instruction window can be:
– in-order (only in program order) or out-of-order
– it can be subject to simultaneous data dependences and resource
constraints,
– or it can be divided in two (or more) stages
• checking structural conflict in the first and data dependences in the next
stage (or vice versa).
• In the case of structural conflicts first, the instructions are issued to
reservation stations (buffers) in front of the FUs where the issued
instructions await missing operands (PowerPC/enhanced Tomasulo
scheme).
8
Reservation station(s)

Two definitions in literature:
– A reservation station is a buffer for a single instruction with its operands
(original Tomasulo paper, Flynn's book, Hennessy/Patterson book).
– A reservation station is a buffer (in front of one or more FUs) with one or
more entries and each entry can buffer an instruction with its operands
(e.g. PowerPC literature).

Depending on the specific processor, reservation stations can be central to a
number of FUs or each FU has one or more own reservation stations.

Instructions await their operands in the reservation stations, as in the
Tomasulo algorithm.
9
Dispatch (PowerPC- and enhanced Tomasulo-Scheme)






An instruction is then said to be dispatched from a reservation station to the FU
when all operands are available, and execution starts.
If all its operands are available during issue and the FU is not busy, an
instruction is immediately dispatched, starting execution in the next cycle after
the issue.
So, the dispatch is usually not a pipeline stage.
An issued instruction may stay in the reservation station for zero to several
cycles.
Dispatch and execution is performed out of program order.
Other authors interchange the meaning of issue and dispatch or use different
semantic.
10
Completion





When the FU finishes the execution of an instruction and the result is ready for
forwarding and buffering, the instruction is said to complete.
Instruction completion is out of program order.
During completion the reservation station is freed and the state of the
execution is noted in the reorder buffer.
The state of the reorder buffer entry can denote an interrupt occurrence.
The instruction can be completed and still be speculatively assigned, which is
also monitored in the reorder buffer.
11
Commitment



After completion, operations are committed in-order.
An instruction can be committed:
– if all previous instructions due to the program order are already committed
or can be committed in the same cycle,
– if no interrupt occurred before and during instruction execution, and
– if the instruction is no more on a speculative path.
By or after commitment, the result of an instruction is made permanent in the
architectural register set,
– usually by writing the result back from the rename register to the
architectural register.
12
Precise interrupt (Precise exception)

If an interrupt occurred, all instructions that are in program order before the
interrupt signaling instruction are committed, and all later instructions are
removed.

Precise exception means that all instructions before the faulting instruction are
committed and those after it can be restarted from scratch.

Depending on the architecture and the type of exception, the faulting
instruction should be committed or removed without any lasting effect.
13
Retirement

An instruction retires when the reorder buffer slot of an instruction is freed
either
– because the instruction commits (the result is made permanent) or
– because the instruction is removed (without making permanent changes).

A result is made permanents by copying the result value from the rename
register to the architectural register.
– This is often done in an own stage after the commitment of the instruction
with the effect that the rename register is freed one cycle after
commitment.
14
Explanation of the term “superscalar”

Definition:
Superscalar machines are distinguished by their ability to (dynamically) issue
multiple instructions each clock cycle from a conventional linear instruction
stream.

In contrast to superscalar processors, VLIW processors use a long instruction
word that contains a usually fixed number of instructions that are fetched,
decoded, issued, and executed synchronously.
15
Explanation of the term “superscalar”

Instructions are issued from a sequential stream of normal instructions (in
contrast to VLIW where a sequential stream of instruction tuples is used).

The instructions that are issued are scheduled dynamically by the hardware (in
contrast to VLIW processors which rely on a static scheduling by the compiler).

More than one instruction can be issued each cycle (motivating the term
superscalar instead of scalar).

The number of issued instructions is determined dynamically by hardware, that
is, the actual number of instructions issued in a single cycle can be zero up to a
maximum instruction issue bandwidth
(In contrast to VLIW where the number of scheduled instructions is fixed due to
padding instructions with no-ops in case the full issue bandwidth would not be
met.)
16
Explanation of the term “superscalar”





Dynamic issue of superscalar processors can allow issue of instructions either
in-order, or it can allow also an issue of instructions out of program order.
– Only in-order issue is possible with VLIW processors.
The dynamic instruction issue complicates the hardware scheduler of a
superscalar processor if compared with a VLIW.
The scheduler complexity increases when multiple instructions are issued outof-order from a large instruction window.
It is a presumption of superscalar that multiple FUs are available.
– The number of available FUs is at least the maximum issue bandwidth, but
often higher to diminish potential resource conflicts.
The superscalar technique is a microarchitecture technique, not an architecture
technique.
17
Please recall: architecture, ISA, microarchitecture

The architecture of a processor is defined as the instruction set architecture
(ISA), i.e. everything that is seen outside of a processor.

In contrast, the microarchitecture comprises implementation techniques
– like number and type of pipeline stages, issue bandwidth, number of FUs,
size and organization of on-chip cache memories etc.
– The maximum issue bandwidth and the internal structure of the processor
can be changed.
– Even several architectural compatible processors may exist with different
microarchitectures and all are able to execute the same code.

An optimizing compiler may also use the knowledge of the microarchitecture.
18
Sections of a superscalar processor

The ability to issue and execute instructions out-of-order partitions a
superscalar pipeline in three distinct sections:
– in-order section with the instruction fetch, decode and rename stages - the
issue is also part of the in-order section in case of an in-order issue,
– out-of-order section starting with the issue in case of an out-of-order issue
processor, the execution stage, and usually the completion stage, and
again an
– in-order section that comprises the retirement and write-back stages.
19
Temporal vs. spacial parallelism

Instruction pipelining, superscalar and VLIW techniques all exploit fine-grain
(instruction-level) parallelism.

Pipelining utilizes temporal parallelism.

Superscalar and VLIW techniques utilize also spatial parallelism.

Performance can be increased by longer pipelines (deeper pipelining) and faster
transistors (a faster clock) emphasizing an improved pipelining.

Provided that enough fine-grain parallelism is available, performance can also be
increased by more FUs and a higher issue bandwidth using more transistors in
the superscalar and VLIW cases.
20
I-cache access and instruction fetch



Harvard architecture: separate instruction and data memory and access paths
– is internally used in a high-performance microprocessor with separate onchip primary I-cache and D-cache.
The I-cache is less complicated to control than the D-cache, because
– it is read-only and
– it is not subjected to cache coherence in contrast to the D-cache.
Sometimes the instructions in the I-cache are predecoded on their way from the
memory interface to the I-cache to simplify the decode stage.
21
Instruction fetch


The main problem of instruction fetching is control transfer performed by jump,
branch, call, return, and interrupt instructions:
– If the starting PC address is not the address of the cache line, then fewer
instructions than the fetch width are returned.
– Instructions after a control transfer instruction are invalidated.
– A multiple cache lines fetch from different locations may be needed in
future very wide-issue processors where often more than one branch will
be contained in a single contiguous fetch block.
Problem with target instruction addresses that are not aligned to the cache line
addresses:
– Self-aligned instruction cache reads and concatenates two consecutive
lines within one cycle to be able to always return the full fetch bandwidth.
Implementation:
• either by use of a dual-port I-cache,
• by performing two separate cache accesses in a single cycle,
• or by a two-banked I-cache (preferred).
22
Prefetching and instruction fetch prediction

Prefetching improves the instruction fetch performance,
but fetching is still limited because instructions after a control transfer must be
invalidated.

Instruction fetch prediction helps to determine the next instructions to be
fetched from the memory subsystem.

Instruction fetch prediction is applied in conjunction with branch prediction.
23
Branch prediction



Branch prediction foretells the outcome of conditional branch instructions.
Excellent branch handling techniques are essential for today's and for future
microprocessors.
The task of high performance branch handling consists of the following
requirements:
– an early determination of the branch outcome (the so-called branch
resolution),
– buffering of the branch target address in a BTAC after its first calculation
and an immediate reload of the PC after a BTAC match,
– an excellent branch predictor (i.e. branch prediction technique) and
speculative execution mechanism,
– often another branch is predicted while a previous branch is still
unresolved, so the processor must be able to pursue two or more
speculation levels,
– and an efficient rerolling mechanism when a branch is mispredicted
(minimizing the branch misprediction penalty).
24
Misprediction penalty




The performance of branch prediction depends on the prediction accuracy and
the cost of misprediction.
Prediction accuracy can be improved by inventing better branch predictors.
Misprediction penalty depends on many organizational features:
– the pipeline length (favoring shorter pipelines over longer pipelines),
– the overall organization of the pipeline,
– the fact if misspeculated instructions can be removed from internal buffers,
or have to be executed and can only be removed in the retire stage,
– the number of speculative instructions in the instruction window or the
reorder buffer. Typically only a limited number of instructions can be
removed each cycle.
Rerolling when a branch is mispredicted is expensive:
– 4 to 9 cycles in the Alpha 21264,
– 11 or more cycles in the Pentium II.
25
Branch-Target Buffer or Branch-Target Address
Cache





The Branch Target Buffer (BTB) or Branch-Target Address Cache (BTAC)
stores branch and jump target addresses.
It should be known already in the IF stage whether the as-yet-undecoded
instruction is a jump or branch.
The BTB is accessed during the IF stage.
The BTB consists of a table with branch addresses, the corresponding target
addresses, and prediction information.
Variations:
Branch Target Cache (BTC): stores one or more target instructions
additionally.
Return Address Stack (RAS): a small stack of return addresses for procedure
calls and returns is used additional to and independent of a BTB.
26
Branch-Target Buffer or Branch-Target Address
Cache
Branch address
...
Prediction
Target address
bits
...
...
27
Static branch prediction

Static Branch Prediction predicts always the same direction for the same
branch during the whole program execution.

It comprises hardware-fixed prediction and compiler-directed prediction.

Simple hardware-fixed direction mechanisms can be:
– Predict always not taken
– Predict always taken
– Backward branch predict taken, forward branch predict not taken

Sometimes a bit in the branch opcode allows the compiler to decide the
prediction direction.
28
Dynamic branch prediction

In a dynamic branch prediction scheme the hardware influences the prediction
while execution proceeds.

Prediction is decided on the computation history of the program.

After a start-up phase of the program execution, where a static branch
prediction might be effective, the history information is gathered and dynamic
branch prediction gets effective.

In general, dynamic branch prediction gives better results than static branch
prediction, but at the cost of increased hardware complexity.
29
One-bit predictor
T
NT
Predict Taken
Predict Not
Taken
T
NT
30
One-bit vs. two-bit predictors




A one-bit predictor correctly predicts a branch at the end of a loop iteration, as
long as the loop does not exit.
In nested loops, a one-bit prediction scheme will cause two mispredictions for
the inner loop:
– One at the end of the loop, when the iteration exits the loop instead of
looping again, and
– one when executing the first loop iteration, when it predicts exit instead of
looping.
Such a double misprediction in nested loops is avoided by a two-bit predictor
scheme.
Two-bit Prediction: A prediction must miss twice before it is changed when a
two-bit prediction scheme is applied.
31
Two-bit predictors
(Saturation Counter Scheme)
T
(11)
Predict Strongly
Taken
NT
T
(10)
Predict Weakly
Taken
NT
T
(01)
Predict Weakly
Not Taken
NT
T
(00)
Predict Strongly
Not Taken
NT
32
Two-bit predictors
(Hysteresis Scheme)
NT
T
(11)
Predict Strongly
Taken
NT
T
(10)
Predict Weakly
Taken
(01)
Predict Weakly
Not Taken
NT
T
(00)
Predict Strongly
Not Taken
NT
T
33
Two-bit predictors






The two-bit prediction scheme is extendable to an n-bit scheme.
Studies showed that a two-bit prediction scheme does almost as well as an
n-bit scheme with n>2.
Two-bit predictors can be implemented in the Branch Target Buffer (BTB)
assigning two state bits to each entry in the BTB.
Another solution is to use a BTB for target addresses
and a separate Branch History Table (BHT) as prediction buffer.
A mispredict in the BHT occurs due to two reasons:
– either a wrong guess for that branch,
– or the branch history of a wrong branch is used because the table is
indexed.
In an indexed table lookup part of the instruction address is used as index to
identify a table entry.
34
Two-bit predictors and correlation-based prediction

Two-bit predictors work well for programs which contain many frequently
executed loop-control branches (floating-point intensive programs).

Shortcomings arise from dependent (correlated) branches, which are frequent
in integer-dominated programs.
35
Example:
if (d==0)
d=1;
if (d==1)
...
/* branch b1*/
/*branch b2 */
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
; d==0, so d=1
L1: subi R3, R1,#1
; branch b2 (d  0)
bnez R3, L2
...
L2: ...

Consider a sequence
where d alternates
between 0 and 2
?
?
 a sequence of NT-T-NT-T-NT-T for branches b1 and b2

The execution behavior is given in the following table:
initial d
0
2
d ==0
yes
no
b1
NT
T
d before b2
1
2
d ==1
yes
no
b2
NT
T
36
One-bit predictor initialized to “predict taken”
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
; branch b2 (d  0)
bnez R3, L2
...
L2:
b1:
b2:
...
d alternates between 0 and 2
Initial
prediction
d==0
d==2
d==0
T
T
NT
NT
T
T
NT
NT
37
Two-bit saturation counter predictor initialized to
“predict weakly taken”
T
(11)
Predict Strongly
Taken
NT
T
NT
(10)
Predict Weakly
Taken
(01)
Predict Weakly
Not Taken
T
T
(00)
Predict Strongly
Not Taken
NT
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
NT
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
b1:
b2:
...
d alternates between 0 and 2
Initial
prediction
d==0
d==2
d==0
WT
WT
WNT
WNT
WT
WT
WNT
WNT
38
Two-bit predictor (Hysteresis counter) initialized to
“predict weakly taken”
NT
T
(11)
Predict Strongly
Taken
NT
T
(10)
Predict Weakly
Taken
(01)
Predict Weakly
Not Taken
NT
T
(00)
Predict Strongly
Not Taken
NT
T
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
subi R3, R1,#1
; branch b2 (d  0)
bnez R3, L2
...
L2:
b1:
b2:
; d==0, so d=1
...
d alternates between 0 and 2
Initial
prediction
d==0
d==2
d==0
WT
WT
SNT
SNT
WNT
WNT
SNT
SNT
39
Predictor behavior in example

A one-bit predictor initialized to “ predict taken” for branches b1 and b2,
 every branch is mispredicted.

A two-bit predictor of of saturation counter scheme starting from the state
“predict weakly taken”  every branch is mispredicted.

The two-bit predictor of UltraSPARC mispredicts every second branch
execution of b1 and b2.

A (1,1) correlating predictor takes advantage of the correlation of the two
branches; it mispredicts only in the first iteration when d = 2.
40
Correlation-based predictor






The two-bit predictor scheme uses only the recent behavior of a single branch
to predict the future of that branch.
Correlations between different branch instructions are not taken into account.
The correlation-based predictors or correlating predictors are branch predictors
that additionally use the behavior of other branches to make a prediction.
While two-bit predictors use self-history only, the correlating predictor uses
neighbor history additionally.
Notation: (m,n)-correlation-based predictor or (m,n)-predictor uses the behavior
of the last m branches to choose from 2m branch predictors, each of which is a
n-bit predictor for a single branch.
Branch history register (BHR): The global history of the most recent m
branches can be recorded in a m-bit shift register where each bit records
whether the branch was taken or not taken.
41
Correlation-based prediction (2,2)-predictor
Pattern History Tables PHTs
(2-bit predictors)
Branch address
...
...
...
...
...
...
10
1 1
...
...
select
Branch History Register BHR
(2-bit shift register)
1 0
42
Prediction behavior of (1,1) correlating predictor
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
d alternates between 0 and 2
...
b1 b2
BHR
Initial
prediction
b1:
b2:
0: 1 1
1: 1 1
1
PHT
d==0
T
T
43
Prediction behavior of (1,1) correlating predictor
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
d alternates between 0 and 2
...
b1 b2
BHR
Initial
prediction
b1:
b2:
T
T
0: 1 1
1: 0 1
0
PHT
d==0
NT
44
Prediction behavior of (1,1) correlating predictor
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
d alternates between 0 and 2
...
b1 b2
BHR
b1:
b2:
0: 1 0
1: 0 1
0
Initial
prediction
d==0
T
T
NT
NT
PHT
d==2
45
Prediction behavior of (1,1) correlating predictor
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
d alternates between 0 and 2
...
b1 b2
BHR
b1:
b2:
0: 1 0
1: 0 1
1
Initial
prediction
d==0
d==2
T
T
NT
NT
T
PHT
46
Prediction behavior of (1,1) correlating predictor
bnez R1,L1
; branch b1 (d0)
addi R1, R0,#1
L1:
; d==0, so d=1
subi R3, R1,#1
bnez R3, L2
; branch b2 (d  0)
...
L2:
d alternates between 0 and 2
...
b1 b2
BHR
b1:
b2:
0: 1 0
1: 0 1
1
Initial
prediction
d==0
d==2
T
T
NT
NT
T
T
PHT
47
Two-level adaptive predictors






Developed by Yeh and Patt at the same time (1992) as the correlation-based
prediction scheme.
The basic two-level predictor uses a single global branch history register (BHR)
of k bits to index in a pattern history table (PHT) of 2-bit counters.
Global history schemes correspond to correlation-based predictor schemes.
Denotation:
GAg:
– a single global BHR (denoted G) and
– a single global PHT (denoted g),
– A stands for adaptive.
All PHT implementations of Yeh and Patt use 2-bit predictors.
GAg-predictor with a 4-bit BHR length is denoted as GAg(4).
48
Implementation of a GAg(4)-predictor
Branch
History
Register
(BHR)
shift direction
1 1 0 0
Index
Branch Pattern
History Table
(PHT)
...
1100
1
1
predict:
taken
...

In the GAg predictor schemes the PHT lookup depends entirely on the bit
pattern in the BHR and is completely independent of the branch address.
49
Mispredictions can be restrained by additionally
using:





the full branch address to distinguish multiple PHTs
(called per-address PHTs),
a subset of branches (e.g. n bits of the branch address) to distinguish multiple PHTs
(called per-set PHTs),
the full branch address to distinguish multiple BHRs
(called per-address BHRs),
a subset of branches to distinguish multiple BHRs
(called per-set BHRs),
or a combination scheme.
50
Implementation of a GAp(4) predictor
Branch address
Per-address PHTs
BHR
...
...
...
1 1 0 0
Index


...
1 1
...
...
...
...
Gap(4) means a 4-bit BHR
a PHT for each branch.
51
GAs(4, 2n)
n bits of branch address
Per-set PHTs
n
BHR
...
...
...
1 1 0 0
Index


...
1 1
...
...
...
...
Gas(4,2n) means a 4-bit BHR
n bits of the branch address are used to choose among 2n PHTs
with 24 entries each.
52
Compare correlation-based (2,2)-predictor (left)
with two-level adaptive GAs(4,2n) predictor (right)
n bits of branch address
Per-set PHTs
n
BHR
...
...
...
1 1 0 0
Index
Pattern History Tables PHTs
(2-bit predictors)
Branch address
...
...
...
...
...
...
1
...
...
...
1
...
...
10
1
...
1
...
select
Branch History Register BHR
(2-bit shift register)
1 0
53
Two-level adaptive predictors:
Per-address history schemes




The first-level branch history refers to the last k occurrences of the same
branch instruction (using self-history only!)
Therefore a BHR is associated with each branch instruction.
The per-address branch history registers are combined in a table that is called
per-address branch history table (PBHT).
In the simplest per address history scheme, the BHRs index into a single global
PHT.
 denoted as PAg
(multiple per-address indexed BHRs, and a single global PHT).
54
PAg(4)
Per-address
BHT
PHT
...
Branch address
...
1 1 0 0
Branch address
1 1 0 0
1 1
Index
...
...
55
PAp(4)
b1
b2
Per-address
BHT
Branch address b1
Branch address b2
Per-address PHTs
...
1 1 0 0
1 1 0 0
...
...
1 1
Index
...
...
...
...
0
...
1
...
56
Two-level adaptive predictors:
Per-set history schemes

Per-set history schemes (SAg, SAs, and SAp): the first-level branch
history means the last k occurrences of the branch instructions from
the same subset.
Each BHR is associated with a set of branches.

Possible Set attributes:
– branch opcode,
– the branch class assigned by the compiler, or
– the branch address (most important!).
57
SAg(4)
Per-set
BHT
PHT
n
...
1 1 0 0
...
n
n bits of
branch address
1 1 0 0
1 1
n bits of
branch address
Index
...
...
58
SAs(4)
b1, b2
Per-set
BHT
n bits of
branch address b1
n bits of
branch address b2
Per-set PHTs
n
...
...
n
n
...
...
...
1 1
1 1 0 0
Index
...
...
...
...
59
Two-level adaptive predictors
Full table:
single global BHR
per-address BHT
per-set BHT
single global PHT
GAg
PAg
SAg
per-set PHTs
GAs
PAs
SAs
per-address PHTs
GAp
PAp
SAp
60
Estimation of hardware costs
Scheme name
GAg(k )
GAs(k,p )
GAp(k )
PAg(k )
PAs(k,p )
PAp(k )
SAg(k )
SAs(k,s * p )
SAp(k )
BHR Length
k
k
k
k
k
k
k
k
k
No. of PHTs
1
p
b
1
p
b
1
p
b
Hardware Cost
k +2k * 2
k +p * 2k* 2
k +b * 2k* 2
b * k +2k* 2
b * k +p * 2k* 2
b * k +b * 2k* 2
s* k +2k* 2
s* k +p * 2k* 2
s* k +b * 2k* 2
In the table b is the number of PHTs or entries in the BHT for the per-address schemes.
P and s denotes the number of PHTs or entries in the BHT for the per-set schemes.
61
Two-level adaptive predictors
Simulations of Yeh and Patt using the SPEC89 benchmarks






The performance of the global history schemes is sensitive to the branch
history length.
Interference of different branches that are mapped to the same pattern history
table is decreased by lengthening the global BHR.
Similarly adding PHTs reduces the possibility of pattern history interference by
mapping interfering branches into different tables.
Global history schemes are better than the per-address schemes for the integer
SPEC89 programs,
– utilize branch correlation, which is often the case in the frequent if-thenelse statements in integer programs
Per-address schemes are better for the floating-point intensive programs.
– better in predicting loop-control branches which are frequent in the
floating-point SPEC89 benchmark programs.
The per-set history schemes are in between both other schemes.
62
gselect and gshare predictors



gselect predictor: concatenates some lower order bit of the branch address
and the global history
gshare predictor: uses the bitwise exclusive OR of part of the branch address
and the global history as hash function.
McFarling: gshare slightly better than gselect
Branch Address BHR
gselect4/4
00000000
00000001
00000001
00000000
00000000
00000000
11111111 00000000
1111000011111111
11111111 10000000
1111000001111111
gshare8/8
00000001
00000000
63
Hybrid predictors

The second strategy of McFarling is to combine multiple separate branch
predictors, each tuned to a different class of branches.

Two or more predictors and a predictor selection mechanism are necessary in a
combining or hybrid predictor.
– McFarling: combination of two-bit predictor and gshare two-level adaptive,
– Young and Smith: a compiler-based static branch prediction with a twolevel adaptive type,
– and many more combinations!

Hybrid predictors often better than single-type predictors.
64
Simulations [Grunwald]
SAg, gshare and MCFarling‘s combining predictor
committed conditional taken
Application instructions branches branches
(in millions) (in millions) (%)
compress
80.4
14.4
54.6
gcc
250.9
50.4
49.0
perl
228.2
43.8
52.6
go
548.1
80.3
54.5
m88ksim
416.5
89.8
71.7
xlisp
183.3
41.8
39.5
vortex
180.9
29.1
50.1
jpeg
252.0
20.0
70.0
mean
267.6
46.2
54.3
SAg
10.1
12.8
9.2
25.6
4.7
10.3
2.0
10.3
8.6
misprediction rate
(%)
gshare combining
10.1
9.9
23.9
12.2
25.9
11.4
34.4
24.1
8.6
4.7
10.2
6.8
8.3
1.7
12.5
10.4
14.5
8.1
65
Results




Simulation of Keeton et al. 1998 using an OLTP (online transaction workload)
on a PentiumPro multiprocessor reported a misprediction rate of 14% with an
branch instruction frequency of about 21%.
The speculative execution factor, given by the number of instructions decoded
divided by the number of instructions committed, is 1.4 for the database
programs.
Two different conclusions may be drawn from these simulation results:
– Branch predictors should be further improved
– and/or branch prediction is only effective if the branch is predictable.
If a branch outcome is dependent on irregular data inputs, the branch often
shows an irregular behavior.
 Question: Confidence of a branch prediction?
66
Predicated instructions and multipath execution
- Confidence estimation

Confidence estimation is a technique for assessing the quality of a particular
prediction.

Applied to branch prediction, a confidence estimator attempts to assess the
prediction made by a branch predictor.

A low confidence branch is a branch which frequently changes its branch
direction in an irregular way making its outcome hard to predict or even
unpredictable.

Four classes possible:
– correctly predicted with high confidence C(HC),
– correctly predicted with low confidence C(LC),
– incorrectly predicted with high confidence I(HC), and
– incorrectly predicted with low confidence I(LC).
67
Implementation of a confidence estimator

Information from the branch prediction tables is used:
– Use of saturation counter information to construct a confidence estimator
 speculate more aggressively when the confidence level is higher
– Used of a miss distance counter table (MDC):
 Each time a branch is predicted, the value in the MDC is compared to a
threshold.
If the value is above the threshold, then the branch is considered to have
high confidence, and low confidence otherwise.
– A small number of branch history patterns typically leads to correct
predictions in a PAs predictor scheme.
The confidence estimator assigned high confidence to a fixed set of
patterns and low confidence to all others.

Confidence estimation can be used for speculation control,
thread switching in multithreaded processors or multipath execution
68
Predicated instructions





Provide predicated or conditional instructions and one or more predicate
registers.
Predicated instructions use a predicate register as additional input operand.
The Boolean result of a condition testing is recorded in a (one-bit) predicate
register.
Predicated instructions are fetched, decoded and placed in the instruction
window like non predicated instructions.
It is dependent on the processor architecture, how far a predicated instruction
proceeds speculatively in the pipeline before its predication is resolved:
– A predicated instruction executes only if its predicate is true, otherwise the
instruction is discarded. In this case predicated instructions are not
executed before the predicate is resolved.
– Alternatively, as reported for Intel's IA64 ISA, the predicated instruction
may be executed, but commits only if the predicate is true, otherwise the
result is discarded.
69
Predication example
if (x = = 0) {
a = b + c;
d = e - f;
}
g = h * i;
/*branch b1 */
(Pred = (x = = 0) )
if Pred then a = b + c;
if Pred then e = e - f;
g = h * i;
/* branch b1: Pred is set to true in x equals 0 */
/* The operations are only performed */
/* if Pred is set to true */
/* instruction independent of branch b1 */
70
Predication






Able to eliminate a branch and therefore the associated branch prediction 
increasing the distance between mispredictions.
The the run length of a code block is increased  better compiler scheduling.
Predication affects the instruction set, adds a port to the register file, and
complicates instruction execution.
Predicated instructions that are discarded still consume processor resources;
especially the fetch bandwidth.
Predication is most effective when control dependences can be completely
eliminated, such as in an if-then with a small then body.
The use of predicated instructions is limited when the control flow involves
more than a simple alternative sequence.
71
Eager (multipath) execution


Execution proceeds down both paths of a branch, and no prediction is made.
When a branch resolves, all operations on the non-taken path are discarded.

Oracle execution: eager execution with unlimited resources
– gives the same theoretical maximum performance as a perfect branch
prediction

With limited resources, the eager execution strategy must be employed
carefully.

Mechanism is required that decides when to employ prediction and when eager
execution: e.g. a confidence estimator

Rarely implemented (IBM mainframes) but some research projects:
– Dansoft processor, Polypath architecture, selective dual path execution,
simultaneous speculation scheduling, disjoint eager execution
72
(a) Single path
speculative execution
.7
(b) Full eager
1
execution
.21
.49
(c) Disjoint eager 2
.15
.34
execution
3
.24
.3
.7
.10
.49
2
.07
.7
.12
6
(a)
.3 2
1
5
.05
.09
.21
.49
3
(b)
.21
4
4
1
4
.17
.3
5
.34
3
6
.24
5
.21 6
.15
.10
(c)
73
Prediction of indirect branches

Indirect branches, which transfer control to an address stored in register, are
harder to predict accurately.

Indirect branches occur frequently in machine code compiled from objectoriented programs like C++ and Java programs.

One simple solution is to update the PHT to include the branch target
addresses.
74
Branch handling techniques and implementations
Technique
No branch prediction
Static prediction
always not taken
always taken
backward taken, forward not taken
semistatic with profiling
Dynamic prediction:
1-bit
2-bit
two-level adaptive
Hybrid prediction
Predication
Eager execution (limited)
Disjoint eager execution
Implementation examples
Intel 8086
Intel i486
Sun SuperSPARC
HP PA-7x00
early PowerPCs
DEC Alpha 21064, AMD K5
PowerPC 604, MIPS R10000,
Cyrix 6x86 and M2, NexGen 586
Intel PentiumPro, Pentium II, AMD K6
DEC Alpha 21264
Intel/HP Merced and most signal processors as e.g.
ARM processors, TI TMS320C6201 and many other
IBM mainframes: IBM 360/91, IBM 3090
none yet
75
High-bandwidth branch prediction

Future microprocessor will require more than one prediction per cycle starting
speculation over multiple branches in a single cycle,
– e.g. Gag predictor is independent of branch address.

When multiple branches are predicted per cycle, then instructions must be
fetched from multiple target addresses per cycle, complicating I-cache access.
– Possible solution: Trace cache in combination with next trace prediction.

Most likely a combination of branch handling techniques will be applied,
– e.g. a multi-hybrid branch predictor combined with support for context
switching, indirect jumps, and interference handling.
76
Details of superscalar pipeline



In-order section:
– Instruction Fetch (BTAC access, simple branch prediction)
 Fetch buffer
– Instruction decode
• often: more complex branch prediction techniques
• Register Rename
 Instruction window
Out-of-order section:
– Instruction issue to FU or Reservation station
– Execute till completion
In-order section:
– Retire (commit or remove)
– Write-back
77
Decode stage





Superscalar processor:
In-order delivery of instructions to the out-of-order execution kernel!
Instruction Delivery:
– Fetch and decode instructions at a higher bandwidth than execute them.
– Delivery task: Keep instruction window kept full
 the deeper instruction look-ahead allows to find more instructions to
issue to the execution units.
The processor fetches and decodes today about 1.4 to twice as many
instructions than it commits (because of mispredicted branch paths).
Typically the decode bandwidth is the same as the instruction fetch bandwidth.
Multiple instruction fetch and decode is supported by a fixed instruction length.
78
Decoding variable-length instructions

Variable instruction length:
often the case for legacy CISC instruction sets as the Intel i86 ISA.
 a multistage decode is necessary.
– The first stage determines the instruction limits within the instruction
stream.
– The second stage decodes the instructions generating one or several
micro-ops from each instruction.

Complex CISC instructions are split into micro-ops which resemble ordinary
RISC instructions.
79
Predecoding

Predecoding can be done when the instructions are transferred from memory
or secondary cache to the I-cache.
 the decode stage is more simple.


MIPS R10000: predecodes each 32-bit instruction into a 36-bit format stored in
the I-cache.
– The four extra bits indicate which functional unit should execute the
instruction.
– The predecoding also rearranges operand- and destination-select fields to
be in the same position for every instruction, and
– modifies opcodes to simplify decoding of integer or floating-point
destination registers.
The decoder can decode this expanded format more rapidly than the original
instruction format.
80
Rename stage





Aim of register renaming:
remove anti and output dependencies dynamically by the processor hardware.
Register renaming is the process of dynamically associating physical registers
(rename registers) with the architectural registers (logical registers) referred to
in the instruction set of the architecture.
Implementation:
– mapping table;
– a new physical register is allocated for every destination register specified
in an instruction.
Each physical register is written only once after each assignment from the free
list of available registers.
If a subsequent instruction needs its value, that instruction must wait until it is
written (true data dependence).
81
Two principal techniques to implement renaming



Separate sets of architectural registers and rename (physical) registers are
provided.
– The physical registers contain values (of completed but not yet retired
instructions),
– the architectural (or logical) registers store the committed values.
– After commitment of an instruction, copying its result from the rename
register to the architectural register is required.
Only a single set of registers is provided and architectural registers are
dynamically mapped to physical registers.
– The physical registers contain committed values and temporary results.
– After commitment of an instruction, the physical register is made
permanent and no copying is necessary.
Alternative to the dynamic renaming is the use of a large register file as defined
for the Intel IA-64 (Itanium).
82
Register rename logic


Access a multi-ported map table with logical register designators as index
Additionally dependence check logic detects cases where the logical register
is written by an earlier instruction  set up output MUXes
Map
Table
...
...
Logical
Dest
Regs
...
Logical
Source
Regs
Physical
Source
Regs
Physical
Dest
Regs
MUX
Physical
Reg Mapped
to Logical
Reg R
Dependence
Check
Logic
(Slice)
Logical
Source
Reg R
83
Issue and dispatch





The notion of the instruction window comprises all the waiting stations
between decode (rename) and execute stages.
The instruction window isolates the decode/rename from the execution stages
of the pipeline.
Instruction issue is the process of initiating instruction execution in the
processor's functional units.
– issue to a FU or a reservation station
– dispatch, if a second issue stage exists to denote when an instruction is
started to execute in the functional unit.
The instruction-issue policy is the protocol used to issue instructions.
The processor's lookahead capability is the ability to examine instructions
beyond the current point of execution in hope of finding independent
instructions to execute.
84
Instruction window organizations




Single-stage issue out of a central instruction window
Multi-stage issue: Operand availability and resource availability checking is
split into two separate stages.
Decoupling of instruction windows: Each instruction window is shared by a
group of (usually related) functional units, most common: separate floatingpoint window and integer window.
Combination of multi-stage issue and decoupling of instruction windows:
– In a two-stage issue scheme with resource dependent issue preceding the
data-dependent dispatch,
the first stage is done in-order,
the second stage is performed out-of-order.
85
The following issue schemes are commonly used

Single-level, central issue: single-level issue out of a central window as in
Pentium II processor
Issue
and
Dispatch
Decode
and
Rename
Functional
Units
86
Single-level, two-window issue

Single-level, two-window issue: single-level issue with a instruction window
decoupling using two separate windows
– most common: separate floating point and integer windows as in HP 8000
processor
Issue
and
Dispatch
Decode
and
Rename
Functional
Units
Functional
Units
87
Two-level issue with multiple windows

Two-level issue with multiple windows with a centralized window in the first
stage and separate windows in the second stage (PowerPC 604 and 620
processors).
Issue
Dispatch
Functional Unit
Functional Unit
Decode
and
Rename
Functional Unit
Functional Unit
Reservation Stations
88
Wakeup logic
tag1
...
rdyL
opd tagL
rdyL
opd tagL
OR
opd tagR
rdyR
inst0
rdyR
instN-1
...
=
=
...
OR
Result  tag broadcast to all
instructions in window
If match  rdyL or rdyR flag set
If both ready  ready flag set
 REQ signal raised
=
=
tagIW
opd tagR
89
...
Issue Window
...
req0
grant0
req1
grant1
req2
grant2
req3
grant3
anyreq enable
REQ signals are raised
when all operands are available
anyreq enable
anyreq enable
anyreq enable
req0
grant0
req1
grant1
req2
grant2
req3
grant3
OR
grant0
grant1
grant2
grant3
req0
req1
req2
req3
anyreq enable
Priority
Encoder
from/to other subtrees
Arbiter Cell
req0
grant0
Selection
logic
root cell
enable
anyreq
enable
90
Execution stages

Various types of FUs classified as:
– single-cycle (latency of one) or
– multiple-cycle (latency more than one) units.

Single-cycle units produce a result one cycle after an instruction started
execution. Usually they are also able to accept a new instruction each cycle
(throughput of one).

Multi-cycle units perform more complex operations that cannot be implemented
within a single cycle.
Multi-cycle units
– can be pipelined to accept a new operation each cycle or each other cycle
– or they are non-pipelined.
Another class of units exists that perform the operations with variable cycle
times.


91
Types of FUs





Siingle-cycle (single latency) units:
– (simple) integer and (integer-based) multimedia units,
Multicycle units that are pipelined (throughput of one):
– complex integer, floating-point, and (floating-point -based) multimedia unit
(also called multimedia vector units),
Multicycle units that are pipelined but do not accept a new operation each cycle
(throughput of 1/2 or less):
– often the 64-bit floating-point operations in a floating-point unit,
Multicycle units that are often not pipelined:
– division unit, square root units, complex multimedia units
Variable cycle time units:
– load/store unit (depending on cache misses) and special implementations
of e.g. floating-point units
92
Media processors and multimedia units


Media processing (digital multimedia information processing) is the decoding,
encoding, interpretation, enhancement, and rendering of digital multimedia
information.
Todays video and 3D graphics require high bandwidth and processing
performance:
– Separate special-purpose video chips e.g. for MPEG-2, 3D-graphics, etc.
and multi-algorithm video chip sets
– Programmable video processors (very sophisticated DSPs): TMS320C82,
Siemens Tricore, Hyperstone
– Media processors and media coprocessors: Chromatics MPACT media
processor, Philips Trimedia TM-1, MicroUnity Media processor
– Multimedia units: multimedia-extensions for general-purpose processors
(VIS, MMX, MAX)
93
Media processors and multimedia units

Utilization of subword parallelism (data parallel instructions, SIMD)
R1:
x1
R2:
x2
x3
x4
*
y1
*
*
y2
y3
y4
*
R3: x1*y1 x2*y2 x3*y3 x4*y4


Saturation arithmetic
Additional arithmetic instructions, e.g. pavgusb (average instruction), masking
and selection instructions, reordering and conversion
94
Multimedia extensions in today's microprocessors








Multimedia acceleration extensions (MAX-1, MAX-2) for HP PA-8000 and PA8500
Visual instruction set (VIS) for UltraSPARC
Matrix manipulation extensions (MMX, MMX2) for the Intel P55C and Pentium II
AltiVec extensions for Motorola processors
Motion video instructions (MVI) for Alpha processors and
MIPS digital media extensions (MDMX) for MIPS processors.
3D Graphical Enhancements:
ISSE (internet streaming SIMD extension) extends MMX in Pentium III
3DNow! of AMD K6-2 and Athlon
95
3D graphical enhancement





The ultimate goal is the integrated real-time processing of multiple audio,
video, and 2-D and 3-D graphics streams on a system CPU.
To speed up 3D applications by the main processor, fast low precision floatingpoint operations are required:
– reciprocal instructions are of specific importance
– e.g. square root reciprocal with low precision.
3D graphical enhancements apply so-called vector operations:
– execute two paired single-precision floating-point operations in parallel
on two single-precision floating-point values stored in an 64-bit floatingpoint register.
Such vector operations are defined by 3Dnow! extension by AMD and by ISSE
of Intel's Pentium III.
The 3DNow! defines 21 new instructions which are mainly paired singleprecision floating-point operations.
96
Finalizing pipelined execution - completion,
commitment, retirement and write-back




An instruction is completed when the FU finished the execution of the
instruction and the result is made available for forwarding and buffering.
– Instruction completion is out of program order.
Committing an operation means that the results of the operation have been
made permanent and the operation retired from the scheduler.
Retiring means removal from the scheduler with or without the commitment of
operation results, whichever is appropriate.
– Retiring an operation does not imply the results of the operation are either
permanent or non permanent.
A result is made permanent:
– either by making the mapping of architectural to physical register
permanent (if no separate physical registers exist) or
– by copying the result value from the rename register to the architectural
register ( in case of separate physical and architectural registers)
in an own write-back stage after the commitment!
97
Precise interrupts


An interrupt or exception is called precise if the saved processor state
corresponds with the sequential model of program execution where one
instruction execution ends before the next begins.
The saved state should fulfil the following conditions:
– All instructions preceding the instruction indicated by the saved program
counter have been executed and have modified the processor state
correctly.
– All instructions following the instruction indicated by the saved program
counter are unexecuted and have not modified the processor state.
– If the interrupt is caused by an exception condition raised by an instruction
in the program, the saved program counter points to the interrupted
instruction.
– The interrupted instruction may or may not have been executed, depending
on the definition of the architecture and the cause of the interrupt.
Whichever is the case, the interrupted instruction has either ended
execution or has not started.
98
Precise interrupts



Interrupts belong to two classes:
– Program interrupts or traps result from exception conditions detected
during fetching and execution of specific instructions
• illegal opcodes, numerical errors such as overflow, or
• part of normal execution, e.g., page faults.
– External interrupts are caused by sources outside of the currently
executing instruction stream
• I/O interrupts and timer interrupts.
• For such interrupts restarting from a precise processor state should be
made possible.
When an exception condition can be detected prior to issue, then instruction
issuing is simply halted and the processor waits until all previous issued
instructions are retired.
Processors often have two modes of operation:
One mode guarantees precise exception and another mode, which is often 10
times faster, does not.
99
Reorder buffers






The reorder buffer keeps the original program order of the instructions after
instruction issue and allows result serialization during the retire stage.
State bits store if an instruction is on a speculative path, and when the branch is
resolved, if the instruction is on a correct path or must be discarded.
When an instruction completes, the state is marked in its entry.
Exceptions are marked in the reorder buffer entry of the triggering instruction.
The reorder buffer is implemented as a circular FIFO buffer.
Reorder buffer entries are allocate in the (first) issue stage and deallocated
serially when the instruction retires.
100
Reorder buffer variations

Reorder buffer holds only instruction execution states (results are in rename
registers).
– Johnson's description of a reorder buffer in combination with a so-called
future file. The future file is similar to the set of rename registers that are
separate to the architectural registers.
– In contrast, Smith and Pleskun describe a reorder buffer in combination
with a future file, whereby the reorder buffer and the future file receive and
store results at the same time.

Other reorder buffer type: The reorder buffer holds the result values of
completed instructions instead of rename registers.

Moreover the instruction window can be combined with the reorder buffer to a
single buffer unit.
101
Other recovery mechanisms


Checkpoint repair mechanism:
– The processor provides a set of logical spaces, where each logical space
consists of a full set of software-visible registers and memory.
– One is used for current execution, the others contain back-up copies of the
in-order state that corresponds to previous points in execution.
– At various times during execution, a check-point is made by copying the
architectural state of the current logical state to the back-up space.
– Restarting is accomplished by loading the contents of the appropriate
back-up stage into the current logical state.
History buffer:
– The (architectural) register file contains the current state, and the history
buffer contains old register values which have been replaced by new
values.
– The history buffer is managed as LIFO stack, and the old values are used to
restore a previous state if necessary.
102
Relaxing in-order retirement




The only relaxation can be existent in the order of load and store instructions.
Result serialization as it is demanded by the serial instruction flow of the von
Neumann architecture.
A fully parallel and highly speculative processor must look like a simple von
Neumann processor as it was state-of-the-art in the fifties.
Possible relaxation:
– Assume an instruction sequence A ends with a branch that predicts an
instruction sequence B, and B is followed by a sequence C which is not
dependent on B.
– Thus C is executed independently from the branch direction.
– Therefore, instructions in C can start to retire before B.
103
The Intel P5 and P6 family
Year
P5
P
6
Net
Burst
1993
1994
1995
1996
1997
1998
1995
1997
1998
1998
1997
1998
1998
1999
1999
1999
2000
Type
Pentium
Pentium
Pentium
Pentium
Pentium MMX
Mobile Pentium MMX
PentiumPro
PentiumPro
Intel Celeron
Intel Celeron
Pentium II
Mobile Pentium II
Pentium II Xeon
Pentium II Xeon
Pentium III
Pentium III Xeon
Pentium 4
Transistors
(x1000)
3100
3200
3200
3300
4500
4500
5500
5500
7500
19000
7000
7000
7000
7000
8200
8200
42000
Technology
(m m)
0.8
0.6
0.6/0.35
0.35
0.35
0.25
0.35
0.35
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.18
including L2 cache
Clock
(MHz)
66
75-100
120-133
150-166
200-233
200-233
150-200
200
266-300
300-333
233-450
300
400-450
450
450-1000
500-1000
1500
Issue
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
Word
L1 cache
format
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 x 16 kB
32-bit 8kB / 12kµOps
L2 cache
256/512 kB
1 MB
-128 kB
256 kB/512 kB
256 kB/512 kB
512 kB/1 MB
512 kB/2 MB
512 kB
512 kB
256 kB
104
Micro-dataflow in PentiumPro 1995

... The flow of the Intel Architecture instructions is predicted and these
instructions are decoded into micro-operations (mops), or series of mops, and
these mops are register-renamed, placed into an out-of-order speculative pool
of pending operations, executed in dataflow order (when operands are ready),
and retired to permanent machine state in source program order. ...

R.P. Colwell, R. L. Steck: A 0.6 mm BiCMOS Processor with Dynamic Execution,
International Solid State Circuits Conference, Feb. 1995.
105
PentiumPro and Pentium II/III






The Pentium II/III processors use the same dynamic execution
microarchitecture as the other members of P6 family.
This three-way superscalar, pipelined micro-architecture features a decoupled,
multi-stage superpipeline, which trades less work per pipestage for more
stages.
The Pentium II/III processor has twelve stages with a pipestage time 33 percent
less than the Pentium processor, which helps achieve a higher clock rate on
any given manufacturing process.
A wide instruction window using an instruction pool.
Optimized scheduling requires the fundamental “execute” phase to be replaced
by decoupled “issue/execute” and “retire” phases. This allows instructions to
be started in any order but always be retired in the original program order.
Processors in the P6 family may be thought of as three independent engines
coupled with an instruction pool.
106
Pentium® Pro Processor and Pentium II/III
Microarchitecture
107
Pentium External Bus
II/III
L2 Cache
Memory
Reorder
Buffer
Bus Interface Unit
Instruction Fetch Unit
(with I-cache)
Branch
Target
Buffer
Instruction
Decode
Unit
Microcode
Instruction
Sequencer
Register
Alias
Table
Reservation Station Unit
D-cache
Unit
Memory
Interface
Unit
Functional
Units
Reorder
Buffer
&
Retirement
Register
File
108
Pentium II/III: The in-order section




The instruction fetch unit (IFU) accesses a non-blocking I-cache, it contains the
Next IP unit.
The Next IP unit provides the I-cache index (based on inputs from the BTB),
trap/interrupt status, and branch-misprediction indications from the integer
FUs.
Branch prediction:
– two-level adaptive scheme of Yeh and Patt,
– BTB contains 512 entries, maintains branch history information and the
predicted branch target address.
– Branch misprediction penalty: at least 11 cycles, on average 15 cycles
The instruction decoder unit (IDU) is composed of three separate decoders
109
Pentium II/III: The in-order section (Continued)



A decoder breaks the IA-32 instruction down to mops, each comprised of an
opcode, two source and one destination operand. These mops are of fixed
length.
– Most IA-32 instructions are converted directly into single micro ops (by any
of the three decoders),
– some instructions are decoded into one-to-four mops (by the general
decoder),
– more complex instructions are used as indices into the microcode
instruction sequencer (MIS) which will generate the appropriate stream of
mops.
The mops are send to the register alias table (RAT) where register renaming is
performed,
i.e., the logical IA-32 based register references are converted into references to
physical registers.
Then, with added status information, mops continue to the reorder buffer (ROB,
40 entries) and to the reservation station unit (RSU, 20 entries).
110
The fetch/decode unit
Instruction Fetch Unit
Next_IP
Simple Decoder
Microcode
Instruction
Sequencer
Alignment
Simple Decoder
Branch
Target
Buffer
Instruction
Decode
Unit
IA-32
instructions
General Decoder
I-cache
(b) instruction decoder unit (IDU)
op1
op2
op3
Register
Alias
Table
(a) in-order section
111
The out-of-order execute section





When the mops flow into the ROB, they effectively take a place in program
order.
mops also go to the RSU which forms a central instruction window with 20
reservation stations (RS), each capable of hosting one mop.
mops are issued to the FUs according to dataflow constraints and resource
availability, without regard to the original ordering of the program.
After completion the result goes to two different places, RSU and ROB.
The RSU has five ports and can issue at a peak rate of 5 mops each cycle.
112
Latencies and throughtput for Pentium II/III FUs
RSU Port FU
Integer arithmetic/logical
Shift
Integer mul
Floating-point add
0
Floating-point mul
Floating-point div
MMX arithmetic/logical
MMX mul
Integer arithmetic/logical
1
MMX arithmetic/logical
MMX shift
2
Load
3
Store address
4
Store data
Latency
1
1
4
3
5
long
1
3
1
1
1
3
3
1
Throughput
1
1
1
1
0.5
nonpipelined
1
1
1
1
1
1
1
1
113
MMX
Functional Unit
Floating-point
Functional Unit
Issue/Execute Unit
to/from
Reorder
Buffer
Reservation Station Unit
Port 0
Integer
Functional Unit
MMX
Functional Unit
Jump
Functional Unit
Port 1
Integer
Functional Unit
Port 2
Load
Functional Unit
Port 3
Store
Functional Unit
Port 4
Store
Functional Unit
114
The in-order retire section.

A mop can be retired
– if its execution is completed,
– if it is its turn in program order,
– and if no interrupt, trap, or misprediction occurred.

Retirement means taking data that was speculatively created and writing it into
the retirement register file (RRF).

Three mops per clock cycle can be retired.
115
Retire unit
to/from
D-cache
Reservation
Station
Unit
Memory
Interface
Unit
Retirement
Register
File
to/from Reorder Buffer
116
BTB1
Reservation station
ROB
read
RSU
IFU1
Port 0
IFU2
Port 1
IDU0
Port 2
IDU1
RAT
Reorder buffer read
ROB
read
Port 3
Reorder buffer
write-back
ROB
write
Retirement
RRF
Port 4
(b)
Retirement
IFU0
Register renaming
(a)
Reorder buffer read
Issue
BTB0
Execution and completion
I-cache access BTB access
Decode
Fetch and predecode
The Pentium II/III pipeline
(c)
117
Pentium® Pro processor basic execution
environment
232-1
Eight 32-bit
Registers
Six 16-bit
Registers
General Purpose
Registers
Segment Registers
32 bits
EFLAGS Register
32 bits
EIP (Instruction
Pointer Register)
0
* The address space can be flat or segmented
Address
Space*
118
Application programming registers
119
Pentium III
120
Pentium II/III summary and offsprings

Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name
Katmai

two 32 kB caches, faster floating-point performance

Coppermine is a shrink of Pentium III down to 0.18 micron.
121
Pentium 4








Was announced for mid-2000 under the code name Willamette
native IA-32 processor with Pentium III processor core
running at 1.5 GHz
42 million transistors
0.18 µm
20 pipeline stages (integer pipeline), IF and ID not included
trace execution cache (TEC) for the decoded µOps
NetBurst micro-architecture
122
Pentium 4 features
Rapid Execution Engine:


Intel: “Arithmetic Logic Units (ALUs) run at twice the processor frequency”
Fact: Two ALUs, running at processor frequency connected with a multiplexer
running at twice the processor frequency
Hyper Pipelined Technology:


Twenty-stage pipeline to enable high clock rates
Frequency headroom and performance scalability
123
Advanced dynamic execution

Very deep, out-of-order, speculative execution engine
– Up to 126 instructions in flight (3 times larger than the Pentium III
processor)
– Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III
processor)

Branch prediction
– based on µOPs
– 4K entry branch target array (8 times larger than the Pentium III processor)
– new algorithm (not specified), reduces mispredictions compared to gshare
of the P6 generation about one third
124
First level caches

12k µOP Execution Trace Cache (~100 k)

Execution Trace Cache that removes decoder latency from main execution
loops

Execution Trace Cache integrates path of program execution flow into a single
line

Low latency 8 kByte data cache with 2 cycle latency
125
Second level caches

Included on the die

Size: 256 kB

Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache

256-bit data bus to the level 2 cache

Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency)

Bandwidth and performance increases with processor frequency
126
NetBurst microarchitecture
127
Streaming SIMD extensions 2 (SSE2)
technology

SSE2 Extends MMX and SSE technology with the addition of 144 new
instructions, which include support for:
– 128-bit SIMD integer arithmetic operations.
– 128-bit SIMD double precision floating point operations.
– Cache and memory management operations.

Further enhances and accelerates video, speech, encryption, image and photo
processing.
128
400 MHz Intel NetBurst microarchitecture
system bus

Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor).

Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed.

Split-transaction, deeply pipelined.

128-byte lines with 64-byte accesses.
129
Pentium 4 data types
130
Pentium 4
131
Pentium 4 offsprings






Foster
Pentium 4 with external L3 cache and DDR-SDRAM support
provided for server
clock rate 1.7 - 2 GHz
to be launched in Q2/2001
Northwood
0.13 µm technique
new 478 pin socket
132
VLIW or EPIC

VLIW (very long instruction word):
Compiler packs a fixed number of instructions into a single VLIW instruction.
The instructions within a VLIW instruction are issued and executed in parallel
Example: High-end signal processors (TMS320C6201)

EPIC (explicit parallel instruction computing):
Evolution of VLIW
Example: Intel’s IA-64, exemplified by the Itanium processor
133
VLIW



VLIW (very long instruction word) processors use a long instruction word that
contains a usually fixed number of operations that are fetched, decoded,
issued, and executed synchronously.
All operations specified within a VLIW instruction must be independent of one
another.
Some of the key issues of a (V)LIW processor:
– (very) long instruction word (up to 1 024 bits per instruction),
– each instruction consists of multiple independent parallel operations,
– each operation requires a statically known number of cycles to complete,
– a central controller that issues a long instruction word every cycle,
– multiple FUs connected through a global shared register file.
134
VLIW and superscalar









Sequential stream of long instruction words.
Instructions scheduled statically by the compiler.
Number of simultaneously issued instructions is fixed during compile-time.
Instruction issue is less complicated than in a superscalar processor.
Disadvantage: VLIW processors cannot react on dynamic events,
e.g. cache misses, with the same flexibility like superscalars.
The number of instructions in a VLIW instruction word is usually fixed.
Padding VLIW instructions with no-ops is needed in case the full issue
bandwidth is not be met. This increases code size. More recent VLIW
architectures use a denser code format which allows to remove the no-ops.
VLIW is an architectural technique, whereas superscalar is a microarchitecture
technique.
VLIW processors take advantage of spatial parallelism.
135
EPIC: a paradigm shift

Superscalar RISC solution
– Based on sequential execution semantics
– Compiler’s role is limited by the instruction set architecture
– Superscalar hardware identifies and exploits parallelism

EPIC solution – (the evolution of VLIW)
– Based on parallel execution semantics
– EPIC ISA enhancements support static parallelization
– Compiler takes greater responsibility for exploiting parallelism
– Compiler / hardware collaboration often resembles superscalar
136
EPIC: a paradigm shift

Advantages of pursuing EPIC architectures
– Make wide issue & deep latency less expensive in hardware
– Allow processor parallelism to scale with additional VLSI density

Architect the processor to do well with in-order execution
– Enhance the ISA to allow static parallelization
– Use compiler technology to parallelize program
– However, a purely static VLIW is not appropriate for general-purpose use
137
The fusion of VLIW and superscalar techniques



Superscalars need improved support for static parallelization
– Static scheduling
– Limited support for predicated execution
VLIWs need improved support for dynamic parallelization
– Caches introduce dynamically changing memory latency
– Compatibility: issue width and latency may change with new hardware
– Application requirements - e.g. object oriented programming with dynamic
binding
EPIC processors exhibit features derived from both
– Interlock & out-of-order execution hardware are compatible with EPIC (but
not required!)
– EPIC processors can use dynamic translation to parallelize in software
138
Many EPIC features are taken from VLIWs

Minisupercomputer products stimulated VLIW research (FPS, Multiflow,
Cydrome)
Minisupercomputers were specialized, costly, and short-lived
Traditional VLIWs not suited to general purpose computing
VLIW resurgence in single chip DSP & media processors

Minisupercomputers exaggerated forward-looking challenges:
Long latency
Wide issue
Large number of architected registers
Compile-time scheduling to exploit exotic amounts of parallelism

EPIC exploits many VLIW techniques
139
Shortcomings of early VLIWs

Expensive multi-chip implementations

No data cache

Poor "scalar" performance

No strategy for object code compatibility
140
EPIC design challenges

Develop architectures applicable to general-purpose computing
– Find substantial parallelism in ”difficult to parallelize” scalar programs
– Provide compatibility across hardware generations
– Support emerging applications (e.g. multimedia)

Compiler must find or create sufficient ILP

Combine the best attributes of VLIW & superscalar RISC
(incorporated best concepts from all available sources)

Scale architectures for modern single-chip implementation
141
EPIC Processors, Intel's IA-64 ISA and Itanium


Joint R&D project by Hewlett-Packard and Intel (announced in June 1994)
This resulted in explicitly parallel instruction computing (EPIC) design style:
– specifying ILP explicit in the machine code, that is, the parallelism is
encoded directly into the instructions similarly to VLIW;
– a fully predicated instruction set;
– an inherently scalable instruction set (i.e., the ability to scale to a lot of
FUs);
– many registers;
– speculative execution of load instructions
142
IA-64 Architecture



Unique architecture features & enhancements
– Explicit parallelism and templates
– Predication, speculation, memory support, and others
– Floating-point and multimedia architecture
IA-64 resources available to applications
– Large, application visible register set
– Rotating registers, register stack, register stack engine
IA-32 & PA-RISC compatibility models
143
Today’s architecture challenges



Performance barriers :
– Memory latency
– Branches
– Loop pipelining and call / return overhead
Headroom constraints :
– Hardware-based instruction scheduling
• Unable to efficiently schedule parallel execution
– Resource constrained
• Too few registers
• Unable to fully utilize multiple execution units
Scalability limitations :
– Memory addressing efficiency
144
Intel's IA-64 ISA

Intel 64-bit Architecture (IA-64) register model:
– 128 64-bit general purpose registers GR0-GR127
to hold values for integer and multimedia computations
• each register has one additional NaT (Not a Thing) bit to indicate
whether the value stored is valid,
– 128 82-bit floating-point registers FR0-FR127
• registers f0 and f1 are read-only with values +0.0 and +1.0,
– 64 1-bit predicate registers P0-PR63
• the first register p0 is read-only and always reads 1 (true)
– 8 64-bit branch registers BR0-BR7 to specify the target addresses of
indirect branches
145
IA-64’s large register file
Floating-Point
Registers
81
0
Integer Registers
63
0
0
GR0
GR0
GR1
GR1
GR31
GR31
GR32
GR32
GR127
GR127
NaT
0.0
Predicate
Registers
Branch Registers
63
BR0
BR7
0
bit 0
PR0 1
PR1
PR15
PR16
PR63
32 Static
32 Static
96 Stacked, Rotating
96 Rotating
16 Static
48 Rotating
146
Intel's IA-64 ISA
– IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of
• op-code,
• predicate field (6 bits),
• two source register addresses (7 bits each),
• destination register address (7 bits), and
• special fields (includes integer and floating-point arithmetic).
– The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate
registers.
– 6 types of instructions
• A: Integer ALU
 I-unit or M-unit
• I: Non-ALU integer
 I-unit
• M: Memory
 M-unit
• B: Branch
 B-unit
• F: Floating-point
 F-unit
• L: Long Immediate
 I-unit
– IA-64 instructions are packed by compiler into bundles.
147
IA-64 bundles





A bundle is a 128-bit long instruction word (LIW) containing three 41-bit IA-64
instructions along with a so-called 5-bit template that contains instruction
grouping information.
IA-64 does not insert no-op instructions to fill slots in the bundles.
The template explicitly indicates (ADAG):
– first 4 bits: types of instructions
– last bit (stop bit): whether the bundle can be executed in parallel with the
next bundle
– (previous literature): whether the instructions in the bundle can be executed
in parallel or if one or more must be executed serially (no more in ADAG
description)
Bundled instructions don't have to be in their original program order, and they
can even represent entirely different paths of a branch.
Also, the compiler can mix dependent and independent instructions together in
a bundle, because the template keeps track of which is which.
148
IA-64 : Explicitly parallel architecture
128 bits (bundle)
Instruction 2
41 bits
Memory (M)




Instruction 1
41 bits
Memory (M)
Instruction 0
41 bits
Template
5 bits
Integer (I)
(MMI)
IA-64 template specifies
– The type of operation for each instruction
• MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB
– Intra-bundle relationship
• M / MI or MI / I
– Inter-bundle relationship
Most common combinations covered by templates
– Headroom for additional templates
Simplifies hardware requirements
Scales compatibly to future generations
M=Memory
F=Floating-point
I=Integer
L=Long Immediate
B=Branch
149
IA-64 scalability

A single bundle containing three instructions corresponds to a set of three FUs.

If an IA-64 processor had n sets of three FUs each then using the template
information it would be possible to chain the bundles to create instruction word
of n bundles in length.

This is the way to provide scalability of IA-64 to any number of FUs.
150
Predication in IA-64 ISA





Branch prediction: paying a heavy penalty in lost cycles if mispredicted.
IA-64 compilers uses predication to remove the penalties caused by
mispredicted branches and by the need to fetch from noncontiguous target
addresses by jumping over blocks of code beyond branches.
When the compiler finds a branch statement it marks all the instructions that
represent each path of the branch with a unique identifier called a predicate.
IA-64 defines a 6-bit field (predicate register address) in each instruction to
store this predicate.
 64 unique predicates available at one time.
Instructions that share a particular branch path will share the same predicate.
IA-64 also defines an advanced branch prediction mechanism for branches
which cannot be removed.
151
If-then-else statement
152
Predication in IA-64 ISA




At run time, the CPU scans the templates, picks out the independent
instructions, and issues them in parallel to the FUs.
Predicated branch: the processor executes the code for every possible branch
outcome.
In spite of the fact that the processor has probably executed some instructions
from both possible paths, none of the (possible) results is stored yet.
To do this, the processor checks predicate register of each of these
instructions.
– If the predicate register contains a 1,
 the instruction is on the TRUE path (i.e., valid path),
so the processor retires the instruction and stores the result.
– If the register contains a 0,
 the instruction is invalid, so the processor discards the result.
153
Speculative loading





Load data from memory well before the program needs it, and thus to effectively
minimize the impact of memory latency.
Speculative loading is a combination of compile-time and run-time
optimizations.  compiler-controlled speculation
The compiler is looking for any instructions that will need data from memory
and, whenever possible, hoists a load at an earlier point in the instruction
stream, ahead of the instruction that will actually use the data.
Today's superscalar processors:
– load can be hoisted up to the first branch instruction which represents a
barrier
Speculative loading combined with predication gives the compiler more
flexibility to reorder instructions and to shift loads above branches.
154
Speculative loading - “control speculation”
155
Speculative loading
speculative load instruction
ld.s
speculative check instruction chk.s


The compiler:
– inserts the matching check immediately before the particular instruction
that will use the data,
– rearranges the surrounding instructions so that the processor can issue
them in parallel.
At run-time:
– the processor encounters the ld.s instruction first and tries to retrieve
the data from the memory.
– ld.s performs memory fetch and exception detection (e.g., checks the
validity of the address).
– If an exception is detected, ld.s does not deliver the exception.
– Instead, ld.s only marks the target register (by setting a token bit).
156
Speculative loading “data speculation”

Mechanism can also be used to move a load above a store
even if is is not known whether the load and the store reference overlapping
memory locations.
Ld.a
...
Chk.a
use data
advanced load
check
157
Speculative loading/checking

Exception delivery is the responsibility of the matching chk.s instruction.
– When encountered, chk.s calls the operating system routine if the target
register is marked (i.e, if the corresponding token bit is set), and does
nothing otherwise.

Whether the chk.s instruction will be encountered may depend on the
outcome of the branch instruction.
 Thus, it may happen that an exception detected by ld.s is never
delivered.

Speculative loading with ld.s/chk.s machine level instructions resembles
the TRY/CATCH statements in some high-level programming languages (e.g.,
Java).
158
Software pipelining via rotating registers
Software pipelining - improves performance by overlapping execution of
different software loops - execute more loops in the same amount of time
Time
Sequential Loop Execution


Software Pipelining Loop Execution
Time

Traditional architectures need complex software loop unrolling for pipelining
– Results in code expansion --> Increases cache misses --> Reduces
performance
IA-64 utilizes rotating registers to achieve software pipelining
– Avoids code expansion --> Reduces cache misses --> Higher performance
159
SW pipelining by modulo scheduling
Ini tiation Interval (II)
[2]
Mi ni mum Initiati on Interval (MII) =
MAX(Resource-constrained MII, Recurrence-constrained MII)


(Cydrome)
Specialized branch and rotating registers eliminate code replication
160
SW pipelining by register rotation


Rotating registers
– Floating-point: f32-f127
– General-purpose: g32-g127; can be set by an alloc imm instruction
– Predicate: p16-p63
Additional registers needed:
– Current frame marker CFM: describes state of the general register stack plus
three register rename base values used in register rotation:
• rr.pr (6 bit)
• rr.fr (7 bit)
• rr.gr (7 bit)
– within the “Application registers”:
• loop count LC (64 bit register): decremented by counted-loop-type
branches
• Epilog count EC (6-bit registers): for counting the epilog stages
161
SW pipelining by register rotation
- Counted loop example
L1: ld4 r4 = [r5],4 ;; // cycle 0, load postinc 4
add r7 = r4,r9 ;; // cycle 2
st4 [r6] = r7,4
// cycle 3 store postinc 4
br.loop L1
;;
All instructions from iteration X are executed before iteration X+1
Assume store from iteration x is independent from load from iteration x+1:
 conceptual view of a single sw pipelined iteration:
Stage 1: (p16) ld4 r4 = [r5],4
Stage 2: (p17)
------------- // empty stage
Stage 3: (p18) add r7 = r4,r9
Stage 4: (p19) st4 [r6] = r7,4 ;; separates
instruction groups
162
SW pipelining by register rotation
- Counted loop example
Stage
Stage
Stage
Stage
1:
2:
3:
4:
(p16)
(p17)
(p18)
(p19)
ld4 r4 = [r5],4
------------- // empty stage
add r7 = r4,r9
st4 [r6] = r7,4
is translated to:
mov lc = 199
// LC = loop count - 1
mov ec = 4
// EC = epilog stages + 1
mov pr.rot = 1<<16 ;; // PR16 = 1, rest = 0
L1:
(p16) ld4 r32 = [r5],4
// Cycle 0
(p18) add r35 = r34,r9
// Cycle 0
(p19) st4 [r6] = r36,4
// Cycle 0
br.ctop L1 ;;
// Cycle 0
163
SW pipelining by register rotation
- Optimizations and limitations






Register rotation removes the requirement that kernel loops be unrolled to
allow software renaming of the registers.
Speculation can further increase loop performance by removing dependence
barriers.
Technique works also for while loops.
Works also with predicated instructions (instead of assigning stage predicates).
Also possible for multiple-exit loops (epilog get more complicated).
Limitation:
– Loops with very small trip counts may decrese performance when
pipelined.
– Not desirable to pipeline a floating-point loop that contains a function call
(number of fp registers is not known and it may be hard to find empty slots
for instructions needed to save and restore the caller-saver floating-point
registers across the function call).
164
IA-64 register stack
Traditional Register Stacks
IA-64 Register Stack
Procedures
Procedures
A
A
B
B
B
C
C
C
D
D
D


Register
?
Eliminate the need for save / restore by
reserving fixed blocks in register
However, fixed blocks waste resources
Register
A


A
B
C
D
D
IA-64 able to reserve
variable block sizes
No wasted resources
165
IA-64 support for procedure calls

Subset of general registers are organized as a logically infinite set of stack
frames that are allocated from a finite pool of physical registers

Stacked registers are GR32 up to a user-configurable maximum of GR127
a called procedure specifies the size of its new stack frame using alloc
instruction
output registers of caller are overlapped with input registers of called
procedure



Register Stack Engine:
– management of register stack by hardware
– moves contents of physical registers between general register file and
memory
– provides programming model that looks like unlimited register stack
166
Full binary IA-32 instruction compatibility
Jump to
IA-64
IA-32
Instruction
Set
IA-64
Instruction
Set
Branch to
IA-32
Intercepts,
Exceptions,
Interrupts
IA-64 Hardware (IA-32 Mode)
Registers
Execution Units
System
Resources
IA-64 Hardware (IA-64 Mode)
Registers
Execution Units
System
Resources
• IA-32 instructions supported through shared hardware resources
• Performance similar to volume IA-32 processors
167
Full binary compatibility for PA-RISC



Transparency:
– Dynamic object code translator in HP-UX automatically converts PA-RISC
code to native IA-64 code
– Translated code is preserved for later reuse
Correctness:
– Has passed the same tests as the PA-8500
Performance:
– Close PA-RISC to IA-64 instruction mapping
– Translation on average takes 1-2% of the time
Native instruction execution takes 98-99%
– Optimization done for wide instructions, predication, speculation, large
register sets, etc.
– PA-RISC optimizations carry over to IA-64
168
Delivery of streaming media



Audio and video functions regularly perform the same operation on arrays of
data values
– IA-64 manages its resources to execute these functions efficiently
• Able to manage general register’s as 8x8, 4x16, or 2x32 bit elements
• Multimedia operands/results reside in general registers
IA-64 accelerates compression / decompression algorithms
– Parallel ALU, Multiply, Shifts
– Pack/Unpack; converts between different element sizes.
Fully compatible with
– IA-32 MMXtechnology,
– Streaming SIMD Extensions and
– PA-RISC MAX2
169
IA-64 3D graphics capabilities



Many geometric calculations (transforms and lighting) use 32-bit floating-point
numbers
IA-64 configures registers for maximum 32-bit floating-point performance
– Floating-point registers treated as 2x32 bit single precision registers
– Able to execute fast divide
– Achieves up to 2X performance boost in 32-bit data floating-point operations
Full support for Pentium® III processor Streaming SIMD Extensions (SSE)
170
IA-64 for scientific analysis


Variety of software optimizations supported
– Load double pair : doubles bandwidth between L1 and registers
– Full predication and speculation support
• NaT Value to propagate deferred exceptions
• Alternate IEEE flag sets allow preserving architectural flags
– Software pipelining for large loop calculations
High precision & range internal format : 82 bits
– Mixed operations supported: single, double, extended, and 82-bit
– Interfaces easily with memory formats
• Simple promotion/demotion on loads/stores
– Iterative calculations converge faster
– Ability to handle numbers much larger than RISC competition without
overflow
171
IA-64 Floating-Point Architecture
(82 bit floating point numbers)
Multiple read ports
Memory
128 FP
Register
File
A

B
+
C
FMAC #1
Multiple write ports

X
...
FMAC #2
...
FMAC
FMAC
D
128 registers
– Allows parallel execution of multiple floating-point operations
Simultaneous Multiply - Accumulate (FMAC)
– 3-input, 1-output operation : a * b + c = d
– Shorter latency than independent multiply and add
– Greater internal precision and single rounding error
172
Memory support for high performance technical
computing



Scientific analysis, 3D graphics and other technical workloads tend to be
predictable & memory bound
IA-64 data pre-fetching of operations allows for fast access of critical
information
– Reduces memory latency impact
IA-64 able to specify cache allocation
– Cache hints from load / store operations allow data to be placed at specific
cache level
– Efficient use of caches, efficient use of bandwidth
173
IA server/workstation roadmap
Madison
IA-64 Perf
Deerfield
IA-64 Price/Perf
Performance
McKinley
Future
IA-32
Itanium
Foster
Pentium®III Xeon™ Proc.
Pentium® II XeonTM
Processor
’98
.25µ
’99
’00
.18µ
’01
’02
.13µ
’03
174
Itanium













64-bit processor  not in the Pentium, PentiumPro, Pentium II/III-line
Targeted at servers with moderate to large numbers of processors
full compatibility with Intel’s IA-32 ISA
EPIC (explicitly parallel instruction computing) is applied.
6-wide (3 EPIC instructions) pipeline
10 stage pipeline
4 int, 4 multimedia, 2 load/store, 3 branch, 2 extended floating-point, 2 singleprec. Floating-point units
Multi-level branch prediction besides predication
16 KB 4-way set-associative d- and I-caches
96 KB 6-way set-associative L2 cache
4 MB L3 cache (on package)
800 MHz, 0.18 micro process (at beginning of 2001)
shipments end of 1999 or mid-2000 or ??
175
Conceptual view of Itanium
176
Itanium processor core pipeline
ROT: instruction rotation
pipelined access of the large register file:
WDL: word line decode:
REG: register read
DET: exception detection (~retire stage)
177
Itanium processor
178
Itanium die plot
179
Itanium vs. Willamette (P4)






Itanium announced with 800 MHz
P4 announced with 1.2 GHz
P4 may be faster in running IA-32 code than Itanium running IA-64 code
Itanium probably won‘t compete with contemporary IA-32 processors
but Intel will complete the Itanium design anyway
Intel hopes for the Itanium successor McKinley which will be out only one year
later
180