Transcript PPT

A Critical Look At IA-64
Massive Resources,
Massive ILP,
But Can It Deliver?
Martin Hopkins, IBM Research
2/7/00
Sampoorani, Sivakumar and Joshua
Design decisions common to
modern processors
Pipelining
 Micro Ops
 Large ROB
 Single path execution
 Dynamic scheduling

At what cost?
Accurate Branch Prediction
 Dependency Checking
 Register Renaming
 Alias Detection Hardware

Performance of IA-64
Execution time = Cycle Time *IC* CPI
No improvement reported in frequency
Possible Reasons?
 Reducing CPI at the cost of cycle time
 Compares and branches in same cycle

Predicated Execution
=> more FUs
=> more complexity
+ longer wires
limit on frequency
=> more power
Dynamic Path Length (IC)
Longer than other architectures
Reasons?
 Speculation
 Check operations and recovery code
 Predication
 No sign extended loads
 No integer multiply or divide
Dynamic Path Length (IC)

Loads and Stores – Only post execution
update of base register
ldsz.ldtype.ldhint r1 = [r3]
no base update form
ldsz.ldtype.ldhint r1 = [r3], r2
register base update
ldsz.ldtype.ldhint r1 = [r3], imm immediate base update
CPI

Cache Effects
Larger code footprint
 128 bit bundle - 3 instructions
 Restrictions on placing instructions
 Branch target - beginning of bundle

Recovery code
 Pollutes I-Cache and/or triggers page faults

Speculative loads - Pollute D-cache
Stalls possible
Example
load ra =
load rb = ;; // end of bundle
add rx = ra
load ry = [rb];;
If load ra causes a cache miss, stall.
Superscalar out-of-order processors – can execute
non-dependent instructions in parallel with the cache
miss.
Comparing Complexities

Support for speculative execution
– Superscalar processors
» reorder buffer
» register renaming hardware
– EPIC
» need to expose parallelism, speculation
» hardware just does what the compiler says
IA-64: Exposing Speculative
Execution
Control speculation
(moving loads above branches)
 Data speculation
(moving loads above stores)

Control Speculation

Hardware for deferring exceptions exposed
to software
– NaT (Not a Thing or poison bits)
» set NaT bit associated with a register on exception
» perform an explicit check before using the register
– Increase in machine state
» 2 NaT registers
» instructions to modify, test, and retrieve NaT values
Data Speculation

Explicit memory-alias-detection table
– ALAT (Advanced Load Address table)
» loads place their entries in ALAT
» stores remove the entry if addresses match
– Hardware cost:
» ALAT is 32 entry, 2 way set associative
» recovery code requires that operands be maintained
(until the store is seen the operands have to be maintained)
» increased register requirements (128 Int + 128 FP)
Data Speculation Hardware Costs

Increased register pressure implies
– more state to be saved across functions
– to avoid this:
» Register stacking (SPARC register windows)

(0-31) global registers, others dynamically
mapped
» CFM (Current Frame Marker)
» Register Stack engine


Should also handle stack overflows
Additional complexity due to rotating
registers
Hardware Costs


Reorder buffer
Register rename
mechanism




NaT bits, associated
instructions
ALAT
Increased number of
registers
Reg Stack Engine
– Additional
complexities due to
rotating registers, page
faults, …
Runtime Information

Information about behavior of programs
– Can’t be predicted at compile time
– Profiling helps
» But costly…

Superscalar machines
– Dynamic selection of instructions to execute
– Rely upon information known at run time
Epic
 Depends
mostly on compiler
– Run time information is not used so much
 Consider
the following code sequence
cmp p1, p2 = ..
/* set predicate registers */
(p1) br.cond low_probability_path ;; /* if (p1) goto ...*/
l
ra = [rb];;
add
rc = ra, rd;;
use of (rc)
4 bundles, load not hoisted over a branch (which is not
usually taken)
As Scheduled by IA64 Compiler

Optimize for the most probable path
l.s ra = [rb];;
add rc = ra, rd
cmp p1, p2 = ...
(p1)
br.cond low_probability_path ;;
check.s rc, recovery_code
use of
(rc)

3 bundles
When Low Probability Path Is
Taken
Superscalar processor
 Execute the load as
early as possible
 Cancel if found to be
mis-speculated
Change assumptions
dynamically
EPIC
 load has to complete since
dependant add is in next
bundle
 may take 100s of cycles if
the pointer is random
Heavy penalty if the
compiler gets the
probabilities wrong
Dependence on Profiling
RISC and CISC find profiling useful, but
not essential
 IA-64 is much more dependent on profiling
 Difficulties involved with profiling

– Additional responsibility for programmer
– Creating a representative test suite
– Using in demanding, diverse development
environments
Code Bloat
RISC instructions
 3 instructions per 128 bits
 Avg of 2 instructions per bundle
 Branch target at beginning of bundle
 Check ops
 Recovery code
 No base+disp addressing
 No sign-extended loads
 Predication
 Optimizations
IA-64 code should be 4.8 times x86 code

50
33
33
10
20
15
30
Some things that may reduce
code size

Post-increment loads can eliminate and add in a
loop
– eg. accessing an array in strides
Combining a compare and a logical op
 r1 + r2 +1
 Rotating register files for s/w pipelining
All the above amount to <5% difference.
So net code bloat is about 4 times. (excluding
optimization overhead)
Code bloat => More memory b/w requirement.

Performance comparison
800MHz Itanium
 SPECint
<68% Alpha 21264 (1GHz) (20% less power)
<60% P4
(2GHz)
 SPECfp
>20% Alpha 21264
>8% P4
Power – a major hurdle
Conclusion
The IA-64 gamble – power is not going to
be a critical limitation in future.
 This allows use of massive resources
