Transcript PPT

Is Out-Of-Order Out Of Date ?
IA-64’s parallel architecture will improve
processor performance
William S. Worley Jr., HP Labs
Jerry Huck, IA-64 Architecture Lab
Slides by Selvin, Pascal, Pavel
The prelude to the IA-64
• The need for greater processing power is increasing
• New innovative computing technologies
• Traditional computing has increasing problem sizes
• Architecture design from the ground-up to support ILP
• Enables the compiler to express more parallelism (EPIC)
• Reduces hardware cost of scheduling parallel instructions
• Current approaches
• Legacy architectures were not designed primarily for high ILP
• Non-architectural, principally OOO dynamic superscalar hardware
• IA-64
• Growing market for high-performance 64-bit architecture
• No existing Intel 64-bit binaries
Not just for ILP
• Better building block for high performance systems
• Multi-programming gives limited improvements
• Parallelism has to be improved at all levels in the system
• Solely hardware-based multithreading cannot compensate for lack of
parallelism in the basic processing element.
• SMT, CMP apply equally to RISC, CISC and EPIC
• Integrated hardware multithreading is orthogonal to EPIC
• Inter-thread interference in SMT processors
• Hardware Resource Utilization vs. Complexity
• Transistors : PA-8000 re-order buffer = PA-7200
• Complexity scales quadratically for 1.5x or 2x increase in issue-width
Architecture vs. Implementation
• Speed of Functional units is architecture independent
• Memory and Data-cache hierarchy
• Largely independent of the architecture
• OOO RISC designs achieve better utilization
• With additional cost, it is possible to realise better designs
• IA-64 memory-system balanced cost and performance
• Cycle time of IA-64
• IC process, number of registers, register ports, bypass network,
number of cache ports
• Critical path is found in functional units and bypass networks
• IA-64 have higher utilization of this fundamental structure
IA-64 Parallelism Capabilities
 Predication:
• less encountered branches
• less mispredicted branches
• more parallelism
 Larger register set:
• new coding strategies (impossible with RISC)
• more efficient than register renaming (RISC)
• less data loss in the event of an interruption
IA-64 Parallelism Capabilities (2)
 Features to deal with memory latency:
• earlier access to variables
• not restricted to fixed hardware algorithms for:
– correctly predicting execution path
– triggering memory fetches
• heuristics to identify speculative load
candidates
– compiler involved
– control of the degree of speculation by the
programmer
IA-64 Parallelism Capabilities (3)
 Register Stack Engine (RSE):
•
•
•
•
increases the utilization of the register file
reduces the cost of procedures calls, returns
especially valuable for object-oriented code
straightforward hardware design
 Mechanisms to deliver instructions to the
processor
• eliminate effects of increased code size
• modest design costs
Results
 Comparison between PA-RISC and IA-64
• 15 codes (encryption, decryption and keying
for five AES algorithms)
• 8/15 IA-64 codes used more than 32 reg.
• 6/15 IA-64 codes smaller
• 2/15 IA-64 codes 4 times smaller
• overall code size 27 % larger (could have been
reduced to 10%)
Compilers and IA-64
IA-64 uses existing compiler techniques to
exploit parallelism:
data prefetch
branch hints
loop unrolling
profile-based path instructions
other
Need for compiler support
IA-64 does require well-prepared code:
(profiled, with branch hints, etc)
to achieve high performance,
but this is also true for Out-of-Order processors.
Lack of code profiling is equally harmful both for
IA-64 and OOO architectures.
With profiled code, IA-64 is superior to OOO, as
proven by benchmark tests (specFP64)
Critical path instructions
(e.g. long latency operations)
 OOO compilers don’t distinguish them, so
such instrs. often have high exec. cost
 IE-64 compilers must detect such
instructions and make sure they start first (*)
* Cost of mispredicts is minimized by
prefetches issued by the compiler
Dealing with cache misses
Compiler contribution:
static code generation (i.e. fewer branches)
branch hints
Hardware mechanisms:
sample instructions on timer ticks, get
information about actual program flow (HP)
feedback info on cache misses back to the
program (Intel Itanium)
Dynamic prediction mechanisms (2)
IA-64 has hint fields in most branch and
memory instructions to allow the program
collect flow info from and pass it to the
processor.
These features allow software to improve
performance during the run-time, without
recompilation.
Current and Future IA-64
implementations
Initial implementation (as always) focuses on the
most important architectural elements only.
It uses the ideas of EPIC while providing
compatibility with IA-32 and PA-RISC processors.
Future implementations will deliver even more ILP
Creators assure that the IA-64 architecture will not
remain fixed