Itanium Processor Microarchitecture

Transcript Itanium Processor Microarchitecture

by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02 1

First implementation of the IA64 instruction set architecture Targets memory latency, memory address disambiguation, and control flow dependencies 0.18 micron process, 800MHz EPIC design style shifts more responsibilities to compiler

Challenge

Try to identify which improvements discussed in this class found their way into the Itanium.

Idea Compiler has larger instruction window than hardware. Communicate to the hardware more of the information gleaned at compile time.

Six instructions wide and ten stage deep Tries to minimize latency of most frequent operations Hardware support for compilation time indeterminacies 5

Software initiated prefetch (requests filtered by instruction cache) prefetch must be 12 cycles before branch to hide latency L2 -> streaming buffer -> instruction cache Four level branch predictor hierarchy to prevent 9-cycle pipeline stall Decoupling buffer hold up to 8 bundles of code 6

Compiler provides branch hint directives • explicit branch predict (BRP) instructions • hint specifiers on branch instructions which provide • branch target addresses • static hints on branch detection • indicators for when to use dynamic predictors Four types of predictors •Resteer 1: single cycle predictor (4 BRP programmed TARs) •Resteer 2: Adaptive multi-way and return predictors (dynamic) •Resteer 3&4: Branch address calculation and correction -Resteer 3 includes “perfect-loop-exit-predictor” 7

Plentiful Resources • four integer units • four multi-media units • two load/store units • three branch units • two single precision FP units Organized around 9 issue ports • two memory • two integer • two FP • three branch • two extended precision FP units • SIMD allows up to 20 parallel operations per clock Dispersal follows high level semantics provided by IA64 ISA Check for: •Independence (determined by stop bits) •Oversubscription (determined by 8-bit instruction template) Template allows for simplified dispersal routing 8

Two types of register renaming (virtual register addressing): Register Stacking reduces function call and return overhead by stacking new register frame on top of old frame to prevent explicit save of caller’s register (not supported in FP registers) Register Rotation supports software-pipelining by accessing the registers through an indirection based on the iteration count If software allocates more virtual registers than are physically available (overflow), the Register Stack Engine takes control of the pipeline to store register values to memory, and the reverse for underflow. No pipeline flushes required :) 9

Integer register file •128 entries • 8 read ports • 6 write ports • postincrement performed by idle ALU and write ports FP register file •128 entries •8 read ports •4 write ports, separated into odd and even banks •supports double extended-precision arithmetic Predicate register file: 1-bit entries with 15 read and 11 write ports 10

Non-blocking cache with scoreboard-based stall on use control strategy Pipeline only stalls when data is needed, not on other hazards Deferred-stall strategy (hazards evaluation in REG stage) allows more time for dependencies to resolve Stalls in EXE stage, where input latches snoop returning data values for correct data using existing register bypass hardware.

Predication : turns control dependency into data dependency by executing all sides of a predicted branch and squashing the incorrect instructions before they change the machine state (speculative predicate register file vs architectural predicate register file) Executes up to three parallel branch predictions a cycle, uses priority encoding to determine earliest taken branch.

Exception tokens

In FP registers, exceptions are noted by storing a NaTVal value in the NaN space, but an extra bit is added to the INT register for the exception token (NaT). These bits must be stored in a special UNaT register in the event of a register spill because it won’t fit in memory, and it is restored during fills.

ALAT structure

If an instruction writes to a register between the time the speculative load reads that register and consumes the value, the ALAT invalidates the speculative load value and recovery is initiated. ALAT checks can be issued in parallel with the consuming instruction.

First Level Cache Second Level Cache

• Data and Instruction Separate • 16Kbytes each, 32 byte line size (6 instructions/cycle in I cache) • four-way set-associative • dual ported • 2 cycle latency, fully pipelined • write through • physically addressed and tagged • single cycle, 64 entry, fully • Combined data and instructions • 96Kbytes • six-way set-associative • 64 byte line size • two banks • four-state MESI for multi processor coherence • 4 double precision operands per clock to FP register file associative iTLB (backed up by an on-chip hardware page walker) • iTLB and cache tags have an additional port to check address for miss 13

Third Level Cache

• 4Mbytes • 64-byte line size • four-way set associative • 128-bit core speed bus line (12 Gbytes/s bandwidth) • MESI protocol

Optimal Cache Management

• Memory locality hints - allocation and replacement strategies •Bias hints - optimize MESI latency 14

• 64-bit system bus, source-synchronous data transfer (2.1 Gbytes/sec) • Multi-drop shared system bus uses MESI coherence protocol • Four-way glueless multiprocessor system support (4 processor nodes) • Multiple nodes connected through high speed interconnects • Transaction based bus protocol allows 56 pending transactions • ‘Defer mechanism’ for OoO data transfers and transaction completion 15

Non-blocking caches as seen in “Lockup-free instruction fetch cache organization” Prefetch - decoupled prefetch based on branch hints as seen in “A Scalable Front-End Architecture for Fast Instruction Delivery” - software initiated prefetch as seen in “Design and Evaluation of a Compiler Algorithm for Prefetching” Memory locality hints for more efficient use of caches Speculation - extra bit for deferred exception tokens What else?

Do you think they made a simple, scalable hardware implementation?