Intel Pentium 4 Processor

Transcript Intel Pentium 4 Processor

Intel Pentium 4 Processor
Presented by
Michele Co
(much slide content courtesy of Zhijian Lu and Steve Kelley)
Outline

Introduction (Zhijian)
– Willamette (11/2000)





Instruction Set Architecture (Zhijian)
Instruction Stream (Steve)
Data Stream (Zhijian)
What went wrong (Steve)
Pentium 4 revisions
– Northwood (1/2002)
– Xeon (Prestonia, ~2002)
– Prescott (2/2004)

Dual Core
– Smithfield
Introduction

Intel Pentium 4 processor
– Latest IA-32 processor equipped with a full set
of IA-32 SIMD operations

First implementation of a new microarchitecture called “NetBurst” by Intel
(11/2000)
IA-32

Intel architecture 32-bit (IA-32)
– 80386 instruction set (1985)
– CISC, 32-bit addresses
“Flat” memory model
 Registers

– Eight 32-bit registers
– Eight FP stack registers
– 6 segment registers
IA-32 (cont’d)

Addressing modes
–
–
–
–

Register indirect (mem[reg])
Base + displacement (mem[reg + const])
Base + scaled index (mem[reg + (2scale x index)])
Base + scaled index + displacement (mem[reg + (2scale
x index) + displacement])
SIMD instruction sets
– MMX (Pentium II)
» Eight 64-bit MMX registers, integer ops only
– SSE (Streaming SIMD Extension, Pentium III)
» Eight 128-bit registers
Pentium III vs. Pentium 4 Pipeline
Comparison Between Pentium3 and
Pentium4
Execution on MPEG4 Benchmarks @ 1 GHz
Instruction Set Architecture

Pentium4 ISA =
Pentium3 ISA +
SSE2 (Streaming SIMD Extensions 2)

SSE2 is an architectural enhancement to
the IA-32 architecture
SSE2
Extends MMX and the SSE extensions with
144 new instructions:
 128-bit SIMD integer arithmetic operations
 128-bit SIMD double precision floating
point operations
 Enhanced cache and memory management
operations
Comparison Between SSE and SSE2



Both support operations on 128-bit XMM register
SSE only supports 4 packed single-precision
floating-point values
SSE2 supports more:
2 packed double-precision floating-point values
16 packed byte integers
8 packed word integers
4 packed doubleword integers
2 packed quadword integers
Double quadword
Packing

128 bits (word = 2 bytes)
Quad word
64 bit
Quad word
64 bit
Double word Double word Double word Double word
32 bit
32 bit
32 bit
32 bit
Hardware Support for SSE2
Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of
that in Pentium3
 Increased bandwidth in load/store for
floating-point values
load and store are 128-bit wide
One load plus one store can be completed
between XMM register and L1 cache in one
clock cycle

SSE2 Instructions (1)



Data movements
Move data between XMM registers and between
XMM registers and memory
Double precision floating-point operations
Arithmetic instructions on both scalar and
packed values
Logical Instructions
Perform logical operations on packed double
precision floating-point values
SSE2 Instructions (2)



Compare instructions
Compare packed and scalar double precision
floating-point values
Shuffle and unpack instructions
Shuffle or interleave double-precision floatingpoint values in packed double-precision floatingpoint operands
Conversion Instructions
Conversion between double word and doubleprecision floating-point or between singleprecision and double-precision floating-point
values
SSE2 Instructions (3)



Packed single-precision floating-point instructions
Convert between single-precision floating-point
and double word integer operands
128-bit SIMD integer instructions
Operations on integers contained in XMM
registers
Cacheability Control and Instruction Ordering
More operations for caching of data when storing
from XMM registers to memory and additional
control of instruction ordering on store operations
Conclusion
Pentium4 is equipped with the full set of
IA-32 SIMD technology. All existing
software can run correctly on it.
 AMD has decided to embrace and
implement SSE and SSE2 in future CPUs

Instruction Stream
Instruction Stream

What’s new?
– Added Trace Cache
– Improved branch predictor

Terminology
 op – Micro-op, already decoded RISC-like
instructions
– Front end – instruction fetch and issue
Front End
Prefetches instructions that are likely to be
executed
 Fetches instructions that haven’t been
prefetched
 Decodes instruction into ops
 Generates ops for complex instructions or
special purpose code
 Predicts branches

Prefetch

Three methods of prefetching:
Instructions only – Hardware
 Data only – Software
 Code or data – Hardware

Decoder
Single decoder that can operate at a
maximum of 1 instruction per cycle
 Receives instructions from L2 cache 64 bits
at a time
 Some complex instructions must enlist the
help of the microcode ROM

Trace Cache
Primary instruction cache in NetBurst
architecture
 Stores decoded ops
 ~12K capacity
 On a Trace Cache miss, instructions are
fetched and decoded from the L2 cache

What is a Trace Cache?
I1 …
I2 br r2, L1
I3 …
I4 …
I5 …
L1: I6
I7 …
 Traditional instruction cache
I1

I2
I3
I4
I2
I6
I7
Trace cache
I1
Pentium 4 Trace Cache
Has its own branch predictor that directs
where instruction fetching needs to go next
in the Trace Cache
 Removes

– Decoding costs on frequently decoded
instructions
– Extra latency to decode instructions upon
branch mispredictions
Microcode ROM
Used for complex IA-32 instructions (> 4
ops) , such as string move, and for fault
and interrupt handling
 When a complex instruction is encountered,
the Trace Cache jumps into the microcode
ROM which then issues the ops
 After the microcode ROM finishes, the
front end of the machine resumes fetching
ops from the Trace Cache

Branch Prediction

Predicts ALL near branches
– Includes conditional branches, unconditional
calls and returns, and indirect branches

Does not predict far transfers
– Includes far calls, irets, and software interrupts
Branch Prediction
Dynamically predict the direction and target
of branches based on PC using BTB
 If no dynamic prediction is available,
statically predict

– Taken for backwards looping branches
– Not taken for forward branches

Traces are built across predicted branches to
avoid branch penalties
Branch Target Buffer
Uses a branch history table and a branch
target buffer to predict
 Updating occurs when branch is retired

Return Address Stack
16 entries
 Predicts return addresses for procedure calls
 Allows branches and their targets to coexist
in a single cache line

– Increases parallelism since decode bandwidth is
not wasted
Branch Hints
P4 permits software to provide hints to the
branch prediction and trace formation
hardware to enhance performance
 Take the forms of prefixes to conditional
branch instructions
 Used only at trace build time and have no
effect on already built traces

Out-of-Order Execution
Designed to optimize performance by
handling the most common operations in
the most common context as fast as possible
 126 ops can in flight at once

– Up to 48 loads / 24 stores
Issue
Instructions are fetched and decoded by
translation engine
 Translation engine builds instructions into
sequences of ops
 Stores ops to trace cache
 Trace cache can issue 3 ops per cycle

Execution
Can dispatch up to 6 ops per cycle
 Exceeds trace cache and retirement op
bandwidth

– Allows for greater flexibility in issuing ops to
different execution units
Execution Units
Double-pumped ALUs

ALU executes an operation on both rising
and falling edges of clock cycle
Retirement
Can retire 3 ops per cycle
 Precise exceptions
 Reorder buffer to organize completed ops
 Also keeps track of branches and sends
updated branch information to the BTB

Execution Pipeline
Execution Pipeline
Data Stream of Pentium 4 Processor
Register Renaming
Register Renaming (2)
8-entry architectural register file
 128-entry physical register file
 2 RAT
Frontend RAT and Retirement RAT
 Data does not need to be copied between
register files when the instruction retires

On-chip Caches



L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
Parameters:
All caches are not inclusive and a pseudo-LRU
replacement algorithm is used
L1 Instruction Cache
Execution Trace Cache stores decoded
instructions
 Remove decoder latency from main
execution loops
 Integrate path of program execution flow
into a single line

L1 Data Cache




Nonblocking
Support up to 4 outstanding load misses
Load latency
2-clock for integer
6-clock for floating-point
1 Load and 1 Store per clock
Speculation Load
Assume the access will hit the cache
“Replay” the dependent instructions when miss
happen
L2 Cache
Load latency
Net load access latency of 7 cycles
 Nonblocking
 Bandwidth
One load and one store in one cycle
New cache operation begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second @ 1.5GHz

Data Prefetcher in L2 Cache
Hardware prefetcher monitors the reference
patterns
 Bring cache lines automatically
 Attempt to stay 256 bytes ahead of current
data access location
 Prefetch for up to 8 simultaneous
independent streams

Store and Load
Out of order store and load operations
Stores are always in program order
 48 loads and 24 stores can be in flight
 Store buffers and load buffers are allocated
at the allocation stage
Total 24 store buffers and 48 load buffers

Store
Store operations are divided into two parts:
Store data
Store address
 Store data is dispatched to the fast ALU,
which operates twice per cycle
 Store address is dispatched to the store
AGU per cycle

Store-to-Load Forwarding
Forward data from pending store buffer to
dependent load
 Load stalls still happen when the bytes of
the load operation are not exactly the same
as the bytes in the pending store buffer

System Bus
Deliver data with 3.2Gbytes/S
 64-bit wide bus
 Four data phase per clock cycle (quad
pumped)
 100MHz clocked system bus
Conclusion
Reduced Cache Size
VS
Increased Bandwidth and Lower Latency
What Went Wrong
No L3 cache
Original plans called for a 1M cache
 Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of
the processor to act as the L3
 But that added another 100 pads to the
processor, and would have also forced Intel
to devise an expensive cartridge package to
contain the processor and cache memory

Small L1 Cache

Only 8k!
– Doubled size of L2 cache to compensate

Compare with
– AMD Athlon – 128k
– Alpha 21264 – 64k
– PIII
– 32k
– Itanium
– 16k
Loses consistently to AMD
In terms of performance, the Pentium 4 is as
slow or slower than existing Pentium III and
AMD Athlon processors
 In terms of price, an entry level Pentium 4
sells for about double the cost of a similar
Pentium III or AMD Athlon based system
 1.5GHz clock rate is more hype than
substance

Northwood
Northwood
1/2002
 Differences from Willamette

–
–
–
–
–
Socket 478
21 stage pipeline
512 KB L2 cache
2.0 GHz, 2.2 GHz clock frequency
0.13 fabrication process (130 nm)
» 55 million transistors
Prescott
Prescott
2/2004
 Differences

–
–
–
–
–
31 stage pipeline!
1MB L2 cache
3.8 GHz clock frequency
0.9 fabrication process
SSE3