Intel Pentium 4 Processor
Download
Report
Transcript Intel Pentium 4 Processor
Intel Pentium 4 Processor
Presented by
Michele Co
(much slide content courtesy of Zhijian Lu and Steve Kelley)
Outline
Introduction (Zhijian)
– Willamette (11/2000)
Instruction Set Architecture (Zhijian)
Instruction Stream (Steve)
Data Stream (Zhijian)
What went wrong (Steve)
Pentium 4 revisions
– Northwood (1/2002)
– Xeon (Prestonia, ~2002)
– Prescott (2/2004)
Dual Core
– Smithfield
Introduction
Intel Pentium 4 processor
– Latest IA-32 processor equipped with a full set
of IA-32 SIMD operations
First implementation of a new microarchitecture called “NetBurst” by Intel
(11/2000)
IA-32
Intel architecture 32-bit (IA-32)
– 80386 instruction set (1985)
– CISC, 32-bit addresses
“Flat” memory model
Registers
– Eight 32-bit registers
– Eight FP stack registers
– 6 segment registers
IA-32 (cont’d)
Addressing modes
–
–
–
–
Register indirect (mem[reg])
Base + displacement (mem[reg + const])
Base + scaled index (mem[reg + (2scale x index)])
Base + scaled index + displacement (mem[reg + (2scale
x index) + displacement])
SIMD instruction sets
– MMX (Pentium II)
» Eight 64-bit MMX registers, integer ops only
– SSE (Streaming SIMD Extension, Pentium III)
» Eight 128-bit registers
Pentium III vs. Pentium 4 Pipeline
Comparison Between Pentium3 and
Pentium4
Execution on MPEG4 Benchmarks @ 1 GHz
Instruction Set Architecture
Pentium4 ISA =
Pentium3 ISA +
SSE2 (Streaming SIMD Extensions 2)
SSE2 is an architectural enhancement to
the IA-32 architecture
SSE2
Extends MMX and the SSE extensions with
144 new instructions:
128-bit SIMD integer arithmetic operations
128-bit SIMD double precision floating
point operations
Enhanced cache and memory management
operations
Comparison Between SSE and SSE2
Both support operations on 128-bit XMM register
SSE only supports 4 packed single-precision
floating-point values
SSE2 supports more:
2 packed double-precision floating-point values
16 packed byte integers
8 packed word integers
4 packed doubleword integers
2 packed quadword integers
Double quadword
Packing
128 bits (word = 2 bytes)
Quad word
64 bit
Quad word
64 bit
Double word Double word Double word Double word
32 bit
32 bit
32 bit
32 bit
Hardware Support for SSE2
Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of
that in Pentium3
Increased bandwidth in load/store for
floating-point values
load and store are 128-bit wide
One load plus one store can be completed
between XMM register and L1 cache in one
clock cycle
SSE2 Instructions (1)
Data movements
Move data between XMM registers and between
XMM registers and memory
Double precision floating-point operations
Arithmetic instructions on both scalar and
packed values
Logical Instructions
Perform logical operations on packed double
precision floating-point values
SSE2 Instructions (2)
Compare instructions
Compare packed and scalar double precision
floating-point values
Shuffle and unpack instructions
Shuffle or interleave double-precision floatingpoint values in packed double-precision floatingpoint operands
Conversion Instructions
Conversion between double word and doubleprecision floating-point or between singleprecision and double-precision floating-point
values
SSE2 Instructions (3)
Packed single-precision floating-point instructions
Convert between single-precision floating-point
and double word integer operands
128-bit SIMD integer instructions
Operations on integers contained in XMM
registers
Cacheability Control and Instruction Ordering
More operations for caching of data when storing
from XMM registers to memory and additional
control of instruction ordering on store operations
Conclusion
Pentium4 is equipped with the full set of
IA-32 SIMD technology. All existing
software can run correctly on it.
AMD has decided to embrace and
implement SSE and SSE2 in future CPUs
Instruction Stream
Instruction Stream
What’s new?
– Added Trace Cache
– Improved branch predictor
Terminology
op – Micro-op, already decoded RISC-like
instructions
– Front end – instruction fetch and issue
Front End
Prefetches instructions that are likely to be
executed
Fetches instructions that haven’t been
prefetched
Decodes instruction into ops
Generates ops for complex instructions or
special purpose code
Predicts branches
Prefetch
Three methods of prefetching:
Instructions only – Hardware
Data only – Software
Code or data – Hardware
Decoder
Single decoder that can operate at a
maximum of 1 instruction per cycle
Receives instructions from L2 cache 64 bits
at a time
Some complex instructions must enlist the
help of the microcode ROM
Trace Cache
Primary instruction cache in NetBurst
architecture
Stores decoded ops
~12K capacity
On a Trace Cache miss, instructions are
fetched and decoded from the L2 cache
What is a Trace Cache?
I1 …
I2 br r2, L1
I3 …
I4 …
I5 …
L1: I6
I7 …
Traditional instruction cache
I1
I2
I3
I4
I2
I6
I7
Trace cache
I1
Pentium 4 Trace Cache
Has its own branch predictor that directs
where instruction fetching needs to go next
in the Trace Cache
Removes
– Decoding costs on frequently decoded
instructions
– Extra latency to decode instructions upon
branch mispredictions
Microcode ROM
Used for complex IA-32 instructions (> 4
ops) , such as string move, and for fault
and interrupt handling
When a complex instruction is encountered,
the Trace Cache jumps into the microcode
ROM which then issues the ops
After the microcode ROM finishes, the
front end of the machine resumes fetching
ops from the Trace Cache
Branch Prediction
Predicts ALL near branches
– Includes conditional branches, unconditional
calls and returns, and indirect branches
Does not predict far transfers
– Includes far calls, irets, and software interrupts
Branch Prediction
Dynamically predict the direction and target
of branches based on PC using BTB
If no dynamic prediction is available,
statically predict
– Taken for backwards looping branches
– Not taken for forward branches
Traces are built across predicted branches to
avoid branch penalties
Branch Target Buffer
Uses a branch history table and a branch
target buffer to predict
Updating occurs when branch is retired
Return Address Stack
16 entries
Predicts return addresses for procedure calls
Allows branches and their targets to coexist
in a single cache line
– Increases parallelism since decode bandwidth is
not wasted
Branch Hints
P4 permits software to provide hints to the
branch prediction and trace formation
hardware to enhance performance
Take the forms of prefixes to conditional
branch instructions
Used only at trace build time and have no
effect on already built traces
Out-of-Order Execution
Designed to optimize performance by
handling the most common operations in
the most common context as fast as possible
126 ops can in flight at once
– Up to 48 loads / 24 stores
Issue
Instructions are fetched and decoded by
translation engine
Translation engine builds instructions into
sequences of ops
Stores ops to trace cache
Trace cache can issue 3 ops per cycle
Execution
Can dispatch up to 6 ops per cycle
Exceeds trace cache and retirement op
bandwidth
– Allows for greater flexibility in issuing ops to
different execution units
Execution Units
Double-pumped ALUs
ALU executes an operation on both rising
and falling edges of clock cycle
Retirement
Can retire 3 ops per cycle
Precise exceptions
Reorder buffer to organize completed ops
Also keeps track of branches and sends
updated branch information to the BTB
Execution Pipeline
Execution Pipeline
Data Stream of Pentium 4 Processor
Register Renaming
Register Renaming (2)
8-entry architectural register file
128-entry physical register file
2 RAT
Frontend RAT and Retirement RAT
Data does not need to be copied between
register files when the instruction retires
On-chip Caches
L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
Parameters:
All caches are not inclusive and a pseudo-LRU
replacement algorithm is used
L1 Instruction Cache
Execution Trace Cache stores decoded
instructions
Remove decoder latency from main
execution loops
Integrate path of program execution flow
into a single line
L1 Data Cache
Nonblocking
Support up to 4 outstanding load misses
Load latency
2-clock for integer
6-clock for floating-point
1 Load and 1 Store per clock
Speculation Load
Assume the access will hit the cache
“Replay” the dependent instructions when miss
happen
L2 Cache
Load latency
Net load access latency of 7 cycles
Nonblocking
Bandwidth
One load and one store in one cycle
New cache operation begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second @ 1.5GHz
Data Prefetcher in L2 Cache
Hardware prefetcher monitors the reference
patterns
Bring cache lines automatically
Attempt to stay 256 bytes ahead of current
data access location
Prefetch for up to 8 simultaneous
independent streams
Store and Load
Out of order store and load operations
Stores are always in program order
48 loads and 24 stores can be in flight
Store buffers and load buffers are allocated
at the allocation stage
Total 24 store buffers and 48 load buffers
Store
Store operations are divided into two parts:
Store data
Store address
Store data is dispatched to the fast ALU,
which operates twice per cycle
Store address is dispatched to the store
AGU per cycle
Store-to-Load Forwarding
Forward data from pending store buffer to
dependent load
Load stalls still happen when the bytes of
the load operation are not exactly the same
as the bytes in the pending store buffer
System Bus
Deliver data with 3.2Gbytes/S
64-bit wide bus
Four data phase per clock cycle (quad
pumped)
100MHz clocked system bus
Conclusion
Reduced Cache Size
VS
Increased Bandwidth and Lower Latency
What Went Wrong
No L3 cache
Original plans called for a 1M cache
Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of
the processor to act as the L3
But that added another 100 pads to the
processor, and would have also forced Intel
to devise an expensive cartridge package to
contain the processor and cache memory
Small L1 Cache
Only 8k!
– Doubled size of L2 cache to compensate
Compare with
– AMD Athlon – 128k
– Alpha 21264 – 64k
– PIII
– 32k
– Itanium
– 16k
Loses consistently to AMD
In terms of performance, the Pentium 4 is as
slow or slower than existing Pentium III and
AMD Athlon processors
In terms of price, an entry level Pentium 4
sells for about double the cost of a similar
Pentium III or AMD Athlon based system
1.5GHz clock rate is more hype than
substance
Northwood
Northwood
1/2002
Differences from Willamette
–
–
–
–
–
Socket 478
21 stage pipeline
512 KB L2 cache
2.0 GHz, 2.2 GHz clock frequency
0.13 fabrication process (130 nm)
» 55 million transistors
Prescott
Prescott
2/2004
Differences
–
–
–
–
–
31 stage pipeline!
1MB L2 cache
3.8 GHz clock frequency
0.9 fabrication process
SSE3