Intel Pentium 4

Download Report

Transcript Intel Pentium 4

Intel Pentium 4
ENCM 515 - 2002
Jonathan Bienert
Tyson Marchuk
Overview:
•
•
•
•
Product review
Specialized architectural features (NetBurst)
SIMD instructional capabilities (MMX, SSE2)
SHARC 2106x comparison
Intel Pentium 4
• Reworked micro-architecture for highbandwidth applications
• Internet audio and streaming video, image
processing, video content creation, speech, 3D,
CAD, games, multi-media, and multi-tasking user
environments
• These are DSP intensive applications!
– What about uses other than in PC?
Hardware Features:
(NetBurst micro-architecture)
•
•
•
•
•
•
•
Hyper pipelined technology
Advanced dynamic execution
Cache (data, L1, L2)
Rapid ALU execution engines
400 MHz bus
OOE
Microcode ROM
Hyper Pipeline
• 20-stage pipeline!!!
• breaks down complex CISC instructions
– sub-stages mimic RISC
– faster execution
Filling the pipeline...
• Review of next 126 instructions to be
executed
• Branch prediction
–
–
–
–
if mispredict must flush 20-stage pipeline!!!
branch target buffer (BTB)
4K branch history table (BHT)
assembly instruction hints
Cache
• 8KB Data Cache
• L1 Execution Trace Cache
– 12K of previous micro-instructions stored
– saves having to translate
• L2 Advanced Transfer Cache
– 256K for data
– 256-bit transfer every cycle
• allows 77GB/s data transfer on 2.4GHz
Rapid ALU Execution Engines
• 2 ALUs
– allow parallel operations
• Many arithmetic operations take 1/2 cycle
– each 2X ALU can have 2 operations per cycle
Software Features:
• Multimedia Extensions (MMX)
– 8 MMX registers
• Streaming SIMD Extensions (SSE2)
– 8 SSE/SSE2 registers
• Standard x86 Registers
– EAX, EBX, ECX, EDX, ESI, etc.
– Register rename to over 100
MMX (Multimedia Extensions)
• Accelerated performance through SIMD
• multimedia, communication, internet applications
• 64-bit packed INTEGER data
– signed/unsigned
SSE2 (Streaming SIMD
Extensions)
• Accelerate a broad range of applications
– video, speech, and image, photo processing, encryption,
financial, engineering, and scientific applications
• 128-bit SIMD instruction formats
 4 single precision FP values
 2 double precision FP values
 16 byte values
 8 word values
 4 double word values
 2 quad word values
 1 128-bit integer value
SIMD Example
(16-tap FIR filter - Real numbers)
• Applications for real FIR filters
• general purpose filters in image processing, audio,
and communication algorithms
• Will utilize SSE2 SIMD instruction set
Thinking about SIMD
• SSE2 instruction format is 128-bits
• 128-bit SSE2 registers
• Many data formats!
• What precision do we want?
• Lets use 32-bit floating point for coefficients,
input, output
4 data sets x 32-bit = 128 bits
Parallelizing
• Require many single multiplications
(coefficients x inputs), then add the results for
output!
• Multiplications…
• then need to perform additions...
Using SSE2 format
• Can hold 4 elements of an array (of 32-bit
data) in each 128-bit register
• 4 single precision floating point ops per
cycle (32-bit)
Additions...
• In both registers, now have 4 32-bit results
– First add the results into an accumulator register
• 4 single precision floating point ops per
cycle (32-bit)
Additions...
• In a register, now have 4 32-bit results
– however, NO SSE2 instruction to add these 4!
– But can use other instructions
• Some BIT INTERTWINING…then add
– This will give results for several output values!
ADI SHARC 21k vs. P4
Disadvantages
• Slower clock speed (40MHz vs 2400MHz)
• Less opportunities for parallelism (5 vs 11)
• Much less memory (Cache and System)
– Limited algorithm applicability
– Limited applications
• Older (Less support – compiler)
– 1994 vs 2001
ADI Sharc 21k vs. P4
•
•
•
•
•
Advantages
Hardware loops
Easier to program for optimal speed
Cheaper
Lower power consumption
Runs cooler
FIR Performance
• Hard to obtain P4 performance numbers
• Can estimate based on 2 FP multiplies per
clock, clock rate and assumption that
pipeline can be kept full.
– 2 * 2.4GHz ~ 4.8 billion multiplies per second
– If ~4 multiplies per element & 44000 samples/s
– FIR length > ~25k taps
• SHARC => ~ 200 taps (Lab 4)
• Factor of ~125x
IIR Performance
•
•
•
•
Hard to obtain P4 performance numbers
No hardware circular buffers
Does have BTB, BHT, etc.
Prefetches ~256bytes ahead of current
position in code.
FFT Performance
• Hard to obtain P4 performance numbers
• Prime95 uses FFT to calculate LucasLehmer test for Mersenne Primes
– Involves FFT, squaring and iFFT, etc.
• 256k points on P4 2.3GHz ~ 10.517ms
• Compare to SHARC 2048 point FFT
~0.37ms
• If SHARC could do 256k, 46.25ms (But…)
Optimization Example
• Hard to optimize Pentium 4 assembly
• Example of multiplying by a constant, 10
• Taken mainly from:
www.emulators.com/docs/pentium_1.htm
Multiplying by 10
• Slowest way:
– IMUL EAX, 10
• Usually optimal way (Visual C++ 6.0)
–
–
–
–
–
–
LEA EAX, [EAX+EAX*4]
SHL EAX, 1
Shift – Add – Shift
On most x86 processors takes 2 cycles
Pentium MMX and before 3 cycles
On Pentium 4 takes 6 cycles!
Multiplying by 10
• Optimal for Pentium 4
–
–
–
–
–
LEA ECX, [EAX + EAX]
LEA EAX, [ECX+EAX*8]
On most x86 still takes 2 cycles
On Pentium 4 takes ~ 3 cycles (OOE - Ops)
But on older processors Pentium MMX and
before this now takes 4 cycles!
Multiplying by 10
• Best generic case
–
–
–
–
LEA EAX, [EAX + EAX*4]
ADD EAX, EAX
On most x86 still takes 2 cycles
On older processors Pentium MMX and before
this now takes 3 cycles again
– On Pentium 4 this takes 4 cycles
• Obviously really hard to optimize
REFERENCES
• Intel application note: AP 809 - Real and
Complex Filter Using Streaming SIMD
Extentions
• graphics from:
http://www6.tomshardware.com/cpu/00q4/0
01120/p4-01.html