CPU - Nanjing University

Download Report

Transcript CPU - Nanjing University

Lecture 6
Processor Technology
1
Advance in Hardware
 INTEL Family: (8086/1978 -- Pentium II/1998)
 exponential
performance improvement over
time
• number of transitors: increased almost 2500
times (29 K --> 7.5 M)
• clock rate: 45 times (10 MHz -> 450 MHz)
2
Moore’s Law (1969)
 The number of transistors on a microchip
doubles about every 18-24 months,

assuming the price of the chip stays the same
 The speed of a microprocessor doubles about
every 18-24 months,

assuming price stays the same
 The price of a microchip drops about 48% every
18-24 months,

assuming the performance metric (processor speed
or memory capacity) of the chip stays the same.
3
Milestones of Chip Density
1G▲
1G
LSI Logic 256M▲
Gate Array
Number of Transistors per Chip
•
64M▲
100M
16M
▲
10M
4M▲
1M
•
▲ 1M
• 80386
256K
▲
100K
64K
▲
4K
10K
▲
• 80286• 68020
•68000
▲
•
•
8080
•
8085
▲
• 4004
70
72
74
Source: ICE
Pentium
•80486•
•
• •
• P7
Pentium Pro
MPU Only
IBM
Gate
Array
LSI Logic
Gate Array
•8086
16K
1K
Pentium Pro
MPU and Cache
Memory Chip
76
= Microprocossor and Logic
▲
78
80
82
84
Year
86
= Memory (DRAM)
88
90
92
94
96
98 00 02
Memory Increase = 1.5/year
MPU Increase = 1.35/year
4
Outline
 Instruction Set Architecture
(ISA)
 Pipelining Concepts
 Processor Technology

CISC, RISC, superscalar, VLIW
 Case Study
 Future processor
5
Part 1:
Instruction Set Architecture
(ISA)
6
Computer Architecture’s
Changing Definition
 1950s to 1960s: Computer Arithmetic
 1970s to mid 1980s: Instruction Set Design,
especially ISA appropriate for compilers
 1990s: Design of CPU, memory system, I/O
system, Multiprocessors
7
Instruction Set
Architecture (ISA)
software
instruction set
hardware
8
Interface Design
A good interface:
• Lasts through many implementations (portability,
compatability)
• Is used in many different ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
use
use
use
imp 1
time
Interface
imp 2
imp 3
9
What Operations Should be in
Instruction Set?
 How many are possible ?
 Which ones do we need ?
 Circuit complexity ?
 How frequently is each used ?
 How much slower would each be, if
implemented in terms of simpler ones ?
10
Typically include:
 ALU (25-40 % frequency of use)
 Data transfer (~15-40 %)
 Control flow (~15-25 %)
 System (~ 2%)
 Floating point (~  15 %)
 Decimal and string (~  15 %)
11
4 types of flow operations:
 Conditional branch (Branches): ~73%
 Unconditional branches (Jumps): ~14 %
 Procedure calls + return (Jump): ~ 13%
12
Data types and sizes ?
 How many are possible ?
 Which ones do we need ?
 How frequently are they used ?
 How much slower if implemented in
software ?
13
Data Types
 Integer: short, long, extra long.
 floating-point: single-, double-, quad-
precision.
 characters: char, strings.
 bit fields.
 binary coded decimal.
14
Other Issues
 What are the most common accesses
(profile) ?
 What should the instruction format be ?
15
Conflicting goals:
 code compactness

less no. of lines in program (at machine level
after compilation)
 less memory, less I-Fetch bandwidth.
 easy decoding

want fixed format
 less expensive and faster I-decode.
16
Ways to get code compactness:
(An ideal case)
 Huffman encoding --; e.g,
 50
% 'A' --- "0"
 25 % 'B' -- '10'
 12.5 % 'C" -- '110'
 12.5 % 'D' -- '111”
 Variable length according to frequency
 Easy to implement ? Cost ?
17
Evolution of Instruction Sets
 Design decisions must take into account:
 technology
 machine organization
 programming languages
 compiler technology
 operating systems
 And they in turn influence these
18
Aspects of CPU
Performance
CPU time
= Seconds
= Instructions x
Program
Program
Program
Inst Count
X
x Seconds
Instruction
CPI
Compiler
X
(X)
Inst. Set.
X
X
Organization
X
(X)
Technology
Cycles
Cycle
Clock Rate
X
X
19
Cycles Per Instruction (CPI)
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
CPI
CPU time = CycleTime *
i =1
i
* I
i
“Instruction Frequency”
n
CPI =

i =1
CPI i *
F
i
where F i =
I i
Instruction Count
Invest Resources where time is Spent!
20
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq Cycles CPI(i)
ALU
50%
1
.5
Load
20%
2
.4
Store
10%
2
.2
Branch
20%
2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Typical Mix
21
Part 2: Pipelining
22
Pipelining: Its Natural!
 Laundry Example
 Ann, Brian, Cathy, Dave
each have one load of
clothes to wash, dry, and
fold
A
B
C
D
 Washer takes 30 minutes
 Dryer takes 40 minutes
 “Folder” takes 20 minutes
23
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
D
 Sequential laundry takes 6 hours for 4 loads
 If they learned pipelining, how long would laundry take?
24
Pipelined Laundry: (Start work
ASAP)
6 PM
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
A
B
O
r
d
e
r
C
D
 Pipelined laundry takes 3.5 hours for 4 loads
25
Computer Pipelining
 Overlapping the execution of instructions.
 Instruction
fetch (IF)
 Instruction decode (ID)
 Execute (EX)
 Write back (WB)
 Some operation (e.g., IF, ID, EX) is
performed on every instruction in the
pipeline.
26
Computer Pipelining
 Pipelining increases the Throughput
 Throughput
= no. of instructions executed in
a given time period
 Hence, reduces the average execution
time per instruction (or CPI).
27
Pipelining Speedup:
 k-stage linear pipeline, n tasks.
 Pipelined
= k+(n-1) cycles.
 Unpipelined = n x k cycles.
 See the laundry example.
 Speedup = Sk = nk / (k+n-1)
 Sk  k as n  .
28
Speedup (explanation)
 At time t=0, the first pipeline operation
enters the pipe.
 After k pipeline clock cycles, the 1st result
exits.
 Then, 1 result exits per clock cycle.
29
Pipelining Lessons
 Doesn’t help latency of
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40
40
40
40 20
single task, but throughput
of entire workload
 Pipeline rate limited by
slowest pipeline stage
A
 Multiple tasks operating
B
 Potential speedup =
C
 Unbalanced lengths of pipe
simultaneously
Number pipe stages
stages reduces speedup
D
30
Computer Pipelines
 Execute billions of instructions, so
throughput is what matters
 desirable features:
 all
instructions same length,
 registers located in same place in instruction
format,
 memory operands only in loads or stores
31
MIPS example: 5 Stage Pipelining
Instruction
Fetch
Instr. Decode
Reg. Fetch
IR
Execute
Addr. Calc
Memory
Access
Write
Back
L
M
D
32
Visualizing Pipelining
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
33
Limits to pipelining: Hazards !!
 Hazards prevent next instruction from
executing during its designated clock cycle



Structural hazards: HW cannot support this
combination of instructions (single person to fold
and put clothes away)
Data hazards: Instruction depends on result of
prior instruction still in the pipeline (missing sock)
Control hazards: Pipelining of branches & other
instructions stall the pipeline until the hazard
“bubbles” in the pipeline
34
One Memory Port/Structural
Hazards
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
Use reg A
Load
Instr 1
Instr 2
Use reg A
Instr 3
Instr 4
35
One Memory Port/Structural
Hazards
Time (clock cycles)
Load
I
n
s
t
r.
O
r
d
e
r
Instr 1
Instr 2
stall
Wait for one cycle
Instr 3
36
All wait for the
result of r1
Data Hazard on R1
Time (clock cycles)
IF
I
n
s
t
r.
add r1,r2,r3
O
r
d
e
r
and r6,r1,r7
ID/RF
EX
MEM
WB
sub r4,r1,r3
or r8,r1,r9
xor r10,r1,r11
37
3 Generic “Data Hazards”
Assume InstrI followed by InstrJ
 Read After Write (RAW)
 InstrJ
tries to read operand before InstrI
writes it
38
Read After Write (RAW)
Write
Inst i
Inst j
1
2
3
4
5
1
2
3
4
5
Read
read the old data.
39
Write After Read (WAR)
 InstrJ tries to write operand before InstrI
reads i
 Gets wrong operand
 Can’t happen in MIPS 5 stage pipeline
because:
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
40
Write After Read (WAR)
Inst i
Inst j
1
Read
2
1
Always read the correct data.
3
4
5
2
3
4
5
Write
41
Write After Write (WAW)
 InstrJ tries to write operand before InstrI writes it

Leaves wrong result ( InstrI not InstrJ )
 Can’t happen in MIPS’s 5 stage pipeline
because:

All instructions take 5 stages, and
 Writes are always in stage 5
42
Write After Write (WAW)
Write
Inst i
Inst j
1
2
3
4
5
1
2
3
4
5
Write
43
Part 3: Processor Technology
44
Evolution of Processor
Design
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 432 1977-80)
Mixed (1998)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)
45
CISC: Complex Instruction
Set Computer.
 more than 300 instructions ISA
 variable instruction/data formats
 small set of 8 to 24 general-purpose registers
 allow many memory reference operations
(addressing modes)
 CPI: 1 to 20 cycles, average CPI: 4 cycles
 Examples: INTEL x86 series (Pentium, Pentium Pro,
Pentium II), Motorola M680X0, Digital VAX 8600, IBM
390, AMD 486, Cyrix 686
46
Intel CPU Family
Year
Model
Features
1981
Intel 8088
16-bit, 29K, max speed 10 MHz
1982
Intel 80286
16-bit, 130K, max speed 12 MHz
1985
Intel 80386
32-bit, 275K, max speed 20 MHz
1989
Intel 80486
32-bit, 2M, 25 MHz
1993- Intel Pentium
3.11M, 66MHz, CMOS, 2-issue, 266 MHz
with MMX (16/16 KB cache)
5.5M, 200 MHz, 8/8 KB cache, Bi-CMOS,
1995- Intel Pentium Pro
(Socket 7)
64-bit data bus, 3-issue, no MMX
16/16 KB L1 cache, 256/512KB L2 cache,
1997- Intel Pentium II
(Slot 1)
300 MHz, 32-bit, 3-issue, with MMX
512KB L2 (half speed), 450 MHz, 7.5 M
1998
transistors
1998 Intel Pentium II Xeon 512KB L2 (full speed), 450 MHz, 32-bit
47
RISC: Reduced Instruction Set
Computer.
 Observation:
 Only
25% of a complex inst set is frequently
used about 95% of the time
 75% of the hardware-supported instructions
are rarely used
 All instructions are of the same length
 Push rarely used inst into software
 Adding cache and Floating Point Units (FPU) in
processor chips
48
RISC: other features
 Instruction set: less than 100 instructions
 Fixed 32- or 64-bit instruction format
 Only 3 to 5 simple addressing modes
 Single address mode for load/store: base +
displacement
 no indirection
 Simple branch conditions
49
RISC
 Large register files:
 32 integer registers + 32 floating point registers,
some has over 100
 execute majority of the instruction in one cycle
(average CPI: 1.5)
 higher clock rate
 easy for compiler optimization
50
Examples of RISC processors
 SUN: SPARC, MicroSPARC, SuperSPARC,
UltraSPARC,
 MIPS: R2000/3000/4000/5000/8000, R10000,
 INTEL: i860,
 Digital : Alpha 21164, 21264, 21364,
 IBM, Apple, Motorola : PowerPC 601, 603, 604e,
620, 630,
 IBM : POWER2 (SP2), POWER 3 (ASCI Blue Pacific),
 HP: HP PA-RISC PA-8000
51
Example: MIPS
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
0
immediate
Rd
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
0
immediate
Jump / Call
31
26 25
Op
0
target
52
Advanced Pipelining
(a) Single-issue base pipeline
Successive
instruction
cycle
(b) 3-issue superscalar pipeline
(c) Single-issue superpipeline
(d) 3-issue superscalar superpipeline
53
Superscalar Processors (RISC+CISC)
 Multiple instructions issued per cycle (IPC > 1).
 Clock rate matches that of generic scalar RISC.
 CPI is lower than generic scalar RISC.
 Examples:
 Alpha 21064 (2-issue), 21164 (4-issue),
 PowerPC: 604e (4), 620(4)
 HP PA-7200 (2), PA-8000 (4),
 MIPS: R5000 (2), R10000 (4),
 SUN: MicroSPARC (2), UltraSparc-2 (4),
 INTEL i860 (RISC, 2 issues), Pentium (CISC, 2),
Pentium Pro (3), Pentium II (3)
54
SUN SPARC (RISC)
 Scalable Processor ARChitecture, a
specification, not a chip.
 Larger register set: SPARC 128-144.
 Generations:
 SPARC (1987), MicroSPARC, SuperSparc (1993),
UltraSparc (1995), UltraSPACR II, Ultra III, Ultra IV,..
 Machines:
 CM-5 : SPARC 33 MHz.
 CS-2 : SuperSparc 40 MHz (viking).
 SUN Sunfire (Enterprise 1000): UltraSparc
55
UltraSPARC Roadmap
 UltraSPARC II: 64-bit, 0.25 micron (same as
Pentium II, AMD K6-2), Max 360 MHz, 30 W,
(400 MHz later 1998)
 UltraSPARC III: 600 MHz late 1999, 1000 MHz
 UltraSPARC IV: mid-2000, 1000 MHz, 0.15
micron, Sun’s first copper-based chip
 UltraSPARC V: 1500 MHz, 0.07 micron (, by
2002
56
PowerPC (RISC)
 1991, by Apple, IBM, and Motorola.
 OS: IBM AIX, Apple Mac OS, NetWare 4,
OS/2, Sun Solaris, and Window NT, MSDOS.
 Technology update:
 September1998:
IBM 400-MHz PowerPC
(copper wiring)
57
PowerPC Family :
 First generation :
 PowerPC 601: desktop PCs.
 The 2nd generation :
 PowerPC 603 (603e, 166 MHz, 3 W, 81 mm2):
portable+battery-powered applications.
 PowerPC 604 (604e, 5.6 M transistors, Power 10 W,
dynamic branch prediction logic, 4-issue, 6-stage):
sophisticated PCs and servers.
 PowerPC 620: integrated L2 controller and dedicated
cache interface, 4-issue, 5-stage, 30 W, used in
servers or supercomputer.
58
PowerPC 3rd generation: G3
 L2 cache support, new bus architecture
 32-bit processor, used in iMAC (1998)
 0.25-micron, 67 mm2, 250 MHz, 5 W, 6.35 M
transistors.
 5 execution units (similar to 603e),





1 floating point unit,
1 branch unit,
1 load/store unit,
2 single-cycle integer unit (603e only 1),
1 system unit
59
PowerPC 3rd generation: G3
 4-stage pipeline:




fetch,
decode-dispatch,
execute,
complete-writeback
 fetch unit fetches 4 instructions per clock
 peak rate: complete 3 instructions per clock
 Two 32 KB on-chip L1 caches (data +
instruction) : same as 604e, 8-way set
associative
60
PowerPC 3rd generation: G3
 On-chip L2 cache: 2-way set associative of
sizes 256 KB, 512 KB or 1 MB
 Performance: 250 MHz CPU clock, 50 MHz
system bus, half-speed 1-MB L2 cache: 10
SPECint95
 Bus protocol: MEI (modified exclusive, invalid),
used for single or dual-processor design
61
IBM POWER
 Performance Optimized With Enhanced RISC
 POWER2, 66.7 MHz, 6-issue, used in IBM SP2
 4 floating-point operations at once cycle.
 peak performance: 266 Mflops (66.7 x 4).
 ASCI Blue Pacific : using POWER3
 3.9 trillion calculation per second
 15,000 times faster than the desktop PC
 at Lawrence Livermore National Lab.
 2.6 trillion bytes of memory
 1 second = 63,000 years using hand calculator
 4096 POWER3 CPUs (8 CPUs per node)
62
POWER3: Superscalar RISC
 8-issue, (most other processors 4-issue)
 200 MHz (slow but fast),
 0.25 micron, 5 metal layers, 1088 pin
 IBM’s first 64-bit microprocessor.
 Used in ASCI Blue Pacific, 4096 nodes,
 Memory subsystem: 6.4 GB/s
 POWER3 workstation, Oct. 23 1998: RS/6000 43P
Model 260, single or dual-processor, up to 8 (in SMP
form), 4MB L2 instruction cache+256 MB SDRAM
memory
 Compatable with PowerPC design
63
IBM
 1999: 0.18 micron, copper wiring
 2000: silicon-on-insulator
 2001: “gigachip” (POWER4 ??)
64
MIPS:
 R4000 (1991, 64-bit), R8000 (1994),
 R10000
 1995
or 1996-, 30 W.
 5.9 M transistors, 32/32 KB cache
 5-7 pipeline stages, 4-issue
 SGI Power Challenge
 up
to 18 X R8000 or x 36 R10000
65
Other Commodity Processors
 HP PA-RISC (Precision Architecture):
 PA-RISC
7200 (CONVEX Examplar, 128
processors)
 DEC: Alpha 21064 (CRAY T3D), 21164
(300 MHz, T3E), 21264 (T3E1200)
 Intel 80860 (i860):
 ``Cray
on a Chip'' 66 MFLOPS (Cray 1S = 85
MFLOPS)
66
VLIW: Very Long Instruction Word.
 Use even more functional units than that of a superscalar
processor.
 All instructions are the same length
 The operations in each work are chosen by the compiler.
 CPI is further lower than superscalar.
 Clock rate is slow.
 Microprogrammed control, synchronization of parallel
operations is entirely done at compile time
 No commodity processor is designed in VLIW (but it is
coming back !! INTEL 64-bit Merced)
67
Main
memory
Register File
Load/Store
Load/store
FP ADD
Integer ALU
FP Multiply
FP Unit
Branch
....
Branch Unit
Integer ALU
(a) A VLIW processor architecture and instruction format
Cycle
(b) pipeline execution of VLIW instruction
68
Summary
CISC
RISC
VLIW
Instruction
Complexity
Varies from
Simple to
complex
One simple operation
Many simple
Independent
operations
Instruction size
Varies
One size, usually 32
bits
One size
Instruction
format
Field placement
varies
Regular, consistent
field placement
Regular, consistent
field placement
Memory reference
Bundled with
operations
No bundled,
load/store
architecture
No bundled,
load/store
architecture
Hardware design
focus
Microcoded
Implementations;
One or more
pipeline
Few, sometimes
special
No microcode; one
or more pipelines
Multiple pipelines, no
microcode, no
complex dispatch
logic
Many, general
purpose
Registers
Many, general
purpose
69
Processor Performance
 Processor
Clock IPC
stage SPEC95int
SPEC95fp
 Alpha 21164
500
4(2+2)
7-12 12.6
18.3
 PowerPC620
200
4
5
9.0
9.0
 PowerPC 604e
225
4
6
8.5
7.0
 UltraSPARC II
250
4
6-9
8.5
15
 HA-8000
180
4
7-9
10.8
18.3
 MIPS R10000
200
4
5-7
8.9
17.2
 Pentium Pro
200
3(2+1)
12-14 8.7
6.0
Only 1 floating point unit
active at a time
70
Case Study 1:
INTEL Processors
71
Pentium processor
Host Bus (3.3 V; 60-66 MHz)
Memory controller
82349TX (MTXC)
32-bit address
64-bit data
500 MB/s (8x60)
L2 cache
(Max 512 KB)
PCI Bus (3.3 V or 5V, 30/33 MHz)
Main Memory
(DRAM
Max 256 MB)
BMI IDE (33 MB/s)
HD
CD-ROM
82371AB
(PIIX4)
Bus Master
USB USB
PCI
Device
ISA/EIO Bus
(3.3 V; 5V)
EISA/ISA
Device
(up to 5)
EISA/ISA
Device
Pentium (430 TX Mother Board)
72
Pentium Pro
 cache:
8
KB data+8 KB instruction cache (L1)
 On-board L2 cache: 256 KB or 512 KB
 40 general purpose registers
 Data TLB: 64 entries
 No MMX
 up to 200 MHz, 35W:
 integration
of high-speed CPU with highspeed cache is not easy
73
Execution in Pentium Pro
 Five functional units:





Store data unit
Store address unit
Load address unit
Integer ALU unit
Floating point/integer unit
 3-issue but only one floating point op
 Peak flop rate= 200 MFLOP at 200 MHz
74
Pentium Pro (P6)
Local
APIC
Substrate
Pentium Pro
Core
Full-speed
8 KB L1
Data
Cache
Backside Bus
8 KB L1
Instruction
Cache
Half-speed
Bus Interface Unit
256/512 KB
Unified
L2 cache
External Bus
75
Pentium Pro Memory Subsystem
 L1 cache (Data cache 8 KB):

supporting one load and one store per cycle (fullspeed), peak bandwidth of 3.2 GB/s on a 200 MHz
 L2 cache:

run at full CPU clock speed, can transfer 64 bit per
cycle (1.6 GB/sec on a 200 MHz Pentium Pro)
 External bus: 64-bit, 64-bit per bus cycle
 SMP support
 Full cache coherence up to 4 processors
76
INTEL Pentium Pro SMP
 Processor Bus: 532 MB/s (66.6 MHz x 64 bits)
 Four-way interleaved DRAMs, EDO or
synchronous DRAM
 Interface to EISA or PCI
 Bus operations:
 write-back cache, MESI protocol
 pipeline depth: 8
77
P6
32-bit address
32-bit data
132 MB/s
EISA/ISA
Bridge
P6
P6
P6
Pentium Pro processor bus
Memory controller
PCI Bridge
PCI Bus
PCI
Device
PCI
Device
EISA/ISA Bus
EISA/ISA
Device
32-bit address
64-bit data
500 MB/s
DRAM
Controller
Data
Path
Mem data
(72 bits)
MIC
MIC
MIC MIC
Interleave data
(288 bits)
EISA/ISA
Device
MIC: memory interface controller
78
Pentium II
 Larger L1 cache: 16/16 KB
 L2 cache: 512 KB unified cache thru backside bus
 Add MMX features back (like Pentium MMX)
 Slot 1 architecture (240 pins)
 Clock speed improved: 233, 266, 300,... 450 MHz.
 SMP support: up to TWO only
 Deschutes version (1998): >= 333 MHz, 100 MHz
external bus (440BX chipset), AGP, SMP support: 4
processors, Slot 2 (330 pins) ?
79
Pentium II (P6)
Local
APIC
Substrate
Pentium II
Core
Full-speed
16 KB L1
Data
Cache
16 KB L1
Instruction
Cache
512 KB
Half-speed
Unified
L2 cache
Bus Interface Unit
External Bus
80
AMD K6-2
 3DNow technology: MMX support
 4 floating point units (4-issue); Pentium II only
one floating point unit
 300 MHz AMD K6-2: 1.2 GFLOPS > Pentium II
450 MHz
 Socket 7, 100 MHz external bus, 0.25 micron
 9.3 M transistors
 K6-3 (SharkTooth): 350, 450 MHz, 256 KB on-
chip L2 cache, 21.3 M transistors
81
Intel Processors for Each Market
Segment
Mid- to High-End
Server/
Workstation
Entry-level Server/
Workstation
Performance
Desktop
Pentium® Pro
Processor
Pentium II
Processor
Pentium® Processor
Basic with MMX™
PC Desktop Technology
Pentium Processor
Mobile PC with MMX
Technology
’98
0.25 micron P6 Microarchitecture Core
’97
Pentium® II Xeon™
Processor
Pentium II
Processor
Intel ® Celeron™
Processor
Mobile Pentium II
Processor
82
INTEL Merced (mid-2000)
 64-bit processor
 VLIW concept?
 Need good compiler technique
 run UNIX (more scalable than NT)
83
INTEL McKinley (late-2001)
 64-bit, 0.13 micron
 More cache memory than any other
INTEL processors
 aims at 1000 MHz, 2x faster than Merced
 Need good compiler technique
84
INTEL Foster
 32-bit, 0.13 micron
 1000 MHz
 high-end PC server
 longer pipeline + “instruction trace cache”
85
Intel Roadmap
32-bit:
P5: Pentium (1993),
P6: Pentium Pro (1995), Pentium II (450 MHz)
Celeron 2, Pentium II Xeon (450 MHz),
Tanner and Cascades chip (1999),
P7: Willamette (desktop),
Foster (1000 MHz, high-end PC server)
64-bit: Merced and McKinley
86
Intel vs. Compaq 64-bit
roadmaps:
 Year
Intel's IA-64
Compaq's Alpha
 1998
in progress
21264 at 575 MHz
 1999
first samples
21264 at 750 MHz to 1 GHz
 mid-2000 Merced at 800 MHz +
21364 at 1 GHz +
 late 2001 McKinley at 1 GHz +
EV8
 2002
Madison
 2003(?) Deerfield
87
Digital Alpha 21164
 0.35 m , 500 MHz,
RISC
 4-way issue superscalar
 Up to 2 Integer and 2
Floating point instructions
issues per CPU cycle
 Large on-chip L2 cache
 96 KB, writable, 3-way set
associative
 9.3 M transistors
 Fully pipelined
 7-stage integer pipeline
 9-stage floating point
pipeline
 High-through memory
subsystem (400 MB/s)
88
Alpha 21164 Block Diagram
Exec Unit
Inst Unit
Int Unit
Inst
cache
(8KB)
4-way
issue
unit
Int Unit
FP +
FP *
Memory Unit
Merge
Log
Writethrough
Data
Cache
(8KB)
Writeback
L2
Cache
(96KB)
128-bit internal data bus
40bit
address
Bus
Interface
Unit
L3
Cache
128bit
data
89
Current Status of Processor
Technology
 Still don’t work well for some applications:
 data bases, CAD tools, sparse matrix,..
 Alpha 21164, 300 MHz, 4-way superscalar
 Running Microsoft SQLserver database on Windows NT
 It operates at 12% of peak performance
 Caches don’t work. Speed is tied to memory
bandwidth.
90
Microprocessor-DRAM
performance gap
 full cache miss time = 100s instructions
 Alpha 7000 server: 340 ns/5.0 ns = 68
clks (2-issue, x2 = 136 insts)
 Alpha 8400 Server: 266 ns/3.3ns = 80
clks (21164 processor, 4-issue, x 4 = 320
insts)
 Rely on locality + caches to bridge gap
91
Processor-Memory Gap
Processor-DRAM Memory Gap (latency)
100
10
1
µProc
60%/yr.
“Moore’s Law”
(2X/1.5yr)
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10 yrs)
CPU
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
92
Future Processors
 Specialized, very long instruction word (VLIW)
machines
 Wide, simultaneous multithreaded (SMT)
uniprocessor
 Single-chip multiprocessor
 Memory-centric computing engines (IRAM,
PPRAM,CRAM)
93
IRAM: Berkeley
 Growing performance gap between CPU
and memory access speed
 Microprocessor and DRAM on single chip
 Bridge the processor-memory
performance gap via on-chip latency and
bandwidth
 improve power-performance
94
CRAM (Univ. Toronto)
 Computation moved from the CPU into
the memory
 CRAM= RAM+SIMD
 PetaOPS performance (1015 operations
per second)
 Bandwidth internal to memory: 2.9 TB/s
 Cache/CPU: 800 MB/s
95
Other Projects
 PPRAM Project (Kyushu Univ., Japan):
Parallel Processing RAM Chip
 CMP Project (Stanford)
 billion-transistor
processor architecture
 single-chip multiprocessor (4 to 16)
 New ISAs
96