The Alpha 21364 and 21464 Microprocessors: Shubu Mukherjee, Ph.D.

Download Report

Transcript The Alpha 21364 and 21464 Microprocessors: Shubu Mukherjee, Ph.D.

The Alpha 21364 and 21464 Microprocessors:
Continuing the Performance Lead Beyond Y2K
Shubu Mukherjee, Ph.D.
Principal Hardware Engineer
VSSAD Labs, Alpha Development Group
Compaq Computer Corporation
Shrewsbury, Massachusetts
Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)
Better answers
Alpha Microprocessor Roadmap
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21264 Microprocessor

Architectural Features
First “Out-of-Order” Alpha
Four-wide superscalar
 …



Performance


World’s Fastest Microprocessor (www.spec.org, 11/17/99)
39 SPECINT95, 68 SPECFP95 @ 700 Mhz
–
Better answers
Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95
Alpha Microprocessor Roadmap
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21364 Goals


Leadership single stream performance

Higher operating frequency

Integrated memory interface
Leadership multiprocessor performance

Integrated system / multiprocessor interface
Better answers
Alpha 21364 Features

System-on-a-Chip





Fault-Tolerance

Better answers
Alpha 21264 core with enhancements
Integrated L2 Cache
Integrated memory controller
Integrated network interface
Support for lock-step operation to enable highavailability systems.
21364 Chip Block Diagram
16 L1
Miss Buffers
Address In
R
A
M
B
U
S
Address Out
64K Icache
21264
Core
64K Dcache
16 L1
Victim Buf
Better answers
L2
Cache
Memory
Controller
Network
Interface
16 L2
Victim Buf
N
S
E
W
I/O
21364 Core
FETCH
Stage: 0
Branch
Predictors
MAP
1
2
QUEUE
3
REG
4
EXEC
5
Int
Reg
Map
Int
Issue
Queue
(20)
Reg
File
(80)
Exec
80 in-flight instructions
plus 32 loads and 32 stores
Next-Line
Address
L1 Ins.
Cache
64KB
2-Set
Better answers
Reg
File
(80)
Exec
DCACHE
6
Addr
Exec
Addr
Exec
L1
Data
Cache
64KB
2-Set
L2
cache1
.5MB
6-Set
4 Instructions / cycle
FP
Reg
Map
FP
Issue
Queue
(15)
Reg
File
(72)
FP ADD
Div/Sqrt
FP MUL
Victim
Buffer
Miss
Address
Integrated L2 Cache
1.5 MB
 6-way set associative
 16 GB/s total read/write bandwidth
 16 Victim buffers for L1 -> L2
 16 Victim buffers for L2 -> Memory
 ECC SECDED code
 12ns load to use latency

Better answers
Integrated Memory Controller

Direct RAMbus



High data capacity per pin
800 MHz operation
30ns CAS latency pin to pin
6 GB/sec read or write bandwidth
 100s of open pages
 Directory based cache coherence
 ECC SECDED

Better answers
Integrated Network Interface
Direct processor-to-processor interconnect
 10 GB/second per processor
 15ns processor-to-processor latency
 Out-of-order network with adaptive routing
 Asynchronous clocking between processors
 3 GB/second I/O interface per processor

Better answers
21364 System Block Diagram
M
364
364
M
364
M
364
IO
IO
IO
IO
M
M
M
M
364
364
364
364
IO
IO
IO
IO
M
M
M
M
364
364
IO
Better answers
M
364
IO
364
IO
IO
Alpha 21364 Technology
0.18 mm CMOS
 1000+ MHz
 100 Watts @ 1.5 volts
2
 3.5 cm
 6 Layer Metal
 100 million transistors

Better answers

8 million logic

92 million RAM
Alpha 21364 Status
70 SPECint95 (estimated)
 120 SPECfp95 (estimated)
 RTL model running
 Tapeout: Summer 2000

Better answers
21364 Summary: System on a Chip

Integrated L2 cache and memory controller


outstanding single processor performance
Integrated network interface


high performance multi-processor systems
scales to large number of processors
Better answers
Alpha Microprocessor Overview
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21464 Goals


Leadership single stream performance

Higher operating frequency / better technology

New microarchitecture

Integrated memory interface (like 21364)
Leadership multiprocessor performance

Simultaneous Multithreading (with minimal change/cost)

Integrated system / multiprocessor interface (like 21364)
Better answers
Alpha 21464 Technology Overview

Leading edge process technology – 1.2-2.0GHz





0.125µm CMOS
SOI-compatible
Cu interconnect
low-k dielectrics
Chip characteristics


~1.2V Vdd
~250 Million transistors
Better answers
Alpha 21464 Architecture Overview
Enhanced out-of-order execution
 8-wide superscalar
 Large on-chip L2 cache
 Direct RAMBUS interface
 On-chip router for system interconnect
 Glueless, directory-based, ccNUMA



for up to 512-way multiprocessing
4-way simultaneous multithreading (SMT)
Better answers
Instruction Issue
Time
Reduced function unit utilization due to dependencies
Better answers
Superscalar Issue
Time
Superscalar leads to more performance, but lower utilization
Better answers
Predicated Issue
Time
Adds to function unit utilization, but results are thrown away
Better answers
Chip Multiprocessor
Time
Limited utilization when only running one thread
Better answers
Fine Grained Multithreading
Time
Intra-thread dependencies still limit performance
Better answers
Simultaneous Multithreading
Time
Maximum utilization of function units by independent operations
Better answers
Basic Out-of-order Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
PC
Register
Map
Regs
Dcache
Icache
Thread-blind
Better answers
Regs
Retire
SMT Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
PC
Register
Map
Regs
Icache
Better answers
Dcache
Regs
Retire
Changes for SMT

Basic pipeline – unchanged

Replicated resources



Program counters
Register maps
Shared resources





Register file (size increased)
Instruction queue
First and second level caches
Translation buffers
Branch predictor
Better answers
Multiprogrammed workload
250%
200%
1T
2T
3T
4T
150%
100%
50%
0%
SpecInt
Better answers
SpecFP
Mixed Int/FP
Decomposed SPEC95 Applications
250%
200%
1T
2T
3T
4T
150%
100%
50%
0%
Turb3d
Better answers
Swm256
Tomcatv
Multithreaded Applications
300%
250%
200%
1T
2T
4T
150%
100%
50%
0%
Barnes
Better answers
Chess
Sort
TP
Architectural Abstraction
1 Processor with 4 Thread Processing Units (TPUs)
 Shared hardware resources

TPU 0
Icache
TPU1
TPU2
TLB
Scache
Better answers
TPU3
Dcache
21464 System Block Diagram
0123
M
EV8
EV8
M
EV8
IO
IO
IO
M
M
M
EV8
EV8
EV8
IO
IO
IO
M
M
M
EV8
EV8
IO
Better answers
M
EV8
IO
IO
Alpha 21464 Summary


Leadership single stream performance

Higher operating frequency / better technology

New microarchitecture

Integrated memory interface (like 21364)
Leadership multiprocessor performance

Simultaneous Multithreading (with minimal changes/cost)

Integrated system / multiprocessor interface (like 21364)
Better answers
Maintain Performance Lead Beyond Y2K

Alpha 21364



Reuses 21264 microprocessor core
System on a chip
Alpha 21464
New microarchitecture
 System on a chip


Better answers
Simultaneous Multithreading
My Current Research: Beyond 21464?

The Truth Project (w/ Joel Emer)


The Multinet Project (w/ Rick Kessler)


Tightly-coupled multiprocessor networks
The Reliant Project (w/ Steve Reinhardt)


Examines different microarchitectural issues
Self-Checking Microprocessors using SMT, ISCA submission
Asim (w/ VSSAD Labs)

Performance Model for Alphas beyond 21464
Better answers