Transcript Document

11. Multicore Processors
Dezső Sima
Fall 2006
 D. Sima, 2006
Overview
•
1 Overview of MCPs
•
2 Attaching L2 caches
•
3 Attaching L3 caches
•
4 Connecting memory and I/O
•
5 Case examples
1. Overview of MCPs (1)
Figure 1.1: Processor power density trends
Source: D. Yen: Chip Multithreading Processors Enable Reliable High Throughput Computing
http://www.irps.org/05-43rd/IRPS_Keynote_Yen.pdf
1. Overview of MCPs (2)
Figure 1.2: Single-stream performance vs. cost
Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture
Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16
1. Overview of MCPs (2)
Superscalar
processors
RISC
IBM
Sun
Dual core
single threaded
Dual core
dual threaded
Multi core
single threaded
POWER4
POWER5
Cell
(2001)
0.18 µ/184 mtrs.
(2004)
0.13 /276 mtrs.
(2006)
0.09 /234 mtrs.
Multi core
multi threaded
UltraSPARC IV
(Jaguar)
(2004)
2*USIII
0.13 /66 mtrs.
UltraSPARC IV+
(Panther)
UltraSPARC T1
(Niagara)
(2005)
0.09 µ/295 mtrs.
(2005)
8 cores/4T
0.09 /279 mtrs
Gemini
(2004)
2*USIII
0.13 /80 mtrs.
HP
PA 8800
(Mako)
(2004)
2*PA8700
0.13 /300 mtrs.
PA 8900
(Shortfin)
(5/2005)
0.13µ/317 mtrs.
Figure 1.2: Dual/multi-core processors (1)
1. Overview of MCPs (3)
Superscalar
processors
CISC
Dual core
single threaded
Dual core
dual threaded
Pentium D 820-840
(Smithfield)
Pentium EE 840
(4/2005)
0.09 /230 mtrs.
(4/2005)
0.09 /230 mtrs.
Multi core
single threaded
Intel
Pentium EE 955/965
(Presler)
Pentium M Core Duo
(Yonah)
(4/2005)
0.065 /2*188 mtrs.
(2005)
0.065 /151 mtrs.
AMD
Athlon 64 X2
(6/2005)
0.09 /233 mtrs.
VLIW
processors
Intel
Montecito
(2006?)
2*Itanium 2 (Madison)
0.09 /1730 mtrs.
Figure 1.3: Dual/multi-core processors (2)
Multi core
multi threaded
1. Overview of MCPs (4)
Macro architecture of dual/multi-core processors (MCPs)
Layout of the
cores
Attaching of L2
caches
Attaching of L3
caches (if available)
Layout of the I/O and
memory architecture
2. Attaching L2 caches
2.1 Main aspects of attaching L2 caches to MCPs (1)
Attaching L2 caches to MCPs
Allocation to the
cores
Use by
instructions/data
Inclusion policy
Integration of L2 caches
to the proc. chip
Banking policy
Allocation of L2 caches to the cores
Shared L2 cache for
all cores
Private L2 cache
for each core
UltraSPARC IV (2004)
UltraSPARC T1 (2005)
Yonah (2006)
Smithfield (2005)
Core Duo (2006)
POWER4 (2001)
POWER5 (2005)
Athlon 64 X2 (2005)
Montecito (2006?)
Expected trend
2.1 Main aspects of attaching L2 caches to MCPs (2)
Attaching L2 caches to MCPs
Allocation to the
cores
Use by
instructions/data
Inclusion policy
Integration of L2 caches
to the proc. chip
Banking policy
Inclusion policy of L2 caches
Inclusive L2
L1
L2
Memory
Exclusive L2
L1
L2
Memory
Lines replaced (victimized) in the L1 are
written into the L2
References to data in the L2 initiate reloading
that cache line into the L1,
L2 operates usually as write back cache
(only modified data that is replaced in the L2
is written back to the memory),
Unmodified data that is replaced in
the L2 is deleted.
Figure 1.1: Implementation of exclusive L2 caches
Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”,
2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
2004, pp. 89-96.
Inclusion policy of L2 caches
Exclusive L2
Inclusive L2
Most implementations
Athlon 64X2 (2005)
Expected trend
2.1 Main aspects of attaching L2 caches to MCPs (3)
Attaching L2 caches to MCPs
Allocation to the
cores
Use by
instructions/data
Inclusion policy
Integration of L2 caches
to the proc. chip
Banking policy
Use by instructions/data
Unified instr./data
cache(s)
Split instr./data
caches
UltraSPARC IV (2004)
Montecito (2006?)
UltraSPARC T1 (2005)
POWER4 (2001)
POWER5 (2005)
Smithfield (2005)
Yonah (2006)
Core Duo (2006)
Athlon 64 X2 (2005)
Expected trend
2.1 Main aspects of attaching L2 caches to MCPs (4)
Attaching L2 caches to MCPs
Allocation to the
cores
Use by
instructions/data
Inclusion policy
Integration of L2 caches
to the proc. chip
Banking policy
Banking policy
Single-banked
implementation
Multi-banked
implementation
2.1 Main aspects of attaching L2 caches to MCPs (5)
Attaching L2 caches to MCPs
Allocation to the
cores
Use by
instructions/data
Inclusion policy
Integration of L2 caches
to the proc. chip
Banking policy
Integration to the processor chip
On chip L2 tags/contr.,
off chip data
Entire L2 on chip
UltraSPARC IV (2004)
UltraSPARC V (2005)
POWER4 (2001)
POWER5 (2005)
Smithfield (2005)
Presler (2005)
Athlon 64 X2(2005)
Expected trend
2.2 Examples of attaching L2 caches to MCPs (1)
Private L2 caches for each core
Unified instruction / data caches
On-chip L2 tags/contr.,
off-chip data
Split instruction/data caches
On-chip L2 t/c
off-chip data
Entire L2 on-chip
Entire L2 on-chip
Examples:
Montecito (2006?)
UltraSPARC IV (2004)
L2 data
Smithfield (2005)
Presler (2005)
(Exclusive L2)
L2 data
Athlon 64 X2 (2005)
Core
Core
Core
L2 tags/contr.
L2 tags/contr.
Core
Core
L2
Interconn.
network
Core
L2
L2 I
Core
Core
L2
L2
L2 D
L2 I
System Request Queue
L3
Syst. if.
Mem. contr.
Memory
FSB
L3
Xbar
Syst. if.
Fire Plane
bus
L2 D
HT-bus
contr.
Mem
contr.
Syst. if.
FSB
HT-bus
Memory
2.2 Examples of attaching L2 caches to MCPs (2)
Shared L2 caches for all cores
Dual core/single banked L2
Examples:
Yonah Duo (2006)
Core (2006)
UltraSPARC T1 (2005) (Niagara)
(8 cores/4xL2 banks)
POWER4 (2001)
POWER5 (2005)
Core
Core
Multi core/multi banked L2
Dual core/multi banked L2
Core
Core
Core
Core
X-bar
X-bar
L2 contr.
L2
L2
L2
L2
Fabric Bu SContr.
Fabric Bus Contr.
System if.
GX
contr.
FSB
L2
L2
Mem. contr.
Mem. contr.
L3 tags/
contr.
Memory
Memory
GX bus
Mapping of addresses to the banks:
The 128-byte long L2 cache lines are hashed across
the 3 modules. Hashing is performed by modulo 3
arithmetric applied on a large number of real address bits.
7 6
Mapping of addresses to the banks:
The four L2 modules are interleaved at 64-byte blocks.
0
196
Addr.
128
64
S Modulo 3
0
1
2
0
256
3. Attaching L3 caches
Macro architecture of dual/multi-core processors (MCPs)
Layout of the
cores
Attaching of L2
caches
Attaching of L3
caches (if available)
Layout of the I/O and
memory architecture
3.1 Main aspects of attaching L3 caches to MCPs (1)
Attaching L3 caches to MCPs
Allocation to the L2
cache(s)
Use by
instructions/data
Inclusion policy
Integration of L3 caches
to the proc. chip
Banking policy
Allocation of L3 caches to the L2 caches
Private L3 cache
for each L2
POWER5 (2005)
Shared L3 cache for
all L2s
POWER4 (2001)
UltraSPARC IV+ (2004)
Montecito (2006?)
3.1 Main aspects of attaching L3 caches to MCPs (2)
Attaching L3 caches to MCPs
Allocation to the L2
cache(s)
Use by
instructions/data
Inclusion policy
Integration of L3 caches
to the proc. chip
Banking policy
Inclusion policy of L3 caches
Inclusive L3
L2
L3
Memory
Exclusive L3
L2
L3
Memory
Lines replaced (victimized) in the L2 are
written into the L3
References to data in the L3 initiate reloading
that cache line into the L2,
L3 operates usually as write back cache
(only modified data that is replaced in the L3
is written back to the memory),
Unmodified data that is replaced in
the L3 is deleted.
Inclusion policy of L3 caches
Exclusive L3
Inclusive L3
POWER4 (2001)
POWER5 (2005)
UltraSPARC IV+ (2004)
Montecito (2006?)
Expected trend
3.1 Main aspects of attaching L3 caches to MCPs (3)
Attaching L3 caches to MCPs
Allocation to the L2
cache(s)
Use by
instructions/data
Inclusion policy
Integration of L3 caches
to the proc. chip
Banking policy
Use by instructions/data
Unified instr./data
cache(s)
All multicore processors
unveiled until now hold
both instruction and data
Split instr./data
caches
3.1 Main aspects of attaching L3 caches to MCPs (4)
Attaching L3 caches to MCPs
Allocation to the L2
cache(s)
Use by
instructions/data
Inclusion policy
Integration of L3 caches
to the proc. chip
Banking policy
Banking policy
Single-banked
implementation
Multi-banked
implementation
3.1 Main aspects of attaching L3 caches to MCPs (5)
Attaching L3 caches to MCPs
Allocation to the L2
cache(s)
Use by
instructions/data
Inclusion policy
Integration of L3 caches
to the proc. chip
Banking policy
Integration to the processor chip
On chip L3 tags/contr.,
off chip data
Entire L3 on chip
UltraSPARC IV+ (2005)
POWER4 (2001)
POWER5 (2005)
Montecito (2006?)
Expected trend
3.2 Examples of attaching L3 caches to MCPs (1)
Inclusive L3 cache
Private L3 caches
for each L2 cache banks
On-chip L3 tags/contr.,
off-chip data
Shared L3 cache
for all cache banks
On-chip L3 tags/contr.,
off-chip data
Entire L3 on-chip
Examples:
POWER4 (2001)
Montecito (2006?)
L2
L2 I
L2 D
L2 I
L2
L2
L2 D
Fabric Bus Contr.
L3
L3
L3 tags/contr.
Arbiter
L3 data
System if.
Mem. contr.
FSB
Memory
Entire L3 on-chip
3.2 Examples of attaching L3 caches to MCPs (2)
Exclusive L3 cache
Private L3 caches
for each L2 cache banks
On-chip L3 tags/contr.,
off-chip data
Entire L3 on-chip
Shared L3 cache
for all cache banks
On-chip L3 tags/contr.,
off-chip data
Examples:
POWER5 (2005):
UltraSPARC IV+ (2005):
L3 data
L3 tags/contr.
L2
L3 data
L3 tags/contr.
L2
L3 tags/contr.
L3 data
L2
L2
L3 tags/contr.
L3 data
Interconn.
network
Core
Fabric Bus Contr.
Syst. if.
Mem. contr.
Fire Plane
bus
Memory
Memory contr.
Memory
Core
Entire L3 on-chip
4. Connecting memory and I/O
Macro architecture of dual/multi-core processors (MCPs)
Layout of the
cores
Attaching of L2
caches
Attaching of L3
caches (if available)
Layout of the I/O and
memory architecture
4.1 Overview
Layout of the I/O and memory architecture in dual/multi-core processors
Connection policy of
I/O and memory
Integration of the memory controller
to the processor chip
4.2 Connection policy (1)
Connection policy of I/O and memory
Connecting both I/O and memory
via the system bus
PA-8800 (2004)
PA-8900 (2005)
Smithfield (2005)
Presler (2005)
Yonah Duo (2006)
Core (2006)
Montecito (2006?)
Dedicated connection of
I/O and memory
Asymmetric connection
of I/O and memory
Symmetric connection
of I/O and memory
POWER4 (2001)
UltraSPARC T1 (2005)
POWER5 (2005)
UltraSPARC IV (2004)
UltraSPARC IV+ (2005)
Athlon64 X2 (2005)
4.2 Connection policy (2)
Connecting both I/O and memory via the system bus
Examples:
Smithfield/Presler (2005/2005)
L2
Yonah Duo/Core (2006/2006)
L2
L2
Syst. bus if.
Syst. bus if.
FSB
FSB
Montecito (2006)
L2 I/
L2 D
PA-8800 (2004)
PA-8900 (2005)
L2 I/
L2 D
L2
Core
L3
L3
L2
contr.
Syst. bus if.
Syst. bus if.
FSB
FSB
Core
4.2 Connection policy (3)
Connection policy of I/O and memory
Connecting both I/O and memory
via the system bus
Dedicated connection of
I/O and memory
Asymmetric connection
of I/O and memory
(Connecting I/O via the internal
interconnection network,
and memory via the L2/L3 cache)
PA-8800 (2004)
PA-8900 (2005)
Smithfield (2005)
Presler (2005)
Yonah Duo (2006)
Core (2006)
Montecito (2006?)
POWER4 (2001)
UltraSPARC T1 (2005)
Symmetric connection
of I/O and memory
(Connecting both I/O and memory
via the internal interconnection
network
POWER5 (2005)
UltraSPARC IV (2004)
UltraSPARC IV+ (2005)
Athlon64 X2 (2005)
4.2 Connection policy (4)
Asymmetric connection of I/O and memory
POWER4 (2001)
UltraSPARC T1 (2005)
Core 0
L2
L2
M. contr.
Memory
L2
M. contr.
Memory
L2
M. contr.
Memory
L2
M. contr.
Memory
X
b
a
r
L2
Chip-to-chip/
Mem.-to-Mem.
interconn.
L2
L2
Fabric Bus Contr.
GX
contr.
L3 dir./
contr.
GX-bus
L3 data
Core 7
Bus if.
Mem. contr.
JBus
Memory
4.2 Connection policy (5)
Connection policy of I/O and memory
Connecting both I/O and memory
via the system bus
Dedicated connection of
I/O and memory
Asymmetric connection
of I/O and memory
(Connecting I/O via the internal
interconnection network,
and memory via the L2/L3 cache)
PA-8800 (2004)
PA-8900 (2005)
Smithfield (2005)
Presler (2005)
Yonah Duo (2006)
Core (2006)
Montecito (2006?)
POWER4 (2001)
UltraSPARC T1 (2005)
Symmetric connection
of I/O and memory
(Connecting both I/O and memory
via the internal interconnection
network
POWER5 (2005)
UltraSPARC IV (2004)
UltraSPARC IV+ (2005)
Athlon64 X2 (2005)
4.2 Connection policy (6)
Symmetric connection of I/O and memory (1)
POWER5 (2005)
UltraSPARC IV (2004)
L2 data
L2
Chip-chip/
Mem.-Mem.
interconn.
L2
L2
L2 data
L3
L2 tags/contr.
Fabric Bus Contr.
L2 tags/contr.
Interconn.
network
Core
GX
contr.
GX. bus
Core
Mem
contr.
Memory
Syst. if.
Mem. contr.
Fire Plane
bus
Memory
4.2 Connection policy (7)
Symmetric connection of I/O and memory (2)
Athlon 64 X2 (2005)
UltraSPARC IV+ (2005)
L3 data
L2
L2
L3 tags/contr.
System Request Queue
L2
Xbar
Interconn.
network
Core
HT-bus
contr.
HT-bus
Mem
contr.
Memory
Syst. if.
Mem. contr.
Fire Plane
bus
Memory
Core
4.3 Integration of the memory controller to the processor chip
Integration of the memory controller to the processor chip
Off-chip memory controller
On-chip memory controller
POWER4 (2001)
POWER5 (2005)
UltraSPARC IV (2004)
UltraSPARC IV+ (2005)
UltraSPARC T1 (2005)
PA-8800 (2004)
PA-8900 (2005)
Smithfield (2005)
Presler (2005)
Yonah Duo (2006)
Core (2006)
Montecito (2006?)
Athlon 64 X2 (2005)
Expected trend
5. Case examples
5.1 Intel MCPs (1)
Platform
The Move to Intel Multi-core
2005
2006
2007+
Itanium®
Itanium®
processor
MP Server
DP Server /
WS
Desktop
Client
Mobile
Client
today
All products and dates are preliminary and
subject to change without notice.
Refer to ‘fact sheet’ for
specific product timings
Figure 5.1: The move to Intel multi-core
Source: A. Loktu: Itanium 2 for Enterprise Computing
http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps
5.1 Intel MCPs (2)
Figure 5.2: Processor specifications of Intel’s Pentium D family (90 nm)
Source: http://www.intel.com/products/processor/index.htm
5.1 Intel MCPs (3)
ED: Execute Disable Bit
Malicious buffer overflow attacks pose a significant security threat.
In a typical attack, a malicious worm creates a flood of code that overwhelms the processor,
allowing the worm to propagate itself to the network, and to other computers.
It can help prevent certain classes of malicious buffer overflow attacks
when combined with a supporting operating system.
Execute Disable Bit allows the processor to classify areas in memory
by where application code can execute and where it cannot.
When a malicious worm attempts to insert code in the buffer,
the processor disables code execution, preventing damage and worm propagation.
VT: Virtualization Technology
It is a set of hardware enhancements to Intel’s server and client platforms
that can improve the performance and robustness of traditional software-based virtualization solutions.
Virtualization solutions will allow a platform to run multiple
operating systems and applications in independent partitions.
Using virtualization capabilities, one computer system can function as multiple "virtual" systems.
EIST: Enhanced Intel SpeedStep Technology
First delivered in Intel’s mobile and server platforms,
It allows the system to dynamically adjust processor voltage and core frequency,
which can result in decreased average power consumption
and decreased average heat production.
5.1 Intel MCPs (4)
Figure 5.3: Processor specifications of Intel’s Pentium D family (65 nm)
Source: http://www.intel.com/products/processor/index.htm
5.1 Intel MCPs (5)
Figure 5.4 Specifications of Intel’s Pentium Processor Extrem Edition models 840/955/965
Source: http://www.intel.com/products/processor/index.htm
5.1 Intel MCPs (6)
Figure 5.5: Procesor specifications of Intel’s Yonah Duo (Core Duo) family
Source: http://www.intel.com/products/processor/index.htm
5.1 Intel MCPs (7)
Figure 5.6 Specifications of Intel’s Core Processors
Source: http://www.intel.com/products/processor_number/chart/core2duo.htm
5.1 Intel MCPs (8)
Category
Code Name
Cores
Cache
Market
Desktop
Kentsfield
Dual core
multi-die
4 MB
Mid 2007
Desktop
Conroe
Dual core
single die
4 MB shared
End 2006
Desktop
Allendale
Dual core
single die
2 MB shared
End 2006
Desktop
Cedar Mill (NetBurst/P4)
Single core
512 kB, 1 MB, 2 MB Early 2006
Desktop
Presler (NetBurst/P4)
Dual core, dual die
4 MB
Early 2006
Desktop/Mobile Millville
Single core
1 MB
Early 2007
Mobile
Yonah2
Dual core, single die 2 MB
Early 2006
Mobile
Yonah1
Single core
1/2 MB
Mid 2006
Mobile
Stealey
Single core
512 kB
Mid 2007
Mobile
Merom
Dual core, single die 2/4 MB shared
End 2006
Enterprise
Sossaman
Dual core, single die 2 MB
Early 2006
Enterprise
Woodcrest
Dual core, single die 4 MB
Mid 2006
Enterprise
Clovertown
Quad core, multi-die 4 MB
Mid 2007
Enterprise
Dempsey (NetBurst/Xeon) Dual core, dual die
Enterprise
Tulsa
Enterprise
Whitefield
4 MB
Mid 2006
Dual core
single die
4/8/16 MB
End 2006
Quad core
single die
8 MB, 16 MB shared Early 2008
Figure 5.7: Future 65 nm processors (overview)
Source: P. Schmid: Top Secret Intel Processor Plans Uncovered
www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
5.1 Intel MCPs (9)
Codename
Cores
Cache
Market
Desktop
Wolfdale
Dual core, single die
3 MB shared
2008
Desktop
Ridgefield
Dual core
single die
6 MB shared
2008
Desktop
Yorkfield
8 cores
multi-die
12 MB shared
2008+
Desktop
Bloomfield
Quad core, single die
-
2008+
Desktop/Mobi
Perryville
le
Single core
2 MB
2008
Mobile
Penryn
Dual core
single die
3 MB, 6 MB shared
2008
Mobile
Silverthorne
-
-
2008+
Enterprise
Hapertown
8 cores
multi-die
12 MB shared
2008
Figure 5.8: Future 45 nm processors (overview)
Source: P. Schmid: Top Secret Intel Processor Plans Uncovered
www.tomshardware.com/2005/12/04/top_secret_intel_processor_plans_uncovered
5.2 Athlon 64 X2
Figure 5.9: AMD Athlon 64 X2 dual-core processor architecture
Source: AMD Athlon 64 X2 Dual-Core Processor for Desktop – Key Architecture Features,
http:///www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041.00.html
5.3 Sun’s UltraSPARC IV/IV+ (1)
ARB: Arbiter
Figure 5.10: UltraSPARC IV (Jaguar)
Source: C. Boussard: Architecture des processeurs
http://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
5.3 Sun’s UltraSPARC IV/IV+ (2)
Figure 5.11: UltraSPARC IV+ (Panther)
Source: C. Boussard: Architecture des processeurs
http://laser.igh.cnrs.fr/IMG/pdf/SUN-CNRS-archi-cpu-3.pdf
5.4 POWER4/POWER5 (1)
Core interface Unit
(crossbar)
Service Processor
Power On Reset
Built-In-SelfTest
Non-Cacheable
Unit
MultiChip Module
Figure 5.12: POWER4 chip logical view
Source: J.M. Tendler, S. Dodson, S. Fields, H. Le, B. Sinharoy: Power4 System Microarchitecture, IBM Server,
Technical White Paper, October 2001
http://www-03.ibm.coom/servers/eserver/pseries/hardware/whitepapers/power4.pdf
5.4 POWER4/POWER5 (2)
Figure 5.13: POWER4 chip
Source: R. Kalla, B. Sinharoy, J. Tendler: Simultaneous Multi-threading Implementation in Power5 –
IBM’s Next Generation POWER Microprocessor, 2003
http://www.hotchips.org/archives/hc15/3_Tue/11.ibm.pdf
5.4 POWER4/POWER5 (3)
Fabric
Controller
Figure 5.14: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor,
IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
5.5 Cell (1)
SPE: Synergistic
Processing Element
EIB: Element Interface Bus
MFC: Memory Flow Controller
PPE: Power Processing Element
AUC: Atomic Update Cache
Figure 5.15: Cell (BE) microarchitecture
Source: IBM: „Cell Broadband Engine™ processor – based systems”,
IBM corp. 2006
5.5 Cell (2)
Figure 5.16: Cell SPE architecture
Source: Blachford N.: „Cell Architecture Explained Version 2”,
http://www.blachford.info/computer/Cell/Cell1_v2.html
5.5 Cell (3)
Figure 5.17: Cell floorplan
Source: Blachford N.: „Cell Architecture Explained Version 2”,
http://www.blachford.info/computer/Cell/Cell1_v2.html