Microprocessor Design 2002

Download Report

Transcript Microprocessor Design 2002

Advanced Computer Architecture
5MD00 / 5Z033
TOP 500
supercomputers
Henk Corporaal
www.ics.ele.tue.nl/~heco/courses/aca
[email protected]
TUEindhoven
2011
Topics
• How to cross the Petaflop boundary
• Ranking
– Nov 2008: crossing the Petaflop/s boundary
– Nov 2009 / Nov 2010: what has been changed
– Nov 2011: Japan "K Computer" on top: 10.51 Petaflop/s on
Linpack using 705024 SPARC64 cores
• 2nd : Chinese Tianhe-1A: 2.57 Petaflop/s
• Examples
–
–
–
–
7/21/2015
Roadrunner (IBM)
Jaguar Cray
SGI Altix
BlueGene
ACA H.Corporaal
2
How to build a Petaflop supercomputer?
Some examples from 2008:
• Opteron cluster (e.g. ~2X Ranger/TACC)
– 32,000 quad-core Opterons (130K cores)
• Cray XT3/4 (e.g. Baker/ORNL sooner)
– 32,000 quad-core Opterons (130K cores)
• IBM BlueGene/P (bigger sooner)
– 80,000 BG/P PPC processors (320K cores)
• IBM Cell-accelerated Roadrunner cluster
– 10,000 Cells (80K Cell SPUs)
7/21/2015
ACA H.Corporaal
3
Supercomputer Ranking
• Started in 1993
• Jack Dongarra, University of Tennessee
• Based on LINPACK benchmark
– linear algebra (LU factorization)
• Superseded by LAPACK
– based on BLAS (Basic Lin. Alg. Subprograms)
– exploits caches
• Measures Floating Point performance
• Fortran code
• see http://www.top500.org
7/21/2015
ACA H.Corporaal
4
Single-Chip GPU v.s. Fastest Super Computers
ref: http://www.llnl.gov/str/JanFeb05/Seager.html
Performance Ranking Nov. 2008
#
Name
N_PE
1
Roadrunner IBM
129600
1105
1456
2483
2
Cray XT5
150152
1059
1381
6950
3
SGI Altix ICE
51200
487
608
2090
4
BlueGene IBM
212992
478
596
2329
100 Cluster Platform
(Xeons)
5120
27
51
-
52
Power 575
SARA (Amst)
3328
49
63
532
75
BlueGene Astron
12288
35
42
95
1568
13
16
51
496 Cluster in Gent Univ.
7/21/2015
ACA H.Corporaal
Rmax
(Tflop)
Rpeak
(Tflop)
P (kW)
6
Performance Ranking
#
Name
1
Roadrunner IBM
129600
1105
1456
2483
2
Cray XT5
150152
1059
1381
6950
3
SGI Altix ICE
51200
487
608
2090
4
BlueGene IBM
212992
478
596
2329
27
51
-
3328
49
63
532
12288
35
42
95
1568
13
16
51
100
2008: we crossed the
Cluster Platform
5120
Petaflop
boundary
(Xeons)
52
Power 575
SARA (Amst)
75
BlueGene Astron
496 Cluster in Gent
Univ.
7/21/2015
Npe
ACA H.Corporaal
Rmax
(Tflop)
Rpeak
(Tflop)
P (kW)
7
Update November 2009
#
Name
N_PE
1
Jaguar-Cray XT5-HE
Oak Ridge, USA
224162
1759
2331
6951
2
Roadrunner IBM
DOE, USA
122400
1042
1376
2346
3
Kraken Cray XT5-HE
Tennessee, USA
98928
832
1029
-
4
BlueGene IBM
Juelich, Germany
294912
826
1003
2268
5
Tianhe Xeon / ATI
cluster, China
71680
563
1206
-
7/21/2015
ACA H.Corporaal
Rmax
(Tflop)
Rpeak
(Tflop)
P (kW)
8
Update November 2010
#
Name
N_PE
1
Tianhe-1A, China
Intel+NVIDIA GPU
186368
2566
4701
4040
2
Jaguar-Cray XT5
DOE, USA
Opteron 6-cores
224162
1759
2331
6950
3
Nebulae, China
Intel + NVIDIA + GPU
120640
1271
2984
2580
4
TSUBAME, NEC, Japan
Intel + NVIDIA GPU
73278
1192
2287
1399
5
Hopper-Cray XE6
138368
1050
1254
4590
7/21/2015
ACA H.Corporaal
Rmax
(Tflop)
Rpeak
(Tflop)
P (kW)
9
Update Nov 2011
• 1st : K COmputer:
– 10.51 Petaflop/s on Linpack
– 705024 SPARC64 cores
(Fujitsu design)
– Tofu interconnect (6-D torus)
– 12.7 MegaWatt
• 2nd : Chinese Tianhe-1A:
– 2.57 Petaflop/s
– 186368 cores (Xeon + NVDIA proc)
– 4.0 MegaWatt
7/21/2015
ACA H.Corporaal
10
Alternative ranking: Green500
• Most Power efficient Supercomputers
– See www.green500.org
• 2008: best result = 536 MFlops/Watt =>
1.87 nJ / FloatingPt_operation
• 2009: best result = 723 MFlops/Watt =>
1.38 nJ / FloatingPt_operation
– Cell cluster, ranking 110 in top500
• 2010: best result = 1684 MFlops/Watt =>
594 pJ / FloatingPt operation
– IBM BlueGene/Q prototype, ranking 101 in top500, Peakperf: 65 TFlops;
see also http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/
7/21/2015
ACA H.Corporaal
11
Energy cost
At ~$1M per MW, energy costs are substantial
• 1 petaflop in 2010 will use 3 MW
• 1 exaflop in 2018 possible in 200 MW with “usual” scaling
• 1 exaflop in 2018 at 20 MW is DOE target
normal
scaling
desired
scaling
7/21/2015
ACA H.Corporaal
from: Katy Yelick, Berkeley
12
Nr1 (2008): Roadrunner
• IBM cluster
• 6480 nodes with
– Dual core Opteron 1.8 GHz
– 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops)
• Infiniband connection fabric (16 Gbit/s per link)
– FAT tree interconnect
•
•
•
•
•
7/21/2015
100 Tbyte DRAM memory
216 I/O nodes
MPI programming
2.35 MW power !!
Size: 296 racks, 5500 ft2
This is huge !!
ACA H.Corporaal
13
Cell/B.E. – the architecture
1 x PPE 64-bit PowerPC
L1: 32 KB I$ + 32 KB D$
L2: 512 KB
8 x SPE cores:
Local store: 256 KB
128 x 128 bit vector
registers
Hybrid memory model:
PPE: Rd/Wr
SPEs: Asynchronous DMA
• EIB: 205 GB/s sustained aggregate bandwidth
• Processor-to-memory bandwidth: 25.6 GB/s
• Processor-to-processor: 20 GB/s in each direction
7/21/2015
ACA H.Corporaal
14
7/21/2015
ACA H.Corporaal
15
Roadrunner: TriBlade = 2 nodes
For more details: Presentation slides of Ken Koch, March 2008
7/21/2015
ACA H.Corporaal
16
Nr2 (2008): Jaguar Cray XT5 QC
• I guess 5 times
–
–
–
–
7832 quad-core 2.1 GHz AMD Opetron
62 TB memory (= 2GB / core)
600 TB file system
250 TFlop
• In total 150152 cores
• SeaStar2+ interconnect (from Cray)
• Note 2009: quad-cores replaced by six-cores
–
–
–
–
7/21/2015
now nr 1
224,256 cores
peak 1.75 PetaFlop
paper: Bland A.S., Kendall R.A., Kothe D.B., Rogers J.H., Shipman G.M.
Jaguar: The World’s Most Powerful Computer
ACA H.Corporaal
17
Jaguar
7/21/2015
ACA H.Corporaal
18
Nr3 (2008): SGI Altix ICE8200
• 92 racks of Al5x ICE
– 8200EX with 3.0 Ghz Intel Xenon quad-core
processors or
– 47,104 cores
• 8 racks of Al5x ICE 8200
– with 2.66 Ghz Intel quad-core
– 4096 cores.
• 51 TB Main memory
• DDR InfiniBand
7/21/2015
ACA H.Corporaal
19
Nr:4 (2008) BlueGene/L IBM
• Based on ASIC with PowerPC 440, 700 Mhz, each 2.8
GFlops
• 105,496 nodes
• 3D Torus interconnect for p2p communication +
Collective network
3D-torus
Complete system
rack
7/21/2015
ACA H.Corporaal
20
BlueGene/L ASIC node
7/21/2015
ACA H.Corporaal
21
BlueGene/L Node board
7/21/2015
ACA H.Corporaal
• 16 cards with 2
ASICs each
• 8 GB
• 180 Gflop
22
2009: BlueGene/P
Rack:
32 Node Cards
13.9 TF/s
2-4 TB
ASIC:
13.6 Gflops
8 MB EDRAM
7/21/2015
System:
256 racks
upto 1PB
3.56 PFlops
Node card:
32 processor cards
Processor card:
64-128 GB
one 4-processor chip 435 GFlops
13.6 GFlops
2-4 GB
ACA H.Corporaal
23
BlueGene/P ASIC
7/21/2015
ACA H.Corporaal
24
PPC450: Exploiting SIMD
• Two FPUs
– 2 x 32 64-bit registers
• SIMD
– Datapath width = 16 bytes
– Feeds two FPUs, with 8
bytes each, every cycle
• Two FP multiply-add
operations per cycle
– 3.4 GFLOP/s peak
performance
7/21/2015
ACA H.Corporaal
25
BlueGene/P
ASIC
•
•
•
•
208M trans
850 MHz
16W
90nm
7/21/2015
ACA H.Corporaal
26
BlueGene/P node card
7/21/2015
ACA H.Corporaal
27
Next: BlueGene/Q
• 10 PFlops in 2011-2012
• see www.research.ibm.com/bluegene
7/21/2015
ACA H.Corporaal
28
Can we match the human brain ???
• Performance = 100 Billion (10^11) Neurons *
1000 (10^3) Connections/Neuron * 200 (2 *
10^2) Calculations Per Second Per Connection =
2 * 10^16 Calculations Per Second
• Memory = 100 Billion (10^11) Neurons * 1000
(10^3) Connections/Neuron * 10 bytes
(information about connection strength and
adress of output neuron, type of synapse) =
10^15 bytes = 1 PB = 1000 TB
How far off are we?
7/21/2015
ACA H.Corporaal
29
Blue brain research
• Software replica of one
column of the neocortex
– cortex: 85% of brains total
mass
– required for language,
learning, memory and
complex thought
– the essential first step to
simulating the whole brain
7/21/2015
ACA H.Corporaal
• Next: include circuitry
from other brain regions
and
• eventually the whole
brain.
30
Latest news: factorization of RSA768
• RSA used to encypher text using both public and
private key
• EPFL, CWI and others have broken RSA768
• This means: Factorize 768 bit number into 2 primes
• Using 1700 AMD 2.2 GHz cores for 1 year =>
15 Mh (single core) compute time
• Current RSA standard uses 1024 bits
– still save for some years
7/21/2015
ACA H.Corporaal
31
RSA (Rivest, Shamir, Adleman)
• choose 2 (large) primes p and q
• n = p*q
• choose e such that e and (p-1)(q-1) are coprime (i.e. do
not share prime factors)
• choose d such d*e = 1 mod ((p-1)(q-1))
• public key = (n,e)
private key = (n,d)
• Encryption of message m: c=me mod n
• Decryption of cypher c: m = cd mod n
• see wikipedia for details and working example
7/21/2015
ACA H.Corporaal
32
RSA factorization result
• factorization of RSA768, the following 768-bit, 232-digit
number from RSA's challenge list:
• 1230186684530117755130494958384962720772853569595334
7921973224215172640050726365751874520219978646938995
6474942774063845925192557326303453731548268507917026
1221429134616704292143116022212404792747377940806653
5141959745985 6902143413
=
3347807169895689878604416984821269081770479498371376
8568912431388982883793878002287614711652531743087737
814467999489
*
3674604366679959042824463379962795263227915816434308
7642676032283815739666511279233373417143396810270092
798736308917
7/21/2015
ACA H.Corporaal
33