Microprocessor Design 2002

Download Report

Transcript Microprocessor Design 2002

Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers

Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca [email protected]

TUEindhoven 2009

Topics

• How to cross the Petaflop boundary • Ranking – Nov 2008 – Nov 2009: what has been changed • Examples – Roadrunner (IBM) – Jaguar Cray – SGI Altix – BlueGene 4/30/2020 ACA H.Corporaal

2

How to build a Petaflop supercomputer?

• Opteron cluster (e.g. ~2X Ranger/TACC) –

32,000 quad-core Opterons (130K cores)

• Cray XT3/4 (e.g. Baker/ORNL sooner) –

32,000 quad-core Opterons (130K cores)

• IBM BlueGene/P (bigger sooner) –

80,000 BG/P PPC processors (320K cores)

• IBM Cell-accelerated Roadrunner cluster –

10,000 Cells (80K Cell SPUs)

4/30/2020 ACA H.Corporaal

3

4/30/2020

Supercomputer Ranking

• Started in 1993 • Jack Dongarra, University of Tennessee • Based on LINPACK benchmark – linear algebra (LU factorization) • Superseded by LAPACK – based on BLAS (Basic Lin. Alg. Subprograms) – exploits caches • Measures Floating Point performance • Fortran code • see http://www.top500.org

ACA H.Corporaal

4

4/30/2020

#

Performance Ranking Nov. 2008

Name

1 2 3 4 Roadrunner IBM Cray XT5 SGI Altix ICE BlueGene IBM 100 Cluster Platform (Xeons) 52 Power 575 SARA (Amst) 75 BlueGene Astron 496 Cluster in Gent Univ.

N_PE

129600 150152 51200 212992 5120 3328 12288 1568

Rmax (Tflop)

1105 1059 487 478 27 49 35 13

Rpeak (Tflop)

1456 1381 608 596 51 63 42 16

P (kW)

2483 6950 2090 2329 532 95 51 ACA H.Corporaal

5

Performance Ranking

#

1 2 3 4 52

Name

Roadrunner IBM Cray XT5 SGI Altix ICE

Npe

129600 150152 51200 BlueGene IBM 212992

2008: we crossed the

5120 (Xeons) Power 575 SARA (Amst) BlueGene Astron 3328 12288

Rmax (Tflop)

1105 1059 487 478 27 49 35 75 496 Cluster in Gent Univ.

4/30/2020 ACA H.Corporaal

1568 13

Rpeak (Tflop)

1456 1381 608 596 51 63 42 16

P (kW)

2483 6950 2090 2329 532 95 51 6

Update November 2009

# Name

1 2 3 4 5 Jaguar-Cray XT5-HE Oak Ridge, USA Roadrunner IBM DOE, USA Kraken Cray XT5-HE Tennessee, USA BlueGene IBM Juelich, Germany Tianhe Xeon / ATI cluster, China

N_PE

224162 122400 98928 294912 71680

Rmax (Tflop)

1759

Rpeak (Tflop)

2331

P (kW)

6951 1042 832 826 563 1376 1029 1003 1206 2346 2268 4/30/2020 ACA H.Corporaal

7

Alternative ranking:

Green500

• Most Power efficient Supercomputers • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation – Cell cluster, ranking 110 in top500 4/30/2020 • See www.green500.org

ACA H.Corporaal

8

Nr1 (

2008)

: Roadrunner

• IBM cluster • 6480 nodes with – Dual core Opteron 1.8 GHz – 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops) • Infiniband connection fabric (16 Gbit/s per link) – FAT tree interconnect • 100 Tbyte DRAM memory • 216 I/O nodes • MPI programming • 2.35

MW

power !!

• Size: 296 racks, 5500 ft 2 4/30/2020 ACA H.Corporaal

This is huge !!

9

Cell/B.E. – the architecture

1 x PPE

64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB

8 x SPE

cores: Local store:

256 KB

128 x 128 bit vector registers

Hybrid

memory model: PPE: Rd/Wr SPEs: Asynchronous

DMA

• EIB: 205 GB/s sustained aggregate bandwidth • Processor-to-memory bandwidth: 25.6 GB/s • Processor-to-processor: 20 GB/s in each direction 4/30/2020 ACA H.Corporaal

10

4/30/2020 ACA H.Corporaal

11

Roadrunner: TriBlade = 2 nodes

4/30/2020 For more details: Presentation slides of Ken Koch, March 2008 ACA H.Corporaal

12

Nr2 (2008): Jaguar Cray XT5 QC

• I guess 5 times – 7832 quad-core 2.1 GHz AMD Opetron – 62 TB memory (= 2GB / core) – 600 TB file system – 250 TFlop • In total 150152 cores • SeaStar2+ interconnect (from Cray) • Note 2009: quad-cores replaced by six-cores – now nr 1 – 224,256 cores – peak 1.75 PetaFlop 4/30/2020 ACA H.Corporaal

13

Jaguar

4/30/2020 ACA H.Corporaal

14

Nr3 (2008): SGI Altix ICE8200

• 92 racks of Al5x ICE – 8200EX with 3.0 Ghz Intel Xenon quad-core processors or – 47,104 cores • 8 racks of Al5x ICE 8200 – with 2.66 Ghz Intel quad-core – 4096 cores.

• 51 TB Main memory • DDR InfiniBand 4/30/2020 ACA H.Corporaal

15

Nr:4 (2008) BlueGene/L IBM

• Based on ASIC with PowerPC 440, 700 Mhz, each 2.8 GFlops • 105,496 nodes • 3D Torus interconnect for p2p communication + Collective network 3D-torus 4/30/2020 rack ACA H.Corporaal

Complete system 16

BlueGene/L ASIC node

4/30/2020 ACA H.Corporaal

17

BlueGene/L Node board

• 16 cards with 2 ASICs each • 8 GB • 180 Gflop 4/30/2020 ACA H.Corporaal

18

2009: BlueGene/P

ASIC: 13.6 Gflops 8 MB EDRAM 4/30/2020 ACA H.Corporaal

Processor card: one 4-processor chip 13.6 GFlops 2-4 GB Rack: 32 Node Cards 13.9 TF/s 2-4 TB Node card: 32 processor cards 64-128 GB 435 GFlops System: 256 racks upto 1PB 3.56 PFlops 19

4/30/2020 ACA H.Corporaal

BlueGene/P ASIC

20

4/30/2020

PPC450: Exploiting SIMD

• • •

Two FPUs

– 2 x 32 64-bit registers

SIMD

– Datapath width = 16 bytes – Feeds two FPUs with 8 bytes each every cycle

Two FP multiply-add operations per cycle

– 3.4 GFLOP/s peak performance ACA H.Corporaal

21

BlueGene/P ASIC • 208M trans • 850 MHz • 16W • 90nm 4/30/2020 ACA H.Corporaal

22

4/30/2020 ACA H.Corporaal

BlueGene/P node card

23

Next: BlueGene/Q

• 10 PFlops in 2011-2012 • see www.research.ibm.com/bluegene 4/30/2020 ACA H.Corporaal

24

Can we match the human brain ???

• Performance = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 200 (2 * 10^2) Calculations Per Second Per Connection = 2 * 10^16 Calculations Per Second 4/30/2020 • Memory = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 10 bytes (information about connection strength and adress of output neuron, type of synapse) = 10^15 bytes = 1 PB = 1000 TB

How far off are we?

ACA H.Corporaal

25

4/30/2020 ACA H.Corporaal

Blue brain research

• Software replica of one column of the neocortex – cortex: 85% of brains total mass – required for language, learning, memory and complex thought – the essential first step to simulating the whole brain • Next: include circuitry from other brain regions and • eventually the whole brain.

26

Latest news: factorization of RSA768

• RSA used to encypher text using public and private key • EPFL, CWI and others have broken RSA768 • This means: Factorize 768 bit number into 2 primes • Using 1700 AMD 2.2 GHz cores for 1 year => 15 Mh (single core) compute time • Current RSA standard uses 1024 bits – still save for some years 4/30/2020 ACA H.Corporaal

27

RSA (Rivest, Shamir, Adleman)

• choose 2 (large) primes p and q • n = p*q • choose e such that e and (p-1)(q-1) are coprime (i.e. do not share prime factors) • choose d such d*e = 1 mod ((p-1)(q-1)) • public key = (n,e) private key = (n,d) • Encryption of message m: c=m e mod n • Decryption of cypher c: m = c d mod n • see wikipedia for details and working example 4/30/2020 ACA H.Corporaal

28

RSA factorization result

• factorization of RSA768, the following 768-bit, 232-digit number from RSA's challenge list: • 1230186684530117755130494958384962720772853569595334 7921973224215172640050726365751874520219978646938995 6474942774063845925192557326303453731548268507917026 1221429134616704292143116022212404792747377940806653 5141959745985 6902143413

=

3347807169895689878604416984821269081770479498371376 8568912431388982883793878002287614711652531743087737 814467999489

*

3674604366679959042824463379962795263227915816434308 7642676032283815739666511279233373417143396810270092 798736308917 4/30/2020 ACA H.Corporaal

29