Transcript apeNEXT Piero Vicini INFN Roma ()
apeNEXT
Piero Vicini
INFN Roma ([email protected]) Paris, May 2003 Piero Vicini - SciParC workshop 1
APE keywords
Parallel system
Massively parallel 3D array of computing nodes with periodic boundary conditions
Custom system
Processor: extensive use of VLSI Native implementation of the complex type Large register file VLIW microcode
a x b + c
Node interconnections Optimized for nearest-neighbors communication Software tools Apese, TAO, OS, Machine simulator (complex numbers)
Dense system
Reliable and safe HW solution Custom mechanics for “wide” integration
Cheap system
0.5 €/MFlops Very low cost maintenance Paris, May 2003 Piero Vicini - SciParC workshop 2
The APE family Our line of Home Made Computers
Architecture # nodes Topology Memory # registers (w.size) clock speed Total Computing Power
APE
( 1988 )
SIMD 16 flexible 1D 256 MB 64 (x32) 8 MHz ~1.5 GFlops APE100
( 1993 )
SIMD 2048 rigid 3D 8 GB 128 (x32) 25 MHz ~ 250 GFlops APEmille
( 1999 )
SIMD 2048 flexible 3d 64 GB 512 (x32) 66 MHz ~ 2 TFlops
Paris, May 2003 Piero Vicini - SciParC workshop
apeNEXT
( 2003 )
SIMD++ 4096 flexible 3D 1 TB 512 (x64) 200 MHz ~ 8-20 TFlops
3
APE (‘88) 1 GFlops
Paris, May 2003 Piero Vicini - SciParC workshop 4
APE100 (1993) - 100 GFlops PB (8 nodes) ~ 400 MFlops
Paris, May 2003 Piero Vicini - SciParC workshop 5
APEmille – 1 TFlops
2048 VLSI processing nodes (0.5 MFlops) SIMD, synchronous communications Fully integrated ”Host computer”, 64 PCs cPCI based Computing node Paris, May 2003 “Processing Board” (PB) 8 nodes, 4GFlops Piero Vicini - SciParC workshop “Torre” 32 PB, 128GFlops 6
APEmille installations
Bielefeld Zeuthen Milan Bari Trento Pisa Rome 1 Rome 2 Orsay Swansea 130 GF (2 crates) 520 GF (8 crates) 130 GF (2 crates) 65 GF (1 crates) 65 GF (1 crates) 325 GF (5 crates) 650 GF (10 crates) 130 GF (2 crates) 16 GF (1/4 crates) 65 GF (1 crates) Gr. Total ~1966 GF Paris, May 2003 Piero Vicini - SciParC workshop 7
The apeNEXT architecture
3D mesh of computing nodes Custom VLSI processor - 200 MHz (J&T) 1.6 GFlops per node (complex “normal”) 256 MB (1 GB) memory per node First neighbor communication network “loosely synchronous” YZ internal, X on cables r = 8/16 => 200 MB/s per channel Scalable 25 GFlops -> 6 Tflops Processing Board Crate (16 PB) Rack (32 PB) Large systems
Z+(bp) 4 x 2 x 2 ~ 26 GF 4 x 8 x 8 ~ 0.5 TF 8 x 8 x 8 ~ 1 TF
(8*n) x 8 x 8
13 9 5 1 X+(cables) 12 8 15 11 4 7 0 3 14 10 6 2 Y+(bp) DDR-MEM J&T
X + … … Z Linux PCs as Host system Paris, May 2003 Piero Vicini - SciParC workshop 8
Design metodology
VHDL incremental model of the (almost) whole system Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model Powerful test-bed for test vectors generation
First-Time-Right-Silicon
Simplified but complete model of HW-Host interaction Test environment for development of compilation chain, OS performance (architecture) evaluation at design time
Software design env.
Paris, May 2003 Piero Vicini - SciParC workshop 9
Assembling apeNEXT…
J&T Asic
J&T module PB
Paris, May 2003
Rack BackPlane
Piero Vicini - SciParC workshop 10
Overview of the J&T Architecture
Peak floating point performance of about 1.6GFlops
IEEE compliant double precision Integer arithmetic performance of about 400 MIPS Link bandwidth of about 200 MByte/sec each direction full duplex 7 links: X+,X-,Y+,Y-,Z+,Z-, “7 th ” (I/O) Support for current generation DDR memory Memory bandwidth of 3.2 GByte/sec 400 Mword/sec Paris, May 2003 Piero Vicini - SciParC workshop 11
J&T: Top Level Diagram
Paris, May 2003 Piero Vicini - SciParC workshop 12
The J&T Arithmetic BOX
•Pipelined complex “normal” a*b+c (8 flops) per cycle
4 multipliers 4 adder/sub
At 200 MHz (fully piped) = 1.6 GFlops
Paris, May 2003 Piero Vicini - SciParC workshop 13
The J&T remote IO
fifo-based communication:
LVDS 1.6 Gb/s per link (8 bit @ 200MHz) 6 (+1) independent links Paris, May 2003 Piero Vicini - SciParC workshop 14
J&T summary
CMOS 0.18u, 7 metal (ATMEL) 200 MHz Double Precision Complex Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory channel 6+1 LVDS 200 MB/s links BGA package, 600 pins Paris, May 2003 Piero Vicini - SciParC workshop 15
PB
• Collaboration with NEURICAM spa 16 Nodes 3D-Interconnected 4x2x2 Topology 26 Gflops, 4.6 GB Memory Light System: J&T Module connectors Glue Logic (Clock tree 10Mhz) Global signal interconnection (FPGA) DC-DC converters (48V to 3.3/2.5) Dominant Technologies: LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers High-Speed differential connectors: Samtec QTS (J&T Module) Erni ERMET-ZD (Backplane) Paris, May 2003 Piero Vicini - SciParC workshop 16
J&T Module
J&T 9 DDR-SDRAM, 256Mbit (x16) 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control network) Dual Power 2.5V+1.8V, 7-10W estimated Dominant technologies: SSTL-II (memory interface) LVDS (network interface + I/O) Paris, May 2003 Piero Vicini - SciParC workshop 17
NEXT BackPlane
16 PB Slots + Root Slot Size 447x600 mm2 4600 LVDS differential signals, point-to-point up to 600 Mb/s 16 controlled-imp. layers (32) Press-fit only Erni/Tyco connectors
ERMET-ZD
Providers: APW (primary) ERNI (2nd src)
connector kit cost:7KEuro (!) PB Insertion force:80-150 Kg(!)
Paris, May 2003 Piero Vicini - SciParC workshop 18
PB Mechanics
PB constraints:
Power consumption: up to 340W
PB-BP insertion force: 80-150 Kg (!)
Fully populated PB weight: 4-5 Kg
Board-to-Board Connector
Detailed study of airflow
DC/DC
a1
AIR-FLOW CHANNEL
b1
1
a3
AIR-FLOW CHANNEL 3
b3 J&T Module Custom design of card frame and insertion tool
AIR-FLOW CHANNEL 2
a2
AIR-FLOW CHANNEL 3
J&T Module Frame apeNEXT PB TOP VIEW ( local )
Paris, May 2003 Piero Vicini - SciParC workshop 19
Rack mechanics
Problem: PB weight: 4-5 Kg, PB consumption: 340W (est.) 32 PB + 2 Root Board Power supply: (<48Vx150A per crate)
Integrated Host PCs
Forced air cooling, Robust, expandable/modular, CE, EMC .... Solution: 42U rack (h: 2,10 m): EMC proof, efficient cables routing 19”-1U slots per 9 “host PCs” (rack mounted) Hot-swap power supply cabinet (modular) Custom design of “card cage” and “tie bar” Custom design of cooling system Paris, May 2003 Piero Vicini - SciParC workshop 20
Paris, May 2003 Piero Vicini - SciParC workshop 21
Host I/O Architecture
Paris, May 2003 7th-Link (200MB/s) I2C: bootstrap & control Piero Vicini - SciParC workshop 22
Host I/O Interface
PCI Board, Altera APEX II based
PCI Master Ctrl Fifo QDR Mem Ctrl 7Link Ctrl 7Link Ctrl PCITarget Ctrl PCI Interface PLDA Altera APEXII I2C Ctrl
Paris, May 2003 QDR Mem Bank QuadDataRateMemory (x32) 7th Link: 1(2) bidir. Chan.
I2C: 4 independent ports PCI Interface 64bit, 66Mhz PCI Master Mode for 7th Link PCI Target Mode for I2C Piero Vicini - SciParC workshop 23
Status and expected schedule
J&T ready to test September 03
We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!
We expect them to work !!
The same team designed 7 ASIC of the same complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!
PB,J&T Module, BackPlane, Mechanics were built and tested
Within days/weeks the first working apeNEXT computer should operate Mass production will follow asap
End 2003 mass production will start….
INFN requirements is 8-12 TFlops of computing power !!
Paris, May 2003 Piero Vicini - SciParC workshop 24
Software
TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost).
http://www.cs.princeton.edu/software/lcc/
Paris, May 2003 Piero Vicini - SciParC workshop 25
Project Costs
Total development cost of 1700 k€uro 1050 k€uro for VLSI development 550 k€uro non VLSI Manpower involved = 20 man/year Mass production cost ~ 0.5 €uro/MFlops Paris, May 2003 Piero Vicini - SciParC workshop 26
Future R&D activities
Computing node architecture Adaptable/reconfigurable computing node Fat operators, short/custom FP data, multiple node integration Evaluation/integration of commercial processor in APE system Interconnection architecture and technologies Custom ape-like network Interface to host, PCs interconnection Mechanics assemblies (Perf/Volume,reliability) Rack, Cables, Power distributions etc… Software Standard languages (C) full support (compiler, linker…) Distributed OS APE system integration in “GRID” environment Paris, May 2003 Piero Vicini - SciParC workshop 27
Conclusions
J&T in fab, ready Summer 03 (300….600 chips) Everything else ready and tested !!!
If tests ok mass production starting 4Q03 All components over-dimensioned Cooling, LVDS tested @ 400 Mb/s, power supply on boards … Makes possible a technology step with no extra design and relatively low test effort Installation plans INFN theoretical group requires 8-12 TFlops (10-15 cabinets) (on delivering of a working machine…) DESY considering between 8 TFlops to 16 Tflops Paris….
Paris, May 2003 Piero Vicini - SciParC workshop 28
APE in SciParC
APE is the actual (“de-facto”) European computing platform for big volume LQCD applications. But….
“Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): Fluid dynamics (lattice boltzman, weather forecast) Complex Systems (spin glasses, real glasses, protein folding) Neural networks Seismic migration Plasma physics (astrophysics, thermonuclear engines) …… So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research.
The APE group can (want) contribute in development of such future machines Paris, May 2003 Piero Vicini - SciParC workshop 29