apeNEXT Piero Vicini INFN Roma ()

Download Report

Transcript apeNEXT Piero Vicini INFN Roma ()

apeNEXT

Piero Vicini

INFN Roma ([email protected]) Paris, May 2003 Piero Vicini - SciParC workshop 1

APE keywords

   

Parallel system

 Massively parallel 3D array of computing nodes with periodic boundary conditions

Custom system

   Processor: extensive use of VLSI    Native implementation of the complex type Large register file VLIW microcode

a x b + c

Node interconnections  Optimized for nearest-neighbors communication Software tools  Apese, TAO, OS, Machine simulator (complex numbers)

Dense system

  Reliable and safe HW solution Custom mechanics for “wide” integration

Cheap system

  0.5 €/MFlops Very low cost maintenance Paris, May 2003 Piero Vicini - SciParC workshop 2

The APE family Our line of Home Made Computers

Architecture # nodes Topology Memory # registers (w.size) clock speed Total Computing Power

APE

( 1988 )

SIMD 16 flexible 1D 256 MB 64 (x32) 8 MHz ~1.5 GFlops APE100

( 1993 )

SIMD 2048 rigid 3D 8 GB 128 (x32) 25 MHz ~ 250 GFlops APEmille

( 1999 )

SIMD 2048 flexible 3d 64 GB 512 (x32) 66 MHz ~ 2 TFlops

Paris, May 2003 Piero Vicini - SciParC workshop

apeNEXT

( 2003 )

SIMD++ 4096 flexible 3D 1 TB 512 (x64) 200 MHz ~ 8-20 TFlops

APE (‘88) 1 GFlops

Paris, May 2003 Piero Vicini - SciParC workshop 4

APE100 (1993) - 100 GFlops PB (8 nodes) ~ 400 MFlops

Paris, May 2003 Piero Vicini - SciParC workshop 5

  

APEmille – 1 TFlops

2048 VLSI processing nodes (0.5 MFlops) SIMD, synchronous communications Fully integrated ”Host computer”, 64 PCs cPCI based Computing node Paris, May 2003 “Processing Board” (PB) 8 nodes, 4GFlops Piero Vicini - SciParC workshop “Torre” 32 PB, 128GFlops 6

APEmille installations

          Bielefeld Zeuthen Milan Bari Trento Pisa Rome 1 Rome 2 Orsay Swansea 130 GF (2 crates) 520 GF (8 crates) 130 GF (2 crates) 65 GF (1 crates) 65 GF (1 crates) 325 GF (5 crates) 650 GF (10 crates) 130 GF (2 crates) 16 GF (1/4 crates) 65 GF (1 crates)  Gr. Total ~1966 GF Paris, May 2003 Piero Vicini - SciParC workshop 7

The apeNEXT architecture

 3D mesh of computing nodes  Custom VLSI processor - 200 MHz (J&T)  1.6 GFlops per node (complex “normal”)  256 MB (1 GB) memory per node  First neighbor communication network “loosely synchronous”  YZ internal, X on cables  r = 8/16 => 200 MB/s per channel  Scalable 25 GFlops -> 6 Tflops Processing Board Crate (16 PB) Rack (32 PB) Large systems

Z+(bp) 4 x 2 x 2 ~ 26 GF 4 x 8 x 8 ~ 0.5 TF 8 x 8 x 8 ~ 1 TF

(8*n) x 8 x 8

13 9 5 1 X+(cables) 12 8 15 11 4 7 0 3 14 10 6 2 Y+(bp) DDR-MEM J&T

X + … … Z  Linux PCs as Host system Paris, May 2003 Piero Vicini - SciParC workshop 8

Design metodology

 VHDL incremental model of the (almost) whole system  Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool  Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model  Powerful test-bed for test vectors generation

First-Time-Right-Silicon

 Simplified but complete model of HW-Host interaction  Test environment for development of compilation chain, OS  performance (architecture) evaluation at design time

Software design env.

Paris, May 2003 Piero Vicini - SciParC workshop 9

Assembling apeNEXT…

J&T Asic

J&T module PB

Paris, May 2003

Rack BackPlane

Piero Vicini - SciParC workshop 10

Overview of the J&T Architecture

     Peak floating point performance of about 1.6GFlops

 IEEE compliant double precision Integer arithmetic performance of about 400 MIPS Link bandwidth of about 200 MByte/sec each direction   full duplex 7 links: X+,X-,Y+,Y-,Z+,Z-, “7 th ” (I/O) Support for current generation DDR memory Memory bandwidth of 3.2 GByte/sec  400 Mword/sec Paris, May 2003 Piero Vicini - SciParC workshop 11

J&T: Top Level Diagram

Paris, May 2003 Piero Vicini - SciParC workshop 12

The J&T Arithmetic BOX

•Pipelined complex “normal” a*b+c (8 flops) per cycle

4 multipliers 4 adder/sub

At 200 MHz (fully piped) = 1.6 GFlops

Paris, May 2003 Piero Vicini - SciParC workshop 13

The J&T remote IO



fifo-based communication:

 LVDS   1.6 Gb/s per link (8 bit @ 200MHz) 6 (+1) independent links Paris, May 2003 Piero Vicini - SciParC workshop 14

J&T summary

        CMOS 0.18u, 7 metal (ATMEL) 200 MHz Double Precision Complex Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory channel 6+1 LVDS 200 MB/s links BGA package, 600 pins Paris, May 2003 Piero Vicini - SciParC workshop 15

PB

• Collaboration with NEURICAM spa  16 Nodes 3D-Interconnected  4x2x2 Topology 26 Gflops, 4.6 GB Memory  Light System:  J&T Module connectors  Glue Logic (Clock tree 10Mhz)  Global signal interconnection (FPGA)  DC-DC converters (48V to 3.3/2.5)  Dominant Technologies:  LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers  High-Speed differential connectors:  Samtec QTS (J&T Module)  Erni ERMET-ZD (Backplane) Paris, May 2003 Piero Vicini - SciParC workshop 16

J&T Module

       J&T 9 DDR-SDRAM, 256Mbit (x16) 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control network) Dual Power 2.5V+1.8V, 7-10W estimated   Dominant technologies: SSTL-II (memory interface) LVDS (network interface + I/O) Paris, May 2003 Piero Vicini - SciParC workshop 17

NEXT BackPlane

 16 PB Slots + Root Slot  Size 447x600 mm2  4600 LVDS differential signals, point-to-point up to 600 Mb/s  16 controlled-imp. layers (32)  Press-fit only  Erni/Tyco connectors 

ERMET-ZD

 Providers:  APW (primary)  ERNI (2nd src)

connector kit cost:7KEuro (!) PB Insertion force:80-150 Kg(!)

Paris, May 2003 Piero Vicini - SciParC workshop 18

PB Mechanics

PB constraints:



Power consumption: up to 340W



PB-BP insertion force: 80-150 Kg (!)



Fully populated PB weight: 4-5 Kg

Board-to-Board Connector

Detailed study of airflow

DC/DC

AIR-FLOW CHANNEL

AIR-FLOW CHANNEL 3

b3 J&T Module Custom design of card frame and insertion tool

AIR-FLOW CHANNEL 2

AIR-FLOW CHANNEL 3

J&T Module Frame apeNEXT PB TOP VIEW ( local )

Paris, May 2003 Piero Vicini - SciParC workshop 19

 

Rack mechanics

Problem:   PB weight: 4-5 Kg, PB consumption: 340W (est.)      32 PB + 2 Root Board Power supply: (<48Vx150A per crate)

Integrated Host PCs

Forced air cooling,  Robust, expandable/modular, CE, EMC .... Solution:  42U rack (h: 2,10 m):  EMC proof,   efficient cables routing 19”-1U slots per 9 “host PCs” (rack mounted) Hot-swap power supply cabinet (modular)  Custom design of “card cage” and “tie bar”  Custom design of cooling system Paris, May 2003 Piero Vicini - SciParC workshop 20

Paris, May 2003 Piero Vicini - SciParC workshop 21

Host I/O Architecture

Paris, May 2003 7th-Link (200MB/s) I2C: bootstrap & control Piero Vicini - SciParC workshop 22

Host I/O Interface

 PCI Board, Altera APEX II based

PCI Master Ctrl Fifo QDR Mem Ctrl 7Link Ctrl 7Link Ctrl PCITarget Ctrl PCI Interface PLDA Altera APEXII I2C Ctrl

Paris, May 2003 QDR Mem Bank  QuadDataRateMemory (x32)  7th Link: 1(2) bidir. Chan.

 I2C: 4 independent ports  PCI Interface 64bit, 66Mhz  PCI Master Mode for 7th Link  PCI Target Mode for I2C Piero Vicini - SciParC workshop 23

Status and expected schedule



J&T ready to test September 03

  We will receive between 300 to 600 chips We need 256 processor to assemble a crate !! 

We expect them to work !!

 The same team designed 7 ASIC of the same complexity   Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !! 

PB,J&T Module, BackPlane, Mechanics were built and tested

 Within days/weeks the first working apeNEXT computer should operate  Mass production will follow asap 

End 2003 mass production will start….

 INFN requirements is 8-12 TFlops of computing power !!

Paris, May 2003 Piero Vicini - SciParC workshop 24

Software

 TAO compilers and linker ….. READY  All existing APE program will run with no change  Physical code already been run on the simulator  Kernel of PHYSICS codes  used to benchmark the efficiencies of the FP unit  C COMPILER  gcc (2.93) and lcc have be retargeted  lcc WORKS (almost).

http://www.cs.princeton.edu/software/lcc/

Paris, May 2003 Piero Vicini - SciParC workshop 25

Project Costs

 Total development cost of 1700 k€uro   1050 k€uro for VLSI development 550 k€uro non VLSI  Manpower involved = 20 man/year  Mass production cost ~ 0.5 €uro/MFlops Paris, May 2003 Piero Vicini - SciParC workshop 26

Future R&D activities

    Computing node architecture  Adaptable/reconfigurable computing node   Fat operators, short/custom FP data, multiple node integration Evaluation/integration of commercial processor in APE system Interconnection architecture and technologies   Custom ape-like network Interface to host, PCs interconnection Mechanics assemblies (Perf/Volume,reliability)  Rack, Cables, Power distributions etc… Software  Standard languages (C) full support (compiler, linker…)   Distributed OS APE system integration in “GRID” environment Paris, May 2003 Piero Vicini - SciParC workshop 27

Conclusions

     J&T in fab, ready Summer 03 (300….600 chips) Everything else ready and tested !!!

If tests ok  mass production starting 4Q03 All components over-dimensioned  Cooling, LVDS tested @ 400 Mb/s, power supply on boards …  Makes possible a technology step with no extra design and relatively low test effort Installation plans  INFN theoretical group requires 8-12 TFlops (10-15 cabinets) (on delivering of a working machine…)   DESY considering between 8 TFlops to 16 Tflops Paris….

Paris, May 2003 Piero Vicini - SciParC workshop 28



APE in SciParC

APE is the actual (“de-facto”) European computing platform for big volume LQCD applications. But….

 “Interdisciplinarity” is on our pathway (i.e. APE is not only QCD):       Fluid dynamics (lattice boltzman, weather forecast) Complex Systems (spin glasses, real glasses, protein folding) Neural networks Seismic migration Plasma physics (astrophysics, thermonuclear engines) ……  So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research.

 The APE group can (want) contribute in development of such future machines Paris, May 2003 Piero Vicini - SciParC workshop 29