TERAFLUX FP7-ICT-2009-4 - Istituto Nazionale di Fisica

Transcript TERAFLUX FP7-ICT-2009-4 - Istituto Nazionale di Fisica

University of Siena

Exploiting Dataflow Parallelism in Teradevice Computing

(a Proposal to Harness the Future Multicores)

Barcelona Supercomputing Center

An Overview of a TERAFLUX-like Architecture

Roberto Giorgi – University of Siena (coordinator) CASTNESS’11 Workshop (Computing Architectures, Software tools and nano-tehnologies for Numerical and Embedded Scalable Systems).

Rome 18/01/2011 University of Augsburg University of Cyprus INRIA University of Manchester

Technologies for the coming years

• • Many new technologies at the horizon: graphene, junctionless transistors… paving the way for the 1 TERA in a chip/package in a few years Feasibility also explored in EU FET projects like TRAMS 2010-09-13 2

Which Multicore Architecture for 2020?

• • Classical 1000-Billion Euro Question !

Lessons from the past: – Message-Passing based – – architectures have poor PROGRAMMABILTY Shared-Memory based architectures have limited scalability or are quite COMPLEX TO DESIGN Failure in part of the computation compromises the whole computation: poor RELIABILITY 2010-09-13 3

CMP of the future == 3D stacking

• 1000 Billion- or 1 TERA- device computing platforms pose new challenges: – (at least) programmability, complexity of design, reliability • TERAFLUX context: – High performance computing and applications (not necessarily embedded) • TERAFLUX scope: – Exploiting a less exploited path (DATAFLOW) at each level of abstraction G. Hendry, K. Bergman, “Hybrid On chip Data Networks”, HotChips-22, Stanford, CA – Aug. 2010

What we propose

• • Exploiting dataflow concepts both – at task level and – inside the threads Offload and manage accelerated codes – to localize the computation – to respect the power/performance/temperature/reliability envelope – to efficiently handle the parallelism and have an easy and powerful execution model 

PUSHING THE DATA WHERE IS NEEDED

Some techniques proposed: Giorgi, R., Popovic, Z., Puzovic, N. , DTA-C: A Decoupled multi-Threaded Architecture for CMP System. Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on, vol., no., pp.263-270, 24-27 Oct. 2007. DOI= http://dx.doi.org/10.1109/SBAC-PAD.2007.27

M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, and Yale N. Patt,

"Data Marshaling for Multi-core Architectures"

Proceedings of the 37th International Symposium on Computer Architecture

(ISCA), Saint-Malo, France, June 2010

Our pillars

• • • • • • • FIXED and MOST-USED ISA (

x86

) MANYCORE FULL SYSTEM SIMULATOR (

COTSon

) REAL WORLD APPLICATIONS (e.g. GROMACS) SYNCHRONIZATION:

TRANSACTIONAL MEMORY GCC

based TOOL-CHAIN OFF-THE-SHELF COMPONENTS FOR CORES, OS, NOC

FDU

AND

TSU

(Fault Detection Unit and Thread Scheduling Unit)

Recent Reference Numbers

• • 26/05/2010 - CEA/BULL – TERA100 is the first European Supercomputer to reach 1 Petaflops (top500 #6 as of Nov. 2010) – Cluster of 4370 nodes with 4 X64 per node (17480 CPUs or 139840 cores); 300TB of memory; 20 PB of secondary storage; Benchmark: LINPACK; 5 MW 11/11/2010 – National Supercomputing Center in Tianjin - Tianhe 1A – Fastest supercomputer in the worlds (2.5 Petaflops) – 7168 NVIDIA® Tesla™ M2050 GPUs (about 3.2 Million CUDA cores) and 14336 CPUs or 86016 cores; 230 TB of memory; Benchmark: LINPACK; 4 MW 2010-09-13 7

A possible TERAFLUX architectural instance

AC=Auxiliary Core SC=Service Core IOx=I/O or SC Core NoC DRAM DRAM AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC AC SC IO1 (disk) IO2 (Keyboard) IO3 (NIC) AC Core (PE,L1$,L2$-partition) Uncore (TSU/FDU, NoC-tap, …) TSU=Thread Scheduling Unit FDU=Fault Detection Unit

WP2 WP3 Programming Model

Data dependencies Transactional memory

WP4 Compilation Tools WP5 WP6 Abstraction Layer and Reliability Layer Source code

Extract TLP Locality optimizations

T1 T2 T2 Threads WP7 Teradevice hardware (simulated) Virtual CPUs VCPU VCPU VCPU VCPU VCPU

possibly 1,000-10,000 cores...

PCPU PCPU PCPU 9

Holistic approach

TERAFLUX: toward a different world

• Relying on existing architectures as much as possible and introduce key modifications to enhance programmability, simplicity of design, reliability.

– not a brand new language, but leverage and extend other open efforts – [C+TM, SCALA, OPEN-MP] not a brand new system, but leverage and extend other open software frameworks [GCC] – not a brand new CPU architecture, but leverage and extend industry standard commodities [x86] • However: the implications on “classical limitations” can be huge – requirements of the hardware memory architecture which limit extensibility – (a.k.a. scalability) can be relaxed significantly Turning dataflow model into a general purpose approach through the addition of transactions

• •

Top-Level ARCHITECTURAL design: A view from 1000 feet

Pool of MANY Asymmetric Cores based on X86-64 ISA on a single chip (e.g. 1000 cores or more) (some NoC, some Memory Hierarchy, some I/O, some physical layout, (e.g., 3D multi-chip, off-the-shelf LINUX)  not in the scope of TERAFLUX – Some options will be however proposed/explored • • • TERAFLUX Baseline Machine: the simplest thing we have now E.g., 64 nodes by 16 cores, L1, L2, hierarchical interconnections – Need to evolve this architecture WITHOUT binding the software to it, to let the Architecture to fully explore dataflow concepts at machine level Major “cross-challenge”: how to integrate the contribution by each WP so that the work is done toward a higher goal we could NOT reach as separate WPs

• • • • •

TERAFLUX key results we are aiming at & long term impact

Coarse grain dataflow model (or fine grain multithreaded model) – fine grain transactional isolation – scalable to many cores and distributed memory – with built-in application-unaware resilience – with novel hardware support structures as needed A solid and open evaluation platform based on an x86 simulator based on COTSon by TERAFLUX Partner HPLabs ( http://cotson.sourceforge.net/ – ) enables leveraging the large software body out there (OS, middleware, libraries, applications) We are available for cooperation on COTSon also with other EU projects (especially TERACOMP projects) RESEARCH PAPERS: http://teraflux.eu/Publications TERADEVICE SIMULATOR: http://cotson.sourceforge.net

• • • • • •

Conclusions: Major Technical Innovations in TERAFLUX

Fragmenting the Applications in Finer grained DF-threads : – DF-threads allow an easy way to decouple memory accesses, therefore hiding memory latencies, balancing the load, managing fault, temperature information without fine grain intervention of the software.

Possibility to repeat the execution of a DF-thread in case this thread happened to be on a core later discovered as faulty Taking advantage of a “direct” dataflow communication of the data (through what we call DF-frames).

Synchronizing threads while taking advantage of native dataflow mechanism (e.g. several threads can be synchronized at a barrier) – DF-threads allow (atomic ) Transactional semantics (DF meets TM) A Thread Scheduling Unit would allow fast thread switching and scheduling, besides the OS scheduler; scalable and distributed A Fault Detection Unit works in conjunction with TSU