Performance Potential of an Easy-toProgram PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C.

Download Report

Transcript Performance Potential of an Easy-toProgram PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C.

Performance Potential of an Easy-to Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation

Uzi Vishkin – University of Maryland www.umiacs.umd.edu/users/vishkin/XMT

Hardware prototypes of PRAM-On-Chip

64-core, 75MHz FPGA prototype

[SPAA’07, Computing Frontiers’08] Original

explicit multi-threaded (XMT) architecture

[SPAA98] (Cray started to use “XMT” ~7 years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip Objective of current paper Meaningful comparison of 1. Our FPGA design, with 2. State-of-the-Art (Intel) Processor

XMT: A PRAM-On-Chip Vision

• • • • • • Manycores are coming. But 40yrs of parallel computing: Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it  great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Unlike matching current HW (Some other SPAA papers) Tested HW & SW prototypes Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase •

This paper:

~10X relative to Intel Core 2 Duo

If there is time: Really serious about ease of programming

Objective for programmer’s model

• # ops Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial)

Serial Paradigm

What could I do in parallel

Natural (Parallel) Paradigm

at each step assuming unlimited hardware  # ops ..

..

..

..

..

..

..

time time •

Time = Work Work = total #ops Time << Work

XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit.

- Competitive on nesting. (To be published.) - No need to program for locality.

• •

Programmer’s Model: Engineering Workflow

Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of – processes at once Prefix-sum (ps). Independence of order semantics (IOS).

• • – – spawn join spawn Circumvents “The problem with threads”, e.g., [Lee].

join Establish correctness & complexity by relating to WD analyses.

Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] Trial&error contrast: similar start  while insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache}

XMT Architecture Overview

• • • • One serial core – master thread control unit (MTCU) Parallel cores (TCUS) grouped in clusters Global memory space evenly partitioned in cache banks using hashing No local caches at TCU – Avoids expensive cache coherence hardware

MTCU Hardware Scheduler/Prefix-Sum Unit Cluster 1 Cluster 2

Cluster C Parallel Interconnection Network Memory Bank 1 Memory Bank 2 Shared Memory

Memory Bank M DRAM Channel 1

DRAM Channel D

Paraleap: XMT PRAM-on-chip silicon

• • • Built FPGA prototype Announced in SPAA’07 Built using 3 FPGA chips – 2 Virtex-4 LX200 – 1 Virtex-4 FX100 Clock rate DRAM size DRAM channels Mem. data rate 75 MHz 1GB 1 0.6GB/s No. TCUs Clusters Cache modules Shared cache 64 8 8 256KB With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person..  simplicity of the XMT architecture simple  basic faster time to market, lower implementation cost.

Benchmarks

• • • Sparse Matrix – Vector Multiplication (SpMV) – Matrix stored in Compact Sparse Row (CSR) format – Serial version: iterate through rows – Parallel version: one thread per row 1-D FFT – Fixed-point arithmetic implementation – Serial version: Radix-2 Cooley-Tukey Algorithm – Parallel version: Parallelized each stage of serial algorithm Quicksort – Serial version: standard textbook implementation – Parallel version: two phases • Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum • Phase 2: Process all partitions in parallel using serial partitioning algorithm

Experimental Platforms

Processor 1 MTCU, 64 TCUs

Clock Cache DRAM Data rate Compiler XMT Paraleap FPGA

75 MHz 256KB shared L1 cache 1GB DDR2 0.6GB/s XMTCC (GCC 4.0.2-based)

Compiler Optim.

-O3, data prefetch, read-only buffers

Intel Core 2 Duo

Dual Core, E6300 1.86GHz

2x64KB L1, 2x2MB L2 2GB DDR2 6.4GB/s GCC 4.1.0

Intel C++ Professional Compiler (ICC) v11 -O3, SSE3 SIMD, data prefetching, auto parallelization For meaningful comparison: compare cycle count

Input Datasets

small large

Program SpMV FFT N

22K 8K

Footprint

200KB 192KB

N

4M 4M

Footprint

33MB 96MB • •

Quicksort

Large dataset represents realistic input sizes – – 100K 781KB 20M 153MB Recommended by Intel engineer for comparison Gives Intel Core 2 advantage because of larger cache Small dataset – – Fits in both Paraleap and Intel Core 2 cache Provides most fair comparison for current XMT generation

Clock-Cycle Speedup

Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap • • • •

Program

SpMV

small

6.7

Core 2 – ICC

large

3.3

small

6.3

Core 2 - GCC

large

3.26

FFT 9.51

2.51

8.76

2.71

Quicksort(*) 13.07

7.75

13.89

8.18

Paraleap outperforms Intel Core 2 on all benchmarks Lower speed-ups for Large dataset because of smaller cache size – Will not be an issue for future implementations of XMT Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo No reason for clock frequency of XMT to fall behind

Conclusion

• • • XMT provides viable answer to biggest challenges for the field – Ease of programming – Scalability (up&down) Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform ICPP’08 paper compares with GPUs.

Software release

Allows to use your own computer for programming on an XMT environment and experimenting with it, including: (i)Cycle-accurate simulator of the XMT machine (ii)Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including (i)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html

Next Major Objective

Industry-grade chip. Requires 10X in funding.

Ease of Programming

• Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher.

- High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.