Supercomputing in Plain English: Distributed Parallel

Download Report

Transcript Supercomputing in Plain English: Distributed Parallel

Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English

Part VII: Multicore Madness

Henry Neeman,

Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday October 17 2007

Outline

    The March of Progress Multicore/Many-core Basics Software Strategies for Multicore/Many-core A Concrete Example: Weather Forecasting Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 2

The March of Progress

OU’s TeraFLOP Cluster, 2002

10 racks @ 1000 lbs per rack 270 Pentium4 Xeon CPUs, 2.0 GHz, 512 KB L2 cache 270 GB RAM, 400 MHz FSB 8 TB disk Myrinet2000 Interconnect 100 Mbps Ethernet Interconnect OS: Red Hat Linux Peak speed: 1.08 TFLOP/s (1.08 trillion calculations per second) One of the first Pentium4 clusters!

boomer.oscer.ou.edu

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 4

TeraFLOP, Prototype 2006, Sale 2011 9 years from room to chip!

http://news.com.com/2300-1006_3-6119652.html

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 5

Moore’s Law

In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.

He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.

It turns out that computer speed is roughly proportional to the number of transistors per unit area.

Moore wrote a paper about this concept, which became known as

“Moore’s Law.”

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 6

Moore’s Law in Practice

Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 7

Moore’s Law in Practice

Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 8

Moore’s Law in Practice

Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 9

Moore’s Law in Practice

Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 10

Moore’s Law in Practice

Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 11

Fastest Supercomputer vs. Moore

Fastest Supercomputer in the World

1000000 www.top500.org

100000 10000 1000 Fastest Moore 100 10 1 1992 1994 1996 1998 2000

Year

2002 2004 2006 Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 2008 12

The Tyranny of the Storage Hierarchy

The Storage Hierarchy

[5]

Fast, expensive, few

     

Slow, cheap, a lot

Registers Cache memory Main memory (RAM) Hard disk Removable media (e.g., DVD) Internet [6] Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 14

RAM is Slow

The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out.

CPU

351 GB/sec [7]

Bottleneck

10.66 GB/sec [9] (3%) Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 15

Why Have Cache?

CPU

351 GB/sec [7] Cache is nearly the same speed as the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second!

253 GB/sec [8] (72%) 10.66 GB/sec [9] (3%) Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 16

Storage Use Strategies

   

Register reuse

: Do a lot of work on the same data before working on new data.

Cache reuse

: The program is much more efficient if all of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache.

Data locality

: Try to access data that are near each other in memory before data that are far.

I/O efficiency

: Do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 17

A Concrete Example

     OSCER’s big cluster, topdawg, has Irwindale CPUs: single core, 3.2 GHz, 800 MHz Front Side Bus.

The theoretical peak CPU speed is 6.4 GFLOPs (double precision) per CPU, and in practice we’ve gotten as high as 94% of that.

So, in theory each CPU could consume 143 GB/sec.

The theoretical peak RAM bandwidth is 6.4 GB/sec, but in practice we get about half that.

So, any code that does less than 45 calculations

per byte

transferred between RAM and cache has speed limited by RAM bandwidth.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 18

Good Cache Reuse Example

A Sample Application

Matrix-Matrix Multiply

Let A, B and C be matrices of sizes

nr

nc

,

nr

nk

and

nk

nc

, respectively:

A

   

a a

1 , 1 2 , 1    

a a

 3 , 1

nr

, 1

a

1 , 2

a

2 , 2

a

3 , 2 

a nr

, 2

a

1 , 3

a

2 , 3

a

3 , 3 

a nr

, 3     

a a a

1 ,

nc

2 3 ,  ,

nc nc a nr

,

nc

      

B

   

b b

1 , 1 2 , 1    

b b

 3 , 1

nr

, 1

b

1 , 2

b

2 , 2

b

3 , 2 

b nr

, 2

b

1 , 3

b

2 , 3

b

3 , 3 

b nr

, 3     

b b b

1 ,

nk

2 3  ,

nk

,

nk b nr

,

nk

      

C

       

c c c c

3  1 , 1 2

nk

, 1 , 1 , 1

c

1 , 2

c

2 , 2

c

3 , 2 

c nk

, 2

c

1 , 3

c

2 , 3

c

3 , 3 

c nk

, 3     

c c c

1 ,

nc

2  ,

nc

3 ,

nc c nk

,

nc

       The definition of A = B • C is for

a r

,

c r

k nk

  1

b r

,

k

 

c k

,

c

{1,

nr

} ,

c

b r

, 1 

c

1 ,

c

 

b r

, 2 

c

2 ,

c

{1,

nc

} .

b r

, 3 

c

3 ,

c

  

b r

,

nk

c nk

,

c

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 20

Matrix Multiply: Naïve Version

SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0

DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DO END SUBROUTINE matrix_matrix_mult_naive

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 21

Better

Performance of Matrix Multiply

800

Matrix-Matrix Multiply

700 600 500 400 300 200 100 0 0 10000000 20000000 30000000 40000000 50000000 60000000

Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 Init 22

Tiling

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 23

Tiling

   

Tile

: A small rectangular subdomain of a problem domain. Sometimes called a

block

or a

chunk

.

Tiling

: Breaking the domain into tiles.

Tiling strategy

: Operate on each tile to completion, then move to the next tile.

Tile size

can be set at runtime, according to what’s best for the machine that you’re running on.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 24

Tiling Code

SUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, & & rtilesize, ctilesize, qtilesize) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize INTEGER :: rstart, rend, cstart, cend, qstart, qend DO cstart = 1, nc, ctilesize cend = cstart + ctilesize - 1 IF (cend > nc) cend = nc DO rstart = 1, nr, rtilesize rend = rstart + rtilesize - 1 IF (rend > nr) rend = nr DO qstart = 1, nq, qtilesize END DO qend = qstart + qtilesize - 1 IF (qend > nq) qend = nq CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) END DO END DO END SUBROUTINE matrix_matrix_mult_by_tiling

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 25

Multiplying Within a Tile

SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend ) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend INTEGER :: r, c, q DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0

DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DO END SUBROUTINE matrix_matrix_mult_tile

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 26

Reminder: Naïve Version, Again

SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = END DO 1, nr dst(r,c) = 0.0

DO q = END DO 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_naive

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 27

Performance with Tiling

Matrix-Matrix Mutiply Via Tiling Matrix-Matrix Mutiply Via Tiling (log-log)

1000 250 200 100 150 10 100 50 1E+08 10000000 1000000 100000 10000 1000 100 10 1 512x256 512x512 1024x512 1024x1024 2048x1024 Better 100000000 10000000 1000000 100000 10000

Tile Size (bytes)

1000 100 10 0

Tile Size (bytes)

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 0.1

28

The Advantages of Tiling

   It allows your code to

exploit data locality

better, to get much more cache reuse: your code runs faster!

It’s a relatively

modest amount of extra coding

(typically a few wrapper functions and some changes to loop bounds).

If you don’t need

tiling – because of the hardware, the compiler or the problem size – then you can

turn it off by simply

setting the tile size equal to the problem size.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 29

Why Does Tiling Work Here?

Cache optimization works best when the number of calculations per byte is large.

For example, with matrix-matrix multiply on an n × n matrix, there are

O

(

n

3 ) calculations (on the order of

n

3 ), but only

O

(

n

2 ) bytes of data.

So, for large

n

, there are a huge number of calculations per byte transferred between RAM and cache.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 30

Multicore/Many-core Basics

What is Multicore?

    In the olden days (i.e., the first half of 2005), each CPU chip had one “brain” in it.

More recently, each CPU chip has 2

cores

(brains), and, starting in late 2006, 4 cores.

Jargon

: Each CPU chip plugs into a

socket

, so these days, to avoid confusion, people refer to

sockets

and

cores

, rather than CPUs or processors.

Each core is just like a full blown CPU, except that it shares its socket with one or more other cores – and therefore shares its bandwidth to RAM.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 32

Dual Core

Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 33

Quad Core

Core Core Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 34

Oct Core

Core Core Core Core Core Core Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 35

The Challenge of Multicore: RAM

      Each socket has access to a certain amount of RAM, at a

fixed RAM bandwidth per SOCKET

.

As the number of cores per socket increases, the

contention for RAM bandwidth increases

too.

At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s

a huge problem

.

So, applications that

are cache optimized speedups

.

will get

big

But, applications whose performance is

limited by RAM bandwidth

are going to speed up only as fast as RAM bandwidth speeds up.

RAM bandwidth

speeds up much slower

up.

than CPU speeds Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 36

The Challenge of Multicore: Network

     Each node has access to a certain number of network ports, at a

fixed number of network ports per NODE

.

As the number of cores per node increases, the

contention for network ports increases

too.

At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s

a huge problem

.

So, applications that

do minimal communication big speedups

.

will get But, applications whose performance is

limited by the number of MPI messages

are going to speed up very very little – and may even crash the node.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 37

Multicore/Many-core Problem

   Most multicore chip families have relatively small cache per core (e.g., 2 MB) – and this problem seems likely to remain.

Small TLBs make the problem worse: 512 KB per core rather than 2 MB.

So, to get good cache reuse, you need to partition algorithm so subproblem needs no more than 512 KB.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 38

The T.L.B. on a Current Chip

On Intel Core Duo (“Yonah”):  Cache size is 2 MB per core.

  Page size is 4 KB.

A core’s data TLB size is 128 page table entries.

 Therefore, D-TLB only covers 512 KB of cache.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 39

The T.L.B. on a Current Chip

On Intel Core Duo (“Yonah”):  Cache size is 2 MB per core.

  Page size is 4 KB.

A core’s data TLB size is 128 page table entries.

 Therefore, D-TLB only covers 512 KB of cache.

 The cost of a TLB miss is 49 cycles, equivalent to as many as

196 calculations

! (4 FLOPs per cycle) http://www.digit-life.com/articles2/cpu/rmma-via-c7.html

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 40

What Do We Need?

   We need much bigger caches!

TLB must be big enough to cover the entire cache.

It’d be nice to have RAM speed increase as fast as core counts increase, but let’s not kid ourselves.

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 41

To Learn More Supercomputing

http://www.oscer.ou.edu/education.php

Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007 42

Computer Architecture Spring 2012 Lecture 27. CMPs & SMTs

Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji )

[Adapted from

Computer Organization and Design

, Patterson & Hennessy, © 2005]

Multithreading on A Chip

 Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions  Multithreading – increase the utilization of resources on a chip by allowing multiple processes ( threads ) to share the functional units of a single processor  Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread  The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly)  The memory can be shared through virtual memory mechanisms  Hardware must support efficient thread context switching

Types of Multithreading on a Chip

 Fine-grain – switch threads on every instruction issue  Round-robin thread interleaving (skipping stalled threads)  Processor must be able to switch threads on every clock cycle  Advantage – can hide throughput losses that come from both short and long stalls  Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads  Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)  Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread  Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss Pipeline must be flushed and refilled on thread switches

Simultaneous Multithreading (SMT)

 A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread level parallelism (TLP)  Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP)  With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them Need separate rename tables (ROBs) for each thread Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle  Intel’s Pentium 4 SMT called hyperthreading  Supports just two threads (doubles the architecture state)  Typically, each core of newer multi-core process is hyperthreaded

Threading on a 4-way SS Processor Example

Coarse MT Fine MT SMT Issue slots → Thread A Thread B Thread C Thread D

William Stallings Computer Organization and Architecture 8

th

Edition

Chapter 18 Multicore Computers

Hardware Performance Issues

 Microprocessors have seen an exponential increase in performance  Improved organization  Increased clock frequency  Increase in Parallelism  Pipelining  Superscalar  Simultaneous multithreading (SMT)  Diminishing returns  More complexity requires more logic  Increasing chip area for coordinating and signal transfer logic Harder to design, make and debug

Alternative Chip Organizations

Intel Hardware Trends

Increased Complexity

 Power requirements grow exponentially with chip density and clock frequency  Can use more chip area for cache Smaller Order of magnitude lower power requirements  By 2015  100 billion transistors on 300mm 2 die Cache of 100MB  1 billion transistors for logic Pollack’s rule:  Performance is roughly proportional to square root of increase in complexity Double complexity gives 40% more performance  Multicore has potential for near-linear improvement  Unlikely that one core can use all cache effectively

Power and Memory Considerations

Chip Utilization of Transistors

Software Performance Issues

 Performance benefits dependent on effective exploitation of parallel resources  Even small amounts of serial code impact performance  10% inherently serial on 8 processor system gives only 4.7 times performance  Communication, distribution of work and cache coherence overheads  Some applications effectively exploit multicore processors

Effective Applications for Multicore Processors

 Database  Servers handling independent transactions  Multi-threaded native applications  Lotus Domino, Siebel CRM  Multi-process applications  Oracle, SAP, PeopleSoft  Java applications  Java VM is multi-thread with scheduling and memory management  Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat  Multi-instance applications  One application running multiple times  E.g. Value Game Software

Multicore Organization

 Number of core processors on chip  Number of levels of cache on chip  Amount of shared cache  Next slide examples of each organization:  (a) ARM11 MPCore (deleted)  (b) AMD Opteron (deleted)  (c) Intel Core Duo  (d) Intel Core i7

Multicore Organization Alternatives

Advantages of shared L2 Cache

 Constructive interference reduces overall miss rate  Data shared by multiple cores not replicated at cache level  With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic  Threads with less locality can have more cache  Easy inter-process communication through shared memory  Cache coherency confined to L1  Dedicated L2 cache gives each core more rapid access  Good for threads with strong locality  Shared L3 cache may also improve performance

Individual Core Architecture

 Intel Core Duo uses superscalar cores  Intel Core i7 uses simultaneous multi-threading (SMT)  Scales up number of threads supported 4 SMT cores, each supporting 4 threads appears as 16 core

Intel x86 Multicore Organization - Core Duo (1)

 2006  Two x86 superscalar, shared L2 cache  Dedicated L1 cache per core  32KB instruction and 32KB data  Thermal control unit per core  Manages chip heat dissipation  Maximize performance within constraints  Improved ergonomics  Advanced Programmable Interrupt Controlled (APIC)  Inter-process interrupts between cores  Routes interrupts to appropriate core  Includes timer so OS can interrupt core

Intel x86 Multicore Organization - Core Duo (2)

 Power Management Logic  Monitors thermal conditions and CPU activity  Adjusts voltage and power consumption  Can switch individual logic subsystems  2MB shared L2 cache  Dynamic allocation  MESI support for L1 caches  Extended to support multiple Core Duo in SMP L2 data shared between local cores or external  Bus interface

Intel Core Duo Block Diagram

Intel x86 Multicore Organization - Core i7

 November 2008  Four x86 SMT processors  Dedicated L2, shared L3 cache  Speculative pre-fetch for caches  On chip DDR3 memory controller  Three 8 byte channels (192 bits) giving 32GB/s  No front side bus  QuickPath Interconnection  Cache coherent point-to-point link  High speed communications between processor chips  6.4G transfers per second, 16 bits per transfer  Dedicated bi-directional pairs  Total bandwidth 25.6GB/s

Intel Core i7 Block Diagram

Performance Effect of Multiple Cores

Computer Architecture

Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji )

[Adapted from

Computer Organization and Design

, Patterson & Hennessy, © 2005]

Multicore Xbox360 – “Xenon” processor

 To provide game developers with a balanced and powerful platform  Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache  165M transistors total  3.2 Ghz Near-POWER ISA  2-issue, 21 stage pipeline, with 128 128-bit registers  Weak branch prediction – supported by software hinting  In order instructions  Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else  An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM  337M transistors, 10MB framebuffer  48 pixel shader cores, each with 4 ALUs

Xenon Diagram

Core 0 L1D L1I Core 1 L1D L1I Core 2 L1D L1I 1MB UL2 512MB DRAM GPU BIU/IO Intf 10MB EDRAM 3D Core Video Out DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Analog Chip Video Out

The PS3 “Cell” Processor Architecture

 Composed of a Non-SMP Architecture  234M transistors @ 4Ghz  1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s  512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else  The PPE is strangely similar to one of the Xenon cores Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT  The real differences lie in the SPEs (21M transistors each) An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors – Direct mapped for low latency 4 vector units per SPE, 1 of everything else – 7M trans.

The PS3 “Cell” Processor Architecture

How to make use of the SPEs

What about the Software?

 Makes use of special IBM “Hypervisor”  Like an OS for OS’s  Runs both a real time OS (for sound) and non-real time (for things like AI)  Software must be specially coded to run well  The single PPE will be quickly bogged down  Must make use of SPEs wherever possible  This isn’t easy, by any standard  What about Microsoft?

 Development suite identifies which 6 threads you’re expected to run  Four of them are DirectX based, and handled by the OS  Only need to write two threads, functionally  http://ps3forums.com/showthread.php?t=22858

Next Lecture and Reminders

 Reminders  Final is Wednesday, May 2 from 1-2:50 PM in ITT 328