Lecture 2 - University of Cincinnati

Download Report

Transcript Lecture 2 - University of Cincinnati

Today’s topics

• Single processors and the Memory Hierarchy • Busses and Switched Networks • Interconnection Network Topologies • Multiprocessors • Multicomputers • Flynn’s Taxonomy • Modern clusters – hybrid

Processors and the Memory Hierarchy

• Registers (1 clock cycle, 100s of bytes) • 1 st level cache (3-5 clock cycles, 100s KBytes) • 2 nd level cache (~10 clock cycles, MBytes) • Main memory (~100 clock cycles, GBytes) • Disk (milliseconds, 100GB to gianormous) CPU registers 1st level Instructions 1st level Data 2 nd Level unified (Instructions & Data)

IBM Dual Core

From Intel® 64 and IA-32 Architectures Optimization Reference Manual http://www.intel.com/design/processor/manuals/248966.pdf

Interconnection Network Topologies - Bus

• Bus – A single shared data path – Pros • Simplicity – cache coherence – synchronization – Cons • fixed bandwidth – Does not scale well CPU Global Memory CPU CPU

Interconnection Network Topologies – Switch based

• Switch Based – mxn switches – Many possible topologies • Characterized by CPU CPU CPU CPU – Diameter • Worst case number of switches between two processors • Impacts latency – Bisection width • Minimum number of connections that must be removed to split the network into two • Communication bandwidth limitation – Edges per switch • Best if this is independent of the size of the network

Interconnection Network Topologies - Mesh

• 2-D Mesh – 2-D array of processors • Torus/Wraparound Mesh – Processors on edge of mesh are connected • Characteristics (n nodes) 2 (

n

 1 ) – Bisection width =

n

– Switch size = 4 – Number of switches = n

Interconnection Network Topologies - Hypercube

• Hypercube – A d -dimensional hypercube has

n=2 d

processors. – Each processor directly connected to d other processors – Shortest path between a pair of processors is at most d • Characteristics (

n=2 d

nodes) – Diameter =

d

– Bisection width =

n/2

– Switch size =

d

– Number of switches =

n

3-D Hypercube 4-D Hypercube

Multistage Networks

• Butterfly • Omega • Perfect shuffle • Characteristics for an Omega network (

n=2 d

nodes) – Diameter =

d-1

– Bisection width =

n/2

– Switch size =

2

– Number of switches = d 

n/2

An 8-input, 8-output Omega network of 2x2 switches

Shared Memory

• One or more memories • Global address space (all system memory visible to all processors) • Transfer of data between processors is usually implicit, just read (write) to (from) a given address (OpenMP) • Cache-coherency protocol to maintain consistency between processors.

(UMA) Uniform-memory-access Shared-memory System Memory Memory Memory Interconnection Network CPU CPU CPU

Distributed Shared Memory

• Single address space with implicit communication • Hardware support for read/write to non-local memories, cache coherency • Latency for a memory operation is greater when accessing non local data than when accessing date within a CPU’s own memory (NUMA)Non-Uniform-memory-access Shared-memory System Interconnection Network CPU Memory CPU Memory CPU Memory

Distributed Memory

• Each processor has access to its own memory only • Data transfer between processors is explicit, user calls message passing functions • Common Libraries for message passing – MPI, PVM • User has complete control/responsibility for data placement and management CPU Interconnection Network Memory CPU Memory CPU Memory

Hybrid Systems

• Distributed memory system with multiprocessor shared memory nodes.

• Most common architecture for current generation of parallel machines CPU CPU Network Interface Interconnection Network Network Interface CPU CPU CPU CPU Network Interface CPU CPU CPU

Flynn’s Taxonomy

(figure 2.20 from Quinn) Data stream Single Multiple SISD Uniprocessor SIMD Procesor arrays Pipelined vector processors MISD Systolic array MIMD Multiprocessors Multicomputers

Top 500 List

• Some highlights from http://www.top500.org/ – On the new list, the IBM BlueGene/L system, installed at DOE’s Lawrence Livermore National Laboratory (LLNL) , retains the No. 1 spot with a Linpack performance of per second, or Tflop/s).

280.6 teraflops (trillions of calculations – The new No. 2 systems is Sandia National Laboratories’ Cray Red Storm supercomputer , only the second system ever to be recorded to exceed the 100 Tflops/s mark with 101.4 Tflops/s . The initial Red Storm system was ranked No. 9 in the last listing.

– Slipping to No. 3 from No. 2 last June is the IBM eServer Blue Gene Solution system , installed at IBM’s Thomas Watson Research Center with 91.20 Tflops/s Linpack performance.

– The new No. 5 is the largest system in Europe, an IBM JS21 cluster installed at the Barcelona Supercomputing Center . The system reached 62.63 Tflops/s .

Linux/Beowulf cluster basics

• Goal – Get super computing processing power at the cost of a few PCs • How – Commodity components: PCs and networks – Free software with open source

CPU nodes

• A typical configuration – Dual socket – Dual core AMD or Intel nodes – 4 GB memory per node

Network Options

From D.K. Panda’s Nowlab website at Ohio State, http://nowlab.cse.ohio-state.edu/ Research Overview presentation

Challenges

• Cooling • Power constraints • Reliability • System Administration