Chapter 8 Notes

Transcript Chapter 8 Notes

Parallel Computer
Architectures
Chapter 8
Parallel Computer Architectures
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor.
(d) A multicomputer. (e) A grid.
Parallelism
a)
b)
Introduced at various levels
Within CPU chip (multiple instructions per cycle)
–
–
–
–
c)
d)
e)
f)
g)
Instruction level VLIW (Very Long Instruction word)
Superscalar
On Chip Multithreading
Single chip multiprocessors
Extra CPU boards (
Multiprocessor/Multicomputer
Computer grids
Tightly Coupled – computationally intimate
Loosely Coupled – computationally remote
Instruction-Level Parallelism
(a) A CPU pipeline. (b) A sequence of VLIW instructions.
(c) An instruction stream with bundles marked.
The TriMedia VLIW CPU (1)
A typical TriMedia instruction, showing five possible operations.
The TriMedia VLIW CPU (2)
The TM3260 functional units, their quantity, latency,
and which instruction slots they can use.
The TriMedia VLIW CPU (3)
The major groups of TriMedia custom operations.
The TriMedia VLIW CPU (4)
(a) An array of 8-bit elements. (b) The transposed array.
(c) The original array fetched into four registers.
(d) The transposed array in four registers.
Multithreading
a)
Fine-grained multithreading
–
–
–
b)
Run multiple threads one instruction from each
Will never stall if enough active threads
Requires hardware to track which instruction is from which thread
Coarse-grain multithreading
–
c)
Run thread until stall and switch (one cycle wasted)
Simultaneous multithreading
–
d)
Coarse grain with no cycle wasted
Hyperthreading
–
–
5% increase in size give 25% gain
Resource sharing
•
•
Partitioned
full resource sharing
Threshold sharing
On-Chip Multithreading (1)
(a) – (c) Three threads. The empty boxes indicated that the thread
has stalled waiting for memory. (d) Fine-grained multithreading.
(e) Coarse-grained multithreading.
On-Chip Multithreading (2)
Multithreading with a dual-issue superscalar CPU.
(a) Fine-grained multithreading.
(b) Coarse-grained multithreading.
(c) Simultaneous multithreading.
Hyperthreading on the Pentium 4
Sharing between two thread white and gray
Resource sharing between threads in the
Pentium 4 NetBurst microarchitecture.
Single-Chip Multiprocessor
a)
b)
Two areas of interest servers and consumer electronics
Homogeneous chips
–
–
c)
2 piplines, one CPU
2 CPU (same design)
Hetrogeneous chips
–
–
–
CPU’s for DVD player or CELL phones
More software => slower but cheaper
Many different cores (essentially libraries)
Sample chip
a)
Cores on a chip for DVD player:
–
Control
–
MPEG video
–
Audio decoder
–
Video encoder
–
Disk controller
–
Cache
Cores require interconnect
IBM CoreConnect
AMBA Advanced Microcontroller Bus Architecture
VCI Virtual Component Interconnect
OCP-IP Open Core Protocol
Homogeneous Multiprocessors on a Chip
Single-chip multiprocessors.
(a) A dual-pipeline chip. (b) A chip with two cores.
Heterogeneous Multiprocessors on a Chip (1)
The logical structure of a simple DVD player contains a heterogeneous
multiprocessor containing multiple cores for different functions.
Heterogeneous Multiprocessors on a Chip (2)
An example of the IBM CoreConnect architecture.
Coprocessors
a)
Come in a variety of sizes
–
–
–
b)
c)
Separate cabinets for mainframes
Separate boards
Separate chips
Primary purpose to offload work and assist main processor
Different types
–
–
–
–
–
–
I/O
DMA
Floating point
Network
Graphics
Encryption
Introduction to Networking (1)
How users are connected to servers on the Internet.
Networks
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
LAN – local area network
WAN – Wide area network
Packet – chunk of data on network 64-1500 bytes
Store-and-forward packet switching – what a router does
Internet – series of WAN’s linked by routers
ISP – Internet service provider
Firewall – specialized computer that filters traffic
Protocols – set of formats, exchange sequences, and rules
HTTP – HyperText Transfer Protocol
TCP – Transmission Control Protocol
IP – Internet protocol
Networks
a)
b)
c)
d)
e)
f)
g)
h)
CRC – Cyclic Redundancy Check
TCP Header – information about data for TCP level
IP header – routing header source, destination, hops
Ethernet Header Next hop address, address, CRC
ASIC – Application Specific Integrated Circuit
FPGA – Field programmable Gate Array
Network processor – programmable device that handles
incoming and outgoing packets a wire speed
PPE – Protocol/Programmable/Packet Processing Engines
Introduction to Networking (2)
A packet as it appears on the Ethernet.
Introduction to Network Processors
A typical network processor board and chip.
Packet Processing
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
m)
Checksum verification
Field Extraction
Packet Classification
Path Selection
Destination network determination
Route Lookup
Fragmentation and reassembly
Computation (compression/ encryption)
Header Management
Queue management
Checksum generation
Accounting
Statistics gathering
Improving Performance
a)
b)
Performance is name of game.
How to measure it.
–
–
c)
Packets per second
Bytes per second
Ways to speed up
–
–
–
–
–
–
Performance is not linear with clock speed
Introduce more PPE’s
Specialized processors
More internal busses
Widen existing busses
Replace SDRAM with SRAM
The Nexperia Media Processor
The Nexperia heterogeneous multiprocessor on a chip.
Multiprocessors
(a) A multiprocessor with 16 CPUs sharing a common memory.
(b) An image partitioned into 16 sections, each being analyzed
by a different CPU.
Shared-Memory Multiprocessors
a)
b)
c)
d)
e)
f)
g)
h)
Multiprocessor – has shared memory
SMP (Symetric Multiprocessor) – every multiprocessor can
access any I/O device
Multicomputer – (distributed memory system) – each
computer has it’s own memory
Multiprocessor – has one address space
Multicomputer has one address space per computer
Multicomputers pass messages to communicate
Ease or programming vs ease of construction
DSM – distributed shared memory page fault memory for
distributed computers
Multicomputers (1)
(a) A multicomputer with 16 CPUs, each with its own private memory.
(b) The bit-map image of Fig. 8-17 split up among the 16 memories.
Multicomputers (2)
Various layers where shared memory can be implemented. (a) The
hardware. (b) The operating system. (c) The language runtime system.
Taxonomy of Parallel Computers (1)
Flynn’s taxonomy of parallel computers.
Taxonomy of Parallel Computers (2)
A taxonomy of parallel computers.
MIMD categories
UMA – uniform memory access
NUMA – NonUniform Memory Access
COMA – Cache only memory Access
Multicomputers are NORMA ( No remote Memory Access)
a)
b)
c)
d)
–
MPP Massive Parallel processor
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0
Consistency Models
a)
b)
c)
d)
e)
f)
g)
How hardware and software will work with memory
Strict Consistency any read of location X returns the most
recent value written to location X
Sequential Consistency – values will be returned in the order
they are written (true order
Processor consistency – Writes by any CPU are seen in the
order they are written
For every memory word, all CPU see all writes to it in the
same order
Weak Consistency – no guarantee unless synchronization is
used.
Release Consistency writes must occur before critical section
is reentered.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0
Sequential Consistency
(a) Two CPUs writing and two CPUs reading a common memory
word. (b) - (d) Three possible ways the two writes and four
reads might be interleaved in time.
Weak Consistency
Weakly consistent memory uses synchronization operations to
divide time into sequential epochs.
UMA Symmetric Multiprocessor Architectures
Three bus-based multiprocessors. (a) Without caching. (b) With
caching. (c) With caching and private memories.
Cache as Cache can
a)
Cache coherence protocol keep memory in maximum of one
cache (eg. Write through)
Snooping cache monitor bus fir access to cache memory
Choose between update strategy or invalidate strategy
MESI protocol named after states
b)
c)
d)
–
Invalid
Shared
Exclusive
Modified
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0
Snooping Caches
The write through cache coherence protocol.
The empty boxes indicate that no action is taken.
The MESI Cache Coherence Protocol
The MESI cache coherence protocol.
UMA Multiprocessors Using Crossbar Switches
(a) An 8 × 8 crossbar switch.
(b) An open crosspoint.
(c) A closed crosspoint.
UMA Multiprocessors Using Multistage Switching
Networks (1)
(a) A 2 × 2 switch.
(b) A message format.
UMA Multiprocessors Using Multistage Switching
Networks (2)
An omega switching network.
NUMA Multiprocessors
A NUMA machine based on two levels of buses. The Cm* was
the first multiprocessor to use this design.
Cache Coherent NUMA Multiprocessors
(a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit
memory address into fields. (c) The directory at node 36.
The Sun Fire E25K NUMA Multiprocessor (1)
The Sun Microsystems E25K multiprocessor.
The Sun Fire E25K NUMA Multiprocessor (2)
The SunFire E25K uses a four-level interconnect. Dashed lines
are address paths. Solid lines are data paths.
Message-Passing Multicomputers
A generic multicomputer.
Topology
Various topologies. The heavy dots represent switches. The CPUs
and memories are not shown. (a) A star. (b) A complete interconnect.
(c) A tree. (d) A ring. (e) A grid. (f) A double torus.
(g) A cube. (h) A 4D hypercube.
BlueGene (1)
The BlueGene/L custom processor chip.
BlueGene (2)
The BlueGene/L. (a) Chip. (b) Card. (c) Board.
(d) Cabinet. (e) System.
Red Storm (1)
Packaging of the Red Storm components.
Red Storm (2)
The Red Storm system as viewed from above.
A Comparison of BlueGene/L and Red Storm
A comparison of
BlueGene/L and
Red Storm.
Google (1)
Processing of a Google query.
Google (2)
A typical Google
cluster.
Scheduling
Scheduling a cluster. (a) FIFO. (b) Without head-of-line blocking.
(c) Tiling. The shaded areas indicate idle CPUs.
Distributed Shared Memory (1)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer.
(a) The initial situation. ….
Distributed Shared Memory (2)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer. …
(b) After CPU 0 references page 10. …
Distributed Shared Memory (3)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer. …
(c) After CPU 1 references page 10, here assumed to be a read-only page.
Linda
Three Linda tuples.
Orca
A simplified ORCA stack object, with internal data and two
operations.
Software Metrics (1)
Real programs achieve less than the perfect speedup
indicated by the dotted line.
Software Metrics (2)
(a) A program has a sequential part and a parallelizable part.
(b) Effect of running part of the program in parallel.
Achieving High Performance
(a) A 4-CPU bus-based system. (b) A 16-CPU bus-based system.
(c) A 4-CPU grid-based system. (d) A 16-CPU grid-based system.
Grid Computing
The grid layers.