Parallel Computing Explained

Download Report

Transcript Parallel Computing Explained

Parallel Computing Explained
Parallel Computing Overview
1
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
2
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
3
Parallel Computing Overview
 Who should read this chapter?
 New Users – to learn concepts and terminology.
 Intermediate Users – for review or reference.
 Management Staff – to understand the basic concepts – even if
you don’t plan to do any programming.
 Note: Advanced users may opt to skip this chapter.
4
Introduction to Parallel Computing
 High performance parallel computers
 can solve large problems much faster than a desktop computer
 fast CPUs, large memory, high speed interconnects, and high speed
input/output
 able to speed up computations
 by making the sequential components run faster
 by doing more operations in parallel
 High performance parallel computers are in demand
 need for tremendous computational capabilities in science,
engineering, and business.
 require gigabytes/terabytes f memory and gigaflops/teraflops of
performance
 scientists are striving for petascale performance
5
Introduction to Parallel Computing
 HPPC are used in a wide variety of disciplines.







Meteorologists: prediction of tornadoes and thunderstorms
Computational biologists: analyze DNA sequences
Pharmaceutical companies: design of new drugs
Oil companies: seismic exploration
Wall Street: analysis of financial markets
NASA: aerospace vehicle design
Entertainment industry: special effects in movies and
commercials
 These complex scientific and business applications all need to
perform computations on large datasets or large equations.
6
Parallelism in our Daily Lives
 There are two types of processes that occur in computers and
in our daily lives:
 Sequential processes
 occur in a strict order
 it is not possible to do the next step until the current one is completed.
 Examples
 The passage of time: the sun rises and the sun sets.
 Writing a term paper: pick the topic, research, and write the paper.
 Parallel processes
 many events happen simultaneously
 Examples
 Plant growth in the springtime
 An orchestra
7
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.2.1 Data Parallelism
1.1.2.2 Task Parallelism
1.1.3 Parallelism in Computers
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
8
Parallelism in Computer Programs
 Conventional wisdom:
 Computer programs are sequential in nature
 Only a small subset of them lend themselves to parallelism.
 Algorithm: the "sequence of steps" necessary to do a computation.
 The first 30 years of computer use, programs were run sequentially.
 The 1980's saw great successes with parallel computers.
 Dr. Geoffrey Fox published a book entitled Parallel Computing
Works!
 many scientific accomplishments resulting from parallel computing
 Computer programs are parallel in nature
 Only a small subset of them need to be run sequentially
9
Parallel Computing
 What a computer does when it carries out more than one
computation at a time using more than one processor.
 By using many processors at once, we can speedup the execution
 If one processor can perform the arithmetic in time t.
 Then ideally p processors can perform the arithmetic in time t/p.
 What if I use 100 processors? What if I use 1000 processors?
 Almost every program has some form of parallelism.
 You need to determine whether your data or your program can be
partitioned into independent pieces that can be run simultaneously.
 Decomposition is the name given to this partitioning process.
 Types of parallelism:
 data parallelism
 task parallelism.
10
Data Parallelism
 The same code segment runs concurrently on each processor,
but each processor is assigned its own part of the data to
work on.
 Do loops (in Fortran) define the parallelism.
 The iterations must be independent of each other.
 Data parallelism is called "fine grain parallelism" because the
computational work is spread into many small subtasks.
 Example
 Dense linear algebra, such as matrix multiplication, is a perfect
candidate for data parallelism.
11
An example of data parallelism
Original Sequential Code
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
12
Parallel Code
!$OMP PARALLEL DO
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
!$END PARALLEL DO
Quick Intro to OpenMP
 OpenMP is a portable standard for parallel directives
covering both data and task parallelism.
 More information about OpenMP is available on the OpenMP
website.
 We will have a lecture on Introduction to OpenMP later.
 With OpenMP, the loop that is performed in parallel is the
loop that immediately follows the Parallel Do directive.
 In our sample code, it's the K loop:
 DO K=1,N
13
OpenMP Loop Parallelism
Iteration-Processor
Assignments
14
The code segment running
on each processor
Processor
Iterations
of K
Data
Elements
proc0
K=1:5
A(I, 1:5)
B(1:5 ,J)
proc1
K=6:10
A(I, 6:10)
B(6:10 ,J)
proc2
K=11:15
A(I, 11:15)
B(11:15 ,J)
proc3
K=16:20
A(I, 16:20)
B(16:20 ,J)
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
OpenMP Style of Parallelism
 can be done incrementally as follows:
Parallelize the most computationally intensive loop.
2. Compute performance of the code.
3. If performance is not satisfactory, parallelize another loop.
4. Repeat steps 2 and 3 as many times as needed.
1.
 The ability to perform incremental parallelism is considered a
positive feature of data parallelism.
 It is contrasted with the MPI (Message Passing Interface)
style of parallelism, which is an "all or nothing" approach.
15
Task Parallelism
 Task parallelism may be thought of as the opposite of data





16
parallelism.
Instead of the same operations being performed on different parts
of the data, each process performs different operations.
You can use task parallelism when your program can be split into
independent pieces, often subroutines, that can be assigned to
different processors and run concurrently.
Task parallelism is called "coarse grain" parallelism because the
computational work is spread into just a few subtasks.
More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.
Task parallelism is often easier to implement and has less overhead
than data parallelism.
Task Parallelism
 The abstract code shown in the diagram is decomposed into
4 independent code segments that are labeled A, B, C, and D.
The right hand side of the diagram illustrates the 4 code
segments running concurrently.
17
Task Parallelism
Original Code
Parallel Code
program main
program main
!$OMP PARALLEL
!$OMP SECTIONS
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP END SECTIONS
!$OMP END PARALLEL
end
code segment labeled A
code segment labeled B
code segment labeled C
code segment labeled D
end
18
A
B
C
D
OpenMP Task Parallelism
 With OpenMP, the code that follows each SECTION(S)
directive is allocated to a different processor. In our sample
parallel code, the allocation of code segments to processors is
as follows.
19
Processor
Code
proc0
code segment
labeled A
proc1
code segment
labeled B
proc2
code segment
labeled C
proc3
code segment
labeled D
Parallelism in Computers
 How parallelism is exploited and enhanced within the
operating system and hardware components of a parallel
computer:
 operating system
 arithmetic
 memory
 disk
20
Operating System Parallelism
 All of the commonly used parallel computers run a version of the
Unix operating system. In the table below each OS listed is in fact
Unix, but the name of the Unix OS varies with each vendor.
Parallel Computer
OS
SGI Origin2000
IRIX
HP V-Class
HP-UX
Cray T3E
Unicos
IBM SP
AIX
Workstation
Clusters
Linux
 For more information about Unix, a collection of Unix documents
is available.
21
Two Unix Parallelism Features
 background processing facility
 With the Unix background processing facility you can run the
executable a.out in the background and simultaneously view the
man page for the etime function in the foreground. There are
two Unix commands that accomplish this:
a.out > results &
man etime
 cron feature
 With the Unix cron feature you can submit a job that will run at
a later time.
22
Arithmetic Parallelism
 Multiple execution units
 facilitate arithmetic parallelism.
 The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to be
used simultaneously, because the execution units operate independently.
 Fused multiply and add
 is another parallel arithmetic feature.
 Parallel computers are able to overlap multiply and add. This arithmetic is named
MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) on
HP computers. In either case, the two arithmetic operations are overlapped and
can complete in hardware in one computer cycle.
 Superscalar arithmetic
 is the ability to issue several arithmetic operations per computer cycle.
 It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n-way superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way superscalar computer.
23
Memory Parallelism
 memory interleaving
 memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory banks,
then data elements with even memory addresses would fall into one bank, and
data elements with odd memory addresses into the other.
 multiple memory ports
 Port means a bi-directional memory pathway. When the data elements that are
interleaved across the memory banks are needed, the multiple memory ports
allow them to be accessed and fetched in parallel, which increases the memory
bandwidth (MB/s or GB/s).
 multiple levels of the memory hierarchy
 There is global memory that any processor can access. There is memory that is
local to a partition of the processors. Finally there is memory that is local to a
single processor, that is, the cache memory and the memory elements held in
registers.
 Cache memory
 Cache is a small memory that has fast access compared with the larger main
memory and serves to keep the faster processor filled with data.
24
Memory Parallelism
Memory Hierarchy
25
Cache Memory
Disk Parallelism
 RAID (Redundant Array of Inexpensive Disk)
 RAID disks are on most parallel computers.
 The advantage of a RAID disk system is that it provides a
measure of fault tolerance.
 If one of the disks goes down, it can be swapped out, and the
RAID disk system remains operational.
 Disk Striping
 When a data set is written to disk, it is striped across the RAID
disk system. That is, it is broken into pieces that are written
simultaneously to the different disks in the RAID disk system.
When the same data set is read back in, the pieces are read in
parallel, and the full data set is reassembled in memory.
26
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
27
Performance Measures
 Peak Performance
 is the top speed at which the computer can operate.
 It is a theoretical upper limit on the computer's performance.
 Sustained Performance
 is the highest consistently achieved speed.
 It is a more realistic measure of computer performance.
 Cost Performance
 is used to determine if the computer is cost effective.
 MHz
 is a measure of the processor speed.
 The processor speed is commonly measured in millions of cycles per second,
where a computer cycle is defined as the shortest time in which some work can be
done.
 MIPS
 is a measure of how quickly the computer can issue instructions.
 Millions of instructions per second is abbreviated as MIPS, where the instructions
28
are computer instructions such as: memory reads and writes, logical operations ,
floating point operations, integer operations, and branch instructions.
Performance Measures
 Mflops (Millions of floating point operations per second)
 measures how quickly a computer can perform floating-point operations
such as add, subtract, multiply, and divide.
 Speedup
 measures the benefit of parallelism.
 It shows how your program scales as you compute with more processors,
compared to the performance on one processor.
 Ideal speedup happens when the performance gain is linearly proportional to
the number of processors used.
 Benchmarks
 are used to rate the performance of parallel computers and parallel
programs.
 A well known benchmark that is used to compare parallel computers is the
Linpack benchmark.
 Based on the Linpack results, a list is produced of the Top 500
Supercomputer Sites. This list is maintained by the University of Tennessee
and the University of Mannheim.
29
More Parallelism Issues
 Load balancing
 is the technique of evenly dividing the workload among the processors.
 For data parallelism it involves how iterations of loops are allocated to processors.
 Load balancing is important because the total time for the program to complete is
the time spent by the longest executing thread.
 The problem size
 must be large and must be able to grow as you compute with more processors.
 In order to get the performance you expect from a parallel computer you need to
run a large application with large data sizes, otherwise the overhead of passing
information between processors will dominate the calculation time.
 Good software tools
30
 are essential for users of high performance parallel computers.
 These tools include:
 parallel compilers
 parallel debuggers
 performance analysis tools
 parallel math software
 The availability of a broad set of application software is also important.
More Parallelism Issues
 The high performance computing market is risky and chaotic. Many
supercomputer vendors are no longer in business, making the
portability of your application very important.
 A workstation farm
 is defined as a fast network connecting heterogeneous workstations.
 The individual workstations serve as desktop systems for their owners.
 When they are idle, large problems can take advantage of the unused
cycles in the whole system.
 An application of this concept is the SETI project.You can participate in
searching for extraterrestrial intelligence with your home PC. More
information about this project is available at the SETI Institute.
 Condor
 is software that provides resource management services for applications that
31
run on heterogeneous collections of workstations.
 Miron Livny at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase high throughput computing to describe
this process of harnessing idle workstation cycles. More information is available
at the Condor Home Page.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
32
Comparison of Parallel Computers
 Now you can explore the hardware components of parallel
computers:
 kinds of processors
 types of memory organization
 flow of control
 interconnection networks
 You will see what is common to these parallel computers,
and what makes each one of them unique.
33
Kinds of Processors
 There are three types of parallel computers:
1.
computers with a small number of powerful processors
 Typically have tens of processors.
 The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for computing
centers.
 They are general-purpose computers that perform especially well on
applications that have large vector lengths.
 The examples of this type of computer are the Cray SV1 and the Fujitsu
VPP5000.
34
Kinds of Processors
 There are three types of parallel computers:
computers with a large number of less powerful processors
2.
 Named a Massively Parallel Processor (MPP), typically have thousands of





35
processors.
The processors are usually proprietary and air-cooled.
Because of the large number of processors, the distance between the furthest
processors can be quite large requiring a sophisticated internal network that
allows distant processors to communicate with each other quickly.
These computers are suitable for applications with a high degree of
concurrency.
The MPP type of computer was popular in the 1980s.
Examples of this type of computer were the Thinking Machines CM-2
computer, and the computers made by the MassPar company.
Kinds of Processors
 There are three types of parallel computers:
3.
computers that are medium scale in between the two extremes
 Typically have hundreds of processors.
 The processor chips are usually not proprietary; rather they are commodity
processors like the Pentium III.
 These are general-purpose computers that perform well on a wide range of
applications.
 The most common example of this class is the Linux Cluster.
36
Trends and Examples
 Processor trends :
Decade Processor Type
Computer Example
1970s
Pipelined, Proprietary
Cray-1
1980s
Massively Parallel, Proprietary
Thinking Machines CM2
1990s
Superscalar, RISC, Commodity SGI Origin2000
2000s
CISC, Commodity
Workstation Clusters
 The processors on today’s commonly used parallel computers:
37
Computer
Processor
SGI Origin2000
MIPS RISC R12000
HP V-Class
HP PA 8200
Cray T3E
Compaq Alpha
IBM SP
IBM Power3
Workstation Clusters
Intel Pentium III, Intel Itanium
Memory Organization
 The following paragraphs describe the three types of
memory organization found on parallel computers:
 distributed memory
 shared memory
 distributed shared memory
38
Distributed Memory
 In distributed memory computers, the total memory is partitioned
into memory that is private to each processor.
 There is a Non-Uniform Memory Access time (NUMA), which is
39
proportional to the distance between the two communicating
processors.
 On NUMA computers,
data is accessed the
quickest from a private
memory, while data from
the most distant
processor takes the
longest to access.
 Some examples are the
Cray T3E, the IBM SP,
and workstation clusters.
Distributed Memory
 When programming distributed memory computers, the
code and the data should be structured such that the bulk of
a processor’s data accesses are to its own private (local)
memory.
 This is called having
good data locality.
 Today's distributed
memory computers use
message passing such as
MPI to communicate
between processors as
shown in the following
example:
40
Distributed Memory
 One advantage of distributed memory computers is that they
are easy to scale. As the demand for resources grows,
computer centers can easily add more memory and
processors.
 This is often called the LEGO block approach.
 The drawback is that programming of distributed memory
computers can be quite complicated.
41
Shared Memory
 In shared memory computers, all processors have access to a single pool
of centralized memory with a uniform address space.
 Any processor can address any memory location at the same speed so
there is Uniform Memory Access time (UMA).
 Processors communicate with each other through the shared memory.
 The advantages and
disadvantages of shared
memory machines are
roughly the opposite of
distributed memory
computers.
 They are easier to program
42
because they resemble the
programming of single
processor machines
 But they don't scale like
their distributed memory
counterparts
Distributed Shared Memory
 In Distributed Shared Memory (DSM) computers, a cluster or partition of
processors has access to a common shared memory.
 It accesses the memory of a different processor cluster in a NUMA fashion.
 Memory is physically distributed but logically shared.
 Attention to data locality again is important.
 Distributed shared memory
43
computers combine the best
features of both distributed
memory computers and
shared memory computers.
 That is, DSM computers have
both the scalability of
distributed memory
computers and the ease of
programming of shared
memory computers.
 Some examples of DSM
computers are the SGI
Origin2000 and the HP VClass computers.
Trends and Examples
 Memory organization
trends:
Decade
Memory Organization
Example
1970s
Shared Memory
Cray-1
1980s
Distributed Memory
Thinking Machines CM-2
1990s
Distributed Shared Memory
SGI Origin2000
2000s
Distributed Memory
Workstation Clusters
 The memory
organization of
today’s commonly
used parallel
computers:
44
Computer
Memory Organization
SGI Origin2000
DSM
HP V-Class
DSM
Cray T3E
Distributed
IBM SP
Distributed
Workstation Clusters
Distributed
Flow of Control
 When you look at the control of flow you will see three types
of parallel computers:
 Single Instruction Multiple Data (SIMD)
 Multiple Instruction Multiple Data (MIMD)
 Single Program Multiple Data (SPMD)
45
Flynn’s Taxonomy
 Flynn’s Taxonomy, devised in 1972 by Michael Flynn of Stanford
University, describes computers by how streams of instructions interact
with streams of data.
 There can be single or multiple instruction streams, and there can be
single or multiple data streams. This gives rise to 4 types of computers as
shown in the diagram below:
 Flynn's taxonomy
names the 4 computer
types SISD, MISD,
SIMD and MIMD.
 Of these 4, only SIMD
46
and MIMD are
applicable to parallel
computers.
 Another computer
type, SPMD, is a special
case of MIMD.
SIMD Computers
 SIMD stands for Single Instruction Multiple Data.
 Each processor follows the same set of instructions.
 With different data elements being allocated to each processor.
 SIMD computers have distributed memory with typically thousands of simple processors,
and the processors run in lock step.
 SIMD computers, popular in the 1980s, are useful for fine grain data parallel applications,
such as neural networks.
 Some examples of SIMD computers
were the Thinking Machines CM-2
computer and the computers from the
MassPar company.
 The processors are commanded by the
global controller that sends
instructions to the processors.
 It says add, and they all add.
 It says shift to the right, and they all
shift to the right.
 The processors are like obedient
soldiers, marching in unison.
47
MIMD Computers
 MIMD stands for Multiple Instruction Multiple Data.
 There are multiple instruction streams with separate code segments distributed






48
among the processors.
MIMD is actually a superset of SIMD, so that the processors can run the same
instruction stream or different instruction streams.
In addition, there are multiple data streams; different data elements are allocated
to each processor.
MIMD computers can have either distributed memory or shared memory.
While the processors on SIMD
computers run in lock step, the
processors on MIMD computers
run independently of each other.
MIMD computers can be used for
either data parallel or task parallel
applications.
Some examples of MIMD
computers are the SGI Origin2000
computer and the HP V-Class
computer.
SPMD Computers
 SPMD stands for Single Program Multiple Data.
 SPMD is a special case of MIMD.
 SPMD execution happens when a MIMD computer is programmed to have the
same set of instructions per processor.
 With SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.
 Unlike SIMD, the synchronous execution of instructions is relaxed.
 An example is the execution of an if statement on a SPMD computer.
 Because each processor computes with its own partition of the data elements, it
may evaluate the right hand side of the if statement differently from another
processor.
 One processor may take a certain branch of the if statement, and another
processor may take a different branch of the same if statement.
 Hence, even though each processor has the same set of instructions, those
instructions may be evaluated in a different order from one processor to the next.
 The analogies we used for describing SIMD computers can be modified for
MIMD computers.
49
 Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD world
the processors march to the beat of their own drummer.
Summary of SIMD versus MIMD
SIMD
distributed memory
distriuted memory
or
shared memory
Code Segment
same per
processor
same
or
different
Processors
Run In
lock step
asynchronously
Data
Elements
different per
processor
different per
processor
data parallel
data parallel
or
task parallel
Memory
Applications
50
MIMD
Trends and Examples
 Flow of control trends:
Decade
Flow of Control
Computer Example
1980's
SIMD
Thinking Machines CM-2
1990's
MIMD
SGI Origin2000
2000's
MIMD
Workstation Clusters
 The flow of control on today:
51
Computer
Flow of Control
SGI Origin2000
MIMD
HP V-Class
MIMD
Cray T3E
MIMD
IBM SP
MIMD
Workstation Clusters
MIMD
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
52
Interconnection Networks
 What exactly is the interconnection network?
 The interconnection network is made up of the wires and cables that define how the
multiple processors of a parallel computer are connected to each other and to the
memory units.
 The time required to transfer data is dependent upon the specific type of the
interconnection network.
 This transfer time is called the communication time.
 What network characteristics are important?
 Diameter: the maximum distance that data must travel for 2 processors to
communicate.
 Bandwidth: the amount of data that can be sent through a network connection.
 Latency: the delay on a network while a data packet is being stored and forwarded.
 Types of Interconnection Networks
53
The network topologies (geometric arrangements of the computer network
connections) are:
 Bus
 Cross-bar Switch
 Hybercube
 Tree
Interconnection Networks
 The aspects of network issues are:







Cost
Scalability
Reliability
Suitable Applications
Data Rate
Diameter
Degree
 General Network Characteristics
 Some networks can be compared in terms of their degree and diameter.
 Degree: how many communicating wires are coming out of each processor.
 A large degree is a benefit because it has multiple paths.
 Diameter: This is the distance between the two processors that are farthest
apart.
 A small diameter corresponds to low latency.
54
Bus Network
 Bus topology is the original coaxial cable-based Local Area Network
(LAN) topology in which the medium forms a single bus to which all
stations are attached.
 The positive aspects
 It is also a mature technology that is well known and reliable.
 The cost is also very low.
 simple to construct.
 The negative aspects
 limited data
transmission rate.
 not scalable in terms
of performance.
 Example: SGI Power
Challenge.
 Only scaled to 18
55
processors.
Cross-Bar Switch Network
 A cross-bar switch is a network that works through a switching mechanism to
access shared memory.
 it scales better than the bus network but it costs significantly more.
 The telephone system uses this type of network. An example of a computer
56
with this type of network is the HP V-Class.
 Here is a diagram of a
cross-bar switch
network which shows
the processors talking
through the
switchboxes to store or
retrieve data in
memory.
 There are multiple
paths for a processor to
communicate with a
certain memory.
 The switches determine
the optimal route to
take.
Cross-Bar Switch Network
 In a hypercube network, the processors are connected as if they
were corners of a multidimensional cube. Each node in an N
dimensional cube is directly connected to N other nodes.
 The fact that the number of directly
57
connected, "nearest neighbor",
nodes increases with the total size of
the network is also highly desirable
for a parallel computer.
 The degree of a hypercube network
is log n and the diameter is log n,
where n is the number of
processors.
 Examples of computers with this
type of network are the CM-2,
NCUBE-2, and the Intel iPSC860.
Tree Network
 The processors are the bottom nodes of the tree. For a processor




58
to retrieve data, it must go up in the network and then go back
down.
This is useful for decision making applications that can be mapped
as trees.
The degree of a tree network is 1. The diameter of the network is
2 log (n+1)-2 where n is the number of processors.
The Thinking Machines CM-5 is an
example of a parallel computer
with this type of network.
Tree networks are very suitable for
database applications because it
allows multiple searches through
the database at a time.
Interconnected Networks
 Torus Network: A mesh with wrap-around connections in




59
both the x and y directions.
Multistage Network: A network with more than one
networking unit.
Fully Connected Network: A network where every processor
is connected to every other processor.
Hypercube Network: Processors are connected as if they
were corners of a multidimensional cube.
Mesh Network: A network where each interior processor is
connected to its four nearest neighbors.
Interconnected Networks
 Bus Based Network: Coaxial cable based LAN topology in
which the medium forms a single bus to which all stations are
attached.
 Cross-bar Switch Network: A network that works through a
switching mechanism to access shared memory.
 Tree Network: The processors are the bottom nodes of the
tree.
 Ring Network: Each processor is connected to two others
and the line of connections forms a circle.
60
Summary of Parallel Computer
Characteristics
 How many processors does the computer have?
 10s?
 100s?
 1000s?
 How powerful are the processors?
 what's the MHz rate
 what's the MIPS rate
 What's the instruction set architecture?
 RISC
 CISC
61
Summary of Parallel Computer
Characteristics
 How much memory is available?
 total memory
 memory per processor
 What kind of memory?
 distributed memory
 shared memory
 distributed shared memory
 What type of flow of control?
 SIMD
 MIMD
 SPMD
62
Summary of Parallel Computer
Characteristics
 What is the interconnection network?










63
Bus
Crossbar
Hypercube
Tree
Torus
Multistage
Fully Connected
Mesh
Ring
Hybrid
Design decisions made by some of the
major parallel computer vendors
Computer
Programming
Style
OS
Processors
Memory
Flow of
Control
Network
SGI
Origin2000
OpenMP
MPI
IRIX
MIPS RISC
R10000
DSM
MIMD
Crossbar
Hypercube
HP V-Class
OpenMP
MPI
HP-UX
HP PA 8200
DSM
MIMD
Crossbar
Ring
Cray T3E
SHMEM
Unicos
Compaq Alpha
Distributed MIMD
Torus
IBM SP
MPI
AIX
IBM Power3
Distributed MIMD
IBM Switch
Linux
Intel Pentium
III
Distributed MIMD
Myrinet
Tree
Workstation
MPI
Clusters
64
Summary
 This completes our introduction to parallel computing.
 You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.
 In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.
 There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:
Highly Parallel Computing, Second Edition
George S. Almasi and Allan Gottlieb
Benjamin/Cummings Publishers, 1994
Parallel Computing Theory and Practice
Michael J. Quinn
McGraw-Hill, Inc., 1994
65