Parallel Computing Explained
Download
Report
Transcript Parallel Computing Explained
Parallel Computing Explained
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallel Computing Overview
Who should read this chapter?
New Users – to learn concepts and terminology.
Intermediate Users – for review or reference.
Management Staff – to understand the basic concepts – even if
you don’t plan to do any programming.
Note: Advanced users may opt to skip this chapter.
Introduction to Parallel Computing
High performance parallel computers
can solve large problems much faster than a desktop computer
fast CPUs, large memory, high speed interconnects, and high speed
input/output
able to speed up computations
by making the sequential components run faster
by doing more operations in parallel
High performance parallel computers are in demand
need for tremendous computational capabilities in science,
engineering, and business.
require gigabytes/terabytes f memory and gigaflops/teraflops of
performance
scientists are striving for petascale performance
Introduction to Parallel Computing
HPPC are used in a wide variety of disciplines.
Meteorologists: prediction of tornadoes and thunderstorms
Computational biologists: analyze DNA sequences
Pharmaceutical companies: design of new drugs
Oil companies: seismic exploration
Wall Street: analysis of financial markets
NASA: aerospace vehicle design
Entertainment industry: special effects in movies and
commercials
These complex scientific and business applications all need to
perform computations on large datasets or large equations.
Parallelism in our Daily Lives
There are two types of processes that occur in computers and
in our daily lives:
Sequential processes
occur in a strict order
it is not possible to do the next step until the current one is completed.
Examples
The passage of time: the sun rises and the sun sets.
Writing a term paper: pick the topic, research, and write the paper.
Parallel processes
many events happen simultaneously
Examples
Plant growth in the springtime
An orchestra
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.2.1 Data Parallelism
1.1.2.2 Task Parallelism
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Parallelism in Computer Programs
Conventional wisdom:
Computer programs are sequential in nature
Only a small subset of them lend themselves to parallelism.
Algorithm: the "sequence of steps" necessary to do a computation.
The first 30 years of computer use, programs were run sequentially.
The 1980's saw great successes with parallel computers.
Dr. Geoffrey Fox published a book entitled Parallel Computing
Works!
many scientific accomplishments resulting from parallel computing
Computer programs are parallel in nature
Only a small subset of them need to be run sequentially
Parallel Computing
What a computer does when it carries out more than one
computation at a time using more than one processor.
By using many processors at once, we can speedup the execution
If one processor can perform the arithmetic in time t.
Then ideally p processors can perform the arithmetic in time t/p.
What if I use 100 processors? What if I use 1000 processors?
Almost every program has some form of parallelism.
You need to determine whether your data or your program can be
partitioned into independent pieces that can be run simultaneously.
Decomposition is the name given to this partitioning process.
Types of parallelism:
data parallelism
task parallelism.
Data Parallelism
The same code segment runs concurrently on each processor,
but each processor is assigned its own part of the data to
work on.
Do loops (in Fortran) define the parallelism.
The iterations must be independent of each other.
Data parallelism is called "fine grain parallelism" because the
computational work is spread into many small subtasks.
Example
Dense linear algebra, such as matrix multiplication, is a perfect
candidate for data parallelism.
An example of data parallelism
Original Sequential Code
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
Parallel Code
!$OMP PARALLEL DO
DO K=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
END DO
!$END PARALLEL DO
Quick Intro to OpenMP
OpenMP is a portable standard for parallel directives
covering both data and task parallelism.
More information about OpenMP is available on the OpenMP
website.
We will have a lecture on Introduction to OpenMP later.
With OpenMP, the loop that is performed in parallel is the
loop that immediately follows the Parallel Do directive.
In our sample code, it's the K loop:
DO K=1,N
OpenMP Loop Parallelism
Iteration-Processor
Assignments
The code segment running
on each processor
Processor
Iterations
of K
Data
Elements
proc0
K=1:5
A(I, 1:5)
B(1:5 ,J)
proc1
K=6:10
A(I, 6:10)
B(6:10 ,J)
proc2
K=11:15
A(I, 11:15)
B(11:15 ,J)
proc3
K=16:20
A(I, 16:20)
B(16:20 ,J)
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
A(I,K)*B(K,J)
END DO
END DO
OpenMP Style of Parallelism
can be done incrementally as follows:
Parallelize the most computationally intensive loop.
2. Compute performance of the code.
3. If performance is not satisfactory, parallelize another loop.
4. Repeat steps 2 and 3 as many times as needed.
1.
The ability to perform incremental parallelism is considered a
positive feature of data parallelism.
It is contrasted with the MPI (Message Passing Interface)
style of parallelism, which is an "all or nothing" approach.
Task Parallelism
Task parallelism may be thought of as the opposite of data
parallelism.
Instead of the same operations being performed on different parts
of the data, each process performs different operations.
You can use task parallelism when your program can be split into
independent pieces, often subroutines, that can be assigned to
different processors and run concurrently.
Task parallelism is called "coarse grain" parallelism because the
computational work is spread into just a few subtasks.
More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.
Task parallelism is often easier to implement and has less overhead
than data parallelism.
Task Parallelism
The abstract code shown in the diagram is decomposed into
4 independent code segments that are labeled A, B, C, and D.
The right hand side of the diagram illustrates the 4 code
segments running concurrently.
Task Parallelism
Original Code
Parallel Code
program main
program main
!$OMP PARALLEL
!$OMP SECTIONS
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP SECTION
code segment labeled
!$OMP END SECTIONS
!$OMP END PARALLEL
end
code segment labeled A
code segment labeled B
code segment labeled C
code segment labeled D
end
A
B
C
D
OpenMP Task Parallelism
With OpenMP, the code that follows each SECTION(S)
directive is allocated to a different processor. In our sample
parallel code, the allocation of code segments to processors is
as follows.
Processor
Code
proc0
code segment
labeled A
proc1
code segment
labeled B
proc2
code segment
labeled C
proc3
code segment
labeled D
Parallelism in Computers
How parallelism is exploited and enhanced within the
operating system and hardware components of a parallel
computer:
operating system
arithmetic
memory
disk
Operating System Parallelism
All of the commonly used parallel computers run a version of the
Unix operating system. In the table below each OS listed is in fact
Unix, but the name of the Unix OS varies with each vendor.
Parallel Computer
OS
SGI Origin2000
IRIX
HP V-Class
HP-UX
Cray T3E
Unicos
IBM SP
AIX
Workstation
Clusters
Linux
For more information about Unix, a collection of Unix documents
is available.
Two Unix Parallelism Features
background processing facility
With the Unix background processing facility you can run the
executable a.out in the background and simultaneously view the
man page for the etime function in the foreground. There are
two Unix commands that accomplish this:
a.out > results &
man etime
cron feature
With the Unix cron feature you can submit a job that will run at
a later time.
Arithmetic Parallelism
Multiple execution units
facilitate arithmetic parallelism.
The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to be
used simultaneously, because the execution units operate independently.
Fused multiply and add
is another parallel arithmetic feature.
Parallel computers are able to overlap multiply and add. This arithmetic is named
MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) on
HP computers. In either case, the two arithmetic operations are overlapped and
can complete in hardware in one computer cycle.
Superscalar arithmetic
is the ability to issue several arithmetic operations per computer cycle.
It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n-way superscalar, where n is the number of slots per
cycle. The SGI Origin2000 is called a 4-way superscalar computer.
Memory Parallelism
memory interleaving
memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory banks,
then data elements with even memory addresses would fall into one bank, and
data elements with odd memory addresses into the other.
multiple memory ports
Port means a bi-directional memory pathway. When the data elements that are
interleaved across the memory banks are needed, the multiple memory ports
allow them to be accessed and fetched in parallel, which increases the memory
bandwidth (MB/s or GB/s).
multiple levels of the memory hierarchy
There is global memory that any processor can access. There is memory that is
local to a partition of the processors. Finally there is memory that is local to a
single processor, that is, the cache memory and the memory elements held in
registers.
Cache memory
Cache is a small memory that has fast access compared with the larger main
memory and serves to keep the faster processor filled with data.
Memory Parallelism
Memory Hierarchy
Cache Memory
Disk Parallelism
RAID (Redundant Array of Inexpensive Disk)
RAID disks are on most parallel computers.
The advantage of a RAID disk system is that it provides a
measure of fault tolerance.
If one of the disks goes down, it can be swapped out, and the
RAID disk system remains operational.
Disk Striping
When a data set is written to disk, it is striped across the RAID
disk system. That is, it is broken into pieces that are written
simultaneously to the different disks in the RAID disk system.
When the same data set is read back in, the pieces are read in
parallel, and the full data set is reassembled in memory.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives
1.1.2 Parallelism in Computer Programs
1.1.3 Parallelism in Computers
1.1.3.4 Disk Parallelism
1.1.4 Performance Measures
1.1.5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1.3 Summary
Performance Measures
Peak Performance
is the top speed at which the computer can operate.
It is a theoretical upper limit on the computer's performance.
Sustained Performance
is the highest consistently achieved speed.
It is a more realistic measure of computer performance.
Cost Performance
is used to determine if the computer is cost effective.
MHz
is a measure of the processor speed.
The processor speed is commonly measured in millions of cycles per second,
where a computer cycle is defined as the shortest time in which some work can be
done.
MIPS
is a measure of how quickly the computer can issue instructions.
Millions of instructions per second is abbreviated as MIPS, where the instructions
are computer instructions such as: memory reads and writes, logical operations ,
floating point operations, integer operations, and branch instructions.
Performance Measures
Mflops (Millions of floating point operations per second)
measures how quickly a computer can perform floating-point operations
such as add, subtract, multiply, and divide.
Speedup
measures the benefit of parallelism.
It shows how your program scales as you compute with more processors,
compared to the performance on one processor.
Ideal speedup happens when the performance gain is linearly proportional to
the number of processors used.
Benchmarks
are used to rate the performance of parallel computers and parallel
programs.
A well known benchmark that is used to compare parallel computers is the
Linpack benchmark.
Based on the Linpack results, a list is produced of the Top 500
Supercomputer Sites. This list is maintained by the University of Tennessee
and the University of Mannheim.
More Parallelism Issues
Load balancing
is the technique of evenly dividing the workload among the processors.
For data parallelism it involves how iterations of loops are allocated to processors.
Load balancing is important because the total time for the program to complete is
the time spent by the longest executing thread.
The problem size
must be large and must be able to grow as you compute with more processors.
In order to get the performance you expect from a parallel computer you need to
run a large application with large data sizes, otherwise the overhead of passing
information between processors will dominate the calculation time.
Good software tools
are essential for users of high performance parallel computers.
These tools include:
parallel compilers
parallel debuggers
performance analysis tools
parallel math software
The availability of a broad set of application software is also important.
More Parallelism Issues
The high performance computing market is risky and chaotic. Many
supercomputer vendors are no longer in business, making the
portability of your application very important.
A workstation farm
is defined as a fast network connecting heterogeneous workstations.
The individual workstations serve as desktop systems for their owners.
When they are idle, large problems can take advantage of the unused
cycles in the whole system.
An application of this concept is the SETI project.You can participate in
searching for extraterrestrial intelligence with your home PC. More
information about this project is available at the SETI Institute.
Condor
is software that provides resource management services for applications that
run on heterogeneous collections of workstations.
Miron Livny at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase high throughput computing to describe
this process of harnessing idle workstation cycles. More information is available
at the Condor Home Page.
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Comparison of Parallel Computers
Now you can explore the hardware components of parallel
computers:
kinds of processors
types of memory organization
flow of control
interconnection networks
You will see what is common to these parallel computers,
and what makes each one of them unique.
Kinds of Processors
There are three types of parallel computers:
1.
computers with a small number of powerful processors
Typically have tens of processors.
The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for computing
centers.
They are general-purpose computers that perform especially well on
applications that have large vector lengths.
The examples of this type of computer are the Cray SV1 and the Fujitsu
VPP5000.
Kinds of Processors
There are three types of parallel computers:
computers with a large number of less powerful processors
2.
Named a Massively Parallel Processor (MPP), typically have thousands of
processors.
The processors are usually proprietary and air-cooled.
Because of the large number of processors, the distance between the furthest
processors can be quite large requiring a sophisticated internal network that
allows distant processors to communicate with each other quickly.
These computers are suitable for applications with a high degree of
concurrency.
The MPP type of computer was popular in the 1980s.
Examples of this type of computer were the Thinking Machines CM-2
computer, and the computers made by the MassPar company.
Kinds of Processors
There are three types of parallel computers:
3.
computers that are medium scale in between the two extremes
Typically have hundreds of processors.
The processor chips are usually not proprietary; rather they are commodity
processors like the Pentium III.
These are general-purpose computers that perform well on a wide range of
applications.
The most common example of this class is the Linux Cluster.
Trends and Examples
Processor trends :
Decade Processor Type
Computer Example
1970s
Pipelined, Proprietary
Cray-1
1980s
Massively Parallel, Proprietary
Thinking Machines CM2
1990s
Superscalar, RISC, Commodity SGI Origin2000
2000s
CISC, Commodity
Workstation Clusters
The processors on today’s commonly used parallel computers:
Computer
Processor
SGI Origin2000
MIPS RISC R12000
HP V-Class
HP PA 8200
Cray T3E
Compaq Alpha
IBM SP
IBM Power3
Workstation Clusters
Intel Pentium III, Intel Itanium
Memory Organization
The following paragraphs describe the three types of
memory organization found on parallel computers:
distributed memory
shared memory
distributed shared memory
Distributed Memory
In distributed memory computers, the total memory is partitioned
into memory that is private to each processor.
There is a Non-Uniform Memory Access time (NUMA), which is
proportional to the distance between the two communicating
processors.
On NUMA computers,
data is accessed the
quickest from a private
memory, while data from
the most distant
processor takes the
longest to access.
Some examples are the
Cray T3E, the IBM SP,
and workstation clusters.
Distributed Memory
When programming distributed memory computers, the
code and the data should be structured such that the bulk of
a processor’s data accesses are to its own private (local)
memory.
This is called having
good data locality.
Today's distributed
memory computers use
message passing such as
MPI to communicate
between processors as
shown in the following
example:
Distributed Memory
One advantage of distributed memory computers is that they
are easy to scale. As the demand for resources grows,
computer centers can easily add more memory and
processors.
This is often called the LEGO block approach.
The drawback is that programming of distributed memory
computers can be quite complicated.
Shared Memory
In shared memory computers, all processors have access to a single pool
of centralized memory with a uniform address space.
Any processor can address any memory location at the same speed so
there is Uniform Memory Access time (UMA).
Processors communicate with each other through the shared memory.
The advantages and
disadvantages of shared
memory machines are
roughly the opposite of
distributed memory
computers.
They are easier to program
because they resemble the
programming of single
processor machines
But they don't scale like
their distributed memory
counterparts
Distributed Shared Memory
In Distributed Shared Memory (DSM) computers, a cluster or partition of
processors has access to a common shared memory.
It accesses the memory of a different processor cluster in a NUMA fashion.
Memory is physically distributed but logically shared.
Attention to data locality again is important.
Distributed shared memory
computers combine the best
features of both distributed
memory computers and
shared memory computers.
That is, DSM computers have
both the scalability of
distributed memory
computers and the ease of
programming of shared
memory computers.
Some examples of DSM
computers are the SGI
Origin2000 and the HP VClass computers.
Trends and Examples
Memory organization
trends:
Decade
Memory Organization
Example
1970s
Shared Memory
Cray-1
1980s
Distributed Memory
Thinking Machines CM-2
1990s
Distributed Shared Memory
SGI Origin2000
2000s
Distributed Memory
Workstation Clusters
The memory
organization of
today’s commonly
used parallel
computers:
Computer
Memory Organization
SGI Origin2000
DSM
HP V-Class
DSM
Cray T3E
Distributed
IBM SP
Distributed
Workstation Clusters
Distributed
Flow of Control
When you look at the control of flow you will see three types
of parallel computers:
Single Instruction Multiple Data (SIMD)
Multiple Instruction Multiple Data (MIMD)
Single Program Multiple Data (SPMD)
Flynn’s Taxonomy
Flynn’s Taxonomy, devised in 1972 by Michael Flynn of Stanford
University, describes computers by how streams of instructions interact
with streams of data.
There can be single or multiple instruction streams, and there can be
single or multiple data streams. This gives rise to 4 types of computers as
shown in the diagram below:
Flynn's taxonomy
names the 4 computer
types SISD, MISD,
SIMD and MIMD.
Of these 4, only SIMD
and MIMD are
applicable to parallel
computers.
Another computer
type, SPMD, is a special
case of MIMD.
SIMD Computers
SIMD stands for Single Instruction Multiple Data.
Each processor follows the same set of instructions.
With different data elements being allocated to each processor.
SIMD computers have distributed memory with typically thousands of simple processors,
and the processors run in lock step.
SIMD computers, popular in the 1980s, are useful for fine grain data parallel applications,
such as neural networks.
Some examples of SIMD computers
were the Thinking Machines CM-2
computer and the computers from the
MassPar company.
The processors are commanded by the
global controller that sends
instructions to the processors.
It says add, and they all add.
It says shift to the right, and they all
shift to the right.
The processors are like obedient
soldiers, marching in unison.
MIMD Computers
MIMD stands for Multiple Instruction Multiple Data.
There are multiple instruction streams with separate code segments distributed
among the processors.
MIMD is actually a superset of SIMD, so that the processors can run the same
instruction stream or different instruction streams.
In addition, there are multiple data streams; different data elements are allocated
to each processor.
MIMD computers can have either distributed memory or shared memory.
While the processors on SIMD
computers run in lock step, the
processors on MIMD computers
run independently of each other.
MIMD computers can be used for
either data parallel or task parallel
applications.
Some examples of MIMD
computers are the SGI Origin2000
computer and the HP V-Class
computer.
SPMD Computers
SPMD stands for Single Program Multiple Data.
SPMD is a special case of MIMD.
SPMD execution happens when a MIMD computer is programmed to have the
same set of instructions per processor.
With SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.
Unlike SIMD, the synchronous execution of instructions is relaxed.
An example is the execution of an if statement on a SPMD computer.
Because each processor computes with its own partition of the data elements, it
may evaluate the right hand side of the if statement differently from another
processor.
One processor may take a certain branch of the if statement, and another
processor may take a different branch of the same if statement.
Hence, even though each processor has the same set of instructions, those
instructions may be evaluated in a different order from one processor to the next.
The analogies we used for describing SIMD computers can be modified for
MIMD computers.
Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD world
the processors march to the beat of their own drummer.
Summary of SIMD versus MIMD
SIMD
MIMD
distributed memory
distriuted memory
or
shared memory
Code Segment
same per
processor
same
or
different
Processors
Run In
lock step
asynchronously
Data
Elements
different per
processor
different per
processor
data parallel
data parallel
or
task parallel
Memory
Applications
Trends and Examples
Flow of control trends:
Decade
Flow of Control
Computer Example
1980's
SIMD
Thinking Machines CM-2
1990's
MIMD
SGI Origin2000
2000's
MIMD
Workstation Clusters
The flow of control on today:
Computer
Flow of Control
SGI Origin2000
MIMD
HP V-Class
MIMD
Cray T3E
MIMD
IBM SP
MIMD
Workstation Clusters
MIMD
Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.2 Comparison of Parallel Computers
1.2.1 Processors
1.2.2 Memory Organization
1.2.3 Flow of Control
1.2.4 Interconnection Networks
1.2.4.1 Bus Network
1.2.4.2 Cross-Bar Switch Network
1.2.4.3 Hypercube Network
1.2.4.4 Tree Network
1.2.4.5 Interconnection Networks Self-test
1.2.5 Summary of Parallel Computer Characteristics
1.3 Summary
Interconnection Networks
What exactly is the interconnection network?
The interconnection network is made up of the wires and cables that define how the
multiple processors of a parallel computer are connected to each other and to the
memory units.
The time required to transfer data is dependent upon the specific type of the
interconnection network.
This transfer time is called the communication time.
What network characteristics are important?
Diameter: the maximum distance that data must travel for 2 processors to
communicate.
Bandwidth: the amount of data that can be sent through a network connection.
Latency: the delay on a network while a data packet is being stored and forwarded.
Types of Interconnection Networks
The network topologies (geometric arrangements of the computer network
connections) are:
Bus
Cross-bar Switch
Hybercube
Tree
Interconnection Networks
The aspects of network issues are:
Cost
Scalability
Reliability
Suitable Applications
Data Rate
Diameter
Degree
General Network Characteristics
Some networks can be compared in terms of their degree and diameter.
Degree: how many communicating wires are coming out of each processor.
A large degree is a benefit because it has multiple paths.
Diameter: This is the distance between the two processors that are farthest
apart.
A small diameter corresponds to low latency.
Bus Network
Bus topology is the original coaxial cable-based Local Area Network
(LAN) topology in which the medium forms a single bus to which all
stations are attached.
The positive aspects
It is also a mature technology that is well known and reliable.
The cost is also very low.
simple to construct.
The negative aspects
limited data
transmission rate.
not scalable in terms
of performance.
Example: SGI Power
Challenge.
Only scaled to 18
processors.
Cross-Bar Switch Network
A cross-bar switch is a network that works through a switching mechanism to
access shared memory.
it scales better than the bus network but it costs significantly more.
The telephone system uses this type of network. An example of a computer
with this type of network is the HP V-Class.
Here is a diagram of a
cross-bar switch
network which shows
the processors talking
through the
switchboxes to store or
retrieve data in
memory.
There are multiple
paths for a processor to
communicate with a
certain memory.
The switches determine
the optimal route to
take.
Cross-Bar Switch Network
In a hypercube network, the processors are connected as if they
were corners of a multidimensional cube. Each node in an N
dimensional cube is directly connected to N other nodes.
The fact that the number of directly
connected, "nearest neighbor",
nodes increases with the total size of
the network is also highly desirable
for a parallel computer.
The degree of a hypercube network
is log n and the diameter is log n,
where n is the number of
processors.
Examples of computers with this
type of network are the CM-2,
NCUBE-2, and the Intel iPSC860.
Tree Network
The processors are the bottom nodes of the tree. For a processor
to retrieve data, it must go up in the network and then go back
down.
This is useful for decision making applications that can be mapped
as trees.
The degree of a tree network is 1. The diameter of the network is
2 log (n+1)-2 where n is the number of processors.
The Thinking Machines CM-5 is an
example of a parallel computer
with this type of network.
Tree networks are very suitable for
database applications because it
allows multiple searches through
the database at a time.
Interconnected Networks
Torus Network: A mesh with wrap-around connections in
both the x and y directions.
Multistage Network: A network with more than one
networking unit.
Fully Connected Network: A network where every processor
is connected to every other processor.
Hypercube Network: Processors are connected as if they
were corners of a multidimensional cube.
Mesh Network: A network where each interior processor is
connected to its four nearest neighbors.
Interconnected Networks
Bus Based Network: Coaxial cable based LAN topology in
which the medium forms a single bus to which all stations are
attached.
Cross-bar Switch Network: A network that works through a
switching mechanism to access shared memory.
Tree Network: The processors are the bottom nodes of the
tree.
Ring Network: Each processor is connected to two others
and the line of connections forms a circle.
Summary of Parallel Computer
Characteristics
How many processors does the computer have?
10s?
100s?
1000s?
How powerful are the processors?
what's the MHz rate
what's the MIPS rate
What's the instruction set architecture?
RISC
CISC
Summary of Parallel Computer
Characteristics
How much memory is available?
total memory
memory per processor
What kind of memory?
distributed memory
shared memory
distributed shared memory
What type of flow of control?
SIMD
MIMD
SPMD
Summary of Parallel Computer
Characteristics
What is the interconnection network?
Bus
Crossbar
Hypercube
Tree
Torus
Multistage
Fully Connected
Mesh
Ring
Hybrid
Design decisions made by some of the
major parallel computer vendors
Computer
Programming
Style
OS
Processors
Memory
Flow of
Control
Network
SGI
Origin2000
OpenMP
MPI
IRIX
MIPS RISC
R10000
DSM
MIMD
Crossbar
Hypercube
HP V-Class
OpenMP
MPI
HP-UX
HP PA 8200
DSM
MIMD
Crossbar
Ring
Cray T3E
SHMEM
Unicos
Compaq Alpha
Distributed MIMD
Torus
IBM SP
MPI
AIX
IBM Power3
Distributed MIMD
IBM Switch
Linux
Intel Pentium
III
Distributed MIMD
Myrinet
Tree
Workstation
MPI
Clusters
Summary
This completes our introduction to parallel computing.
You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.
In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.
There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:
Highly Parallel Computing, Second Edition
George S. Almasi and Allan Gottlieb
Benjamin/Cummings Publishers, 1994
Parallel Computing Theory and Practice
Michael J. Quinn
McGraw-Hill, Inc., 1994
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
2.1 Automatic Compiler Parallelism
2.2 Data Parallelism by Hand
2.3 Mixing Automatic and Hand Parallelism
2.4 Task Parallelism
2.5 Parallelism Issues
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
How to Parallelize a Code
This chapter describes how to turn a single processor
program into a parallel one, focusing on shared memory
machines.
Both automatic compiler parallelization and parallelization by
hand are covered.
The details for accomplishing both data parallelism and task
parallelism are presented.
Automatic Compiler Parallelism
Automatic compiler parallelism enables you to use a
single compiler option and let the compiler do the work.
The advantage of it is that it’s easy to use.
The disadvantages are:
The compiler only does loop level parallelism, not task
parallelism.
The compiler wants to parallelize every do loop in your code.
If you have hundreds of do loops this creates way too much
parallel overhead.
Automatic Compiler Parallelism
To use automatic compiler parallelism on a Linux system
with the Intel compilers, specify the following.
ifort -parallel -O2 ... prog.f
The compiler creates conditional code that will run with any
number of threads.
Specify the number of threads and make sure you still get the
right answers with setenv:
setenv OMP_NUM_THREADS 4 a.out > results
Data Parallelism by Hand
First identify the loops that use most of the CPU time (the Profiling
lecture describes how to do this).
By hand, insert into the code OpenMP directive(s) just before the
loop(s) you want to make parallel.
Some code modifications may be needed to remove data dependencies
and other inhibitors of parallelism.
Use your knowledge of the code and data to assist the compiler.
For the SGI Origin2000 computer, insert into the code an OpenMP
directive just before the loop that you want to make parallel.
!$OMP PARALLEL
DO do i=1,n
… lots of computation ...
end do
!$OMP END PARALLEL DO
Data Parallelism by Hand
Compile with the mp compiler option.
f90 -mp ... prog.f
As before, the compiler generates conditional code that will run with any
number of threads.
If you want to rerun your program with a different number of threads, you do
not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8
a.out > results2
The setenv command can be placed anywhere before the a.out command.
The setenv command must be typed exactly as indicated. If you have a typo,
you will not receive a warning or error message. To make sure that the setenv
command is specified correctly, type:
setenv
It produces a listing of your environment variable settings.
Mixing Automatic and Hand Parallelism
You can have one source file parallelized automatically by the
compiler, and another source file parallelized by hand.
Suppose you split your code into two files named prog1.f and
prog2.f.
f90 -c -apo … prog1.f
(automatic // for prog1.f)
f90 -c -mp … prog2.f
prog2.f)
(by hand // for
f90 prog1.o prog2.o
executable)
(creates one
a.out > results
(runs the executable)
Task Parallelism
You can accomplish task parallelism as follows:
!$OMP PARALLEL
!$OMP SECTIONS
… lots of computation in part A …
!$OMP SECTION
… lots of computation in part B ...
!$OMP SECTION
… lots of computation in part C ...
!$OMP END SECTIONS
!$OMP END PARALLEL
Compile with the mp compiler option.
f90 -mp … prog.f
Use the setenv command to specify the number of threads.
setenv OMP_NUM_THREADS 3
a.out > results
Parallelism Issues
There are some issues to consider when parallelizing a
program.
Should data parallelism or task parallelism be used?
Should automatic compiler parallelism or parallelism
by hand be used?
Which loop in a nested loop situation should be the
one that becomes parallel?
How many threads should be used?
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
3.1 Recompile
3.2 Word Length
3.3 Compiler Options for Debugging
3.4 Standards Violations
3.5 IEEE Arithmetic Differences
3.6 Math Library Differences
3.7 Compute Order Related Differences
3.8 Optimization Level Too High
3.9 Diagnostic Listings
3.10 Further Information
Porting Issues
In order to run a computer program that presently runs on a
workstation, a mainframe, a vector computer, or another parallel
computer, on a new parallel computer you must first "port" the code.
After porting the code, it is important to have some benchmark results
you can use for comparison.
To do this, run the original program on a well-defined dataset, and save the
results from the old or “baseline” computer.
Then run the ported code on the new computer and compare the results.
If the results are different, don't automatically assume that the new results
are wrong – they may actually be better. There are several reasons why this
might be true, including:
Precision Differences - the new results may actually be more accurate than the baseline
results.
Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in
the code that was already there.
Detection methods for finding code flaws, solutions, and workarounds
are provided in this lecture.
Recompile
Some codes just need to be recompiled to get accurate results.
The compilers available on the NCSA computer platforms are
shown in the following table:
Language
SGI Origin2000
MIPSpro
Portland
Group
IA-32 Linux
Intel
GNU
Portland
Group
Intel
GNU
g77
pgf77
ifort
g77
pgf90
ifort
Fortran 77
f77
ifort
Fortran 90
f90
ifort
Fortran 90
f95
ifort
High
Performance
Fortran
C
C++
IA-64 Linux
ifort
pghpf
cc
CC
pghpf
icc
icpc
gcc
g++
pgcc
pgCC
icc
icpc
gcc
g++
Word Length
Code flaws can occur when you are porting your code to a
different word length computer.
For C, the size of an integer variable differs depending on the
machine and how the variable is generated. On the IA32 and IA64
Linux clusters, the size of an integer variable is 4 and 8 bytes,
respectively. On the SGI Origin2000, the corresponding value is 4
bytes if the code is compiled with the –n32 flag, and 8 bytes if
compiled without any flags or explicitly with the –64 flag.
For Fortran, the SGI MIPSpro and Intel compilers contain the
following flags to set default variable size.
-in where n is a number: set the default INTEGER to INTEGER*n.
The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux
clusters.
-rn where n is a number: set the default REAL to REAL*n. The value
of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.
Compiler Options for Debugging
On the SGI Origin2000, the MIPSpro compilers include
debugging options via the –DEBUG:group. The syntax is as
follows:
-DEBUG:option1[=value1]:option2[=value2]...
Two examples are:
Array-bound checking: check for subscripts out of range at
runtime.
-DEBUG:subscript_check=ON
Force all un-initialized stack, automatic and dynamically
allocated variables to be initialized.
-DEBUG:trap_uninitialized=ON
Compiler Options for Debugging
On the IA32 Linux cluster, the Fortran compiler is
equipped with the following –C flags for runtime
diagnostics:
-CA: pointers and allocatable references
-CB: array and subscript bounds
-CS: consistent shape of intrinsic procedure
-CU: use of uninitialized variables
-CV: correspondence between dummy and actual
arguments
Standards Violations
Code flaws can occur when the program has non-ANSI
standard Fortran coding.
ANSI standard Fortran is a set of rules for compiler writers that
specify, for example, the value of the do loop index upon exit
from the do loop.
Standards Violations Detection
To detect standards violations on the SGI Origin2000 computer
use the -ansi flag.
This option generates a listing of warning messages for the use
of non-ANSI standard coding.
On the Linux clusters, the -ansi[-] flag enables/disables
assumption of ANSI conformance.
IEEE Arithmetic Differences
Code flaws occur when the baseline computer conforms to the
IEEE arithmetic standard and the new computer does not.
The IEEE Arithmetic Standard is a set of rules governing arithmetic
roundoff and overflow behavior.
For example, it prohibits the compiler writer from replacing x/y
with x *recip (y) since the two results may differ slightly for some
operands.You can make your program strictly conform to the IEEE
standard.
To make your program conform to the IEEE Arithmetic Standards
on the SGI Origin2000 computer use:
f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.
This option specifies the level of conformance to the IEEE
standard where 1 is the most stringent and 3 is the most liberal.
On the Linux clusters, the Intel compilers can achieve
conformance to IEEE standard at a stringent level with the –mp
flag, or a slightly relaxed level with the –mp1 flag.
Math Library Differences
Most high-performance parallel computers are equipped with
vendor-supplied math libraries.
On the SGI Origin2000 platform, there are SGI/Cray Scientific
Library (SCSL) and Complib.sgimath.
SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms
(BLAS), LAPACK and Fast Fourier Transform (FFT) routines.
SCSL can be linked with –lscs for the serial version, or –mp –
lscs_mp for the parallel version.
The complib library can be linked with –lcomplib.sgimath for the
serial version, or –mp –lcomplib.sgimath_mp for the parallel
version.
The Intel Math Kernel Library (MKL) contains the complete set of
functions from BLAS, the extended BLAS (sparse), the complete
set of LAPACK routines, and Fast Fourier Transform (FFT)
routines.
Math Library Differences
On the IA32 Linux cluster, the libraries to link to are:
For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide
–lpthread
When calling MKL routines from C/C++ programs, you also
need to link with –lF90.
On the IA64 Linux cluster, the corresponding libraries are:
For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread
For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp –
lpthread
When calling MKL routines from C/C++ programs, you also
need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins
Compute Order Related Differences
Code flaws can occur because of the non-deterministic computation of
data elements on a parallel computer. The compute order in which the
threads will run cannot be guaranteed.
For example, in a data parallel program, the 50th index of a do loop may be
computed before the 10th index of the loop. Furthermore, the threads may
run in one order on the first run, and in another order on the next run of the
program.
Note: : If your algorithm depends on data being compared in a specific order,
your code is inappropriate for a parallel computer.
Use the following method to detect compute order related differences:
If your loop looks like
DO I = 1, N change it to
DO I = N, 1, -1 The results should not change if the iterations are
independent
Optimization Level Too High
Code flaws can occur when the optimization level has been set too
high thus trading speed for accuracy.
The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes cause
answers to change at higher optimization level.
Setting the Optimization Level
Both SGI Origin2000 computer and IBM Linux clusters provide
Level 0 (no optimization) to Level 3 (most aggressive) optimization,
using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3
optimization may carry out loop transformations that affect the
correctness of calculations. Checking correctness and precision of
calculation is highly recommended when –O3 is used.
For example on the Origin 2000
f90 -O0 … prog.f turns off all optimizations.
Optimization Level Too High
Isolating Optimization Level Problems
You can sometimes isolate optimization level problems using
the method of binary chop.
To do this, divide your program prog.f into halves. Name them prog1.f
and prog2.f.
Compile the first half with -O0 and the second half with -O3
f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o
prog2.o a.out > results
If the results are correct, the optimization problem lies in prog1.f
Next divide prog1.f into halves. Name them prog1a.f and prog1b.f
Compile prog1a.f with -O0 and prog1b.f with -O3
f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o
prog1b.o prog2.o a.out > results
Continue in this manner until you have isolated the section of code that is
producing incorrect results.
Diagnostic Listings
The SGI Origin 2000 compiler will generate all
kinds of diagnostic warnings and messages, but
not always by default. Some useful listing options
are:
f90
f90
f90
f90
f90
-listing ...
-fullwarn ...
-showdefaults ...
-version ...
-help ...
Further Information
SGI
man f77/f90/cc
man debug_group
man math
man complib.sgimath
MIPSpro 64-Bit Porting and Transition Guide
Online Manuals
Linux clusters pages
ifort/icc/icpc –help (IA32, IA64, Intel64)
Intel Fortran Compiler for Linux
Intel C/C++ Compiler for Linux
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
4.1 Aggressive Compiler Options
4.2 Compiler Optimizations
4.3 Vendor Tuned Code
4.4 Further Information
Scalar Tuning
If you are not satisfied with the performance of your
program on the new computer, you can tune the scalar code
to decrease its runtime.
This chapter describes many of these techniques:
The use of the most aggressive compiler options
The improvement of loop unrolling
The use of subroutine inlining
The use of vendor supplied tuned code
The detection of cache problems, and their solution are
presented in the Cache Tuning chapter.
Aggressive Compiler Options
For the SGI Origin2000 Linux clusters the main
optimization switch is
-On where n ranges from 0 to 3.
-O0 turns off all optimizations.
-O1 and -O2 do beneficial optimizations that will not
effect the accuracy of results.
-O3 specifies the most aggressive optimizations. It takes
the most compile time, may produce changes in
accuracy, and turns on software pipelining.
Aggressive Compiler Options
It should be noted that –O3 might carry out loop
transformations that produce incorrect results in some codes.
It is recommended that one compare the answer obtained from
Level 3 optimization with one obtained from a lower-level
optimization.
On the SGI Origin2000 and the Linux clusters, –O3 can be
used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3)
and –mp (or –mp1), respectively, to enforce operation
conformance to IEEE standard at different levels.
On the SGI Origin2000, the option
-Ofast = ip27
is also available. This option specifies the most aggressive
optimizations that are specifically tuned for the Origin2000
computer.
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
4.1Aggressive Compiler Options
4.2 Compiler Optimizations
4.2.1 Statement Level
4.2.2 Block Level
4.2.3 Routine Level
4.2.4 Software Pipelining
4.2.5 Loop Unrolling
4.2.6 Subroutine Inlining
4.2.7 Optimization Report
4.2.8 Profile-guided Optimization (PGO)
4.3 Vendor Tuned Code
4.4 Further Information
Compiler Optimizations
The various compiler optimizations can be classified as
follows:
Statement Level Optimizations
Block Level Optimizations
Routine Level Optimizations
Software Pipelining
Loop Unrolling
Subroutine Inlining
Each of these are described in the following sections.
Statement Level
Constant Folding
Replace simple arithmetic operations on constants with the pre-
computed result.
y = 5+7 becomes y = 12
Short Circuiting
Avoid executing parts of conditional tests that are not necessary.
if (I.eq.J .or. I.eq.K) expression
when I=J immediately compute the expression
Register Assignment
Put frequently used variables in registers.
Block Level
Dead Code Elimination
Remove unreachable code and code that is never executed or
used.
Instruction Scheduling
Reorder the instructions to improve memory pipelining.
Routine Level
Strength Reduction
Replace expressions in a loop with an expression that takes fewer
cycles.
Common Subexpressions Elimination
Expressions that appear more than once, are computed once, and the
result is substituted for each occurrence of the expression.
Constant Propagation
Compile time replacement of variables with constants.
Loop Invariant Elimination
Expressions inside a loop that don't change with the do loop index are
moved outside the loop.
Software Pipelining
Software pipelining allows the mixing of operations from
different loop iterations in each iteration of the hardware
loop. It is used to get the maximum work done per clock
cycle.
Note: On the R10000s there is out-of-order execution of
instructions, and software pipelining may actually get in the
way of this feature.
Loop Unrolling
The loops stride (or step) value is increased, and the body of the loop is
replicated. It is used to improve the scheduling of the loop by giving a
longer sequence of straight line code. An example of loop unrolling
follows:
Original Loop
Unrolled Loop
do I = 1, 99
c(I) = a(I) + b(I)
enddo
do I =
c(I) =
c(I+1)
c(I+2)
enddo
1, 99, 3
a(I) + b(I)
= a(I+1) + b(I+1)
= a(I+2) + b(I+2)
There is a limit to the amount of unrolling that can take place because there
are a limited number of registers.
On the SGI Origin2000, loops are unrolled to a level of 8 by default.
You can unroll to a level of 12 by specifying:
f90 -O3 -OPT:unroll_times_max=12 ... prog.f
On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0
for unrolling and no unrolling, respectively.
Subroutine Inlining
Subroutine inlining replaces a call to a subroutine with
the body of the subroutine itself.
One reason for using subroutine inlining is that when a
subroutine is called inside a do loop that has a huge
iteration count, subroutine inlining may be more
efficient because it cuts down on loop overhead.
However, the chief reason for using it is that do loops
that contain subroutine calls may not parallelize.
Subroutine Inlining
On the SGI Origin2000 computer, there are several options to
invoke inlining:
Inline all routines except those specified to -INLINE:never
f90 -O3 -INLINE:all … prog.f:
Inline no routines except those specified to -INLINE:must
f90 -O3 -INLINE:none … prog.f:
Specify a list of routines to inline at every call
f90 -O3 -INLINE:must=subrname … prog.f:
Specify a list of routines never to inline
f90 -O3 -INLINE:never=subrname … prog.f:
On the Linux clusters, the following flags can invoke function inlining:
inline function expansion for calls defined within the current source file
-ip:
inline function expansion for calls defined in separate files
-ipo:
Optimization Report
Intel 9.x and later compilers can generate reports that provide
useful information on optimization done on different parts of your
code.
To generate such optimization reports in a file filename, add the flag -
opt-report-file filename.
If you have a lot of source files to process simultaneously, and you use
a makefile to compile, you can also use make's "suffix" rules to have
optimization reports produced automatically, each with a unique
name. For example,
.f.o:
ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f
creates optimization reports that are named identically to the original
Fortran source but with the suffix ".f" replaced by ".opt".
Optimization Report
To help developers and performance analysts navigate through the
usually lengthy optimization reports, the NCSA program OptView is
designed to provide an easy-to-use and intuitive interface that allows the
user to browse through their own source code, cross-referenced with
the optimization reports.
OptView is installed on NCSA's IA64 Linux cluster under the directory
/usr/apps/tools/bin. You can either add that directory to your UNIX
PATH or you can invoke optview using an absolute path name.You'll
need to be using the X-Window system and to have set your DISPLAY
environment variable correctly for OptView to work.
Optview can provide a quick overview of which loops in a source code
or source codes among multiple files are highly optimized and which
might need further work. For a detailed description of use of OptView,
readers see: http://perfsuite.ncsa.uiuc.edu/OptView/
Profile-guided Optimization (PGO)
Profile-guided optimization allows Intel compilers to use
valuable runtime information to make better decisions about
function inlining and interprocedural optimizations to
generate faster codes. Its methodology is illustrated as
follows:
Profile-guided Optimization (PGO)
First, you do an instrumented compilation by adding the -prof-gen
flag in the compile process:
icc -prof-gen -c a1.c a2.c a3.c
icc a1.o a2.o a3.o -lirc
Then, you run the program with a representative set of data to
generate the dynamic information files given by the .dyn suffix.
These files contain valuable runtime information for the compiler
to do better function inlining and other optimizations.
Finally, the code is recompiled again with the -prof-use flag to use
the runtime information.
icc -prof-use -ipo -c a1.c a2.c a3.c
A profile-guided optimized executable is generated.
Vendor Tuned Code
Vendor math libraries have codes that are optimized for their
specific machine.
On the SGI Origin2000 platform, Complib.sgimath and SCSL
are available.
On the Linux clusters, Intel MKL is available. Ways to link to
these libraries are described in Section 3 - Porting Issues.
Further Information
SGI IRIX man and www pages
man opt
man lno
man inline
man ipa
man perfex
Performance Tuning for the Origin2000 at
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL
D/Doc/
Linux clusters help and www pages
ifort/icc/icpc –help (Intel)
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)
http://perfsuite.ncsa.uiuc.edu/OptView/
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
5.1 Sequential Code Limitation
5.2 Parallel Overhead
5.3 Load Balance
5.3.1 Loop Schedule Types
5.3.2 Chunk Size
Parallel Code Tuning
This chapter describes several of the most common
techniques for parallel tuning, the type of programs that
benefit, and the details for implementing them.
The majority of this chapter deals with improving load
balancing.
Sequential Code Limitation
Sequential code is a part of the program that cannot be run with
multiple processors. Some reasons why it cannot be made data
parallel are:
The code is not in a do loop.
The do loop contains a read or write.
The do loop contains a dependency.
The do loop has an ambiguous subscript.
The do loop has a call to a subroutine or a reference to a function
subprogram.
Sequential Code Fraction
As shown by Amdahl’s Law, if the sequential fraction is too large,
there is a limitation on speedup. If you think too much sequential
code is a problem, you can calculate the sequential fraction of code
using the Amdahl’s Law formula.
Sequential Code Limitation
Measuring the Sequential Code Fraction
Decide how many processors to use, this is p.
Run and time the program with 1 processor to give T(1).
Run and time the program with p processors to give T(2).
Form a ratio of the 2 timings T(1)/T(p), this is SP.
Substitute SP and p into the Amdahl’s Law formula:
f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code.
Solve for f, this is the fraction of sequential code.
Decreasing the Sequential Code Fraction
The compilation optimization reports list which loops could not be
parallelized and why.You can use this report as a guide to improve
performance on do loops by:
Removing dependencies
Removing I/O
Removing calls to subroutines and function subprograms
Parallel Overhead
Parallel overhead is the processing time spent
creating threads
spin/blocking threads
starting and ending parallel regions
synchronizing at the end of parallel regions
When the computational work done by the parallel processes is too
small, the overhead time needed to create and control the parallel
processes can be disproportionately large limiting the savings due to
parallelism.
Measuring Parallel Overhead
To get a rough under-estimate of parallel overhead:
Run and time the code using 1 processor.
Parallelize the code.
Run and time the parallel code using only 1 processor.
Subtract the 2 timings.
Parallel Overhead
Reducing Parallel Overhead
To reduce parallel overhead:
Don't parallelize all the loops.
Don't parallelize small loops.
To benefit from parallelization, a loop needs about 1000 floating
point operations or 500 statements in the loop.You can use the IF
modifier in the OpenMP directive to control when loops are
parallelized.
!$OMP PARALLEL DO IF(n > 500)
do i=1,n
... body of loop ...
end do
!$OMP END PARALLEL DO
Use task parallelism instead of data parallelism. It doesn't generate as
much parallel overhead and often more code runs in parallel.
Don't use more threads than you need.
Parallelize at the highest level possible.
Load Balance
Load balance
is the even assignment of subtasks to processors so as to keep each
processor busy doing useful work for as long as possible.
Load balance is important for speedup because the end of a do loop is
a synchronization point where threads need to catch up with each
other.
If processors have different work loads, some of the processors will
idle while others are still working.
Measuring Load Balance
On the SGI Origin, to measure load balance, use the perfex tool
which is a command line interface to the R10000 hardware counters.
The command
perfex -e16 -mp a.out > results
reports per thread cycle counts. Compare the cycle counts to
determine load balance problems. The master thread (thread 0)
always uses more cycles than the slave threads. If the counts are vastly
different, it indicates load imbalance.
Load Balance
For linux systems, the thread cpu times can be compared
with ps. A thread with unusually high or low time compared
to the others may not be working efficiently [high cputime
could be the result of a thread spinning while waiting for
other threads to catch up].
ps uH
Improving Load Balance
To improve load balance, try changing the way that loop
iterations are allocated to threads by
changing the loop schedule type
changing the chunk size
These methods are discussed in the following sections.
Loop Schedule Types
On the SGI Origin2000 computer, 4 different loop schedule
types can be specified by an OpenMP directive. They are:
Static
Dynamic
Guided
Runtime
If you don't specify a schedule type, the default will be used.
Default Schedule Type
The default schedule type allocates 20 iterations on 4 threads as:
Loop Schedule Types
Static Schedule Type
The static schedule type is used when some of the iterations do more
work than others. With the static schedule type, iterations are
allocated in a round-robin fashion to the threads.
An Example
Suppose you are computing on the
upper triangle of a 100 x 100
matrix, and you use 2 threads,
named t0 and t1. With default
scheduling, workloads are uneven.
Loop Schedule Types
Whereas with static scheduling, the columns of the matrix
are given to the threads in a round robin fashion, resulting in
better load balance.
Loop Schedule Types
Dynamic Schedule Type
The iterations are dynamically allocated to threads at runtime. Each
thread is given a chunk of iterations. When a thread finishes its work,
it goes into a critical section where it’s given another chunk of
iterations to work on.
This type is useful when you don’t know the iteration count or work
pattern ahead of time. Dynamic gives good load balance, but at a high
overhead cost.
Guided Schedule Type
The guided schedule type is dynamic scheduling that starts with large
chunks of iterations and ends with small chunks of iterations. That is,
the number of iterations given to each thread depends on the number
of iterations remaining. The guided schedule type reduces the number
of entries into the critical section, compared to the dynamic schedule
type. Guided gives good load balancing at a low overhead cost.
Chunk Size
The word chunk refers to a grouping of iterations. Chunk size means
how many iterations are in the grouping. The static and dynamic
schedule types can be used with a chunk size. If a chunk size is not
specified, then the chunk size is 1.
Suppose you specify a chunk size of 2 with the static schedule type.
Then 20 iterations are allocated on 4 threads:
The schedule type and chunk size are specified as follows:
!$OMP PARALLEL DO SCHEDULE(type, chunk)
…
!$OMP END PARALLEL DO
Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any
positive integer.
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Timing and Profiling
Now that your program has been ported to the new
computer, you will want to know how fast it runs.
This chapter describes how to measure the speed of a
program using various timing routines.
The chapter also covers how to determine which parts of the
program account for the bulk of the computational load so
that you can concentrate your tuning efforts on those
computationally intensive parts of the program.
Timing
In the following sections, we’ll discuss timers and review the
profiling tools ssrun and prof on the Origin and vprof and gprof
on the Linux Clusters. The specific timing functions described are:
Timing a section of code
FORTRAN
etime, dtime, cpu_time for CPU time
time and f_time for wallclock time
C
clock for CPU time
gettimeofday for wallclock time
Timing an executable
time a.out
Timing a batch run
busage
qstat
qhist
CPU Time
etime
A section of code can be timed using etime.
It returns the elapsed CPU time in seconds since the program
started.
real*4 tarray(2),time1,time2,timeres
… beginning of program
time1=etime(tarray)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=etime(tarray)
timeres=time2-time1
CPU Time
dtime
A section of code can also be timed using dtime.
It returns the elapsed CPU time in seconds since the last call to
dtime.
real*4 tarray(2),timeres
… beginning of program
timeres=dtime(tarray)
… start of section of code to be timed
… lots of computation …
end of section of code to be timed
timeres=dtime(tarray)
… rest of program
CPU Time
The etime and dtime Functions
User time.
This is returned as the first element of tarray.
It’s the CPU time spent executing user code.
System time.
This is returned as the second element of tarray.
It’s the time spent executing system calls on behalf of your program.
Sum of user and system time.
This is the function value that is returned.
It’s the time that is usually reported.
Metric.
Timings are reported in seconds.
Timings are accurate to 1/100th of a second.
CPU Time
Timing Comparison Warnings
For the SGI computers:
The etime and dtime functions return the MAX time over all threads
for a parallel program.
This is the time of the longest thread, which is usually the master
thread.
For the Linux Clusters:
The etime and dtime functions are contained in the VAX compatibility
library of the Intel FORTRAN Compiler.
To use this library include the compiler flag -Vaxlib.
Another warning: Do not put calls to etime and dtime inside a do
loop. The overhead is too large.
CPU Time
cpu_time
The cpu_time routine is available only on the Linux clusters as it is
a component of the Intel FORTRAN compiler library.
It provides substantially higher resolution and has substantially
lower overhead than the older etime and dtime routines.
It can be used as an elapsed timer.
real*8 time1, time2, timeres
… beginning of program
call cpu_time (time1)
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
call cpu_time(time2)
timeres=time2-time1
… rest of program
CPU Time
clock
For C programmers, one can call the cpu_time routine using a
FORTRAN wrapper or call the intrinsic function clock that can be
used to determine elapsed CPU time.
include <time.h>
static const double iCPS =
1.0/(double)CLOCKS_PER_SEC;
double time1, time2, timres;
…
time1=(clock()*iCPS);
…
/* do some work */
…
time2=(clock()*iCPS);
timers=time2-time1;
Wall clock Time
time
For the Origin, the function time returns the time since
00:00:00 GMT, Jan. 1, 1970.
It is a means of getting the elapsed wall clock time.
The wall clock time is reported in integer seconds.
external time integer*4 time1,time2,timeres
… beginning of program
time1=time( )
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=time( )
timeres=time2 - time1
Wall clock Time
f_time
For the Linux clusters, the appropriate FORTRAN function for elapsed
time is f_time.
integer*8 f_time
external f_time
integer*8 time1,time2,timeres
… beginning of program
time1=f_time()
… start of section of code to be timed
… lots of computation
… end of section of code to be timed
time2=f_time()
timeres=time2 - time1
As above for etime and dtime, the f_time function is in the VAX
compatibility library of the Intel FORTRAN Compiler. To use this
library include the compiler flag -Vaxlib.
Wall clock Time
gettimeofday
For C programmers, wallclock time can be obtained by using the very
portable routine gettimeofday.
#include <stddef.h> /* definition of NULL */
#include <sys/time.h> /* definition of timeval struct and
protyping of gettimeofday */
double t1,t2,elapsed;
struct timeval tp;
int rtn;
....
....
rtn=gettimeofday(&tp, NULL);
t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
....
/* do some work */
....
rtn=gettimeofday(&tp, NULL);
t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;
elapsed=t2-t1;
Timing an Executable
To time an executable (if using a csh or tcsh shell, explicitly
call /usr/bin/time)
time …options… a.out
where options can be ‘-p’ for a simple output or ‘-f format’
which allows the user to display more than just time related
information.
Consult the man pages on the time command for format
options.
Timing a Batch Job
Time of a batch job running or completed.
Origin
busage jobid
Linux clusters
qstat jobid # for a running job
qhist jobid # for a completed job
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
6.1 Timing
6.1.1 Timing a Section of Code
6.1.1.1 CPU Time
6.1.1.2 Wall clock Time
6.1.2 Timing an Executable
6.1.3 Timing a Batch Job
6.2 Profiling
6.2.1 Profiling Tools
6.2.2 Profile Listings
6.2.3 Profiling Analysis
6.3 Further Information
Profiling
Profiling determines where a program spends its time.
It detects the computationally intensive parts of the code.
Use profiling when you want to focus attention and
optimization efforts on those loops that are responsible for
the bulk of the computational load.
Most codes follow the 90-10 Rule.
That is, 90% of the computation is done in 10% of the code.
Profiling Tools
Profiling Tools on the Origin
On the SGI Origin2000 computer there are profiling tools named
ssrun and prof.
Used together they do profiling, or what is called hot spot analysis.
They are useful for generating timing profiles.
ssrun
The ssrun utility collects performance data for an executable that you
specify.
The performance data is written to a file named
"executablename.exptype.id".
prof
The prof utility analyzes the data file created by ssrun and produces a
report.
Example
ssrun -fpcsamp a.out
prof -h a.out.fpcsamp.m12345 > prof.list
Profiling Tools
Profiling Tools on the Linux Clusters
On the Linux clusters the profiling tools are still maturing. There are
currently several efforts to produce tools comparable to the ssrun,
prof and perfex tools. .
gprof
Basic profiling information can be generated using the OS utility gprof.
First, compile the code with the compiler flags -qp -g for the Intel
compiler (-g on the Intel compiler does not change the optimization
level) or -pg for the GNU compiler.
Second, run the program.
Finally analyze the resulting gmon.out file using the gprof utility: gprof
executable gmon.out.
efc -O -qp -g -o foo foo.f
./foo
gprof foo gmon.out
Profiling Tools
Profiling Tools on the Linux Clusters
vprof
On the IA32 platform there is a utility called vprof that provides
performance information using the PAPI instrumentation
library.
To instrument the whole application requires recompiling and
linking to vprof and PAPI libraries.
setenv VMON PAPI_TOT_CYC
ifc -g -O -o md md.f
/usr/apps/tools/vprof/lib/vmonauto_gcc.o L/usr/apps/tools/lib -lvmon -lpapi
./md
/usr/apps/tools/vprof/bin/cprof -e md vmon.out
Profile Listings
Profile Listings on the Origin
Prof Output First Listing
Cycles
-------42630984
6498294
6141611
3654120
2615860
1580424
1144036
886044
861136
%
----58.47
8.91
8.42
5.01
3.59
2.17
1.57
1.22
1.18
Cum%
----58.47
67.38
75.81
80.82
84.41
86.57
88.14
89.36
90.54
Secs
---0.57
0.09
0.08
0.05
0.03
0.02
0.02
0.01
0.01
Proc
---VSUB
PFSOR
PBSOR
PFSOR1
VADD
ITSRCG
ITSRSI
ITJSI
ITJCG
The first listing gives the number of cycles executed in each
procedure (or subroutine). The procedures are listed in
descending order of cycle count.
Profile Listings
Profile Listings on the Origin
Prof Output Second Listing
Cycles
-------36556944
5313198
4968804
2989882
2564544
1988420
1629776
994210
969056
483018
%
----50.14
7.29
6.81
4.10
3.52
2.73
2.24
1.36
1.33
0.66
Cum%
----50.14
57.43
64.24
68.34
71.86
74.59
76.82
78.19
79.52
80.18
Line
---8106
6974
6671
8107
7097
8103
8045
8108
8049
6972
Proc
---VSUB
PFSOR
PBSOR
VSUB
PFSOR1
VSUB
VADD
VSUB
VADD
PFSOR
The second listing gives the number of cycles per source
code line.
The lines are listed in descending order of cycle count.
Profile Listings
Profile Listings on the Linux Clusters
gprof Output First Listing
Flat profile:
Each sample counts as 0.000976562 seconds.
%
cumulative
self
self
time
seconds seconds
calls
us/call
----- ---------- ----------------38.07
5.67
5.67
101 56157.18
34.72
10.84
5.17 25199500
0.21
25.48
14.64
3.80
1.25
14.83
0.19
0.37
14.88
0.06
0.05
14.89
0.01
50500
0.15
0.05
14.90
0.01
100
68.36
0.01
14.90
0.00
0.01
14.90
0.00
0.01
14.90
0.00
0.00
14.90
0.00
1
0.00
total
us/call
------107450.88
0.21
0.15
68.36
0.00
name
----------compute_
dist_
SIND_SINCOS
sin
cos
dotr8_
update_
f_fioinit
f_intorange
mov
initialize_
The listing gives a 'flat' profile of functions and routines
encountered, sorted by 'self seconds' which is the number of
seconds accounted for by this function alone.
Profile Listings
Profile Listings on the Linux Clusters
gprof Output Second Listing
Call graph:
index
----[1]
% time
-----72.9
self children
called
name
---- -------------------------------------0.00
10.86
main [1]
5.67
5.18
101/101
compute_ [2]
0.01
0.00
100/100
update_ [8]
0.00
0.00
1/1
initialize_ [12]
--------------------------------------------------------------------5.67
5.18
101/101
main [1]
[2]
72.8
5.67
5.18
101
compute_ [2]
5.17
0.00
25199500/25199500 dist_ [3]
0.01
0.00
50500/50500
dotr8_ [7]
--------------------------------------------------------------------5.17
0.00
25199500/25199500 compute_ [2]
[3]
34.7
5.17
0.00
25199500
dist_ [3]
--------------------------------------------------------------------<spontaneous>
[4]
25.5
3.80
0.00
SIND_SINCOS [4]
…
…
The second listing gives a 'call-graph' profile of functions and routines encountered. The
definitions of the columns are specific to the line in question. Detailed information is
contained in the full output from gprof.
Profile Listings
Profile Listings on the Linux Clusters
vprof Listing
Columns correspond to the following events:
PAPI_TOT_CYC - Total cycles (1956 events)
File Summary:
100.0% /u/ncsa/gbauer/temp/md.f
Function Summary:
84.4% compute
15.6% dist
Line Summary:
67.3% /u/ncsa/gbauer/temp/md.f:106
13.6% /u/ncsa/gbauer/temp/md.f:104
9.3% /u/ncsa/gbauer/temp/md.f:166
2.5% /u/ncsa/gbauer/temp/md.f:165
1.5% /u/ncsa/gbauer/temp/md.f:102
1.2% /u/ncsa/gbauer/temp/md.f:164
0.9% /u/ncsa/gbauer/temp/md.f:107
0.8% /u/ncsa/gbauer/temp/md.f:169
0.8% /u/ncsa/gbauer/temp/md.f:162
0.8% /u/ncsa/gbauer/temp/md.f:105
The above listing from (using the -e option to cprof), displays not only cycles consumed by
functions (a flat profile) but also the lines in the code that contribute to those functions.
Profile Listings
Profile Listings on the Linux Clusters
vprof Listing (cont.)
0.7%
0.5%
0.2%
0.1%
/u/ncsa/gbauer/temp/md.f:149
/u/ncsa/gbauer/temp/md.f:163
/u/ncsa/gbauer/temp/md.f:109
/u/ncsa/gbauer/temp/md.f:100
…
…
100
101
102
103
104
105
106
107
108
109
0.1%
1.5%
13.6%
0.8%
67.3%
0.9%
0.2%
do j=1,np
if (i .ne. j) then
call dist(nd,box,pos(1,i),pos(1,j),rij,d)
! attribute half of the potential energy to particle 'j'
pot = pot + 0.5*v(d)
do k=1,nd
f(k,i) = f(k,i) - rij(k)*dv(d)/d
enddo
endif
enddo
Profiling Analysis
The program being analyzed in the previous Origin example has
approximately 10000 source code lines, and consists of many
subroutines.
The first profile listing shows that over 50% of the computation is done
inside the VSUB subroutine.
The second profile listing shows that line 8106 in subroutine VSUB
accounted for 50% of the total computation.
Going back to the source code, line 8106 is a line inside a do loop.
Putting an OpenMP compiler directive in front of that do loop you can get
50% of the program to run in parallel with almost no work on your part.
Since the compiler has rearranged the source lines the line numbers
given by ssrun/prof give you an area of the code to inspect.
To view the rearranged source use the option
f90 … -FLIST:=ON
cc … -CLIST:=ON
For the Intel compilers, the appropriate options are
ifort … –E …
icc … -E …
Further Information
SGI Irix
man etime
man 3 time
man 1 time
man busage
man timers
man ssrun
man prof
Origin2000 Performance Tuning and Optimization Guide
Linux Clusters
man 3 clock
man 2 gettimeofday
man 1 time
man 1 gprof
man 1B qstat
Intel Compilers Vprof on NCSA Linux Cluster
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scaler Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.1.1 Memory Hierarchy
7.1.2 Cache Mapping
7.1.3 Cache Thrashing
7.1.4 Cache Coherence
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache Concepts
The CPU time required to perform an operation is the sum of the
clock cycles executing instructions and the clock cycles waiting
for memory.
The CPU cannot be performing useful work if it is waiting for
data to arrive from memory.
Clearly then, the memory system is a major factor in determining
the performance of your program and a large part is your use of
the cache.
The following sections will discuss the key concepts of cache
including:
Memory subsystem hierarchy
Cache mapping
Cache thrashing
Cache coherence
Memory Hierarchy
The different subsystems in the memory hierarchy have different
speeds, sizes, and costs.
Smaller memory is faster
Slower memory is cheaper
The hierarchy is set up so that the fastest memory is closest to the
CPU, and the slower memories are further away from the CPU.
Memory Hierarchy
It's a hierarchy because every level is a subset of a level further away.
All data in one level is found in the level below.
The purpose of cache is to improve the memory access time to the
processor.
There is an overhead associated with it, but the benefits outweigh the cost.
Registers
Registers are the sources and destinations of CPU data operations.
They hold one data element each and are 32 bits or 64 bits wide.
They are on-chip and built from SRAM.
Computers usually have 32 or 64 registers.
The Origin MIPS R10000 has 64 physical 64-bit registers of which 32
are available for floating-point operations.
The Intel IA64 has 328 registers for general-purpose (64 bit),
floating-point (80 bit), predicate (1 bit), branch and other functions.
Register access speeds are comparable to processor speeds.
Memory Hierarchy
Main Memory Improvements
A hardware improvement called interleaving reduces main memory access
time.
In interleaving, memory is divided into partitions or segments called
memory banks.
Consecutive data elements are spread across the banks.
Each bank supplies one data element per bank cycle.
Multiple data elements are read in parallel, one from each bank.
The problem with interleaving is that the memory interleaving improvement
assumes that memory is accessed sequentially.
If there is 2-way memory interleaving, but the code accesses every other
location, there is no benefit.
The bank cycle time is 4-8 times the CPU clock cycle time so the main
memory can’t keep up with the fast CPU and keep it busy with data.
Large main memory with a cycle time comparable to the processor is not
affordable.
Memory Hierarchy
Principle of Locality
The way your program operates follows the Principle of Locality.
Temporal Locality: When an item is referenced, it will be referenced again soon.
Spatial Locality: When an item is referenced, items whose addresses are nearby
will tend to be referenced soon.
Cache Line
The overhead of the cache can be reduced by fetching a chunk or block of data
elements.
When a main memory access is made, a cache line of data is brought into the
cache instead of a single data element.
A cache line is defined in terms of a number of bytes.
For example, a cache line is typically 32 or 128 bytes.
This takes advantage of spatial locality.
The additional elements in the cache line will most likely be needed soon.
The cache miss rate falls as the size of the cache line increases, but there is a
point of negative returns on the cache line size.
When the cache line size becomes too large, the transfer time increases.
Memory Hierarchy
Cache Hit
A cache hit occurs when the data element requested by the
processor is in the cache.
You want to maximize hits.
The Cache Hit Rate is defined as the fraction of cache hits.
It is the fraction of the requested data that is found in the cache.
Cache Miss
A cache miss occurs when the data element requested by the
processor is NOT in the cache.
You want to minimize cache misses. Cache Miss Rate is defined as
1.0 - Hit Rate
Cache Miss Penalty, or miss time, is the time needed to retrieve the data
from a lower level (downstream) of the memory hierarchy. (Recall
that the lower levels of the hierarchy have a slower access time.)
Memory Hierarchy
Levels of Cache
It used to be that there were two levels of cache: on-chip and offchip.
L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors.
Caches closer to the CPU are called Upstream. Caches further from the CPU are called
Downstream.
The on-chip cache is called First level, L1, or primary cache.
An on-chip cache performs the fastest but the computer designer makes a trade-off between
die size and cache size. Hence, on-chip cache has a small size. When the on-chip cache has a
cache miss the time to access the slower main memory is very large.
The off-chip cache is called Second Level, L2, or secondary cache.
A cache miss is very costly. To solve this problem, computer designers have implemented a
larger, slower off-chip cache. This chip speeds up the on-chip cache miss time. L1 cache
misses are handled quickly. L2 cache misses have a larger performance penalty.
The cache external to the chip is called Third Level, L3.
The newer Intel IA-64 processor has 3 levels of cache
Memory Hierarchy
Split or Unified Cache
In unified cache, typically L2, the cache is a combined instruction-data
cache.
A disadvantage of a unified cache is that when the data access and instruction access
conflict with each other, the cache may be thrashed, e.g. a high cache miss rate.
In split cache, typically L1, the cache is split into 2 parts:
one for the instructions, called the instruction cache
another for the data, called the data cache.
The 2 caches are independent of each other, and they can have independent
properties.
Memory Hierarchy Sizes
Memory hierarchy sizes are specified in the following units:
Cache Line: bytes
L1 Cache: Kbytes
L2 Cache: Mbytes
Main Memory: Gbytes
Cache Mapping
Cache mapping determines which cache location should be used
to store a copy of a data element from main memory. There are 3
mapping strategies:
Direct mapped cache
Set associative cache
Fully associative cache
Direct Mapped Cache
In direct mapped cache, a line of main memory is mapped to only a
single line of cache.
Consequently, a particular cache line can be filled from (size of main
memory mod size of cache) different lines from main memory.
Direct mapped cache is inexpensive but also inefficient and very
susceptible to cache thrashing.
Cache Mapping
Direct Mapped Cache
http://larc.ee.nthu.edu.tw/~cthuang/courses/ee3450/lectures/07_memory.html
Cache Mapping
Fully Associative Cache
For fully associative cache, any line of cache can be loaded with any line from
main memory.
This technology is very fast but also very expensive.
http://www.xbitlabs.com/images/video/radeon-x1000/caches.png
Cache Mapping
Set Associative Cache
For N-way set associative cache, you can think of cache as being divided into N sets
(usually N is 2 or 4).
A line from main memory can then be written to its cache line in any of the N sets.
This is a trade-off between direct mapped and fully associative cache.
http://www.alasir.com/articles/cache_principles/cache_way.png
Cache Mapping
Cache Block Replacement
With direct mapped cache, a cache line can only be mapped to one unique
place in the cache. The new cache line replaces the cache block at that
address. With set associative cache there is a choice of 3 strategies:
1. Random
There is a uniform random replacement within the set of cache blocks. The
advantage of random replacement is that it’s simple and inexpensive to implement.
2.
LRU (Least Recently Used)
The block that gets replaced is the one that hasn’t been used for the longest time.
The principle of temporal locality tells us that recently used data blocks are likely
to be used again soon. An advantage of LRU is that it preserves temporal locality. A
disadvantage of LRU is that it’s expensive to keep track of cache access patterns. In
empirical studies, there was little performance difference between LRU and
Random.
3.
FIFO (First In First Out)
Replace the block that was brought in N accesses ago, regardless of the usage
pattern. In empirical studies, Random replacement generally outperformed FIFO.
Cache Thrashing
Cache thrashing is a problem that happens when a frequently used
cache line gets displaced by another frequently used cache line.
Cache thrashing can happen for both instruction and data caches.
The CPU can’t find the data element it wants in the cache and must
make another main memory cache line access.
The same data elements are repeatedly fetched into and displaced
from the cache.
Cache thrashing happens because the computational code
statements have too many variables and arrays for the needed data
elements to fit in cache.
Cache lines are discarded and later retrieved.
The arrays are dimensioned too large to fit in cache. The arrays are
accessed with indirect addressing, e.g. a(k(j)).
Cache Coherence
Cache coherence
is maintained by an agreement between data stored in cache,
other caches, and main memory.
When the same data is being manipulated by different
processors, they must inform each other of their modification
of data.
The term Protocol is used to describe how caches and main
memory communicate with each other.
It is the means by which all the memory subsystems maintain
data coherence.
Cache Coherence
Snoop Protocol
All processors monitor the bus traffic to determine cache line
status.
Directory Based Protocol
Cache lines contain extra bits that indicate which other
processor has a copy of that cache line, and the status of the
cache line – clean (cache line does not need to be sent back to
main memory) or dirty (cache line needs to update main
memory with content of cache line).
Hardware Cache Coherence
Cache coherence on the Origin computer is maintained in the
hardware, transparent to the programmer.
Cache Coherence
False sharing
happens in a multiprocessor system as a result of maintaining
cache coherence.
Both processor A and processor B have the same cache line.
A modifies the first word of the cache line.
B wants to modify the eighth word of the cache line.
But A has sent a signal to B that B’s cache line is invalid.
B must fetch the cache line again before writing to it.
Cache Coherence
A cache miss creates a processor stall.
The processor is stalled until the data is retrieved from the
memory.
The stall is minimized by continuing to load and execute
instructions, until the data that is stalling is retrieved.
These techniques are called:
Prefetching
Out of order execution
Software pipelining
Typically, the compiler will do these at -O3 optimization.
Cache Coherence
The following is an example of software pipelining:
Suppose you compute
Do I=1,N
y(I)=y(I) + a*x(I)
End Do
In pseudo-assembly language, this is what the Origin compiler will do:
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
t+0
t+1
t+2
t+3
t+4
t+5
t+6
t+7
t+8
t+9
t+10
t+11
ld
ld
st
st
st
st
ld
ld
ld
ld
ld
ld
y(I+3)
x(I+3)
y(I-4)
y(I-3)
y(I-2)
y(I-1)
y(I+4)
x(I+4)
y(I+5)
x(I+5)
y(I+6)
x(I+6)
madd
madd
madd
madd
I
I+1
I+2
I+3
Cache Coherence
Since the Origin processor can only execute 1 load or 1 store
at a time, the compiler places loads in the instruction
pipeline well before the data is needed.
It is then able to continue loading while simultaneously
performing a fused multiply-add (a+b*c).
The code above gets 8 flops in 12 clock cycles.
The peak is 24 flops in 12 clock cycles for the Origin.
The Intel Pentium III (IA-32) and the Itanium (IA-64) will
have differing versions of the code above but the same
concepts apply.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2Cache Specifics
7.2.1 Cache on the SGI Origin2000
7.2.2 Cache on the Intel Pentium III
7.2.3 Cache on the Intel Itanium
7.2.4 Cache Summary
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Cache on the SGI Origin2000
L1 Cache (on-chip primary cache)
Cache size: 32KB floating point data
32KB integer data and instruction
Cache line size: 32 bytes
Associativity: 2-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 4MB per processor
Cache line size: 128 bytes
Associativity: 2-way set associative
Replacement: LRU
Coherence: Directory based 2-way interleaved (2 banks)
Cache on the SGI Origin2000
Bandwidth L1 cache-to-processor
1.6 GB/s/bank
3.2 GB/sec overall possible
Latency: 1 cycle
Bandwidth between L1 and L2 cache
1GB/s
Latency: 11 cycles
Bandwidth between L2 cache and local memory
.5 GB/s
Latency: 61 cycles
Average 32 processor remote memory
Latency: 150 cycles
Cache on the Intel Pentium III
L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 16 bytes
Associativity: 4-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 256 KB per processor
Cache line size: 32 bytes
Associativity: 8-way set associative
Replacement: pseudo-LRU
Coherence: interleaved (8 banks)
Cache on the Intel Pentium III
Bandwidth L1 cache-to-processor
16 GB/s
Latency: 2 cycles
Bandwidth between L1 and L2 cache
11.7 GB/s
Latency: 4-10 cycles
Bandwidth between L2 cache and local memory
1.0 GB/s
Latency: 15-21 cycles
Cache on the Intel Itanium
L1 Cache (on-chip primary cache)
Cache size: 16KB floating point data
16KB integer data and instruction
Cache line size: 32 bytes
Associativity: 4-way set associative
L2 Cache (off-chip secondary cache)
Cache size: 96KB unified data and instruction
Cache line size: 64 bytes
Associativity: 6-way set associative
Replacement: LRU
L3 Cache (off-chip tertiary cache)
Cache size: 4MB per processor
Cache line size: 64 bytes
Associativity: 4-way set associative
Replacement: LRU
Cache on the Intel Itanium
Bandwidth L1 cache-to-processor
25.6 GB/s
Latency: 1 - 2 cycle
Bandwidth between L1 and L2 cache
25.6 GB/sec
Latency: 6 - 9 cycles
Bandwidth between L2 and L3 cache
11.7 GB/sec
Latency: 21 - 24 cycles
Bandwidth between L3 cache and main memory
2.1 GB/sec
Latency: 50 cycles
Cache Summary
Chip
MIPS R10000 Pentium III
Itanium
#Caches
2
2
3
Associativity
2/2
4/8
4/6/4
Replacement LRU
Pseudo-LRU
LRU
CPU MHz
195/250
1000
800
Peak Mflops
390/500
1000
3200
LD,ST/cycle 1 LD or 1 ST 1 LD and 1 ST 2 LD or 2 ST
Only one load or store may be performed each CPU cycle on the R10000.
This indicates that loads and stores may be a bottleneck.
Efficient use of cache is extremely important.
Agenda
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3Code 0ptimization
7.4 Measuring Cache Performance
7.4.1 Measuring Cache Performance on the SGI Origin2000
7.4.2 Measuring Cache Performance on the Linux Clusters
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Code 0ptimization
Gather statistics to find out where the bottlenecks are in your
code so you can identify what you need to optimize.
The following questions can be useful to ask:
How much time does the program take to execute?
Use /usr/bin/time a.out for CPU time
Which subroutines use the most time?
Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters.
Which loop uses the most time?
Put etime/dtime or other recommended timer calls around loops for CPU time.
For more information on timers see Timing and Profiling section.
What is contributing to the cpu time?
Use the Perfex utility on the Origin or perfex or hpmcount on the Linux
clusters.
Code 0ptimization
Some useful optimizing and profiling tools are
etime/dtime/time
perfex
ssusage
ssrun/prof
gprof cvpav, cvd
See the NCSA web pages on Compiler, Performance, and
Productivity Tools
http://www.ncsa.uiuc.edu/UserInfo/Resources/Software/Tools/
for information on which tools are available on NCSA platforms.
Measuring Cache Performance on the
SGI Origin2000
The R10000 processors of NCSA’s Origin2000 computers
have hardware performance counters.
There are 32 events that are measured and each event is
numbered.
0 = cycles
1 = Instructions issued
...
26 = Secondary data cache misses
...
View man perfex for more information.
The Perfex Utility
The hardware performance counters can be measured using the
perfex utility.
perfex [options] command [arguments]
Measuring Cache Performance on the
SGI Origin2000
where the options are:
-e counter1-e counter2
This specifies which events are to be counted. You enter the number
of the event you want counted. (Remember to have a space in
between the "e" and the event number.)
-a
sample ALL the events
-mp
Report all results on a per thread basis.
-y
Report the results in seconds, not cycles.
-x
Gives extra summary info including Mflops command Specify the
name of the executable file. arguments Specify the input and output
arguments to the executable file.
Measuring Cache Performance on the
SGI Origin2000
Examples
perfex -e 25 -e 26 a.out
- outputs the L1 and L2 cache misses
- the output is reported in cycles
perfex -a -y a.out > results
- outputs ALL the hardware performance counters
- - the output is reported in seconds
Measuring Cache Performance on the
Linux Clusters
The Intel Pentium III and Itanium processors provide
hardware event counters that can be accessed from
several tools.
perfex for the Pentium III and pfmon for the
Itanium
To view usage and options for perfex and pfmon:
perfex -h
pfmon --help
To measure L2 cache misses:
perfex –eP6_L2_LINES_IN a.out
pfmon –-events=L2_MISSES a.out
Measuring Cache Performance on the
Linux Clusters
psrun [soft add +perfsuite]
Another tool that provides access to the hardware
event counter and also provides derived statistics is
perfsuite.
To add perfsuite's psrun to the current shell
environment :
soft add +perfsuite
To measure cache misses:
psrun a.out
psprocess a.out*.xml
Agends
7 Cache Tuning
7.1 Cache Concepts
7.2 Cache Specifics
7.3 Code 0ptimization
7.4 Measuring Cache Performance
7.5 Locating the Cache Problem
7.6 Cache Tuning Strategy
7.7 Preserve Spatial Locality
7.8 Locality Problem
7.9 Grouping Data Together
7.10 Cache Thrashing Example
7.11 Not Enough Cache
7.12 Loop Blocking
7.13 Further Information
Locating the Cache Problem
For the Origin, the perfex output is a first-pass detection of a
cache problem.
If you then use the CaseVision tools, you can locate the cache
problem in your code.
The CaseVision tools are
cvpav for performance analysis
cvd for debugging
CaseVision is not available on the Linux clusters.
Tools like vprof and libhpm provide routines for users to
instrument their code.
Using vprof with the PAPI cache events can provide detailed
information about where poor cache utilization is occurring.
Cache Tuning Strategy
The strategy for performing cache tuning on your code is
based on data reuse.
Temporal Reuse
Use the same data elements on more than one iteration of the loop.
Spatial Reuse
Use data that is encached as a result of fetching nearby data elements from
downstream memory.
Strategies that take advantage of the Principle of Locality will
improve performance.
Preserve Spatial Locality
Check loop nesting to ensure stride-one memory access.
The following code does not preserve spatial locality:
do I=1,n
do K=1,n
do J=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
It is not wrong but runs much slower than it could.
To ensure stride-one access modify the code using loop interchange.
do J=1,n
do K=1,n
do I=1,n
C(I,J)=C(I,J) + A(I,K) * B(K,J)
end do …
For Fortran the innermost loop index should be the leftmost index of
the arrays. The code has been modified for spatial reuse.
Locality Problem
Suppose your code looks like:
DO J=1,N
DO I=1,N
A(I,J)=B(J,I)
ENDDO
ENDDO
The loop as it is typed above does not have unit-stride access
on loads.
If you interchange the loops, the code doesn’t have unitstride access on stores.
Use the optimized, intrinsic-function transpose from the
FORTRAN compiler instead of hand-coding it.
Grouping Data Together
Consider the following code segment:
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j))
Since the arrays are accessed with indirect accessing, it is likely
that 3 new cache lines need to be brought into the cache for each
iteration of the loop. Modify the code by grouping together x, y,
and z into a 2-dimensional array named r.
d=0.0
do I=1,n
j=index(I)
d = d + sqrt(r(1,j)*r(1,j) + r(2,j)*r(2,j) +
r(3,j)*r(3,j))
Since r(1,j), r(2,j), and r(3,j) are contiguous in memory, it is likely
they will be in one cache line. Hence, 1 cache line, rather than 3,
is brought in for each iteration of I. The code has been modified
for cache reuse.
Cache Thrashing Example
This example thrashes a 4MB direct mapped cache.
parameter (max = 1024*1024)
common /xyz/ a(max), b(max)
do I=1,max
something = a(I) + b(I)
enddo
The cache lines for both a and b have the same cache address.
To avoid cache thrashing in this example, pad common with the
size of a cache line.
parameter (max = 1024*1024)
common /xyz/ a(max),extra(32),b(max)
do I=1,max
something=a(I) + b(I)
enddo
Improving cache utilization is often the key to getting good
performance.
Not Enough Cache
Ideally you want the inner loop’s arrays and variables to fit
into cache.
If a scalar program won’t fit in cache, its parallel version may
fit in cache with a large enough number of processors.
This often results in super-linear speedup.
Loop Blocking
This technique is useful when the arrays are too large to fit
into the cache.
Loop blocking uses strip mining of loops and loop interchange.
A blocked loop accesses array elements in sections that
optimally fit in the cache.
It allows for spatial and temporal reuse of data, thus minimizing
cache misses.
The following example (next slide) illustrates loop blocking
of matrix multiplication.
The code in the PRE column depicts the original code, the
POST column depicts the code when it is blocked.
Loop Blocking
PRE
POST
do k=1,n
do j=1,n
do i=1,n
c(i,j)=c(i,j)+a(i,k)
*b(k,j)
enddo
enddo
enddo
do kk=1,n,iblk
do jj=1,n,iblk
do ii=1,n,iblk
do j=jj,jj+iblk-1
do k=kk,kk+iblk-1
do i=ii,ii+iblk-1
c(i,j)=c(i,j)+a(i,k)
*b(k,j)
enddo
enddo
enddo
enddo
enddo
enddo
Further Information
Computer Organization and Design
The Hardware/Software Interface, David A. Patterson and John L.
Hennessy, Morgan Kaufmann Publishers, Inc.
Computer Architecture
A Quantitative Approach, John L. Hennessy and David A. Patterson,
Morgan Kaufmann Publishers, Inc.
The Cache Memory Book, Jim Handy, Academic Press
High Performance Computing, Charles Severance, O’Reilly and
Associates, Inc.
A Practitioner’s Guide to RISC Microprocessor Architecture, Patrick H.
Stakem, John Wiley & Sons, Inc.
Tutorial on Optimization of Fortran, John Levesque, Applied
Parallel Research
Intel® Architecture Optimization Reference Manual
Intel® Itanium® Processor Manuals
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5 Speedup Limitations
8.6 Benchmarks
8.7 Summary
9 About the IBM Regatta P690
Parallel Performance Analysis
Now that you have parallelized your code, and have run it on
a parallel computer using multiple processors you may want
to know the performance gain that parallelization has
achieved.
This chapter describes how to compute parallel code
performance.
Often the performance gain is not perfect, and this chapter
also explains some of the reasons for limitations on parallel
performance.
Finally, this chapter covers the kinds of information you
should provide in a benchmark, and some sample
benchmarks are given.
Speedup
The speedup of your code tells you how much performance gain is
achieved by running your program in parallel on multiple
processors.
A simple definition is that it is the length of time it takes a program to
run on a single processor, divided by the time it takes to run on a
multiple processors.
Speedup generally ranges between 0 and p, where p is the number of
processors.
Scalability
When you compute with multiple processors in a parallel
environment, you will also want to know how your code scales.
The scalability of a parallel code is defined as its ability to achieve
performance proportional to the number of processors used.
As you run your code with more and more processors, you want to
see the performance of the code continue to improve.
Computing speedup is a good way to measure how a program scales
as more processors are used.
Speedup
Linear Speedup
If it takes one processor an amount of time t to do a task and if
p processors can do the task in time t / p, then you have perfect
or linear speedup (Sp= p).
That is, running with 4 processors improves the time by a factor of 4,
running with 8 processors improves the time by a factor of 8, and so on.
This is shown in the following illustration.
Speedup Extremes
The extremes of speedup happen when speedup is
greater than p, called super-linear speedup,
less than 1.
Super-Linear Speedup
You might wonder how super-linear speedup can occur. How can
speedup be greater than the number of processors used?
The answer usually lies with the program's memory use. When using multiple
processors, each processor only gets part of the problem compared to the
single processor case. It is possible that the smaller problem can make better
use of the memory hierarchy, that is, the cache and the registers. For
example, the smaller problem may fit in cache when the entire problem
would not.
When super-linear speedup is achieved, it is often an indication that the
sequential code, run on one processor, had serious cache miss problems.
The most common programs that achieve super-linear speedup
are those that solve dense linear algebra problems.
Speedup Extremes
Parallel Code Slower than Sequential Code
When speedup is less than one, it means that the parallel code
runs slower than the sequential code.
This happens when there isn't enough computation to be done
by each processor.
The overhead of creating and controlling the parallel threads
outweighs the benefits of parallel computation, and it causes the
code to run slower.
To eliminate this problem you can try to increase the problem
size or run with fewer processors.
Efficiency
Efficiency is a measure of parallel performance that is closely
related to speedup and is often also presented in a description
of the performance of a parallel program.
Efficiency with p processors is defined as the ratio of speedup
with p processors to p.
Efficiency is a fraction that usually ranges between 0 and 1.
Ep=1 corresponds to perfect speedup of Sp= p.
You can think of efficiency as describing the average speedup
per processor.
Amdahl's Law
An alternative formula for speedup is named Amdahl's Law attributed to
Gene Amdahl, one of America's great computer scientists.
This formula, introduced in the 1980s, states that no matter how many
processors are used in a parallel run, a program's speedup will be limited by its
fraction of sequential code.
That is, almost every program has a fraction of the code that doesn't lend itself to
parallelism.
This is the fraction of code that will have to be run with just one processor, even
in a parallel run.
Amdahl's Law defines speedup with p processors as follows:
Where the term f stands for the fraction of operations done sequentially
with just one processor, and the term (1 - f) stands for the fraction of
operations done in perfect parallelism with p processors.
Amdahl's Law
The sequential fraction of code, f, is a unitless measure
ranging between 0 and 1.
When f is 0, meaning there is no sequential code, then speedup
is p, or perfect parallelism. This can be seen by substituting f =
0 in the formula above, which results in Sp = p.
When f is 1, meaning there is no parallel code, then speedup is
1, or there is no benefit from parallelism. This can be seen by
substituting f = 1 in the formula above, which results in Sp = 1.
This shows that Amdahl's speedup ranges between 1 and
p, where p is the number of processors used in a parallel
processing run.
Amdahl's Law
The interpretation of Amdahl's Law is that speedup is limited
by the fact that not all parts of a code can be run in parallel.
Substituting in the formula, when the number of processors goes to
infinity, your code's speedup is still limited by 1 / f.
Amdahl's Law shows that the sequential fraction of code has a
strong effect on speedup.
This helps to explain the need for large problem sizes when using
parallel computers.
It is well known in the parallel computing community, that you
cannot take a small application and expect it to show good
performance on a parallel computer.
To get good performance, you need to run large applications, with
large data array sizes, and lots of computation.
The reason for this is that as the problem size increases the
opportunity for parallelism grows, and the sequential fraction
shrinks, and it shrinks in its importance for speedup.
Agenda
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency
8.4 Amdahl's Law
8.5Speedup Limitations
8.5.1 Memory Contention Limitation
8.5.2 Problem Size Limitation
8.6 Benchmarks
8.7 Summary
Speedup Limitations
This section covers some of the reasons why a program
doesn't get perfect Speedup. Some of the reasons for
limitations on speedup are:
Too much I/O
Speedup is limited when the code is I/O bound.
That is, when there is too much input or output compared to the amount
of computation.
Wrong algorithm
Speedup is limited when the numerical algorithm is not suitable for a
parallel computer.
You need to replace it with a parallel algorithm.
Too much memory contention
Speedup is limited when there is too much memory contention.
You need to redesign the code with attention to data locality.
Cache reutilization techniques will help here.
Speedup Limitations
Wrong problem size
Speedup is limited when the problem size is too small to take best advantage
of a parallel computer.
In addition, speedup is limited when the problem size is fixed.
That is, when the problem size doesn't grow as you compute with more
processors.
Too much sequential code
Speedup is limited when there's too much sequential code.
This is shown by Amdahl's Law.
Too much parallel overhead
Speedup is limited when there is too much parallel overhead compared to the
amount of computation.
These are the additional CPU cycles accumulated in creating parallel regions,
creating threads, synchronizing threads, spin/blocking threads, and ending
parallel regions.
Load imbalance
Speedup is limited when the processors have different workloads.
The processors that finish early will be idle while they are waiting for the
other processors to catch up.
Memory Contention Limitation
Gene Golub, a professor of Computer Science at Stanford University,
writes in his book on parallel computing that the best way to define
memory contention is with the word delay.
When different processors all want to read or write into the main memory,
there is a delay until the memory is free.
On the SGI Origin2000 computer, you can determine whether your
code has memory contention problems by using SGI's perfex utility.
The perfex utility is covered in the Cache Tuning lecture in this course.
You can also refer to SGI's manual page, man perfex, for more details.
On the Linux clusters, you can use the hardware performance counter
tools to get information on memory performance.
On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite.
On the IA64 platform, use vprof, pfmon, psrun/perfsuite.
Memory Contention Limitation
Many of these tools can be used with the PAPI performance counter
interface.
Be sure to refer to the man pages and webpages on the NCSA website for
more information.
If the output of the utility shows that memory contention is a problem, you
will want to use some programming techniques for reducing memory
contention.
A good way to reduce memory contention is to access elements from the
processor's cache memory instead of the main memory.
Some programming techniques for doing this are:
Access arrays with unit `.
Order nested do loops (in Fortran) so that the innermost loop index is the leftmost
index of the arrays in the loop. For the C language, the order is the opposite of
Fortran.
Avoid specific array sizes that are the same as the size of the data cache or that are
exact fractions or exact multiples of the size of the data cache.
Pad common blocks.
These techniques are called cache tuning optimizations. The details for
performing these code modifications are covered in the section on Cache
Optimization of this lecture.
Problem Size Limitation
Small Problem Size
Speedup is almost always an increasing function of problem size.
If there's not enough work to be done by the available
processors, the code will show limited speedup.
The effect of small problem size on speedup is shown in the
following illustration.
Problem Size Limitation
Fixed Problem Size
When the problem size is fixed, you can reach a point of
negative returns when using additional processors.
As you compute with more and more processors, each
processor has less and less amount of computation to perform.
The additional parallel overhead, compared to the amount of
computation, causes the speedup curve to start turning
downward as shown in the following figure.
Benchmarks
It will finally be time to report the parallel performance
of your application code.
You will want to show a speedup graph with the
number of processors on the x axis, and speedup on
the y axis.
Some other things you should report and record are:
the date you obtained the results
the problem size
the computer model
the compiler and the version number of the compiler
any special compiler options you used
Benchmarks
When doing computational science, it is often helpful to find
out what kind of performance your colleagues are obtaining.
In this regard, NCSA has a compilation of parallel performance
benchmarks online at
http://www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/.
You might be interested in looking at these benchmarks to
see how other people report their parallel performance.
In particular, the NAMD benchmark is a report about the
performance of the NAMD program that does molecular
dynamics simulations.
Summary
There are many good texts on parallel computing which treat
the subject of parallel performance analysis. Here are two
useful references:
Scientific Computing An Introduction with Parallel Computing, Gene
Golub and James Ortega, Academic Press, Inc.
Parallel Computing Theory and Practice, Michael J. Quinn,
McGraw-Hill, Inc.
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
About the IBM Regatta P690
To obtain your program’s top performance, it is important to
understand the architecture of the computer system on
which the code runs.
This chapter describes the architecture of NCSA's IBM p690.
Technical details on the size and design of the processors,
memory, cache, and the interconnect network are covered
along with technical specifications for the compute rate,
memory size and speed, and interconnect bandwidth.
IBM p690 General Overview
The p690 is IBM's latest Symmetric Multi-Processor (SMP)
machine with Distributed Shared Memory (DSM).
This means that memory is physically distributed and logically
shared.
It is based on the Power4 architecture and is a successor to the
Power3-II based RS/6000 SP system.
IBM p690 Scalability
The IBM p690 is a flexible, modular, and scalable architecture.
It scales in these terms:
Number of processors
Memory size
I/O and memory bandwidth and the Interconnect bandwidth
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.2.1 Power4 Core
9.2.2 Multi-Chip Modules
9.2.3 The Processor
9.2.4 Cache Architecture
9.2.5 Memory Subsystem
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
IBM p690 Building Blocks
An IBM p690 system is built from a number of fundamental
building blocks.
The first of these building blocks is the Power4 Core, which
includes the processors and L1 and L2 caches.
At NCSA, four of these Power4 Cores are linked to form a
Multi-Chip Module.
This module includes the L3 cache and four Multi-Chip
Modules are linked to form a 32 processor system (see figure
on the next slide).
Each of these components will be described in the following
sections.
32-processor IBM p690 configuration
(Image courtesy of IBM)
Power4 Core
The Power4 Chip contains:
Two processors
Local caches (L1)
External cache for each processor (L2)
I/O and Interconnect interfaces
The POWER4 chip
(Image curtsey of IBM)
Multi-Chip Modules
Four Power4 Chips are assembled to form a Multi-Chip
Module (MCM) that contains 8 processors.
Each MCM also supports the L3 cache for each Power4 chip.
Multiple MCM interconnection (Image courtesy of IBM)
The Processor
The processors at the heart of the Power4 Core are speculative
superscalar out of order execution chips.
The Power4 is a 4-way superscalar RISC architecture running
instructions on its 8 pipelined execution units.
Speed of the Processor
The NCSA IBM p690 has CPUs running at 1.3 GHz.
64-Bit Processor Execution Units
There are 8 independent fully pipelined execution units.
2 load/store units for memory access
2 identical floating point execution units capable of fused multiply/add
2 fixed point execution units
1 branch execution unit
1 logic operation unit
The Processor
The units are capable of 4 floating point operations, fetching 8
instructions and completing 5 instructions per cycle.
It is capable of handling up to 200 in-flight instructions.
Performance Numbers
Peak Performance:
4 floating point instructions per cycle
1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS
MIPS Rating:
5 instructions per cycle
1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS
Instruction Set
The instruction set (ISA) on the IBM p690 is the PowerPC AS
Instruction set.
Cache Architecture
Each Power4 Core has both a primary (L1) cache associated with each processor and
a secondary (L2) cache shared between the two processors. In addition, each MultiChip Module has a L3 cache.
Level 1 Cache
The Level 1 cache is in the processor core. It has split instruction and data caches.
L1 Instruction Cache
The properties of the Instruction Cache are:
64KB in size
direct mapped
cache line size is 128 bytes
L1 Data Cache
The properties of the L1 Data Cache are:
32KB in size
2-way set associative
FIFO replacement policy
2-way interleaved
cache line size is 128 bytes
Peak speed is achieved when the data accessed in a loop is entirely contained in the L1
data cache.
Cache Architecture
Level 2 Cache on the Power4 Chip
When the processor can't find a data element in the L1
cache, it looks in the L2 cache. The properties of the L2
Cache are:
external from the processor
unified instruction and data cache
1.41MB per Power4 chip (2 processors)
8-way set associative
split between 3 controllers
cache line size is 128 bytes
pseudo LRU replacement policy for cache coherence
124.8 GB/s peak bandwidth from L2
Cache Architecture
Level 3 Cache on the Multi-Chip Module
When the processor can't find a data element in the L2
cache, it looks in the L3 cache. The properties of the L3
Cache are:
external from the Power4 Core
unified instruction and data cache
128MB per Multi-Chip Module (8 processors)
8-way set associative
cache line size is 512 bytes
55.5 GB/s peak bandwidth from L2
Memory Subsystem
The total memory is physically distributed among the
Multi-Chip Modules of the p690 system (see the
diagram in the next slide).
Memory Latencies
The latency penalties for each of the levels of the
memory hierarchy are:
L1 Cache - 4 cycles
L2 Cache - 14 cycles
L3 Cache - 102 cycles
Main Memory - 400 cycles
Memory distribution within an MCM
Agenda
9 About the IBM Regatta P690
9.1 IBM p690 General Overview
9.2 IBM p690 Building Blocks
9.3 Features Performed by the Hardware
9.4 The Operating System
9.5 Further Information
Features Performed by the Hardware
The following is done completely by the hardware,
transparent to the user:
Global memory addressing (makes the system memory shared)
Address resolution
Maintaining cache coherency
Automatic page migration from remote to local memory (to
reduce interconnect memory transactions)
The Operating System
The operating system is AIX. NCSA's p690 system is
currently running version 5.1 of AIX. Version 5.1 is a full 64bit file system.
Compatibility
AIX 5.1 is highly compatible to both BSD and System V Unix
Further Information
Computer Architecture: A Quantitative Approach
John Hennessy, et al. Morgan Kaufman Publishers, 2nd Edition,
1996
Computer Hardware and Design:The Hardware/Software Interface
David A. Patterson, et al. Morgan Kaufman Publishers, 2nd
Edition, 1997
IBM P Series [595] at the URL:
http://www-03.ibm.com/systems/p/hardware/highend/590/index.html
IBM p690 Documentation at NCSA at the URL:
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/