The practical scientist is trying to solve tomorrow's problem on yesterday's computer.
Download ReportTranscript The practical scientist is trying to solve tomorrow's problem on yesterday's computer.
Tuesday, September 19, 2006
The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around.
- Numerical Recipes, C Edition
1
Reference Material
Lectures 1 & 2 “Parallel Computer Architecture” by David Culler et. al., Chapter 1.
“Sourcebook of Parallel Computing” by Jack Dongarra et. al., Chapters 1 and 2.
Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4.
www.top500.org
Lecture 3 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3
Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5 “Techniques for Optimizing Applications” by Garg et. al., Chapter 9 “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1 2.2
2
Software Optimizations
Optimize serial code before parallelizing it.
3
Loop Unrolling
Assumption n is divisible by 4 do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo
•
Unrolled by 4.
•
Some compilers allow users to specify unrolling depth.
•
Avoid excessive unrolling: Register pressure / spills can hurt performance
•
Pipelining to hide instruction latencies
•
Reduces overhead of index increment and conditional check
4
Loop Unrolling
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo
Unroll outer loop by 2 5
Loop Unrolling
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo
6
Loop Unrolling
do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo
Number of load operations can be reduced e.g. Half as many loads of X 7
Loop Fusion
Beneficial in loop-intensive programs.
Decreases index calculation overhead.
Can also help in instruction level parallelism.
Beneficial if same data structures are used in different loops.
8
Loop Fusion
for (i=0; i
9
Loop Fusion
for (i=0; i
Check for register pressure before fusing 10
Loop Fission
Condition statements can hurt pipelining Split into two, one with condition statements and the other without.
Compiler can do optimizations in condition-free loop like unrolling.
Beneficial for fat loops that may lead to register spills
11
Loop Fission
for (i=0;i
12
Loop Fission
for (i=0;i
13
Reductions
for (i=0; i
Hide floating point instruction latency?
14
Reductions
for (i=0; i
nend = (n>>2)<<2; for (i=0; i
15
a**0.5 vs sqrt(a)
16
a**0.5 vs sqrt(a)
Appropriate include files can help in generating faster code. e.g. math.h 17
The time to access memory has not kept pace with CPU clock speeds.
Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.
Wastage of CPU cycles: CPU starvation
18
19
Ability of memory system to feed data to the processor
Memory latency
Memory Bandwidth
20
Effect of Memory Latency
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns Cache block size : 1 word Peak processor rating?
21
Effect of Memory Latency
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns (no caches) Memory block 1 word Peak processor rating 4 GFlops
22
Effect of Memory Latency
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns (no caches) Memory block: 1 word Peak processor rating 4 GFlops Dot product of two vectors Peak speed of computation?
23
Effect of Memory Latency
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
• •
DRAM with latency 100ns (no caches) Memory block 1 word Peak processor rating 4 GFlops Dot product of two vectors Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS
24
Effect of Memory Latency: Introduce Cache
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns Memory block 1 word Cache 32KB with 1ns latency Multiply two matrices A and B of 32x32 words with result in C.
(Note: Previous example had no data reuse).
Assume ideal cache placement and enough capacity to hold A,B and C
25
Effect of Memory Latency: Introduce Cache
Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K words Total operations and total time taken?
26
Effect of Memory Latency: Introduce Cache
Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K words Total operations and total time taken?
Two matrices = 2K require words Multiplying two matrices requires 2n operations 3
27
Effect of Memory Latency: Introduce Cache
Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K Two matrices = 2K require 2K *100ns = 200µs.
Multiplying two matrices requires 2n 3 operations = 2*32 3 = 64K operations 4 operations per cycle we need 64K/4 cycles = 16µs Total time = 200+16µs Computation rate 64K operations/(200+16µs) = 303 MFLOPS
28
Effect of Memory Bandwidth
1 GHz processor (1ns clock)
Capable of executing 4 instructions in each cycle of 1ns
DRAM with latency 100ns Memory block 4 words Cache 32KB with 1ns latency Dot product example again Bandwidth increased 4 fold
29
Reduce cache misses.
Spatial locality Temporal locality
30
Impact of strided access
for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];
31
Eliminating strided access
for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i];
Assumption: Vector column_sum is retained in the cache 32
do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.
Little reuse between touches How many cache misses for A and B?
33