The practical scientist is trying to solve tomorrow's problem on yesterday's computer.

Download Report

Transcript The practical scientist is trying to solve tomorrow's problem on yesterday's computer.

Tuesday, September 19, 2006

The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around.

- Numerical Recipes, C Edition

  

Reference Material

Lectures 1 & 2  “Parallel Computer Architecture” by David Culler et. al., Chapter 1.

 “Sourcebook of Parallel Computing” by Jack Dongarra et. al.,  Chapters 1 and 2.

Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4.

 www.top500.org

Lecture 3   Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3

Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5  “Techniques for Optimizing Applications” by Garg et. al., Chapter 9  “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5  Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1 2.2

Software Optimizations



Optimize serial code before parallelizing it.

Loop Unrolling

Assumption n is divisible by 4 do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo

•

Unrolled by 4.

•

Some compilers allow users to specify unrolling depth.

•

Avoid excessive unrolling: Register pressure / spills can hurt performance

•

Pipelining to hide instruction latencies

•

Reduces overhead of index increment and conditional check

Loop Unrolling

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo

Unroll outer loop by 2 5

Loop Unrolling

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo

Loop Unrolling

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo

Number of load operations can be reduced e.g. Half as many loads of X 7

Loop Fusion

  

Beneficial in loop-intensive programs.

Decreases index calculation overhead.

Can also help in instruction level parallelism.



Beneficial if same data structures are used in different loops.

Loop Fusion

for (i=0; i

9

Loop Fusion

for (i=0; i

Check for register pressure before fusing 10

Loop Fission

 

Condition statements can hurt pipelining Split into two, one with condition statements and the other without.



Compiler can do optimizations in condition-free loop like unrolling.



Beneficial for fat loops that may lead to register spills

11

Loop Fission

for (i=0;i hgreat) { temp1[i]=1; } }

12

Loop Fission

for (i=0;i hgreat) { temp1[i]=1; } } for (i=0;i hgreat) { temp1[i]=1; } }

13

Reductions

for (i=0; i

Hide floating point instruction latency?

14

Reductions

for (i=0; i

nend = (n>>2)<<2; for (i=0; i

15

a**0.5 vs sqrt(a)

16

a**0.5 vs sqrt(a)

 Appropriate include files can help in generating faster code. e.g. math.h 17



The time to access memory has not kept pace with CPU clock speeds.



Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.



Wastage of CPU cycles: CPU starvation

18

19



Ability of memory system to feed data to the processor



Memory latency



Memory Bandwidth

20

Effect of Memory Latency



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

  

DRAM with latency 100ns Cache block size : 1 word Peak processor rating?

21

Effect of Memory Latency



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

  

DRAM with latency 100ns (no caches) Memory block 1 word Peak processor rating 4 GFlops

22

Effect of Memory Latency



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

    

DRAM with latency 100ns (no caches) Memory block: 1 word Peak processor rating 4 GFlops Dot product of two vectors Peak speed of computation?

23

Effect of Memory Latency



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

• •   

DRAM with latency 100ns (no caches) Memory block 1 word Peak processor rating 4 GFlops Dot product of two vectors Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS

24

Effect of Memory Latency: Introduce Cache



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

   

DRAM with latency 100ns Memory block 1 word Cache 32KB with 1ns latency Multiply two matrices A and B of 32x32 words with result in C.

(Note: Previous example had no data reuse).



Assume ideal cache placement and enough capacity to hold A,B and C

25

Effect of Memory Latency: Introduce Cache

  

Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K words Total operations and total time taken?

26

Effect of Memory Latency: Introduce Cache

    

Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K words Total operations and total time taken?

Two matrices = 2K require words Multiplying two matrices requires 2n operations 3

27

    

Effect of Memory Latency: Introduce Cache

 

Multiply two matrices A and B of 32x32 words with result in C 32x32 = 1K Two matrices = 2K require 2K *100ns = 200µs.

Multiplying two matrices requires 2n 3 operations = 2*32 3 = 64K operations 4 operations per cycle we need 64K/4 cycles = 16µs Total time = 200+16µs Computation rate 64K operations/(200+16µs) = 303 MFLOPS

28

Effect of Memory Bandwidth



1 GHz processor (1ns clock)



Capable of executing 4 instructions in each cycle of 1ns

    

DRAM with latency 100ns Memory block 4 words Cache 32KB with 1ns latency Dot product example again Bandwidth increased 4 fold

29

  

Reduce cache misses.

Spatial locality Temporal locality

30

Impact of strided access

for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];

31

Eliminating strided access

for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i];

Assumption: Vector column_sum is retained in the cache 32

do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.

Little reuse between touches How many cache misses for A and B?

33

The practical scientist is trying to solve tomorrow's problem on yesterday's computer.

Transcript The practical scientist is trying to solve tomorrow's problem on yesterday's computer.