Transcript ppt

15-213
“The course that gives CMU its Zip!”
Memory System Performance
March 22, 2001
Topics
• Impact of cache parameters
• Impact of memory reference patterns
– memory mountain range
– matrix multiply
class19.ppt
Basic Cache Organization
Address space (N = 2n bytes)
Cache (C = S x E x B bytes)
E lines/set
Address
(n = t + s + b bits)
t
s
S = 2s sets
b
B
Cache line
Valid bit tag
1 bit t bits
data
B = 2b bytes (line size)
Block
class19.ppt
–2–
CS 213 S’01
Multi-Level Caches
Options: separate data and instruction caches, or a unified cache
Processor
TLB
regs
L1 Dcache
L1 Icache
size:
speed:
$/Mbyte:
line size:
200 B
3 ns
8-64 KB
3 ns
8B
32 B
larger, slower, cheaper
L2
Cache
1-4MB SRAM
4 ns
$100/MB
32 B
Memory
disk
128 MB DRAM
60 ns
$1.50/MB
8 KB
30 GB
8 ms
$0.05/MB
larger line size, higher associativity, more likely to write back
class19.ppt
–3–
CS 213 S’01
Key Features of Caches
Accessing Word Causes Adjacent Words to be Cached
• B bytes having same bit pattern for upper n–b address bits
t
00
s
b
101 1000
00
00
00
00
101
101
101
101
0000
0100
1000
1100
• In anticipation that will want to reference these words due to spatial
locality in program
Loading Block into Cache Causes Existing Block to be
Evicted
• One that maps to same set
• If E > 1, then generally choose least recently used
class19.ppt
–4–
CS 213 S’01
Cache Performance Metrics
Miss Rate
• Fraction of memory references not found in cache
– misses/references
• Typical numbers:
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
• Time to deliver a line in the cache to the processor
– includes time to determine whether the line is in the cache
• Typical numbers:
1 clock cycle for L1
3-8 clock cycles for L2
Miss Penalty
• Additional time required because of a miss
– Typically 25-100 cycles for main memory
class19.ppt
–5–
CS 213 S’01
Categorizing Cache Misses
Compulsory (“Cold”) Misses:
• First ever access to a memory line
– since lines are only brought into the cache on demand, this is
guaranteed to be a cache miss
• Programmer/system cannot reduce these
Capacity Misses:
• Active portion of memory exceeds the cache size
• Programmer can reduce by rearranging data & access patterns
Conflict Misses:
• Active portion of address space fits in cache, but too many lines map to
the same cache entry
• Programmer can reduce by changing data structure sizes
– Avoid powers of 2
class19.ppt
–6–
CS 213 S’01
Measuring Memory Bandwidth
int data[MAXSIZE];
int test(int size, int stride)
{
int result = 0;
int wsize = size/sizeof(int);
for (i = 0; i < wsize; i+= stride)
result += data[i];
return result;
}
Stride (words)
Size (bytes)
class19.ppt
–7–
CS 213 S’01
Measuring Memory Bandwidth (cont.)
Stride (words)
Size (bytes)
Measurement
• Time repeated calls to test
– If size sufficiently small, then can hold array in cache
Characteristics of Computation
• Stresses read bandwidth of system
• Increasing stride yields decreased spatial locality
– On average will get stride*4/B accesses / cache block
• Increasing size increases size of “working set”
class19.ppt
–8–
CS 213 S’01
Alpha Memory Mountain Range
DEC Alpha 21164
466 MHz
8 KB (L1)
96 KB (L2)
2 M (L3)
600
500
400
L1 Resident
300
200
100
16m
s13
s11
L3 Resident
Data Set
s15
class19.ppt
s9
Stride
s7
s5
s3
0
8m
4m
2m
1024k
512k
256k
128k
64k
32k
16k
8k
4k
2k
1k
.5k
L2 Resident
s1
MB/s
–9–
Main Memory Resident
CS 213 S’01
Effects Seen in Mountain Range
Cache Capacity
• See sudden drops as increase working set size
Cache Block Effects
• Performance degrades as increase stride
– Less spatial locality
• Levels off
– When reach single access per line
class19.ppt
– 10 –
CS 213 S’01
Alpha Cache Sizes
600
500
400
300
200
100
0
16m
8m
4m
2m
1024k 512k 256k 128k
64k
32k
16k
8k
4k
2k
1k
• MB/s for stride = 16
Ranges
.5k – 8k
16k – 64k
128k
256k – 2m
4m – 16m
class19.ppt
Running in L1 (High overhead for small data set)
Running in L2.
Indistinct cutoff (Since cache is 96KB)
Running in L3.
Running in main memory
– 11 –
CS 213 S’01
.5k
Alpha Line Size Effects
600
500
16m
1024k
32k
8k
400
300
200
100
0
s1
s2
s4
s8
s16
Observed Phenomenon
• As double stride, decrease accesses/block by 2
• Until reaches point where just 1 access / block
• Line size at transition from downward slope to horizontal line
– Sometimes indistinct
class19.ppt
– 12 –
CS 213 S’01
Alpha Line Sizes
600
500
16m
1024k
32k
8k
400
300
200
100
0
s1
s2
s4
s8
s16
Measurements
8k
32k
1024k
16m
Entire array L1 resident. Effectively flat (except for overhead)
Shows that L1 line size = 32B
Shows that L2 line size = 32B
L3 line size = 64?
class19.ppt
– 13 –
CS 213 S’01
Xeon Memory Mountain Range
Pentium III Xeon
550 MHz
16 KB (L1)
512 KB (L2)
1200
1000
800
MB/s
600
L1 Resident
400
16m
s13
s11
L2 Resident
Data Set
s15
class19.ppt
s9
Stride
s7
s5
s3
s1
0
8m
4m
2m
1024k
512k
256k
128k
64k
32k
16k
8k
4k
2k
1k
.5k
200
Main Memory Resident
– 14 –
CS 213 S’01
Xeon Cache Sizes
1000
900
800
700
600
500
400
300
200
100
0
16m
8m
4m
2m
1024k 512k
256k
128k
64k
32k
16k
8k
4k
2k
1k
.5k
• MB/s for stride = 16
Ranges
.5k – 16k
32k – 256k
512k
1m – 16m
class19.ppt
Running in L1. (Overhead at high end)
Running in L2.
Running in main memory (but L2 supposed to be 512K!)
Running in main memory
– 15 –
CS 213 S’01
Xeon Line Sizes
1200
1000
800
16m
600
256k
4k
400
200
0
s1
s2
s4
s8
s16
Measurements
4k
256k
16m
Entire array L1 resident. Effectively flat (except for overhead)
Shows that L1 line size = 32B
Shows that L2 line size = 32B
class19.ppt
– 16 –
CS 213 S’01
Interactions Between Program & Cache
Major Cache Effects to Consider
• Total cache size
– Try to keep heavily used data in cache closest to processor
• Line size
Variable sum
held in register
– Exploit spatial locality
Example Application
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
• Multiply N x N matrices
• O(N3) total operations
• Accesses
– N reads per source element
– N values summed per destination
» but may be able to hold in register
class19.ppt
– 17 –
CS 213 S’01
Matrix Mult. Performance: Sparc20
20
18


16
mflops (d.p.)
14
12














10


8














175
200
6
4






50
75
100

ikj

kij

ijk

jik

jki

kji
2
0
125
150
matrix size (n)
• As matrices grow in size, they eventually exceed cache capacity
• Different loop orderings give different performance
– cache effects
– whether or not we can accumulate partial sums in registers
class19.ppt
– 18 –
CS 213 S’01
Miss Rate Analysis for Matrix Multiply
Assume:
• Line size = 32B (big enough for 4 64-bit words)
• Matrix dimension (N) is very large
– Approximate 1/N as 0.0
• Cache is not even big enough to hold multiple rows
Analysis Method:
• Look at access pattern of inner loop
k
i
j
k
A
class19.ppt
j
i
C
B
– 19 –
CS 213 S’01
Layout of Arrays in Memory
C arrays allocated in row-major order
• each row in contiguous memory locations
Stepping through columns in one row:
for (i = 0; i < N; i++)
sum += a[0][i];
• accesses successive elements
• if line size (B) > 8 bytes, exploit spatial locality
– compulsory miss rate = 8 bytes / B
Stepping through rows in one column:
for (i = 0; i < n; i++)
sum += a[i][0];
• accesses distant elements
• no spatial locality!
– compulsory miss rate = 1 (i.e. 100%)
Memory Layout
0x80000
0x80008
0x80010
0x80018
a[0][0]
a[0][1]
a[0][2]
a[0][3]
•••
0x807F8
0x80800
0x80808
0x80810
0x80818
a[0][255]
a[1][0]
a[1][1]
a[1][2]
a[1][3]
•••
0x80FF8
a[1][255]
•••
•••
0xFFFF8 a[255][255]
class19.ppt
– 20 –
CS 213 S’01
Matrix Multiplication (ijk)
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Misses per Inner Loop Iteration:
A
0.25
class19.ppt
B
1.0
Inner loop:
(*,j)
(i,j)
(i,*)
A
B
Row-wise
Columnwise
C
Fixed
C
0.0
– 21 –
CS 213 S’01
Matrix Multiplication (jik)
/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}
Misses per Inner Loop Iteration:
A
0.25
class19.ppt
B
1.0
C
0.0
– 22 –
Inner loop:
(*,j)
(i,j)
(i,*)
A
B
Row-wise Columnwise
C
Fixed
CS 213 S’01
Matrix Multiplication (kij)
/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
(k,*)
(i,*)
A
Fixed
B
Row-wise Row-wise
Misses per Inner Loop Iteration:
A
0.0
class19.ppt
B
0.25
C
0.25
– 23 –
C
CS 213 S’01
Matrix Multiplication (ikj)
/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
(k,*)
(i,*)
A
Fixed
B
Row-wise Row-wise
Misses per Inner Loop Iteration:
A
0.0
class19.ppt
B
0.25
C
0.25
– 24 –
C
CS 213 S’01
Matrix Multiplication (jki)
/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Misses per Inner Loop Iteration:
A
1.0
class19.ppt
B
0.0
Inner loop:
(*,k)
(*,j)
(k,j)
A
Column wise
B
Fixed
C
Columnwise
C
1.0
– 25 –
CS 213 S’01
Matrix Multiplication (kji)
/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Misses per Inner Loop Iteration:
A
1.0
class19.ppt
B
0.0
Inner loop:
(*,k)
(*,j)
(k,j)
A
Columnwise
B
Fixed
C
1.0
– 26 –
CS 213 S’01
C
Columnwise
Summary of Matrix Multiplication
ijk (& jik):
kij (& ikj):
jki (& kji):
• 2 loads, 0 stores
• 2 loads, 1 store
• 2 loads, 1 store
• misses/iter = 1.25
• misses/iter = 0.5
• misses/iter = 2.0
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
sum = 0.0;
r = a[i][k];
r = b[k][j];
for (k=0; k<n; k++)
for (j=0; j<n; j++)
for (i=0; i<n; i++)
sum += a[i][k] * b[k][j];
c[i][j] += r * b[k][j];
c[i][j] = sum;
}
c[i][j] += a[i][k] * r;
}
}
}
}
}
class19.ppt
– 27 –
CS 213 S’01
Matrix Mult. Performance: DEC5000
3
mflops (d.p.)
2.5





ikj

kij
ijk
jik

2





1.5


1






























0.5

jki
kji
(misses/iter = 0.5)
(misses/iter = 1.25)
0
50
75
100
125
150
175
200
(misses/iter = 2.0)
matrix size (n)
class19.ppt
– 28 –
CS 213 S’01
Matrix Mult. Performance: Sparc20
Multiple columns of B fit in cache
20
mflops (d.p.)
18


16
14
12
ikj

kij
ijk
jik
















10


8






















6
4

jki
kji
(misses/iter = 0.5)
(misses/iter = 1.25)
2
0
50
75
100
125
150
175
200
(misses/iter = 2.0)
matrix size (n)
class19.ppt
– 29 –
CS 213 S’01
Matrix Mult. Performance: Alpha 21164
Too big for L1 Cache
Too big for L2 Cache
160
mflops (d.p.)
140
120
ijk
ikj
jik
jki
kij
kji
(misses/iter = 0.5)
100
80
60
40
20
(misses/iter = 1.25)
0
(misses/iter = 2.0)
matrix size (n)
class19.ppt
– 30 –
CS 213 S’01
Matrix Mult.: Pentium III Xeon
90
80
70
ijk
ikj
jik
jki
kij
kji
50
40
30
20
(misses/iter
= 0.5 or 1.25)
10
(misses/iter
= 2.0)
10
0
12
5
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
42
5
45
0
47
5
50
0
75
50
0
25
MFlops (d.p.)
60
class19.ppt
Matrix Size (n)
– 31 –
CS 213 S’01
Blocked Matrix Multiplication
• “Block” (in this context) does not mean “cache block”
- instead, it means a sub-block within the matrix
Example: N = 8; sub-block size = 4
A11 A12
A21 A22
B11 B12
X
C11 C12
=
B21 B22
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = A11B11 + A12B21
C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
class19.ppt
– 32 –
CS 213 S’01
Blocked Matrix Multiply (bijk)
for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
class19.ppt
– 33 –
CS 213 S’01
Blocked Matrix Multiply Analysis
• Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X
bsize block of B and accumulates into 1 X bsize sliver of C
• Loop over i steps through n row slivers of A & C, using same B
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
Innermost
}
kk
jj
jj
Loop Pair
i
A
class19.ppt
kk
i
B
C
Update successive
row sliver accessed
elements of sliver
bsize times
block reused n
times in succession
– 34 –
CS 213 S’01
Blocked Matrix Mult. Perf: DEC5000
3


mflops (d.p.)
2.5
2







1.5

1














125
150
175
200


bijk

bikj

ikj

ijk

0.5
0
50
75
100
matrix size (n)
class19.ppt
– 35 –
CS 213 S’01
Blocked Matrix Mult. Perf: Sparc20
20

mflops (d.p.)
18
16
14
12















75
100
125
150








175
200

bijk

bikj

ikj

ijk
10
8
6
4
2
0
50
matrix size (n)
class19.ppt
– 36 –
CS 213 S’01
Blocked Matrix Mult. Perf: Alpha 21164
160
140
mflops (d.p.)
120
100
bijk
bikj
80
ijk
ikj
60
40
20
500
475
450
425
400
375
350
325
300
275
250
225
200
175
150
125
100
75
50
0
matrix size (n)
class19.ppt
– 37 –
CS 213 S’01
Blocked Matrix Mult. : Xeon
90
80
70
ijk
ikj
bijk
bikj
50
40
30
20
10
10
0
12
5
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
42
5
45
0
47
5
50
0
75
0
50
MFlops (d.p.)
60
class19.ppt
Matrix Size (n)
– 38 –
CS 213 S’01
Observations
Programmer Can Optimize for Cache Performance
• How data structures are organized
• How data accessed
– Nested loop structure
– Blocking is general technique
All Machines Like “Cache Friendly Code”
• Getting absolute optimum performance very platform specific
– Cache sizes, line sizes, associatitivities, etc.
• Can get most of the advantage with generic code
– Keep working set reasonably small
class19.ppt
– 39 –
CS 213 S’01