Transcript Slides

Temporal Memory Streaming
Babak Falsafi
Team: Mike Ferdman, Brian Gold, Nikos Hardavellas,
Jangwoo Kim, Stephen Somogyi, Tom Wenisch
Collaborator: Anastassia Ailamaki & Andreas Moshovos
STEMS
Computer Architecture Lab
Carnegie Mellon
http://www.ece.cmu.edu/CALCM
© 2005 Babak Falsafi
The Memory Wall
1000
80
100
10
100
10
10
6
1
0.1
0.01
1
0.33
0.1
Core (s)
Memory
VAX/1980
0.04
PPro/1996
2010+
Clocks per DRAM access
Clocks per instruction
1000
0.01
Logic/DRAM speed gap continues to increase!
© 2005 Babak Falsafi
2
Current Approach
But, in modern servers
• Only 50% utilization of one proc.
C
P
U
1000 clk
100 clk
10 clk
1 clk
Cache hierarchies:
• Trade off capacity for speed
• Exploit “reuse”
L1 64K
L2
2M
[Ailamaki, VLDB’99]
•
Much bigger problem in MPs
What is wrong?
• Demand fetch/repl. data
© 2005 Babak Falsafi
L3
DRAM
64M
100G
3
Prior Work (SW-Transparent)
Prefetching

[Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]
Simple patterns or low accuracy
Large Exec. Windows / Runahead

Fetch dependent addresses serially
Coherence Optimizations

[Mutlu 03]
[Stenström 93] [Lai 00] [Huh 04]
Limited applicability (e.g., migratory)
Need solutions for arbitrary access patterns
© 2005 Babak Falsafi
4
Our Solution:
Spatio-Temporal Memory Streaming
CPU
Observation:
•
•
Extract spat./temp. patterns
Stream data to/from CPU


•
Manage resources for multiple blocks
Break dependence chains
stream
Approach  Memory Streaming
L1
ZWX
STEMS
fetch
Data spatially/temporally correlated
Arbitrary, yet repetitive, patterns
.. C A B D
•
•
In HW, SW or both
DRAM or
Other CPUs
© 2005 Babak Falsafi
5
Contribution #1:
Temporal Shared-Memory Streaming
• Recent coherence miss sequences recur


50% misses closely follow previous sequence
Large opportunity to exploit MLP
• Temporal streaming engine


© 2005 Babak Falsafi
Ordered streams allow practical HW
Performance improvement:
 7%-230% in scientific apps.
 6%-21% in commercial Web & OLTP apps.
6
Contribution #2:
Last-touch Correlated Data Streaming
• Last-touch prefetchers



Cache block deadtime >> livetime
Fetch on a predicted “last touch”
But, designs impractical (> 200MB on-chip)
• Last-touch correlated data streaming



© 2005 Babak Falsafi
Miss order ~ last-touch order
Stream table entries from off-chip
Eliminates 75% of all L1 misses with ~200KB
7
Outline
• STEMS Overview
• Example Temporal Streaming
1. Temporal Shared-Memory Streaming
2. Last-Touch Correlated Data Streaming
• Summary
© 2005 Babak Falsafi
8
Temporal Shared-Memory Streaming
[ISCA’05]
• Record sequences of memory accesses
• Transfer data sequences ahead of requests
Baseline System
CPU
Miss A
Fill A
Miss B
Fill B
Streaming System
Mem
CPU
Miss A
Fill A,B,C,…
Mem
• Accelerates arbitrary access patterns

© 2005 Babak Falsafi
Parallelizes critical path of pointer-chasing
9
Relationship Between Misses
• Intuition: Miss sequences repeat

Because code sequences repeat
Miss seq. Q W A B C D E R … T A B C D E Y
• Observed for uniprocessors in [Chilimbi’02]
• Temporal Address Correlation

Same miss addresses repeat in the same order
Correlated miss sequence = stream
© 2005 Babak Falsafi
10
Relationship Between Streams
• Intuition: Streams exhibit temporal locality


Because working set exhibits temporal locality
For shared data, repetition often across nodes
Node 1 Q W A B C D E R
Node 2 T A B C D E Y
• Temporal Stream Locality

Recent streams likely to recur
Addr. correlation + stream locality = temporal correlation
© 2005 Babak Falsafi
11
Memory Level Parallelism
• Streams create MLP for dependent misses
Baseline
Temporal Streaming
A
CPU
A
B
CPU
C
Must wait to follow pointers
B
C
Fetch in parallel
• Not possible with larger windows / runahead
Temporal streaming breaks dependence chains
© 2005 Babak Falsafi
12
Temporal Streaming
Node 1
Directory
Node 2
A
 Miss
Miss B
Miss C
Miss D

Record
© 2005 Babak Falsafi
13
Temporal Streaming
Node 1
Directory
Node 2
A
 Miss
Miss B
Miss C
Miss D
Miss A

Record
© 2005 Babak Falsafi
14
Temporal Streaming
Node 1
Directory
Node 2
A
 Miss
Miss B
Miss C
Miss D



Miss A
Record
Locate
© 2005 Babak Falsafi
15
Temporal Streaming
Node 1
Directory
Node 2
A
 Miss
Miss B
Miss C
Miss D




Record
Locate
Stream
© 2005 Babak Falsafi
Miss A
Fetch B, C, D

Hit B
Hit C
16
Temporal Streaming Engine

Record
CPU
• Coherence Miss Order Buffer (CMOB)
 ~1.5MB circular buffer per node
$
 In local memory
Fill E
 Addresses only
 Coalesced accesses
CMOB
QW A B C D E R T Y
Local Memory
© 2005 Babak Falsafi
17
Temporal Streaming Engine
 Locate
• Annotate directory
 Already has coherence info for every block
 CMOB append  send pointer to directory
 Coherence miss  forward stream request
A shared
B modified
Node 4 @ CMOB[23]
Node 11 @ CMOB[401]
Directory
© 2005 Babak Falsafi
18
Temporal Streaming Engine

Stream
• Fetch data to match use rate
 Addresses in FIFO stream queue
 Fetch into streamed value buffer
CPU
Node i: stream {A,B,C…}
A
Stream Queue
F EDCB
Fetch A
data
L1 $
Streamed
Value Buffer
~32 entries
© 2005 Babak Falsafi
19
Practical HW Mechanisms
• Streams recorded/followed in order



FIFO stream queues
~32-entry streamed value buffer
Coalesced cache-block size CMOB appends
• Predicts many misses from one request



© 2005 Babak Falsafi
More lookahead
Allows off-chip stream storage
Leverages existing directory lookup
20
Methodology: Infrastructure
SimFlex
[SIGMETRICS’04]
Statistically sampling → uArch sim. in minutes
 Full-system MP simulation (boots Linux & Solaris)
 Uni, CMP, DSM timing models
 Real server software (e.g., DB2 & Oracle)
 Software publicly available for download

http://www.ece.cmu.edu/~simflex
© 2005 Babak Falsafi
21
Methodology:
Benchmarks & Parameters
Benchmark Applications
• Scientific
 em3d, moldyn, ocean
• OLTP: TPC-C 3.0 100 WH
 IBM DB2 7.2
 Oracle 10g
• SPECweb99 w/ 16K con.
 Apache 2.0
 Zeus 4.3
© 2005 Babak Falsafi
Model Parameters

16 4GHz SPARC CPUs

8-wide OoO; 8-stage pipe

256-entry ROB/LSQ

64K L1, 8MB L2

TSO w/ speculation
22
250%
Coverage
Discards
200%
150%
100%
50%
Stride
G/DC
G/AC
TSE
Stride
G/DC
G/AC
TSE
Stride
G/DC
G/AC
TSE
Stride
G/DC
G/AC
TSE
Stride
G/DC
G/AC
TSE
Stride
G/DC
G/AC
TSE
0%
Stride
G/DC
G/AC
TSE
% Coherent Read Misses
TSE Coverage Comparison
em3d
moldyn
ocean
Apache
DB2
Oracle
Zeus
TSE outperforms Stride and GHB for coherence misses
© 2005 Babak Falsafi
23
Stream Lengths
DB2
moldyn
Oracle
ocean
Zeus
80%
60%
40%
20%
0%
0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
Cum. % of All Hits
Apache
em3d
100%
Length (# of stream ed blocks)
• Comm: Short streams; low base MLP (1.2-1.3)
• Sci: Long streams; high base MLP (1.6-6.6)
• Temporal Streaming addresses both cases
© 2005 Babak Falsafi
24
TSE Performance Impact
Busy
1.0
Other Stalls
Speedup
3.3
Coherent Read Stalls
1.3
0.8
95% CI
0.6
1.2
0.4
0.2
1.1
Apache
DB2
Oracle
Zeus
base
TSE
Oracle Zeus
Oracle
1.0
em3d
moldyn
ocean
em3d
em3d moldyn
moldyn ocean
ocean Apache DB2
DB2
base
TSE
base
TSE
base
TSE
base
TSE
base
TSE
-
base
TSE
Normalized Time
Time Breakdown
• TSE eliminates 25%-95% of coherent read stalls
6% to 230% performance improvement
© 2005 Babak Falsafi
25
TSE Conclusions
• Temporal Streaming


Intuition: Recent coherence miss sequences recur
Impact: Eliminates 50-100% of coherence misses
• Temporal Streaming Engine


Intuition: In-order streams enable practical HW
Impact: Performance improvement
 7%-230% in scientific apps.
 6%-21% in commercial Web & OLTP apps.
© 2005 Babak Falsafi
26
Outline
• Big Picture
• Example Streaming Techniques
1. Temporal Shared Memory Streaming
2. Last-Touch Correlated Data Streaming
• Summary
© 2005 Babak Falsafi
27
Enhancing Lookahead
L1 @ Time T1
• Few live sets
 Use until last “hit”
 Data reuse  high hit rate
 ~80% dead frames!
Dead sets
Observation [Mendelson, Wood&Hill]:
L1 @ Time T2
• Predict last “touch” prior to “death”
• Evict, predict and fetch next line
© 2005 Babak Falsafi
Live sets
Exploit for lookahead:
28
DRAM latency
L2 latency
How Much Lookahead?
Frame Deadtimes (cycles)
Predicting last-touches will eliminate all latency!
© 2005 Babak Falsafi
29
Dead-Block Prediction
[ISCA’00 & ’01]
Accesses to a block frame
• Per-block trace of memory accesses to a block
 Predicts repetitive last-touch events
PC0: load/store A0 (hit)
PC1: load/store A1 (miss)
First touch
PC3: load/store A1 (hit)
PC3: load/store A1 (hit)
Last touch
PC5: load/store A3 (miss)
Trace = A1  (PC1,PC3, PC3)
© 2005 Babak Falsafi
30
Dead-Block Prefetcher (DBCP)
• History & correlation tables


History ~ L1 tag array
Correlation ~ memory footprint
• Encoding  truncated addition
• Two bit saturating counter
History Table (HT)
Latest
PC1,PC3
A1
© 2005 Babak Falsafi
Correlation Table

PC3
Current Access
A1,PC1,PC3,PC3
A3 11
Evict A1
Fetch A3
31
DBCP Coverage with
Unlimited Table Storage
120%
(%) of Cache Misses
100%
early
80%
train
60%
incorrect
correct
40%
20%
0%
Olden
SPEC INT
• High average L1 miss coverage
• Low misprediction (2-bit counters)
© 2005 Babak Falsafi
SPEC FP
32
100%
average
worst-case
80%
60%
40%
20%
160MB
40MB
10MB
2MB
640KB
0%
160KB
% of Achievable Coverage
Impractical On-Chip Storage Size
On-Chip Correlation Table Size
Needs over 150MB to achieve full potential!
© 2005 Babak Falsafi
33
Our Observation:
Signatures are Temporally Correlated
Signatures need not reside on chip
1. Last-touch sequences recur
•
•
Much as cache miss sequences recur [Chilimbi’02]
Often due to large structure traversals
2. Last-touch order ~ cache miss order
• Off by at most L1 cache capacity
Key implications:
• Can record last touches in miss order
• Store & stream signatures from off-chip
© 2005 Babak Falsafi
34
Last-Touch Correlated Data Streaming
(LT-CORDS)
• Streaming signatures on chip



Keep all sigs. in sequences in off-chip DRAM
Retain sequence “heads” on chip
“Head” signals a stream fetch
• Small (~200KB) on-chip stream cache


Tolerate order mismatch
Lookahead for stream startup
DBCP coverage with moderate on-chip storage!
© 2005 Babak Falsafi
35
DBCP Mechanisms
Core
L1
HT
L2
DRAM
Sigs. (160MB)
All signatures in random-access on-chip table
© 2005 Babak Falsafi
36
What LT-CORDS Does
Core
L1
HT
L2
DRAM
… and only in order
Only a subset
“Head”
as cueneeded
for the “stream”
at a time
Signatures stored off-chip
© 2005 Babak Falsafi
37
LT-CORDS Mechanisms
Core
L1
HT
L2
SC
Heads (10K)
DRAM
Sigs. (200K)
On-chip storage independent of footprint
© 2005 Babak Falsafi
38
Methodology
• SimpleScalar CPU model with Alpha ISA

SPEC CPU2000 & Olden benchmarks
• 8-wide out-of-order processor



2 cycle L1, 16 cycle L2, 180 cycle DRAM
FU latencies similar to Alpha EV8
64KB 2-way L1D, 1MB 8-way L2
• LT-CORDS with 214KB on-chip storage
• Apps. with significant memory stalls
© 2005 Babak Falsafi
39
LT-CORDS vs. DBCP Coverage
100%
80%
early
60%
train
incorrect
40%
correct
LTcords
SPEC INT
DBCP
Olden
LTcords
DBCP
0%
LTcords
20%
DBCP
(%) of Cache Misses
120%
SPEC FP
LT-CORDS reaches infinite DBCP coverage
© 2005 Babak Falsafi
40
LT-CORDS Speedup
5
Infinite Cache
Speedup
4
LT-CORDS
GHB (PC/DC)
3
2
gc
pa c
rse
s ix r
tra
ck
tre
ea
dd
mg
r id
sw
eq im
ua
ke
ap
plu
mc
fm f
a3
d
ar
em t
fac 3d
e
wu rec
pw
ise
b
am h
mp
1
LT-CORDS hides large fraction of memory latency
© 2005 Babak Falsafi
41
LT-CORDS Conclusions
• Intuition: Signatures temporally correlated


Cache miss & last-touch sequences recur
Miss order ~ last-touch order
• Impact: eliminates 75% of all misses



© 2005 Babak Falsafi
Retains DBCP coverage, lookahead, accuracy
On-chip storage indep. of footprint
2x speedup over best prior work
42
For more information
Visit our website:
http://www.ece.cmu.edu/CALCM
© 2005 Babak Falsafi
43