The Case of a Scalable Coherence Protocol for Complex On

Download Report

Transcript The Case of a Scalable Coherence Protocol for Complex On

MOSAIC :
Lucía G. Menezo
Valentín Puente
José Ángel Gregorio
University of Cantabria (Spain)


Motivation
Directory Schemas
◦ In-cache
◦ Sparse

MOSAIC Coherence Protocol
◦ Examples


Evaluation Results
Conclusions
University of Cantabria
Edinburgh - PACT 2013

Performance improvement: more processors
per chip
Major challenges: off-chip bandwidth wall
Introduce cache into the chip
Complex on-chip cache hierarchies

Coherence protocol: fundamental role to play



University of Cantabria
Edinburgh - PACT 2013
3

What coherence protocol to use with large
number of cores:
◦ Broadcast-based protocols  high energy
requirements
◦ Directory-based protocols  more storage
necessities for sharing information

MOSAIC: new coherence protocol
◦ Directory without inclusiveness
◦ Token Coherence to guarantee correctness
University of Cantabria
Edinburgh - PACT 2013
4


Motivation
Directory Schemas
◦ In-cache
◦ Sparse

MOSAIC Coherence Protocol
◦ Examples


Evaluation Results
Conclusions
University of Cantabria
Edinburgh - PACT 2013





Each block in LLC includes tag, data and the
sharers information
LLC receives requests  needs precise
knowledge
Inclusiveness is necessary: any block in the
private levels needs to be allocated in LLC
Advantage: coherence protocol less complex
Disadvantage: all LLC blocks has storage
overhead
University of Cantabria
Edinburgh - PACT 2013
6
P
P
P
P
P
P
@
P
P
@
P
P
P
P
University of Cantabria
Edinburgh - PACT 2013
@
@ data
sharers
@data
data
@
@
LLC + in-cache directory
data
data
data
@
data
@
data
@
data
data
@data
data
@
Interconnection network
Processors and private caches
@
Overhead!!!
data
7
P
P
P
P
P
P
P
University of Cantabria
Edinburgh - PACT 2013
data
@
data
@
data
@
data
@
data
@
data
@
data
@
data
LLC + in-cache directory
@ data
Interconnection network
Processors and private caches
P
@
sharers
Overhead!!!
Overhead!!!
8




Directory entries separated from data
Allocated under demand
Overhead proportional to the aggregate
private levels size (not LLC)
Capacity and associativity has to be sufficient
to keep private-level cache tags
University of Cantabria
Edinburgh - PACT 2013
9
P
P
P
P
P
P
P
University of Cantabria
Edinburgh - PACT 2013
LLC
data
@
data
@
data
@
data
@
data
@
data
@
data
@
data
@ data
sharers
Sparse
dir
@ sharers
Interconnection network
Processors and private caches
P
@
10
Associativity = # cores * private caches associativity
# sets =
# private
caches
sets


tag tag tag tag tag tag
tag tag tag
tag tag tag tag tag tag
tag tag tag
tag tag tag tag tag tag
tag tag tag
tag tag tag tag tag tag
tag tag tag
tag tag tag tag tag tag
tag tag tag
tag tag tag tag tag tag
tag tag tag
Duplicate-tag directory: holding all the tags
of private levels
Example: 16 cores with 4-way 32KB L1  64-way
University of Cantabria
Edinburgh - PACT 2013
11
Decrease Associativity: now << # cores * private caches associativity
tag sharers
tag tag tag sharers
tag tag

tag sharers
tag tag tag sharers
tag tag
tag sharers
tag tag tag sharers
tag tag
tag sharers
tag tag tag sharers
tag tag

tag sharers
tag tag tag sharers
tag tag
Increase
number
of sets
tag sharers
tag tag tag sharers
tag tag
sharers
tag tag tag
sharers
tag
tag tag
sharers
tag tag tag
sharers
tag
tag tag
sharers
tag tag tag
sharers
tag
tag tag
sharers
tag tag tag
sharers
tag
tag tag

One
tag tag
tag may
tag be in
various
tag tagprivate
tag
caches
tag tag tag
tag than
tag tag1 tag per
More
tag 
tag conflicts
tag
entry
tag tag tag
Inclusiveness
needed  invalidate
private data (recalls
messages)
sharers
tag tag tag
sharers
tag
tag tag
sharers
tag tag tag
sharers
tag
tag tag
University of Cantabria
Edinburgh - PACT 2013
12


Motivation
Directory Schemas
◦ In-cache
◦ Sparse

MOSAIC Coherence Protocol
◦ Examples


Evaluation Results
Conclusions
University of Cantabria
Edinburgh - PACT 2013
13






In-cache or sparse  it doesn’t matter
No inclusiveness
No invalidations of data in private caches
Reconstruction of sharing information under
demand
Uses token counting to avoid extra traffic and
guarantee correctness
Token Coherence protocol:
◦ Initially each block := # tokens (==#procs)
◦ Read request: data and 1 token
◦ Write request: data and all tokens
University of Cantabria
Edinburgh - PACT 2013
14
Private Caches
P0
I 0
3
4
N/A
P1
O 2
P2
DATA
S 1
DATA
5
1
On-chip network
Last Level Cache
2
Data_slice
Dir_slice
I Sharers
V
2
State
University of Cantabria
Edinburgh - PACT 2013
3
I 0
Memory
Controller
N/A
Num.
Tokens
Data
15





When data not present in LLC  broadcast for
reconstruction
Private caches inform of num. of held tokens
Token counting avoids negative
acknowledgements or timeouts
Reconstruction message piggybacks type of
request and requestor
Key: directory may replace silently  no
invalidations
University of Cantabria
Edinburgh - PACT 2013
16
P0
Read
P1
3 tokens
P2
1 token
P3
Dir
LLC
Reconstruction
Invalid
State IS
Read
Sharers [P2]
Owner: ¿?
Sharers [P2, P1]
Owner: P1
State S
Sharers [P2, P1, P0]
Owner: P1
State A
State O
State C
Sharers [P2, P1, P0, P3]
Owner: P1
University of Cantabria
Edinburgh - PACT 2013
17
P0
Write
P1
3 tokens
P2
1 token
P3
Dir
LLC
Reconstruction
Invalid
State IS
State S
State O
State C
University of Cantabria
Edinburgh - PACT 2013
Sharers [P0]
Owner: P0
State A
Directory
Eviction
State M
State IM
18


Motivation
Directory Schemas
◦ In-cache
◦ Sparse

MOSAIC Coherence Protocol
◦ Examples


Evaluation Results
Conclusions
University of Cantabria
Edinburgh - PACT 2013
19
Config 2
8 @3GHz
16 @3GHz
Block size
64B
L2
Size /
Associativity
NUCA Mapping
64KB, 4-way
(exclusive with L1)
16MB
16-way
32MB
16-way
Static, interleaved across
slices
Memory Capacity
4GB
Max. Outstanding
Mem. Operations
16
Topology
University of Cantabria
Edinburgh - PACT 2013
4×4 Mesh
R
R
Slice 4
Slice 5
R
Slice 6
R
Slice 4
Slice 5
Slice 11
Slice 10
R
R
Slice 8
Slice 16
Slice 12
R
R
Slice 12
R
R
R
R
R
Slice 29
R
Core 11
Slice 8
Slice 13
Slice 19
R
Slice 25
Slice 14
R
R
Slice 30
R
Core 5
Core 10
R
Slice 9
R
R
R
Slice 7
Slice 14
R
R
Slice 3
R
R
Slice 10
Slice 24
Slice 13
R
R
Slice 6
R
R
R
Slice 3
Slice 7
R
R
Slice 28
Core 4
R
Slice 18
Slice 23
Slice 22
Slice 2
R
Slice 9
Slice 17
R
Slice 2
R
Slice 20
R
Slice 26
R
Slice 15
R
R
R
Core 9
R
Slice 11
Slice 21
R
R
R
Slice 27
Slice 15
R
R
Slice 31
Core 6
3
R
R
R
Core 7
Size /
Associativity
32KB I/D, 2-way
R
Slice 1
Slice 1
Core
Core
3
Core 6
L3 Shared
L1
Size /
Associativity
Slice 0
Slice 0
Core
Core 2 2
Core 5
Private
128, 4-way
Core 0
11
Core 0 Core
Core
Core 4
IWin size/Issue
Width
Core 12 Core 13 Core 14 Core 15
Number of cores
Config 1
Core 7
Core 8
6×6 Mesh
20

GEMS: full-system
evaluation
◦ SLICC: Specification
Language for
Implementing Cache
Coherence
4 Wisconsin
Commercial Workload
Multithreaded
Workloads
3 NAS Parallel Bench.
Multiprogrammed
Workloads
University of Cantabria
Edinburgh - PACT 2013
3 Spec 2006
(Rate Mode)
21
Normalized execution time
64w128KB
32w128KB
2w128KB
1w128KB
1.1
1
0.9
0.8
0.7
0.6
0.5
128KB  16K entries (8 bytes per entry)
University of Cantabria
Edinburgh - PACT 2013
22
32
1
32
1
Misses L1D
32
1
32
1
64
2
64
2
Misses L1I
32
1
32
1
64
2
64
2
32
1
32
1
64
2
64
2
32
1
32
1
64
2
64
2
Misses L2
64
2
64
2
Normalized num. misses
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASE MOSAIC BASEMOSAIC
Astar
University of Cantabria
Edinburgh - PACT 2013
Hmmer
Omnetpp
FT
IS
LU
Apache
Jbb
OLTP
Zeus
23
Normalized execution time
64w16KB
32w16KB
2w16KB
1w16KB
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
128KB  16K entries (8 bytes per entry)
University of Cantabria
Edinburgh - PACT 2013
16KB  2K entries
24
L3
Other L2
Other L1
Private L2
Local L1
Latency (Processor Cycles)
12
10
8
6
4
2
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
2
1
64
0
BASEMOSAIC
Astar
University of Cantabria
Edinburgh - PACT 2013
Hmmer
Omnetpp
FT
IS
LU
Apache
Jbb
OLTP
Zeus
16KB  2K entries
25
Average network link utilization
1.4
1.2
64w128KB
64w64KB
64w32KB
2w128KB
2w64KB
2w16KB
64w8KB
1
0.8
0.6
0.4
0.2
0
University of Cantabria
Edinburgh - PACT 2013
26
Normalized network link utilization
2w128KB
2w64KB
2w16KB
1.4
1.2
40%!!
1
0.8
0.6
0.4
0.2
0
University of Cantabria
Edinburgh - PACT 2013
27
Normalized link utilization

16 cores configuration
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
128w256KB
2w256KB
University of Cantabria
Edinburgh - PACT 2013
128w128KB
2w128KB
128w64KB
2w64KB
128w32KB
2w32KB
28
Bandwidth scalability of a directory
Elegancy of Token Coherence
MOSAIC Coherence Protocol




Low complexity and great scalability
Very low storage overhead
No noticeable energy cost
Alternative for future many-core cache
coherent CMPs
University of Cantabria
Edinburgh - PACT 2013
29
University of Cantabria
Edinburgh - PACT 2013
30
University of Cantabria
Edinburgh - PACT 2013
31
L1: 4-way 32KB / L2: 8-way 256KB
x2 full dir
Normalized execution time
16w512KB
16w256KB
1/10 full dir
16w128KB
16w64KB
16w32KB
1.2
1
0.8
0.6
0.4
0.2
0
- Same experiment with BASE: 20% impact in some cases
University of Cantabria
Edinburgh - PACT 2013
32
Sparse directory
L3
L2
L1
128
128
64
64
16
16
128
128
64
64
16
16
128
128
64
64
16
16
128
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
128
Normalized Dynamic Energy
Network
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
BASE
MOSAIC
Astar
University of Cantabria
Edinburgh - PACT 2013
Hmmer Omnetpp
FT
IS
LU
Apache
Jbb
OLTP
Zeus
33