Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko
Download
Report
Transcript Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko
Managing Distributed,
Shared L2 Caches through
OS-Level Page Allocation
Jason Bosko
March 5th, 2008
Based on “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation” by Sangyeun Cho
and Lei Jin, appearing in IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), December 2006.
Outline
Background and Motivation
Page Allocation
Specifics of Page Allocation
Evaluation of Page Allocation
Conclusion
Motivation
With multicore processors, on-chip
memory design and management
becomes crucial
Increasing L2 cache sizes result in nonuniform cache access latencies, which
complicates the management of these
caches
Private Caches
A cache slice is
associated with a specific
processor core
Data must be replicated
across processors as it is
accessed
Advantages?
T0
T1
T2
T3
Data is always close to the
processor, reducing hit
latency
Disadvantages?
Limits overall cache space,
resulting in more capacity
misses
0 1 2 ……………………...........
Blocks in memory
15
Shared Caches
S = A mod N
Each memory block uniquely
maps to one (and only one) cache
slice that all processors will
access
Advantages?
Increase effective L2 cache size
Easier to implement coherence
protocols (data only exists in one
place)
Disadvantages?
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
Requested data is not always
close, so hit latency increases
Increase network traffic due to
movement of data that is not close
to requesting processor
0 1 2 ……………………...........
Blocks in memory
15
Page Allocation
S = PPN mod N
Add another level of indirection
– pages!
Built on top of a shared cache
architecture
Use the physical page number
(PPN) to map physical pages to
the correct cache slice
The OS controls the mapping of
virtual pages to physical pages
– if the OS knows where a
physical page maps to, then it
can assign virtual pages based
on which cache slice it desires!
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
0
1
2
……………..
15
Pages in memory
a
b
c
d
e
Pages in VM
f
g
h
How does Page Allocation work?
A Congruence Group (CGi) is the partition of physical pages
that map to the unique processor core i
Each congruence group needs to maintain a “free list” of
available pages
To implement private caching, when a page is requested by
processor i, allocate a free page from CGi
To implement shared caching, when any page is requested,
allocate a page from any CG
To implement hybrid caching, split the CGs into K groups,
keeping track of which CG maps to which group – when a
page is requested, allocate a page from any CG in the correct
group
All of this is controlled by the OS without any additional
hardware support!
Page Spreading & Page Spilling
If the OS always allocates pages from
the CG corresponding to the requesting
processor, then it acts like a private
cache.
The OS can choose to direct allocations
to cache slices in other cores in order to
increase the effective cache size. This
is page spreading.
When available pages in a CG drop
below some threshold, the OS may be
forced to allocate pages from another
group. This is page spilling.
Each tile is on a specific tier that
corresponds to how close it is to the
target tile.
Tier-1 tiles
Cache Pressure
Add hardware support for counting
“unique” page accesses in a cache
But
we aren’t supposed to need hardware
support? It still doesn’t hurt!
When cache pressure is measured to be
high, pages are allocated to other tiles on
the same tier, or tiles on the next tier
Home allocation policy
Profitability of choosing a home cache slice
depends on different factors:
Recent
miss rates of L2 caches
Recent network contention levels
Current page allocation
QoS requirements
Processor configuration (# of processors, etc.)
The OS can easily find the cache slice with the
highest profitability
Virtual Multicore (VM)
For parallel applications,
the OS should try to
coordinate page
allocation to minimize
latency and traffic –
schedule a parallel
application onto a set of
cores in close proximity
When cache pressure
increases, pages can be
still be allocated outside
of the VM
Hardware Support
The best feature of OS-level page allocation is
that it can be built on a simple shared cache
organization with no hardware support
But additional hardware support can still be
leveraged!
Data
replication
Data migration
Bloom filter
Evaluation
Use SimpleScalar tool set to model 4x4 mesh multicore
processor chip
Demand paging – every memory access is checked
against allocated pages; when a memory access is the
first access to an unallocated page, a physical page is
allocated based on the desired policy
No page spilling was ever experienced
Used single-threaded, multiprogrammed, and parallel
workloads
Single-threaded = variety of SPEC2k benchmarks, integer
programs, and floating-point programs
Multiprogrammed = one core (core 5 in the experiments) runs a
target benchmark, while other processors run a synthetic
benchmark that continuously generates memory accesses
Parallel = SPLASH-2 benchmarks
Performance on single-threaded
workloads
PRV: private
PRV8: 8MB cache size
(instead of 512k)
SL: shared
SP: OS-based page
allocation
SP-RR: round-robin
allocation
SP-80: 80% allocated
locally, 20% spread across
tier-1 cores
Performance on single-threaded
workloads
Decreased sharing = higher miss rate
Decreased sharing = less on-chip traffic
Performance on multiprogrammed
workloads
SP40-CS: use controlled
spreading to limit spreading of
unrelated pages onto cores
that have data of target
application
Synthetic benchmarks produce
low, mid, or high traffic
SP40 usually performs better
in high traffic, but performance
is similar to SL in low traffic
Not shown here, but SP40
reduces on-chip network traffic
by 50% (compared to SL)
Performance on parallel workloads
VM: virtual
multicore with
round-robin page
allocations on
participating cores
lu and ocean have
higher L1 miss
rates, so the L2
cache policy had a
greater effect on
performance
No real difference here!
VM outperforms the rest!
Related Issues
Remember NUMA? They used a page scanner
that maintained reference counters and
generated page faults to allow the OS to take
some control
In CC-NUMA, hardware-based counters affected
OS decisions
Big difference: NUMA deals with main memory,
while OS-level page allocation presented here
deals with distributed L2 caches
Conclusion
Page allocation allows for a very simple shared cache
architecture, but how can we use advances in
architecture for our benefit?
Can we apply OS-level modifications also?
Architecture can provide more detailed information about current
state of the cores
CMP-NuRAPID, victim replication, cooperative caching
Page coloring and page recoloring
We are trading hardware complexity for software
complexity – where is the right balance?