FQMS_Micro06_ioana - Computer Engineering Research Group

Download Report

Transcript FQMS_Micro06_ioana - Computer Engineering Research Group

Fair Queuing Memory
Systems
Kyle Nesbit, Nidhi Aggarwal,
Jim Laudon*, and Jim Smith
University of Wisconsin – Madison
Department of Electrical and Computer Engineering
Sun Microsystems*
1
Motivation: Multicore Systems

Significant memory bandwidth
limitations


Bandwidth constrained operating points
will occur more often in the future
Systems must perform well at
bandwidth constrained operating points

Must respond in a predictable manner
2
Bandwidth Interference

IPC
1
0.8
0.6
0.4
0.2
0
Desktops


Servers

vpr
vpr
with
crafty
vpr
with
art

Soft real-time
constraints
Fair sharing /
billing
Decreases overall
throughput
3
Solution

A memory scheduler based on




First-Ready FCFS Memory Scheduling
Network Fair Queuing (FQ)
System software allocates memory system
bandwidth to individual threads
The proposed FQ memory scheduler
1. Offers threads their allocated bandwidth
2. Distributes excess bandwidth fairly
4
Background




Memory Basics
Memory Controllers
First-Ready FCFS Memory Scheduling
Network Fair Queuing
5
Background
Memory Basics
6
Micron DDR2-800 timing
constraints
tRCD
Activate to read
5 cycles
tCL
Read to data bus valid
5 cycles
tWL
Write to data bus valid
4 cycles
tCCD
CAS to CAS (CAS is a read or a write)
2 cycles
tWTR
Write to read
3 cycles
tWR
Internal write to precharge
6 cycles
tRTP
Internal read to precharge
3 cycles
tRP
Precharge to activate
5 cycles
tRRD
Activate to activate (different banks)
3 cycles
tRAS
Activate to precharge
18 cycles
tRC
Activate to activate (same bank)
22 cycles
BL/2
Burst length (Cache Line Size / 64
bits)
4 cycles
tRFC
Refresth to activate
51 cycles
tRFC
Max refresh to refresh
28,000 cycles
(measured in DRAM address bus cycles)
7
Background:
Memory Controller
CMP
Processor
L1 Caches
Processor
L1 Caches
L2 Cache
L2 Cache
Memory Controller
SDRAM
Chip
Boundary
8
Background:
Memory Controller

Translates memory requests into
SDRAM commands


Tracks SDRAM timing constraints


Activate, Read, Write, and Precharge
E.g., activate latency tRCD and CAS
latency tCL
Buffers and reorders requests in order
to improve memory system throughput
9
Background:
Memory Scheduler
Processor Data Bus
Memory
Requests
Processor Data Bus
Arrival Time Assignment
Cache Line
Read Buffer
Transaction
Buffer
Bank 1
Requests
…
Bank n
Requests
Cache Line
Write Buffer
FR-FCFS Scheduler
SDRAM Address Bus
SDRAM Data Bus
SDRAM Data Bus
Data Path
Request / Command Path
Control Path
10
Background:
FR-FCFS Memory Scheduler

A First-Ready FCFS priority queues
1.
2.
3.


Ready commands
CAS commands over RAS commands
earliest arrival time
Ready with respect to the SDRAM timing
constraints
FR-FCFS is a good general-purpose
scheduling policy [Rixner 2004]

Multithreaded issues
11
Example: Two Threads
Bursty MLP, bandwidth constrained
a5a6a7a8
a1a2a3a4
Thread 1
Memory Latency
Computation
Isolated misses, latency sensitive
a1
a2
a3
a4
a5
Thread 2
Computation
12
First Come First Serve
1a2a3 4
a
a
Thread 1
a5a6a7a8
Shared
Memory
System
Thread 2
a1
a2
13
Background:
Network Fair Queuing

Network Fair Queuing (FQ) provides QoS
in communication networks


Routers use FQ algorithms to offer flows
their allocated bandwidth


Network flows are allocated bandwidth on each
network link along the flow’s path
Minimum bandwidth bounds end-to-end
communication delay through the network
We leverage FQ theory to provide QoS in
memory systems
14
Background:
Virtual Finish-Time Algorithm

The kth packet on flow i is denoted


pik virtual start-time


pik
Sik = max { aik, Fik-1 }
pik virtual finish-time

Fik = Sik + Lik / i

i flow i’s share of network link

A virtual clock determines arrival time aik

VC algorithm determines the fairness policy
15
Quality of Service

Each thread is allocated a fraction i of the
memory system bandwidth



Desktop – soft real time applications
Server – differentiated service – billing
The proposed FQ memory scheduler
1.
2.
Offers threads their allocated bandwidth,
regardless of the load on the memory system
Distributes excess bandwidth according to the
FQ memory scheduler’s fairness policy
16
Quality of Service

Minimum Bandwidth ⇒ QoS

A thread allocated a fraction i of the
memory system bandwidth will perform
as well as the same thread on a private
memory system operating at i of the
frequency
17
Fair Queuing Memory
Scheduler
Thread 1
Requests

VTMS is used to
calculate memory
request deadlines


Request deadlines are
virtual finish-times
FQ scheduler selects
1.
2.
the first-ready pending
request
with the earliest
deadline first (EDF)
Thread 1
VTMS
…
…
Thread m
Requests
Thread m
VTMS
Deadline / Finish-Time
Algorithm
Transaction
Buffer
FQ Scheduler
SDRAM
18
Fair Queuing Memory
Scheduler
Thread 1 a1a2a3a4
a5a6a7a8
Virtual Time
Dilated latency
by the reciprocal
Memory
i
Shared
Memory
System
Thread 2
Deadlines
Deadlines
a1
a2
a3
a4
Virtual Time
19
Virtual Time Memory System


Each thread has its own VTMS to model its
private memory system
VTMS consists of multiple resources


Banks and channels
In hardware, a VTMS consists of one register
for each memory bank and channel resource

A VTMS register holds the virtual time the virtual
resource will be ready to start the next request
20
Virtual Time Memory System

A request’s deadline is its virtual finish-time


The time the request would finish if the request’s
thread were running on a private memory system
operating at i of the frequency
A VTMS model captures fundamental SDRAM
timing characteristics

Abstracts away some details in order to apply
network FQ theory
21
Priority Inversion



First-ready scheduling is required to
improve bandwidth utilization
Low priority ready commands can
block higher priority (earlier virtual
finish-time) commands
Most priority inversion blocking occurs
at active banks, e.g. a sequence of
row hits
22
Bounding Priority Inversion
Blocking Time
1.
2.
When a bank is inactive and tRAS
cycles after a bank has been
activated, prioritize request FR-VFTF
After a bank has been active for tRAS
cycles, FQ scheduler select the
command with the earliest virtual
finish time and wait for it to become
ready
23
Evaluation


Simulator originally developed at IBM
Research
Structural model



Adopts the ASIM modeling methodology
Detailed model of finite memory system
resources
Simulate 20 statistically representative
100M instruction SPEC2000 traces
24
4GHz Processor – System
Configuration
Issue Buffer
64 entries
Issue Width
8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU)
Reorder Buffer
128 entries
Load / Store Queues
32 entry load reorder queue, 32 entry store reorder queue
I-Cache
32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs
D-Cache
32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs
L2 Cache
512KB private cache, 64 byte lines,
Memory Controller
16 transaction buffer entries per thread, 8 write buffer entries per
thread, closed page policy
SDRAM Channels
1 channel
SDRAM Ranks
1 rank
SDRAM Banks
8 banks
8-ways, 12 cycle
latency, 16 store merge buffer entries, 32 transaction buffer entries
25
Evaluation
Single Thread Data Bus Utilization
Utilization
100%
80%
60%
40%
20%
crafty
perlbmk
sixtrack
mesa
vpr
gzip
bzip2
ammp
gap
twolf
wupwise
apsi
mgrid
swim
gcc
lucas
facerec
mcf
equake

art
0%
We use data bus utilization to roughly
approximate “aggressiveness”
26
Evaluation

We present results for a two thread
workload that stresses the memory system



IPC normalized to QoS IPC


Construct 19 workloads by combining each
benchmark (subject thread) with art, the most
aggressive benchmark (background thread)
Static partitioning of memory bandwidth i = .5
Benchmark’s IPC on private memory system at
i = .5 the frequency (.5 the bandwidth)
More results in the paper
27
FQ
FR-FCFS
1
0.5
Normalized IPC
1.5
perlbmk
crafty
hmean
sixtrack
perlbmk
crafty
hmean
mcf
facerec
lucas
gcc
swim
mgrid
apsi
wupwise
twolf
gap
ammp
bzip2
gzip
vpr
mesa
sixtrack
mesa
vpr
gzip
bzip2
ammp
gap
twolf
wupwise
apsi
mgrid
swim
gcc
lucas
facerec
28
equake
0
mcf
0
equake
2
1.5
1
0.5
Normalized IPC
Normalized IPC of Subject Thread
Normalized IPC of Background Thread (art)
Throughput – Harmonic Mean of Normalized IPCs
1.4
Harmonic Mean of Normalized IPCs
FR-FCFS
FQ
1.2
1
0.8
0.6
0.4
0.2
hmean
crafty
perlbmk
sixtrack
mesa
vpr
gzip
bzip2
ammp
gap
twolf
wupwise
apsi
mgrid
swim
gcc
lucas
facerec
mcf
equake
0
29
Subject Thread of Two Thread Workload (Background Thread is art)
30
Summary and Conclusions

Existing techniques can lead to unfair
sharing of memory bandwidth resources
⇒ Destructive interference


Fair queuing is a good technique to provide
QoS in memory systems
Providing threads QoS eliminates
destructive interference which can
significantly improve system throughput
31
Backup Slides
32
Generalized Processor
Sharing

Ideal generalized
processor sharing (GPS)


Each flow i is allocated a
share i of the shared
network link
GPS server services all
backlogged flows
simultaneously in proportion
to their allocated shares
Flow 1
Flow 2
1
2
Flow 3 Flow 4
3
4
33
Background:
Network Fair Queuing

Network FQ algorithms model each
flow as if it were on a private link


Flow i’s private link has i the bandwidth
of the real link
Calculates packet deadlines

A packet’s deadline is the virtual time the
packet finishes its transmission on its
private link
34
Virtual Time Memory System
Finish Time Algorithm

Thread i’s kth memory request is denoted


mik bank j virtual start-time


Bj.Fik = Bj.Sik + Bj.Lik / i
mik channel virtual start-time


Bj.Sik = max { aik , Bj.Fi(k-1)’ }
mik bank j virtual finish-time


mik
C.Sik = max { Bj.Fik-1, C.Fik-1}
mik channel virtual finish-time

C.Fik = C.Sik + C.Lik / i
35
Fairness Policy

FQMS Fairness policy: distribute excess
bandwidth to the thread that has consumed
the least excess bandwidth (relative to its
service share) in the past


Different than the fairness policy commonly used
in networks
Differs from the fairness policy commonly
used in networks because a memory system
is an integral part of a closed system
36
Background:
SDRAM Memory Systems

SDRAM 3D Structure




Banks
Rows
Columns
SDRAM Commands



Activate row
Read or write columns
Precharge bank
37
Virtual Time Memory System
Service Requirements
SDRAM Command
Bcmd .L
Ccmd .L
Activate
tRCD
n/a
Read
tCL
BL/2
Write
tWL
BL/2
Precharge
tRP + (tRAS - tRCD - tCL)
n/a


The tRAS timing constraint overlaps read and write
bank timing constraints
Precharge bank service requirement accounts for
the overlap
38