TxLinux: Transactional Memory in an Operating System
Download
Report
Transcript TxLinux: Transactional Memory in an Operating System
Chris Rossbach, Jon Currey, Microsoft Research
Mark Silberstein, Technion
Baishakhi Ray, Emmett Witchel, UT Austin
SOSP October 25, 2011
There are lots of GPUs
◦
◦
◦
◦
3 of top 5 supercomputers use GPUs
In all new PCs, smart phones, tablets
Great for gaming and HPC/batch
Unusable in other application domains
GPU programming challenges
◦ GPU+main memory disjoint
◦ Treated as I/O device by OS
PTask SOSP 2011
2
There are lots of GPUs
◦
◦
◦
◦
3 of top 5 supercomputers use GPUs
In all new PCs, smart phones
tablets
These two
things are related:
Great for gaming and HPC/batch
We need OS abstractions
Unusable in other application domains
GPU programing challenges
◦ GPU+main memory disjoint
◦ Treated as I/O device by OS
PTask SOSP 2011
3
The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
4
programmervisible interface
OS-level
abstractions
Hardware
interface
1:1 correspondence between OS-level and user-level abstractions
PTask SOSP 2011
5
programmervisible interface
GPGPU
APIs
Shaders/
Kernels
Language
Integration
DirectX/CUDA/OpenCL Runtime
1 OS-level
abstraction!
1.
2.
3.
No kernel-facing API
No OS resource-management
Poor composability
PTask SOSP 2011
6
GPU benchmark throughput
1200
1000
800
600
400
200
0
Higher is
better
no CPU load
CPU scheduler and GPU
not integrated!
high CPU load
• Image-convolution in CUDA
scheduler
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230
PTask SOSP 2011
7
OS cannot prioritize cursor updates
•
Flatter lines
Are better
WDDM + DWM + CUDA == dysfunction
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230
PTask SOSP 2011
8
Raw images
“Hand”
events
detect
capture
capture
camera images
noisy point cloud
xform
detect gestures
filter
geometric
transformation
noise filtering
High data rates
Data-parallel algorithms
… good fit for GPU
NOT Kinect: this is a harder problem!
PTask SOSP 2011
9
#> capture | xform | filter | detect &
CPU
GPU
CPU
GPU
Modular design
flexibility, reuse
Utilize heterogeneous hardware
Data-parallel components GPU
Sequential components CPU
Using OS provided tools
processes, pipes
PTask SOSP 2011
10
GPUs cannot run OS: different ISA
Disjoint memory space, no coherence
Host CPU must manage GPU execution
◦ Program inputs explicitly transferred/bound at runtime
◦ Device buffers pre-allocated
User-mode apps
must implement
Main
memory
Copy inputs
CPU
Send commands
Copy outputs
GPU
memory
GPU
PTask SOSP 2011
11
#> capture | xform | filter | detect &
xform
capture
read()
write()
read()
copy
to
GPU
camdrv
filter
write() read()
OS
copy
from
GPU
executive
copy
to
GPU
detect
write()
copy
from
GPU
GPU driver
PCI-xfer
PCI-xfer
PCI-xfer
read()
IRP
HIDdrv
PCI-xfer
GPU
Run!
PTask SOSP 2011
12
GPU Analogues for:
◦ Process API
◦ IPC API
◦ Scheduler hints
Abstractions that enable:
◦ Fairness/isolation
◦ OS use of GPU
◦ Composition/data movement optimization
PTask SOSP 2011
13
The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
14
ptask (parallel task)
◦ Has priority for fairness
◦ Analogous to a process for GPU execution
◦ List of input/output resources (e.g. stdin, stdout…)
ports
◦ Can be mapped to ptask input/outputs
◦ A data source or sink
channels
◦ Similar to pipes, connect arbitrary ports
• OS objectsOS RM possible
◦ Specialize to eliminate double-buffering
• data: specify where, not how
graph
◦ DAG: connected ptasks, ports, channels
datablocks
◦ Memory-space transparent buffers
PTask SOSP 2011
15
#> capture | xform | filter | detect &
mapped mem
filter
f-out
xform
f-in
capture
cloud
rawimg
ptask graph
detect
GPU mem GPU mem
process (CPU)
ptask (GPU)
port
channel
Optimized data movement
Data arrival triggers computation
ptask graph
datablock
PTask SOSP 2011
16
Graphs scheduled dynamically
◦ ptasks queue for dispatch when inputs ready
Queue: dynamic priority order
◦ ptask priority user-settable
◦ ptask prio normalized to OS prio
Transparently support multiple GPUs
◦ Schedule ptasks for input locality
PTask SOSP 2011
17
Datablock
V
main 1
gpu0 0
gpu1 1
space
M
1
1
1
RW
11
10
10
data
Main
Memory
GPU 0
Memory
GPU 1
Memory
…
Logical buffer
◦ backed by multiple physical buffers
◦ buffers created/updated lazily
◦ mem-mapping used to share across process boundaries
Track buffer validity per memory space
◦ writes invalidate other views
Flags for access control/data placement
PTask SOSP 2011
18
Main
Memory
Datablock
V M RW
0 1
01
0
0 1
main 1
0 1
0 1
01
0
gpu 1
space
f-in
xform
cloud
capture
rawimg
#> capture | xform | filter …
filter
GPU
Memory
data
PTask SOSP 2011
…
process
ptask
port
channel
datablock
19
port
datablock
port
• 1-1 correspondence between programmer and OS abstractions
• GPU APIs can be built on top of new OS abstractions
PTask SOSP 2011
20
The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
21
Windows 7
◦ Full PTask API implementation
◦ Stacked UMDF/KMDF driver
Kernel component: mem-mapping, signaling
User component: wraps DirectX, CUDA, OpenCL
◦ syscalls DeviceIoControl() calls
Linux 2.6.33.2
◦ Changed OS scheduling to manage GPU
GPU accounting added to task_struct
PTask SOSP 2011
22
Windows 7, Core2-Quad, GTX580 (EVGA)
Implementations
◦
◦
◦
◦
pipes: capture | xform | filter | detect
modular: capture+xform+filter+detect, 1process
handcode: data movement optimized, 1process
ptask: ptask graph
Configurations
◦ real-time: driven by cameras
◦ unconstrained: driven by in-memory playback
PTask SOSP 2011
23
relative to handcode
3.5
lower is
better
3
2.5
handcode
2
modular
1.5
pipes
1
ptask
0.5
0
runtime
compared to hand-code
• pipes
11.6% higher throughput
compared to
user
• sys
lower
CPU util: no driver
• ~2.7x less
CPU
usage
program
• 16x higher throughput
• Windows 7 x64 8GB RAM
• ~45% less memory
• Intelusage
Core 2 Quad 2.66GHz
• GTX580 (EVGA)
PTask SOSP 2011
24
PTask invocations/second
1600
1400
1200
1000
800
fifo
600
priority
ptask
400
200
0
Higher is
better
2
4
6
PTask provides throughput
8
proportional
to priority
PTask priority
• FIFO – queue invocations in arrival order
• ptask – aged priority queue w OS priority
• graphs: 6x6 matrix multiply
• priority same for every PTask node
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• GTX580 (EVGA)
PTask SOSP 2011
25
Speedup over 1 GPU
2
1.5
1
0.5
• Synthetic graphs:
Varying depths
priority
data-aware
0
Higher is
better
• Data-aware == priority + locality
• Graph depth > 1 req. for any benefit
Data-aware provides best
throughput, preserves priority
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• 2 x GTX580 (EVGA)
PTask SOSP 2011
26
user-prgs
R/W bnc
cuda-1
cuda-2
user-libs
EncFS
FUSE
libc
PTask
OS
HW
SSD1
GPU/
CPU
…
Linux 2.6.33
SSD2
GPU
Simple GPU usage accounting
• Restores performance
cuda-1
Linux
cuda-2
Linux
cuda-1
PTask
cuda-2
Ptask
Read
1.17x -10.3x
-30.8x
1.16x
1.16x
Write
1.28x -4.6x
-10.3x
1.21x
1.20x
PTask SOSP 2011
27
• EncFS: nice -20
• cuda-*: nice +19
• AES: XTS chaining
• SATA SSD, RAID
• seq. R/W 200 MB
The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
28
OS support for heterogeneous platforms:
◦ Helios
[Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08]
GPU Scheduling
◦ TimeGraph
Pegasus
[Gupta 11]
Graph-based programming models
◦
◦
◦
◦
◦
◦
[Kato 11],
Synthesis [Masselin 89]
Monsoon/Id [Arvind]
Dryad [Isard 07]
StreamIt [Thies 02]
DirectShow
TCP Offload [Currid 04]
Tasking
◦ Tessellation, Apple GCD, …
PTask SOSP 2011
29
OS abstractions for GPUs are critical
◦ Enable fairness & priority
◦ OS can use the GPU
Dataflow: a good fit abstraction
◦ system manages data movement
◦ performance benefits significant
Thank you. Questions?
PTask SOSP 2011
30