GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K.

Transcript GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K.

GPUTeraSort: High Performance Graphics
Co-processor Sorting for Large Data Management
Naga K. Govindaraju
Ritesh Kumar
Jim Gray
Dinesh Manocha
http://gamma.cs.unc.edu/GPUTERASORT
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting
“I believe that virtually every important
aspect of programming arises somewhere
in the context of sorting or searching!”
-Don Knuth
2
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting
Well studied
High performance computing
Databases
Computer graphics
Programming languages
...
Google map reduce algorithm
Spec benchmark routine!
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Massive Databases
Terabyte-data sets are common
Google sorts more than 100 billion terms in its
index
> 1 Trillion records in web indexed!
Database sizes are rapidly increasing!
Max DB sizes increases 3x per year
(http://www.wintercorp.com)
Processor improvements not matching
information explosion
4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
CPU vs. GPU
GPU (690 MHz)
CPU
(3 GHz)
Video Memory
(512 MB)
2 x 1 MB Cache
System Memory
(2 GB)
AGP Memory
(512 MB)
5
PCI-E Bus
(4 GB/s)
GPU (690 MHz)
Video Memory
(512 MB)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Performed on Terabyte-scale databases
Two phases algorithm [Vitter01, Salzberg90,
Nyberg94, Nyberg95]
Limited main memory
First phase – partitions input file into large data chunks
and writes sorted chunks known as “Runs”
Second phase – Merge the “Runs” to generate the sorted
file
6
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Performance mainly governed by I/O
Salzberg Analysis: Given the main
memory size M and the file size N, if the
I/O read size per run is T in phase 2,
external memory sorting achieves efficient
I/O performance if the run size R in phase
1 is given by R ≈ √(TN)
7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Given the main memory size M and the file size N, if the
I/O read size per run is T in phase 2, external memory
sorting achieves efficient I/O performance if the run size R
in phase 1 is given by R ≈ √(TN)
N
8
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Given the main memory size M and the file size N, if the I/O
read size per run is T in phase 2, external memory sorting
achieves efficient I/O performance if the run size R in
phase 1 is given by R ≈ √(TN)
R
9
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Given the main memory size M and the file size N, if the I/O
read size per run is T in phase 2, external memory
sorting achieves efficient I/O performance if the run size R
in phase 1 is given by R ≈ √(TN)
T
10
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Salzberg Analysis
If N=100GB, T=2MB, then
R ≈ 230MB
Large data sorting on CPUs can
achieve high I/O performance by
sorting large runs
11
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Massive Data Handling on CPUs
Require random memory accesses
Small CPU caches (< 2MB)
Slower than even sequential disk accesses –
bottleneck shift from I/O to memory
Widening memory to compute gap!
External memory sorting on CPUs can
have low performance due to
High memory latency on account of cache misses
Or low I/O performance
Sorting is hard!
12
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics
applications
Massively parallel vector processors
High memory bandwidth
Low memory latency pipeline
Programmable
High growth rate
13
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU: Commodity Processor
Cell phones
Laptops
Consoles
PSP
14
Desktops
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics
applications
Massively parallel vector processors
10x more operations per sec than CPUs
High memory bandwidth
Low memory latency pipeline
Programmable
High growth rate
15
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Parallelism on GPUs
Graphics FLOPS
GPU – 1.3 TFLOPS
CPU – 25.6 GFLOPS
16
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics
applications
Massively parallel vector processors
High memory bandwidth
Better hides memory latency
Programmable
10x more memory bandwidth than CPUs
High growth rate
17
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Pipeline
Hides
memory
latency!!
Low pipeline depth
56 GB/s
vertex
programmable vertex
processing (fp32)
setup
polygon
rasterizer
polygon setup,
culling, rasterization
pixel
texture
image
18
programmable perpixel math (fp32)
per-pixel texture,
fp16 blending
Z-buf, fp16 blending,
anti-alias (MRT)
memory
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
NON-Graphics Pipeline
Abstraction
Courtesy:
David Kirk,
Chief Scientist,
NVIDIA
data
setup
lists
rasterizer
data
data
data
19
programmable MIMD
processing (fp32)
SIMD
“rasterization”
programmable SIMD
processing (fp32)
data fetch,
fp16 blending
predicated write, fp16
blend, multiple output
memory
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics
applications
Massively parallel vector processors
High memory bandwidth
Low memory latency pipeline
Programmable
High growth rate
20
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Log of Relative Processing Power
Technology Trends: CPU and GPU
Cooling (Cost)
Limitations
31 GHz
Enthusiast / Specialty
Mainstream Desktop
4.4
GHz
11.2
DT ‘Replacement’
2.2
GHz
?
4.2
Mobile
1.6 GHz
0.8 GHz
2002
21
2004
2006
2008
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Architecture of Phase 1:
GPUTeraSort
22
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPUs for Sorting: Issues
No support for arbitrary writes
Optimized CPU algorithms do not map!
Requires new algorithms – sorting networks
Lack of support for general data types
Out-of-core algorithms
Limited GPU memory
Difficult to program
23
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
General Sorting on GPUs
Sorting networks: No data dependencies
Utilize high parallelism on GPUs
To handle large keys, use bitonic radix
sort
Perform bitonic sort on the 4 most significant bytes
(MSB) using GPUs, compute sorted records with
equal 4 MSBs, proceed to the next 4 bytes on those
and so on
Can handle any length keys
24
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU-Based Sorting Networks
Represent data as 2D arrays
Multi-stage algorithm
Each stage involves multiple steps
In each step
1. Compare one array element against exactly one
other element at fixed distance
2. Perform a conditional assignment (MIN or MAX) at
each element location
25
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Flash animation removed to save
(46MB !)
26
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
2D Memory Addressing
GPUs optimized for 2D representations
Map 1D arrays to 2D arrays
Minimum and maximum regions mapped to rowaligned or column-aligned quads
27
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
1D – 2D Mapping
MIN
28
MAX
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
1D – 2D Mapping
Effectively reduce instructions
per element
MIN
29
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting on GPU: Pipelining and
Parallelism
Input Vertices
Texturing, Caching
and 2D Quad
Comparisons
Sequential Writes
30
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Comparison with GPU-Based
Algorithms
3-6x faster than
prior GPU-based
algorithms!
31
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU vs. High-End Multi-Core
CPUs
2-2.5x faster than
Intel high-end
processors
Single GPU
performance
comparable to
high-end dual core
Athlon
Hand-optimized CPU code from Intel Corporation!
32
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Slash Dot News and Toms Hardware News Headlines
Super-Moore’s Law Growth
50 GB/s on a
single GPU
Peak Performance:
Effectively hide
memory latency
with 15 GOP/s
Download URL: http://gamma.cs.unc.edu/GPUSORT
33
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Implementation & Results
Pentium IV PC ($170)
NVIDIA 7800 GT ($270)
2 GB RAM ($152)
9 80GB SATA disks ($477)
SuperMicro Motherboard & SATA Controller
($325)
Windows XP
PC costs $1469
34
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Implementation & Results
Indy SortBenchmark
10 byte random string keys
100 byte long records
Sort maximum amount in 644 seconds
35
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)!
36
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Performance/$
1.8x faster than
current Terabyte
sorter
World’s best
price-toperformance
system
http://research.microsoft.com/barc/SortBenchmark
37
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Salzberg Analysis:
100 MB Run Size
38
Peak
sequential
throughput in
MB/s
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Salzberg Analysis:
100 MB Run Size
Pentium IV:
25MB Run
Size (to reduce
memory
latency)
Less work and
only 75% IO
efficient!
39
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Salzberg Analysis:
100 MB Run Size
Dual 3.6 GHz
Xeons: 25MB
Run size (to
reduce
memory
latency)
More cores,
less work but
only 85% IO
efficient!
40
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Salzberg Analysis:
100 MB Run Size
7800 GT:
100MB run
size
Ideal work,
and 92% IO
efficient with
single CPU!
41
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Task Parallelism
Performance
limited by IO
and memory
Reorder or
Sequential
IO
Sorting
100MB on
GPU
Sorting 100MB on GPU: 3x > reorder or sequential IO
42
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Why GPU-like Architectures for
Large Data Management? GPU
Plateau: Data
Management
Performance
Crisis
43
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Advantages
Exploit high memory bandwidth on
GPUs
Higher memory performance than CPU-based
algorithms
High I/O performance due to large
run sizes
44
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Advantages
Offload work from CPUs
CPU cycles well-utilized for resource management
Scalable solution for large databases
Best performance/price solution for
terabyte sorting
45
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Limitations
May not work well on variable-sized
keys and almost sorted databases
Requires programmable GPUs (GPUs
manufactured after 2003)
46
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Conclusions
Designed new sorting algorithms on
GPUs
Handles wide keys and long records
Achieves 10x higher memory
performance
Memory efficient sorting algorithm with peak
memory performance of (50 GB/s) on GPUs
15 GOP/sec on a single GPU
47
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Conclusions
Novel external memory sorting
algorithm as a scalable solution
Achieves peak I/O performance on CPUs
Best performance/price solution – world’s fastest
sorting system
High performance growth rate
characteristics
Improve 2-3 times/yr
48
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Future Work
Designed high performance/price solutions
High wattage and cooling requirements of CPUs and GPUs
To exploit GPUs, we need easy-to-use
programming APIs
Promising directions: BrookGPU, Microsoft Accelerator, Sh,
etc.
Scientific libraries utilizing high parallelism
and memory bandwidth
Scientific routines on LU, QR, SVD, FFT, etc.
BLAS library on GPUs
Eventually, build GPU-LAPACK and Matlab routines
49
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
N. Govindaraju, S. Larsen, J. Gray and D. Manocha,
Proc. of ACM SuperComputing, 2006 (to appear)
GPUFFTW
4x faster
than IMKL
on high-end
Quad cores
SlashDot
Headlines,
May 2006
Download URL: http://gamma.cs.unc.edu/GPUFFTW
50
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU Roadmap
GPUs are becoming more general
purpose
Fewer limitations in Microsoft DirectX10 API
• Better and consistent floating point support,
• Integer instruction support,
• More programmable stages, etc.
Significant advance in performance
GPUs are being widely adopted in
commercial applications
Eg. Microsoft Vista
51
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Call to Action
Don’t put all your eggs
in the Multi-core basket
If you want TeraOps
– go where they are
If you want memory bandwidth
– go where the memory bandwidth is.
CPU-GPU gap is widening
Microsoft Xbox is ½ TeraOP today.
52
40
gops
40
gBps
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Acknowledgements
Research Sponsors:
Army Research Office
Defense and Advanced Research Projects Agency
National Science Foundation
Naval Research Laboratory
Intel Corporation
Microsoft Corporation
Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou
NVIDIA Corporation
RDECOM
53
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Acknowledgements
David Tuft (UNC)
UNC Systems, GAMMA and Walkthrough
groups
54
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Thank You
Questions or Comments?
{naga,ritesh,dm}@cs.unc.edu
[email protected]
http://www.cs.unc.edu/~naga
http://research.microsoft.com/~Gray
55
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K.

Transcript GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K.

Directory