Interactive Display of Complex Datasets

Download Report

Transcript Interactive Display of Complex Datasets

GP2: General Purpose Computation
using Graphics Processors
Dinesh Manocha & Avneesh Sud
Lecture 2: January 17, 2006
http://gamma.cs.unc.edu/GPGP
Spring 2007
Department of Computer Science
UNC Chapel Hill
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Class Schedule
♦ Current Time Slot: 2:00 – 3:15pm, Mon/Wed,
SN011
♦ Office hours: TBD
♦ Class mailing list: [email protected] (should be
up and running)
GPGP
• The GPU on commodity video cards has evolved
into an extremely flexible and powerful processor
♦ Programmability
♦ Precision
♦ Power
• This course will address how to harness that
power for general-purpose computation (nonrasterization)
♦ Algorithmic issues
♦ Programming and systems
♦ Applications
Capabilities of Current GPUs
• Modern GPUs are deeply programmable
♦ Programmable pixel, vertex, video engines
♦ Solidifying high-level language support
• Modern GPUs support 32-bit floating point
precision
♦ Great development in the last few years
♦ 64-bit arithmetic may be coming soon
♦ Almost IEEE FP compliant
The Potential of GPGP
• The power and flexibility of GPUs makes them an attractive
platform for general-purpose computation
• Example applications range from in-game physics
simulation, geometric applications to conventional
computational science
• Goal: make the inexpensive power of the GPU available to
developers as a sort of computational coprocessor
Check out http://www.gpgpu.org
GPGP: Challenges
• GPUs designed for and driven by video games
♦ Programming model is unusual & tied to computer
graphics
♦ Programming environment is tightly constrained
• Underlying architectures are:
♦ Inherently parallel
♦ Rapidly evolving (even in basic feature set!)
♦ Largely secret
♦ No clear standards (besides DirectX imposed by
MSFT)
• Can’t simply “port” code written for the CPU!
• Is there a formal class of problems that can be solved
using current GPUs
Importance of Data Parallelism
• GPUs are designed for graphics or gaming industry
♦ Highly parallel tasks
• GPUs process independent vertices & fragments
♦ Temporary registers are zeroed
♦ No shared or static data
♦ No read-modify-write buffers
• Data-parallel processing
♦ GPUs architecture is ALU-heavy
• Multiple vertex & pixel pipelines, multiple
ALUs per pipe
♦ Hide memory latency (with more computation)
Goals of this Course
• A detailed introduction to general-purpose
computing on graphics hardware
• Emphasis includes:
♦ Core computational building blocks
♦ Strategies and tools for programming GPUs
♦ Cover many applications and explore new
applications
♦ Highlight major research issues
Course Organization
• Survey lectures
♦ Instructors, other faculty, senior graduate
students
♦ Breadth and depth coverage
♦ Student presentations
Course Contents
♦ Overview of GPUs: architecture and features
♦ Models of computation for GPU-based algorithms
♦ System issues: Cache and data management; Languages and
compilers
♦ Numerical and Scientific Computations: Linear algebra
computations. Optimization, FFTrigid body simulation, fluid
dynamics
♦ Geometric computations: Proximity computations; distance
fields; motion planning and navigation
♦ Database computations: database queries: predicates,
booleans, aggregates; streaming databases and data mining;
sorting & searching
♦ GPU Clusters: Parallel computing environments for GPUs
♦ Rendering: Ray-tracing, photon mapping; Shadows
Student Load
♦
♦
♦
♦
Stay awake in classes!
One class lecture
Read a lot of papers
1-2 small assignments
Student Load
♦
♦
♦
♦
♦
Stay awake in classes!
One class lecture
Read a lot of papers
1-2 small assignments
A MAJOR COURSE PROJECT WITH
RESEARCH COMPONENT
Course Projects
♦ Work by yourself or part of a small team
♦ Develop new algorithms for simulation,
geometric problems, database computations
♦ Formal model for GPU algorithms or GPU
hacking
♦ Issues in developing GPU clusters for scientific
computation
♦ Look into new architecture and parallel
programming trends
Course Projects: Importance
♦
If you are planning to take this course for credit, start thinking about
the course project ASAP
♦ It is important that your project has some novelty to it:
• Shouldn’t be just a GPU-hack
• You need to work on a problem or application, for which GPUs are a good
candidate
– For example, GPUs are not a good solution for many problems
• It is ok to work in groups of 2 or 3 (for a large project)
• Periodic milestones to monitor the progress
– Project proposals due by February 10
– Monthly progress reports (will count towards the final grade)
Course Projects: Possible Topics
•
We are also interested in comparing GPU capabilities with other emerging architectures
(e.g. Cell, multi-core, other data parallel processors)
•
Numerical computations: Some of the prime candidates for GPU acceleration
–
–
–
Sparse matrix computations
Numerical linear algebra (SVD, QR computations)
Applications (like WWW search)
•
Power efficiency of GPU algorithms
•
Programming environments of GPUs (talk to Jan Prins)
•
GPU Clusters and high performance computing using GPUs
–
–
–
•
Scientific computations (possible collaboration with RENCI)
Data mining algorithms (talk to Wei Wang or Jan Prins)
Physically-based simulation, e.g. fluid simulation (talk to Ming Lin)
Others …
Course Topics & Lectures
• Focus on Breadth
• Quite a few guest and student lectures
– Overview of OpenGL and GPU Programming (Wendt on Jan.
22)
– Cell processor (Stephen Olivier on Jan. 24)
– NVIDIA G80 Architecture (Steve Molnar, Jan. 29)
– CUDA Programming Environment (Lars Nyland, Jan. 31)
– Lectures on CTM (ATI)
Heterogeneous Computing Systems &
GPUs
0920/2005
Manocha
The UNIVERSITY OF17NORTH CAROLINA at CHAPEL HILL
What are Heterogeneous Computing Systems?
Develop computer systems and applications that are scalable
from a system with a single homogeneous processor to a highend computing platform with tens, or even hundreds, of
thousands of heterogeneous processors
0920/2005
Manocha
The UNIVERSITY OF18NORTH CAROLINA at CHAPEL HILL
What are Heterogeneous Computing Systems?
Heterogeneous computing systems are those with a range of
diverse computing resources that can be local to one another
or geographically distributed. The pervasive use of networks
and the internet by all segments of modern society means that
the number of connected computing resources is growing
tremendously.
From “International Workshop on Heterogeneous
Computing”, from early 1990’s
0920/2005
Manocha
The UNIVERSITY OF19NORTH CAROLINA at CHAPEL HILL
Computing using Accelerators
•GPU is one type of accelerator (commodity and easily available)
•Other accelerators:
•Cell processor
•Clearspeed
0920/2005
Manocha
The UNIVERSITY OF20NORTH CAROLINA at CHAPEL HILL
Organization
• Current architectures
• Use of Accelerators
• Programming environments for accelerators
0920/2005
Manocha
The UNIVERSITY OF21NORTH CAROLINA at CHAPEL HILL
Organization
• Current architectures
• Use of Accelerators
• Programming environments for accelerators
0920/2005
Manocha
The UNIVERSITY OF22NORTH CAROLINA at CHAPEL HILL
Current Architectures
• Multi-core architectures
• Processors lowering communication costs
• Heterogeneous processors
0920/2005
Manocha
The UNIVERSITY OF23NORTH CAROLINA at CHAPEL HILL
Current Architectures
• Multi-core architectures
• Processors lowering communication costs
• Heterogeneous processors
0920/2005
Manocha
The UNIVERSITY OF24NORTH CAROLINA at CHAPEL HILL
Multi-Core Processor
What is a Multicore processor?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Multi-Core Architectures
http://gamma.cs.unc.edu/EDGE/SLIDES/agarwal.pdf
What is a Multicore processor?
Three properties (Agarwal’06)
Single chip
Multiple distinct processing engines
Multiple, independent threads of control (or program
counters – MIMD)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Multi-Core: Motivation
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Multi-Core: Growth Rate
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sun’s Niagra Chip: Chip MultiThreaded Processor
http://gamma.cs.unc.edu/EDGE/SLIDES/shoaib.pdf
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Current Architectures
• Multi-core architectures
• Processors lowering communication costs
• Heterogeneous processors
0920/2005
Manocha
The UNIVERSITY OF30NORTH CAROLINA at CHAPEL HILL
Efficient Processors
Reduce communication costs [Dally’03]
• PCA architectures: http://www.darpa.mil/ipto/Programs/pca/index.htm
• GPUs
• Streaming processors:http://cva.stanford.edu/publications/2004/spqueue.pdf
• Other data parallel processors (PPUs, ClearSpeed)
• FPGAs
0920/2005
Manocha
The UNIVERSITY OF31NORTH CAROLINA at CHAPEL HILL
Current Architectures
• Multi-core architectures
• Processors lowering communication costs
• Heterogeneous processors
•Combining different type of processors in one chip
0920/2005
Manocha
The UNIVERSITY OF32NORTH CAROLINA at CHAPEL HILL
Heterogeneous Processors
• Cell BE Processor
• AMD Fusion Architecture
0920/2005
Manocha
The UNIVERSITY OF33NORTH CAROLINA at CHAPEL HILL
IBesearch
Cell BE Processor Overview
 IBM, SCEI/Sony, Toshiba Alliance formed in 2000
 Design Center opened in March 2001
 Based in Austin, Texas
 ~$400M Investment
 February 7, 2005: First technical disclosures
 Designed for Sony PlayStation3
– Commodity processor
 Cell is an extension to IBM Power family of processors
 Sets new performance standards for computation & bandwidth
 High affinity to HPC workloads
– Seismic processing, FFT, BLAS, etc.
34
IBesearch
Cell BE Processor Features
 Heterogeneous multi-core
system architecture
SPE
SPU
SPU
– Synergistic Memory Flow
Control (SMF)
•
Data movement and
synchronization
•
Interface to highperformance Element
Interconnect Bus
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
SMF
SMF
SMF
SMF
SMF
SMF
SMF
SMF
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
16B/cycle
PPE
PPU
L2
L1
MIC
16B/cycle (2x)
BIC
PXU
32B/cycle 16B/cycle
Dual
XDRTM
64-bit Power Architecture with VMX
35
SPU
SXU
 Synergistic Processor
Element (SPE) consists of
– Synergistic Processor Unit
(SPU)
SPU
SXU
– Power Processor Element
for control tasks
– Synergistic Processor
Elements for data-intensive
processing
SPU
SXU
FlexIOTM
IBesearch
Cell BE Architecture
 Combines multiple high performance processors in
one chip
 9 cores, 10 threads
 A 64-bit Power Architecture™ core (PPE)
 8 Synergistic Processor Elements (SPEs) for
data-intensive processing
 Current implementation—roughly 10 times the
performance of Pentium for computational intensive
tasks
 Clock: 3.2 GHz (measured at >4GHz in lab)
36
Cell
Pentium D
Peak I/O BW
75 GB/s
~6.4 GB/s
Peak SP Performance
~230 GFLOPS
~30 GFLOPS
Area
221 mm²
206 mm²
Total Transistors
234M
~230M
IBesearch
Peak GFLOPs (Cell SPEs only)
200
180
160
140
120
100
80
60
40
20
0
37
Single
Precision
Double
Precision
FreeScale
PPC 970
AMD DC
Intel SC
Cell
DC 1.5 GHz
2.2 GHz
2.2 GHz
3.6 GHz
3.0 GHz
IBesearch
Cell BE Processor Can Support Many Systems
 Game console systems
XDRtm
 Blades
XDRtm
XDRtm
XDRtm
 HDTV
IOIF
 …
XDRtm
XDRtm
IOIF
XDRtm XDRtm
XDRtm XDRtm
Cell BE
Processor
Cell BE
Processor
BIF
XDRtm XDRtm
XDRtm XDRtm
IOIF
38
IOIF1
Cell BE
Processor
IOIF0
SW
Cell BE
Processor
Cell BE
Processor
BIF
Cell BE
Processor
IOIF
IOIF
IOIF
 HPC
Cell BE
Processor
BIF
 Home media servers
Heterogeneous Processors
• Cell BE Processor
• AMD Fusion Architecture
0920/2005
Manocha
The UNIVERSITY OF39NORTH CAROLINA at CHAPEL HILL
AMDs Fusion Architecture
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
AMDs Fusion Architecture
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
AMDs Fusion Architecture
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
AMDs Fusion Architecture
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Organization
• Current architectures
• Use of Accelerators
• Programming environments for accelerators
0920/2005
Manocha
The UNIVERSITY OF44NORTH CAROLINA at CHAPEL HILL
Organization
• Current architectures
• Use of Accelerators
•Single workstation (real-world) applications
•High performance computing
• Programming environments for accelerators
0920/2005
Manocha
The UNIVERSITY OF45NORTH CAROLINA at CHAPEL HILL
NON-Graphics Pipeline
Abstraction (GPGPU)
data
Courtesy:
David Kirk,
NVIDIA
setup
lists
rasterizer
data
data
data
programmable MIMD
processing (fp32)
SIMD
“rasterization”
programmable SIMD
processing (fp32)
data fetch,
fp16 blending
predicated write, fp16
blend, multiple output
memory
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting and Searching
“I believe that virtually every important
aspect of programming arises somewhere
in the context of sorting or searching!”
-Don Knuth
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Massive Databases
Terabyte-data sets are common
Google sorts more than 100 billion terms in its
index
> 1 Trillion records in web indexed
(unconfirmed sources)
Database sizes are rapidly
increasing!
Max DB sizes increases 3x per year
(http://www.wintercorp.com)
Processor improvements not matching
information explosion
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General Sorting on GPUs
 Design sorting algorithms with deterministic
memory accesses – “Texturing” on GPUs
86 GB/s peak memory bandwidth (NVIDIA 8800)
Can better hide the memory latency!!
 Require minimum and maximum computations –
“Blending functionality” on GPUs
Low branching overhead
 No data dependencies
Utilize high parallelism on GPUs
0920/2005
Manocha
The UNIVERSITY OF49NORTH CAROLINA at CHAPEL HILL
Sorting on GPU: Pipelining
and Parallelism
Input Vertices
Texturing, Caching
and 2D Quad
Comparisons
Sequential Writes
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Comparison with prior GPUBased Algorithms
3-6x faster than
prior GPU-based
algorithms!
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Sorting: GPU vs. Multi-Core
CPUs
2-2.5x faster than
Intel high-end
processors
Single GPU
performance
comparable to
high-end dual core
Athlon
Hand-optimized CPU code from Intel Corporation!
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
N. Govindaraju, J. Gray, R. Kumar and D. Manocha,
Proc. of ACM SIGMOD 2006
External Memory Sorting
Performed on Terabyte-scale databases
Two phases algorithm
Limited main memory
First phase – partitions input file into large data chunks
and writes sorted chunks known as “Runs”
Second phase – Merge the “Runs” to generate the sorted
file
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
External memory sorting
using GPUs
External memory sorting on CPUs can have low
performance due to
High memory latency
Low I/O performance
Our GPU-based algorithm
Sorts large data arrays on GPUs
Perform I/O operations in parallel on CPUs
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
GPUTeraSort
Govindaraju et al.,
SIGMOD 2006
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)!
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Performance/$
1.8x faster than
current Terabyte
sorter
World’s best
performance/$
system
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
GPUTeraSort: PennySort
Winner 2006
“These results paint a clear picture for progress on
processor speeds. When you measure records-sortedper-cpu-second, the speed plateaued in 1995 at about 200k
records/second/cpu . This year saw a breakthrough with
GpuTeraSort which uses the GPU interface to drive the
memory more efficiently (and uses the 10x more memory
bandwidth inside the GPU). GpuTeraSort gave a 3x
records/second/cpu improvement There is a lot of effort on
multi-core processors, and comparatively little effort on
addressing the “core” problems: (1) the memory
architecture, and (2) the way processors access memory.
Sort demonstrates those problems very clearly.” By Jim
Gray (Microsoft) [NY Times, November 2006]
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
N. Govindaraju, S. Larsen, J. Gray and D. Manocha,
SuperComputing 2006
GPUFFTW (1D & 2D FFT)
4x faster
than IMKL
on high-end
Quad cores
SlashDot
Headlines,
May 2006
Download URL: http://gamma.cs.unc.edu/GPUFFTW
The UNIVERSITY OF NORTH CAROLINA at CHAPEL HILL
Digital Breast Tomosynthesis (DBT)
Pioneering DBT work at Massachusetts General Hospital
100X reconstruction speed-up with NVIDIA Quadro FX 4500 GPU
From hours to minutes
Facilitates clinical use
X-Ray tube
Improved diagnostic value
Clearer images
Fewer obstructions
Earlier detection
Advanced Imaging Solution of the
Year
Axis of rotation
Compression paddle
Compressed breast
Digital detector
11 Low-dose X-ray Projections
Extremely Computationally Intense Reconstruction
“Mercury reduced reconstruction time from 5 hours to 5 minutes, making DBT clinically viable.
…among 70 women diagnosed with breast cancer, DBT pinpointed 7 cases not seen with mammography”
© 2006 Mercury Computer Systems, Inc.
Electromagnetic Simulation
3D Finite-Difference and Finite-Element
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)
Computationally Intensive!
Large speedups with Quadro GPUs
Performance (Mcells/s)
18X
800
Pacemaker with Transmit Antenna
700
600
10X
500
400
5X
300
200
100
Copyright © NVIDIA Corporation 2004
Commercial, Optimized,
Mature Software
Single CPU, 3.x GHz
0
1X
0
1
2
4
# Quadro FX 4500 GPUs
Havok FX Physics on NVIDIA GPUs
Physics-based effects on a massive scale
10,000s of objects at high frame rates
Rigid bodies
Particles
Fluids
Cloth
and more
Copyright © NVIDIA Corporation 2004
Dedicated Performance For Physics
Performance Measurement
15,000 Boulder Scene
64.5 fps
Frame
Rate
6.2 fps
Copyright © NVIDIA Corporation 2004
CPU Physics
GPU Physics
Dual Core P4EE 955 - 3.46GHz
GeForce 7900GTX SLI
CPU Multi-threading enabled
Dual Core P4EE 955 - 3.46GHz
GeForce 7900GTX SLI
CPU Multi-threading enabled
GPUs: High Memory
Throughput
50 GB/s on a
single GPU
(NVIDIA 7900)
Peak Performance:
Effectively hide
memory latency
with 15 GOP/s
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Microsoft Vista & GPUs
Windows Vista is the first Windows operating system that
directly utilizes the power of a dedicated GPU. High-end
GPUs are essential for accelerating the Windows Vista
experience by offering an enriched 3D user interface,
increased productivity, vibrant photos, smooth, highdefinition videos, and realistic games.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPUs as Accelerators
GPUs are primarily designed for rasterization
GPUs are programmed using graphics APIs
Specialized algorithms for different applications to
demonstrate higher performance
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPUs as Accelerators
GPUs are primarily designed for rasterization
GPUs are programmed using graphics APIs
Specialized algorithms for different applications to
demonstrate higher performance
Inspite of these limitations good speedups were
demonstrated
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPUs as Accelerators
GPUs are primarily designed for rasterization
GPUs are programmed using graphics APIs
Specialized algorithms for different applications to demonstrate
higher performance
Inspite of these limitations good speedups were
demonstrated
What if we have the right API and programming
environment for GPUs?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Accelerators for HPC
Recent Trends is to use Accelerators to
achieve 100-1000 TFlop performance
RoadRunner (LANL): plans to use 16,000 cell
processors (expected PetaFlop performance)
Tsubame cluster (Tokyo): 360 ClearSpeed
accelerators (47 TFlop performance)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Organization
• Current architectures
• Use of Accelerators
• Programming environments for accelerators
0920/2005
Manocha
The UNIVERSITY OF70NORTH CAROLINA at CHAPEL HILL
Thread parallelism is upon us (Smith’06)
 Uniprocessor performance is leveling off
Instruction-level parallelism is nearing its limit
Power per chip is painfully high for client systems
0920/2005
Manocha
The UNIVERSITY OF71NORTH CAROLINA at CHAPEL HILL
Thread parallelism is upon us (Smith’06)
 Uniprocessor performance is leveling off
Instruction-level parallelism is nearing its limit
Power per chip is painfully high for client systems
 Meanwhile, logic cost ($ per gate-Hz) continues to fall
What are we going to do with all that hardware?
0920/2005
Manocha
The UNIVERSITY OF72NORTH CAROLINA at CHAPEL HILL
Thread parallelism is upon us (Smith’06)
 Uniprocessor performance is leveling off
Instruction-level parallelism is nearing its limit
Power per chip is painfully high for client systems
 Meanwhile, logic cost ($ per gate-Hz) continues to fall
What are we going to do with all that hardware?
 Newer microprocessors are multi-core, and/or
multithreaded
So far, it’s just “more of the same” architecturally
Now we also have heterogeneous processors
0920/2005
Manocha
The UNIVERSITY OF73NORTH CAROLINA at CHAPEL HILL
Thread parallelism
 We expect new “killer apps” will need more performance
Semantic analysis and query
Improved human-computer interfaces (e.g. speech,
vision)
Games
 Which and how much thread parallelism can we exploit?
This is a good question for both hardware and
software
0920/2005
Manocha
The UNIVERSITY OF74NORTH CAROLINA at CHAPEL HILL
Programming the Accelerators
• Data parallel processors
• Improved APIs and interfaces
0920/2005
Manocha
The UNIVERSITY OF75NORTH CAROLINA at CHAPEL HILL
Possible Approaches
Extend existing high-level languages with new dataparallel array types
Ease of programming
Implement as a library so programmers can use it
now
Eventually fold into base languages
Build implementations with compelling performance
Target GPUs and multi-core CPUs
Create examples and applications
Educate programmers, provide sample code
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Challenges in using GPUs
Need a non-graphics interface
For more flexibility
Less execution overhead
Need native GPU support
Replace library with language built-ins
Need to learn from users
Retarget for multi-core
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Research Issues
Languages for mainstream parallel computing
Compilation techniques for parallel programs
Debugging and performance tuning of parallel
programs
Operating systems for parallel computing at all
scales
Computer architecture for mainstream parallel
computing
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL