Many-core Computing: A Disruptive Technology Enabling Low-cost, Low-power Desktop Supercomputing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA Systor Keynote May 24, 2010
Download
Report
Transcript Many-core Computing: A Disruptive Technology Enabling Low-cost, Low-power Desktop Supercomputing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA Systor Keynote May 24, 2010
Many-core Computing: A Disruptive
Technology Enabling Low-cost, Low-power
Desktop Supercomputing
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
Systor Keynote May 24, 2010
Current trends in Many-core
Computing
The CPU industry has elected to jump off the
cycle-time scaling bandwagon
Power/thermal constraints have become a limiting
factor
We now see CPU vendors placing multiple (10’s
of) cores on a single chip
Clock speeds have not changed
The memory wall persists and multiple cores that
assume a shared-memory model place further
pressure on this problem
Software vendors are looking for new
parallelization technology
Multi-core aware operating systems
Semi-automatic parallelizing compilers
Systor Keynote May 24, 2010
Current trends in Many-core
Computing
There has been a renewed interest in parallel
computing paradigms and languages
Existing many-core architectures are being
considered for general-purpose platforms (e.g., Cell,
GPUs, DSPs)
Heterogeneous systems are becoming a common
theme
Are we returning to the days of the X87 co-processor?
How can we combined multi-core and many-core
systems into a single design?
Systor Keynote May 24, 2010
Graphics Processors
Graphics Processing Units
More than 64% of Americans played a video game in 2009
High-end - primarily used for 3-D rendering for videogame
graphics and movie animation
Mid/low-end – primarily used for computer displays
Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
Very competitive commodities market
Systor Keynote May 24, 2010
GPU Performance
GPUs provide a path for performance growth
Cost and power usage numbers are also impressive
Near exponential
growth
in performance
for GPUS!!
Source:NVIDIA 2009
Systor Keynote May 24, 2010
Comparison of CPU and GPU
Hardware Architectures
CPU: Cache heavy,
focused on individual
thread performance
GPU: ALU heavy,
massively parallel,
throughput-oriented
Systor Keynote May 24, 2010
CPU/GPU Relationship
CPU
(host)
GPU w/
local DRAM
(device)
Systor Keynote May 24, 2010
A wide range of GPU apps
3D image analysis
Adaptive radiation therapy
Acoustics
Astronomy
Audio
Automobile vision
Bioinfomatics
Biological simulation
Broadcast
Cellular automata
Fluid dynamics
Computer vision
Cryptography
CT reconstruction
Data mining
Digital cinema / projections
Electromagnetic simulation
Equity training
Film
Protein folding
Financial
Quantum chemistry
Languages
Ray tracing
GIS
Radar
Holographics cinema
Reservoir simulation
Machine learning
Robotic vision / AI
Mathematics research
Robotic surgery
Military
Satellite data
analysis
Mine planning
Seismic imaging
Molecular dynamics
Surgery simulation
MRI reconstruction
Surveillance
Multispectral imaging
Ultrasound
N-body simulation
Video conferencing
Network processing
Telescope
Neural network
Oceanographic research Video
Visualization
Optical inspection
Wireless
Particle physics
X-Ray
Systor Keynote May 24, 2010
GPU as a General Purpose
Computing Platform
Speedups are impressive and ever increasing!
Genetic Algorithm
2600 X
Real Time Elimination
Lattice-Boltzmann Method Total Variation Modeling
of Undersampling Artifacts for Numerical Fluid Mechanics
2300 X
1840 X
1000 X
Fast Total Variation for
Computer Vision
1000 X
Monte Carlo Simulation
Of Photon Migration
1000 X
Stochastic Differential
Equations
675 X
K-Nearest Neighbor
Search
470 X
Source: CUDA Zone at www.nvidia.com/cuda/
Systor Keynote May 24, 2010
GPGPU is becoming
mainstream research
Research activities are expanding significantly
Search result for keyword “GPGPU” in IEEE and ACM
Systor Keynote May 24, 2010
Streaming Processor Array
Grid of thread blocks
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
Multiple thread blocks,
many warps of threads
Texture Processor
Cluster
SM
Streaming Multiprocessor
SP
SP
• 240 shader cores
SP
SP
• 1.4B transistors
• Up to 2GB onboard
memory
• ~150GB/sec BW
• 1.06 SP GFLOPS
• CUDA and OpenCL
support
• Programmable
memory spaces
• Tesla S1070
provides 4 GPUs in a
1U unit
Texture Unit
SFU
SFU
SP
SP
SP
SP
SM
SM
NVIDIA GT200
architecture
Individual threads
Systor Keynote May 24, 2010
AMD/ATI Radeon HD 5870
• Codename “Evergreen”
• 1600 SIMD cores
• L1/L2 memory
architecture
• 153GB/sec memory
bandwidth
• 2.72 TFLOPS SP
• OpenCL and DirectX11
• Hidden memory
microarchitecure
• Provides for vectorized
operation
Systor Keynote May 24, 2010
Comparison of CPU and GPU
Hardware Architectures
CPU/GPU
Single
precision
TFLOPs
Cores
GFLOPs/W
att
$/GFLOP
NVIDIA 285
1.06
240
5.8
$3.12
NVIDIA 295
1.79
480
6.2
$3.80
AMD HD 5870
2.72
1600
14.5
$0.16
AMD HD 4890
1.36
800
7.2
$0.18
Intel I-7 965
0.051
4
0.39
$11.02
Source: NVIDIA, AMD and Intel
Systor Keynote May 24, 2010
How to exploit GPUs
Identify hot kernels
Eliminate kernels that have inherent data sharing (e.g., loop
carried dependence)
Identify the right grain of parallelism (i.e., the loop nest) to
expose parallelism
Generate thousands of independent threads…
Identify appropriate GPU memory subsystems for storage
of data used by kernel…
Consider additional performance optimizations
Warp occupancy - NVIDIA
Vectorization – AMD
Register usage - both
Systor Keynote May 24, 2010
AMD vs. NVIDIA
AMD
NVIDIA
Hardware architecture
Vector
Scalar
Programming
language
Brook+, IL, OpenCL
CUDA, OpenCL
Programming model
SIMD vector
SIMT
Thread hierarchy
Single level
Two level
Memory exposure
Uniform space
Multiple space
Source of horsepower
Vectorization and
multiple output
Pros
Easier programming
Memory spaces utilization
including shared memory
More flexible
programming
Challenges
Harnessing the potential horsepower
Systor Keynote May 24, 2010
Vector Addition Example (CPU)
void vecAdd(float *A, float *B, float *C, int N) {
for(int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
Computational
kernel
int main() {
int N
float
float
float
= 4096;
*A = (float *)malloc(sizeof(float)*N);
*B = (float *)malloc(sizeof(float)*N);
*C = (float *)malloc(sizeof(float)*N)
Allocate memory
init(A); init(B);
Initialize memory
vecAdd(A, B, C, N);
Call kernel
free(A); free(B); free(C);
}
Systor Keynote May 24, 2010
Deallocate memory
Vector Addition Example (GPU)
• Map multi-dimensional data structures to sequential GPU threads
__global__
void vecAdd(float *A, float *B, float *C) {
int tid = blockIdx.x * blockDim.x + threadIdx.x
GPU Computational
kernel
C[tid] = A[tid] + B[tid];
}
GRID
(0,0)
(0,0) (1,0) (2,0)
BLOCK
G
... (31,0)
(0,0) (1,0) (2,0)
(1,0)
blockDim.x = 32
Systor Keynote May 24, 2010
BLOCK
... (31,0)
…….
Vector Addition Example (GPU)
int main() {
}
int N
float
float
float
= 4096;
*A = (float *)malloc(sizeof(float)*N);
*B = (float *)malloc(sizeof(float)*N);
*C = (float *)malloc(sizeof(float)*N)
Allocate memory on
GPU
init(A); init(B);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float)*N);
cudaMalloc(&d_B, sizeof(float)*N);
cudaMalloc(&d_C, sizeof(float)*N);
cudaMemcpy(d_A, A, sizeof(float)*N, HtoD);
cudaMemcpy(d_B, B, sizeof(float)*N, HtoD);
dim3 blocks(N/32,1);
dim3 threads(32,1);
gpuVecAdd <<< blocks,threads >>> (d_A, d_B, d_C);
cudaMemcpy(C, d_C, sizeof(float)*N, DtoH);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(A);
free(B);
free(C);
Initialize memory on
GPU
Configure threads
Run kernel (on GPU)
Copy results back to
CPU
Deallocate memory
on GPU
Deallocate memory on
GPU
Systor Keynote May 24, 2010
Experiences with migrating
applications to a GPU
3-D Cardiac CT Imaging
Iterative Least Squares Back
Projection
3-D Breast Cancer Screening
Maximum Likelihood Estimation
Intrusion Detection Systems
K-Nearest Neighbor Outlier
Detection
Physics-based Simulation
Surgical simulation
Systor Keynote May 24, 2010
NUIC
Technologies
Trends in Medical Imaging
The Medical Imaging field is rapidly deploying new 3-D and
4-D imaging technologies to improve patient outcomes
This move has created an avalanche of image data
Image reconstruction and image analysis have become major bottlenecks
Accurate 3-D and 4-D image reconstruction requires compute-intensive
algorithms
The use of multi-modality imaging (e.g., CT and Ultrasound) further
exacerbates this problem
Many-core computing can play a large role in addressing
these challenges
Systor Keynote May 24, 2010
Developing a suite of Biomedical Image
Reconstruction Libraries – CUDA/OpenCL
Target applications:
Deformable registration - radiation oncology
3-D Iterative reconstruction – cardiovascular imaging
Maximum likelihood estimation – Digital
Breast Tomosynthesis
Motion compensation in PET/CT images cardiovascular imaging
Hyperspectral imaging – skin cancer
screening
Image segmentation – brain imaging
$1.3M NSF Award EEC-0946463
Systor Keynote May 24, 2010
Developing a suite of Biomedical Image
Reconstruction Libraries – CUDA/OpenCL
Target applications:
Deformable registration - radiation oncology
3-D Iterative reconstruction – cardiovascular imaging
Maximum likelihood estimation – Digital
Breast Tomosynthesis
Motion compensation in PET/CT images cardiovascular imaging
Hyperspectral imaging – skin cancer
screening
Image segmentation – brain imaging
$1.3M NSF Award EEC-0946463
Systor Keynote May 24, 2010
State-of-the-art in Cardiac Imaging
Currently, coronary heart disease (CHD) is the single
leading cause of death in America
Health care costs related to CHD >$150B/year
U.S. in 2006 (American Heart Association)
Approximately 1,255,000 coronary attacks
Approximately 425,425 deaths
Invasive coronary angiography is the state-of-the-art
for assessing coronary blockages
Inject dye into the bloodstream and then Xray the heart
8% complication rate
0.2% mortality rate
Systor Keynote May 24, 2010
3-D Cardiovascular Plaque Imaging
3D CT imaging can be used to identify vulnerable plaque
A helical scan of the body is performed
Provides for more accurate imaging of the cardio-vascular system
Produces a detailed 3-D view of the blockage
Possesses few negative side effects
Scanning geometry produces a tremendous amount of data
to process
Image reconstruction can take
days to generate a single view!!
Systor Keynote May 24, 2010
Impacting heart disease with GPUs
A single 8800 GTX (128 cores) speedup versus multi-threaded
dual-core Intel CPU execution – 20.3x forward / 17.8x backward
A series of optimizations applied, includes utilizing multiple GPUs –
71.3x forward / 137x backward
Forward projection
Backward projection
* Collaboration with Synho Do (MGH), Clem Karl (BU) and Homer Pien (MGH)
3-D Cardiovascular Plaque Imaging
3-D Spiral Cone-Beam Cardiac Image
Reconstruction
Reconstruction performance is a barrier to improve image
quality
A single NVIDIA 285 GPU card ($300) can speedup
up 3-D reconstruction performance by 700X
Processing time reduced from 10’s of hours to
seconds
Supercomputing performance in a single GPU card
Systor Keynote May 24, 2010
Digital Breast Tomosynthesis (DBT)
A new technology developed at MGH to:
Produce a 3-D image of the breast utilizing 15 or more 2-D
projections
3-D imagery can help address the following issues related
to 3-D mammography
Increase the correct detection rate of cancers
Reduce the rate of misdiagnosed cancers – avoid unneeded
biopsies
2-D
DBT
2-D
DBT
Cancer
Increase correct detection rate
Hammartoma
Decrease false positive rate
Tomosynthesis Image
Reconstruction
X-ray source
(15 views)
X-ray
projections
Set 3D volume (guess)
Compute projections
Forward
• Utilizes a limited angle
tomography approach using
many 2-D images to generate a
3-D image
Correct 3D volume
Backward
• Performs an iterative
Maximum Likelihood Estimation
for 3-D image reconstruction
3D volume
• Reconstruction time is a
barrier to image-guided biopsy
(1196x2304x45)
Detector
(1196x2304)
Systor Keynote May 24, 2010
Reconstruction Computing Performance
What can a GPU buy you?
Systor Keynote May 24, 2010
Experiences with migrating
applications to a GPU
3-D Cardiac CT Imaging
Iterative Least Squares Back
Projection
3-D Breast Cancer Screening
Maximum Likelihood Estimation
Intrusion Detection Systems
K-Nearest Neighbor Outlier
Detection
Physics-based Simulation
Surgical simulation
Systor Keynote May 24, 2010
NUIC
Technologies
VGUARD - Intrusion Detection on
Virtualized Servers
Targeted platforms
Server application appliances
Embedded systems
App
Two-tiered architecture
Front-end subsystem
Profiling integrated in a
virtualization layer
Monitors run-time environment
below the OS
App
Guest OS1
(Linux/Win)
App
Guest OS2
(Linux/Win)
VMM Layer
Execution profiling
Hardware
Back-end subsystem
Protection
embedded in
virtualization
layer
Utilizes machine learning and
pattern classification algorithms
Identifies potential malicious
behavior through modeling
normal behavior
NUIC
Technologies
Systor Keynote May 24, 2010
Intrusion Detection on
Virtualized Servers
• VGUARD Utilizes virtualization to obtain feature-rich
execution profiles
Spam
MySQL Exchange Assassin
Windows
Ubuntu
Apache
RedHat
300+ real
world
malwares
tested
VMWare ESX + VGUARD
Dell PowerEdge Server
VGUARD
NUIC
Technologies
Systor Keynote May 24, 2010
Intrusion Detection on
Virtualized Servers
Alarm
Anomaly detection
with LOF method
Data sets
Streaming Data
Feature Construction
NUIC
Technologies
Systor Keynote May 24, 2010
Local Outlier Factor Uses Density –
computes multiple K-nearest neighbors
C1
C2
p2
p1
Systor Keynote May 24, 2010
NUIC
Technologies
Intrusion Detection on Virtualized Servers
Detects >95% of all
malwares with a low false
positive rate
NUIC
Technologies
Systor Keynote May 24, 2010
Intrusion Detection on Virtualized
Servers
Moving KNN/LOF to a GPU makes this
algorithm feasible do deploy in practice
Systor Keynote May 24, 2010
NUIC
Technologies
Physics Simulation on a GPU
Systor Keynote May 24, 2010
GPUs in Surgical Simulation
Goal:
Provide highly accurate physics
simulation in near real time
Target operations:
Surgical training for Army doctors
Modeling deformation of soft tissues
and cutting
• Improves parallel
performance within a dynamic
environment designed for
physics-modeling
• Solves iterative systems
expressed in terms of Physics
and not matrix operations
Systor Keynote May 24, 2010
Architecture Aware Data Structures
Bridging the gap between the convenience of “Physics
Derived” data structures and “Architecture Aware” data
structures
Abstract architecture-specific requirements like
alignment and data layouts within a intermediate object
layer
Allow rapid development and code reuse across
algorithms
Minimal changes to pre-existing GPU kernel
Our data structures are slid in from underneath
Applied to a range of algorithms for surgical simulation
for tissue deformation studies
QuasiStatic Simulation
Backward Euler Systems
Multigrid Simulation
Systor Keynote May 24, 2010
PhysBAM Data Structure Performance
8.32x
4.31x
4.67x
Systor Keynote May 24, 2010
6.95x
GPU Strengths
Supercomputing on the desktop
Easy to program (small learning curve)
Already have demonstrated success with
several complex biomedical imaging and
machine learning applications
Impressive speedups competitive to Beowulf
clusters
Excellent cost/performance
CUDA allows us to read and write data at
any location in the device memory
Memory close to the processors (registers +
shared memory)
Systor Keynote May 24, 2010
GPU Limitations
Porting applications to the latest-and-greatest
hardware becomes a time-consuming task
Suggests we need to raise the abstraction level
Some hardwired graphic components are hidden
Performance optimization requires deep knowledge of
the microarchitecture
Better tools are needed
Register usage
Memory blocking and layout
Aggressive threading schemes
Multi-GPU exploration
What do researchers want??
Semi-automatic tuning
Systor Keynote May 24, 2010
GPU Optimizations
Vectorization
Memory space selection and memory
transformation
Multi-GPU exploitation
GPU Virtualization
PTX Optimization
Systor Keynote May 24, 2010
GPU Vectorization
Loop Vectorization – targeting the vector architecture provided
for on the AMD Firestream platform
Targets linearizing data to improve the number of loops that
can vectorized on AMD GPUs
A[0:N] [0:M]
B[0:N] [0:M]
for (i1=0;ii1<N;i1++)
For (i2=0;i2<=M;i2++)
A[i1][i2]=B[i1][M-i2]+1;
Transform
Array B
A[0:N] [0:M]
B[0:N] [M:0]
for (i1=0;ii1<N;i1++)
For (i2=0;i2<=M;i2++)
A[i1][i2]=B[i1][i2]+1;
Systor Keynote May 24, 2010
Current GPU Optimizations
Loop Vectorization – targeting the vector architecture provided
for on the AMD Firestream platform
Targets linearizing data to improve the number of loops that
can vectorized on AMD GPUs
A[0:N] [0:M]
B[0:N] [0:M]
for (i1=0;ii1<N;i1++)
For (i2=0;i2<=M;i2++)
A[i1][i2]=B[i1][M-i2]+1;
Obtained up to
11X speedup
over scalar code for
Livermore Loops
Transform
Array B
A[0:N] [0:M]
B[0:N] [M:0]
for (i1=0;ii1<N;i1++)
For (i2=0;i2<=M;i2++)
A[i1][i2]=B[i1][i2]+1;
*”Data transformations enabling loop vectorization on multithreaded data parallel architectures.” Jang et al.,
PPOPP 2010.
Memory Access Pattern Modeling
Classification of memory access patterns
Best memory access pattern for GPUs
Classification of memory access patterns and the mathematical representation in the model.
Gray color: elements being accessed, C: constant number, Z: random number
Systor Keynote May 24, 2010
Current GPU Optimizations
Memory Selection and Coalescing on NVIDIA GPUs
Multiple memory spaces are exposed to the program on NVIDIA
GPUs – a remnant of graphics
Memory
Location
Cached
Access
Scope
Global
Off Chip
No
R/W
Thread Grid
Constant
Off Chip
Yes
R
Thread Grid
Texture
Off Chip
Yes
R
Thread Grid
Local
Off Chip
No
R/W
Thread
Shared
On Chip
N/A
R/W
Thread Block
Register
On Chip
N/A
R/W
Thread
Mathematical framework developed that characterizes loop-based
array iteration spaces
Applied mapping framework to Parboil and PhysBAM programs
Speedups ranged from 1.3X to 15X speedup versus a baseline GPU
implementation
Systor Keynote May 24, 2010
Multi-GPU Design Space
Exploration
Predict performance for GPU programs while
scaling either the number of GPUs or the input
data size
Select the optimal configuration of GPUs
(distributed/multi-system or sharedmemory/multi-processing, and how many)
without having to purchase hardware
Avoid architecture-specific optimizations which
limit scalability and portability to future
generations of hardware
Systor Keynote May 24, 2010
Requirements for
Performance Prediction
System-specific Inputs
Network bandwidth
PCIe bandwidth to GPU
Disk throughput
RAM size
Algorithm-specific Inputs
Communication requirements
Reference (single-GPU)
Model
implementation
Variables
Number of GPUs
Data set sizes
GPU Configurations
Systor Keynote May 24, 2010
Predicted
execution
times
Multi-GPU Modeling
Developed quantitative GPU models to capture
various aspects of multi-GPU execution and
communication
Allows for accurate performance prediction of
”traditional” CUDA applications across multiple
GPUs
Average difference between modeled and actual
is 11%
Able to account for changes in number of GPUs and
changes in data set sizes
Future work explores modeling larger numbers of
GPUs
Ongoing work with UIUC/NCSA and 3Leaf Systems
D. Schaa and D. Kaeli, “Exploring the Multiple-GPU Design
Space,'‘ IEEE International Parallel and Distributed Processing
Symposium, Best Paper Award, May 2009.
Systor Keynote May 24, 2010
GPU Virtualization
GPUs on many heterogeneous systems can be
accessed from a remote system over the network
server
user’s machine
virtual GPUs
workstation
network
cluster
Systor Keynote May 24, 2010
Exploit Virtualization to
Access Remote GPUs
Each system presents this
abstraction of GPUs
Multiple virtual machines
can be present per system
Manager module
Determines availability
Schedules execution
Communicates data and
commands to other
manager units
Any number of virtual GPUs
can be connected to a
system
No change to GPU
programming model (no
distributed coding)
Access to remote (diverse)
resources
Virtual Machine
Guest OS
PCIe interface
CUDA App
CUDA Driver
Coordinates
with other
manager modules
Hypervisor
PCIe interface
No local GPUs required!
Systor Keynote May 24, 2010
Manager
Physical GPUs
network
Improving the Open64 Backend
for GPUs
Goal: Reduce register pressure in PTX using
instruction scheduling and rematerialization
Benefits:
More blocks per core (higher occupancy)
More threads per block
Fewer register spills (to local memory)
Challenges:
PTX is only an intermediate representation
Little to no visibility into the NVIDIA backend
device driver
Systor Keynote May 24, 2010
Improving the Open64 Backend
for GPUs
correlation = 0.94
Target system: NVIDIA GeForce 8800 GTX
SDK: CUDA Toolkit 2.3
Target applications: CUDA SDK 2.3 examples, CUBLAS, PhysX
Systor Keynote May 24, 2010
Improving the Open64 Backend for GPUs
Systor Keynote May 24, 2010
Improving the Open64 Backend for GPUs
Systor Keynote May 24, 2010
OpenCL – The future for many-core
computing
Open Compute Language
LLVM compiler
A framework for writing programs that execute on
heterogeneous systems
Presently runs on NVIDIA GPUs and AMD multi-core
CPUs/GPUs, and on the Apple Snow Leopard OS
Being developed by Khronos Group – a non-profit
Modeled as four parts
•
Platform Model
•
Execution Model
•
Memory Model
•
Programming Model
Systor Keynote May 24, 2010
Fermi and Fusion –
The next steps in GPGPU
Fermi (4/15/10)
512 CUDA cores
8X the current double
precision FP
performance
16 Concurrent kernels
ECC support
Power issues delaying
the system
Fusion
CPU/GPU Integration
Details later this year
Systor Keynote May 24, 2010
Summary and Future Work
GPUs are revolutionizing desktop
supercomputing
A number of critical applications have been
migrated successfully
We will see shortly if heterogeneous CPU/GPU
systems will be adopted as the status quo for the
desktop market
The key will be power/performance/cost
GPUs have already demonstrated their value in selected domains
OpenCL is the future for heterogeneous computing
The low-end and the high-end are meeting in the
middle!
Systor Keynote May 24, 2010
For more info on GPGPUs
IEEE Transactions on Parallel and Distributed
System special issues on Hardware
Accelerators – focused on GPUs
Proceedings for GPGPU 1, 2, and 3
Published in JPDC and ACM digital library
Also check out: http://www.gpgpu.org
Systor Keynote May 24, 2010