presentation title - Home

Download Report

Transcript presentation title - Home

ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA
HETEROGENEOUS COMPUTE… TRANSPARENTLY!
HARRIS GASPARAKIS, PH.D.
RAGHUNATH RAO, PH.D.
VISION AND IMAGE PROCESSING ARE COMPUTATIONALLY DEMANDING!








Automatic Inspection
Medical image analysis
Autonomous Navigation
Human Machine Interfaces
Augmented Reality
Robotics
Security/Surveillance
Data Analytics and Organization… and more




Millions of pixels per image
100s of calculations per pixel
100s–1000s of image frames per second
Complex + constantly evolving algorithms
 Hungry for PERFORMANCE
 But needs to be PROGRAMMABLE
 And within POWER & COST budgets
2 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
THE TRADITIONAL SOLUTION
 CPU for system control, IO and UI
 Hardware offload for compute-intensive processing (DSP / FPGA / ASIC)
 Various tradeoffs of Performance, Programmability, Power, Cost
CPU
DSP/
FPGA/
ASIC
System
Memory
Device
Memory
…
3 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
DSP/
FPGA/
ASIC
Device
Memory
EVOLUTION OF HETEROGENEOUS COMPUTE
cc
CPU
cc … cc
Host
Memory
But...
 Data copy overheads
 Kernel launch constraints
 Expert programming needed
dGPU
cc
cc
PCIe Memory
(pinned)
PCIe®
cc
cc
cc
cc
cc
cc
GPU Device
Memory
Main Memory
cc
1. GPU Compute: dGPU brought 100s–1000s of
GFLOPS of data-parallel performance.
cc
CPU
cc …
…
cc
cc
cc
iGPU
cc …
CPU
cc
cc
cc
HSA iGPU
cc cc
cc
Cache
cc
Cache
Unified (Bidirectionally Coherent, Pageable) Virtual Memory
cc
cc
Physical Memory
Host Memory
Device Visible
Host Memory
Host Visible
Device Memory
Device Memory
Main Memory
2. APU with iGPU: Easier to use SOC, Unified
memory eliminates some copies.
3. Heterogeneous Systems Architecture (HSA) APU:
True heterogeneous compute across CPU/GPU
 Share pointers freely
 Move work freely across CPU and GPU
 Use standard programming languages
4 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
OPEN-SOURCE COMPUTER VISION LIBRARY
ISCAS
MultiCoreWare
Google
Contributors
Nvidia
Willow Garage
Intel
Core team
Itseez
2000
First public release
2008
2009
v2.0, C++ API
 ~3K algorithms/functions/samples
 BSD license
 ~8M downloads
5 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
2012
2013
present
@github v2.4.3, OpenCL™
v3.0, T-API
OPEN-SOURCE COMPUTER VISION LIBRARY
Filters
Robust features
Transformations
Optical Flow
Background subtraction
Edges
Segmentation
Depth
Detection/recognition
Calibration
End applications
6 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
Images courtesy of Itseez
OPENCV SUPPORTS MULTIPLE PLATFORMS
In OpenCV 2.4.x, CPU and OpenCL™ are similar yet distinct code paths.
// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
Mat frame, frameGray;
vector<Rect> faces;
for(;;){
vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces);
}
// initialization
VideoCapture vcap(...);
ocl::OclCascadeClassifier fd("haar_ff.xml");
ocl::oclMat frame, frameGray;
vector<Rect> faces;
for(;;){
vcap >> frame;
ocl::cvtColor(frame, frameGray, BGR2GRAY);
ocl::equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces);
}
7 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
INTRODUCING THE TRANSPARENT API
// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
UMat frame, frameGray;
vector<Rect> faces;
for(;;){
vcap >> frame;
cvtColor(frame, frameGray, BGR2GRAY);
equalizeHist(frameGray, frameGray);
fd.detectMultiScale(frameGray, faces);
}
8 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
T-API: UNDER THE HOOD
Easy transition path from 2.x to 3.x. Code that used to work in 2.x, should still work.
Therefore, cv::Mat is still around.
UMat:
getMat(…)
Reference counts
Dirty bits
Opaque handles (e.g. clBuffer)
CPU data
GPU data
Handles data synchronization efficiently
Mat:
Both Mat and UMat are views into UMatData, which does the heavy lifting.
9 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
getUMat(…)
UMatData:
T-API: HOW DOES IT WORK?
APPLICATION
C++
Language Binding
C
python
java
OpenCV
OS
Multi-threading
API
Windows®
iOS
Linux
Android
Mac OSX
WinRT
Windows® Concurrency
TBB
GDC
Implementation
Hardware
T-API
C++
IPP
CPU
OpenCL
CPU,
GPU, …
10 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
NEON
CUDA
ARM®
NVIDIA
dGPU
EXPERIMENTAL RESULTS
350
5.6x
300
Average of AMD Radeon™ RX-427B GPU
OCL/ CPU
Average of AMD Radeon™ E8860 GPU
OpenCL/BE CPU
Average of AMD Radeon™ R9-290X GPU
OCL/BE CPU
250
E8860 OpenCL dGPU/
RX-427B C++ CPU
15.5x
R9-290X OpenCL dGPU/
RX 427B C++ CPU
65.5x
200
50
11 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
Threshold
WarpPerspective
SqrBoxFilter
Scharr
Remap
PyrDown
MorphologyEx
matchTemplate
Integral2
HoughLines
Filter2D
EqualizeHist
CvtColor
CornerHarris
CLAHE
CalcBackProj
0
Blur
 Performance transparently scales
based on platform capabilities
100
Bilateral
 T-API enables comparison of various
execution paths (C++, IPP, OpenCL™)
150
AccumulateSquare
RX-427B OpenCL iGPU/
RX-427B C++ CPU
uplift
Accumulate
Runs under comparison
imgproc module performance
test suite.
image resolution: 3840x2160
VISION ALGORITHMS ARE DIFFERENT… AND COMPLEX
 CPUs and GPUs are complementary compute cores
‒ GPUs do well with parallelizable algorithms, large data sets, dense data access (cache locality), high arithmetic
complexity (compute-to-memory ratio, occupancy)
‒ CPUs do well on serial algorithms, single-thread execution, branchy code, memory irregular/intensive operations
 Vision algorithms are complex and would benefit from flexible partition across both CPU and GPU
Example 1: Machine
learning (Viola Jones
face detection)
During Adaboost
cascade, each stage
increases sparsity,
reduces data locality, and
reduces occupancy:
Ideally GPU for first
stages and end with CPU
12 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
Example 2: Object Recognition
 Dense feature detection: GPU
 Keypoint finalization: CPU or GPU depending
on density
 Descriptors: CPU/GPU depending on density
 Model update (e.g. Bag of Words,
Deformable parts model etc): CPU
 Recognition (Dictionary lookup, or energy
minimization): CPU or GPU based on
available libraries for chosen algorithm
HSA PLATFORMS ENABLE FLEXIBLE CPU/GPU COMPUTE
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
CU
M-2
CU
M-1
CU
M
hQ
hUMA
CPU
GPU
Unified Coherent Memory
 Unified Coherent Memory enables data sharing
across all processors
 Processors architected to operate cooperatively
 Designed to enable the application to run on
different processors at different times
 GPU and CPU have
uniform visibility into
entire memory space
CPU
 GPU and CPU have
equal flexibility to be
used to create and
dispatch work items
Heterogeneous Systems Architecture (HSA)
13 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
GPU
HSA PLATFORMS ENABLE EASY PROGRAMMABILITY
 Shared Virtual Memory (SVM)
‒ Coarse-grained SVM
‒ Sharing of complex data structures containing
pointers
‒ Fine-Grained System SVM
‒ Enqueue child kernels
‒ Solve non-gridded problems
Python
App
…
OpenCL
Runtime
Various
Runtimes
Various
Runtimes
Various
Runtimes
…
HSA
Helper Libraries
 Platform Atomics
 Dynamic Parallelism (a.k.a. Device Enqueue)
OpenMP
App
HSAIL
‒ Use any system memory pointer (malloc, new,
stack, etc.)
‒ Allows fine-grained atomics within a kernel
‒ Synchronize host/device while kernel is running
(can keep state live on GPU)
C++ (AMP, HC)
App
Build
‒ Fine-Grained Buffer SVM
‒ Concurrent access from CPU & GPU without
map/unmap (platform atomics sync CPU/GPU)
OpenCL™
App
HSAIL
Runtime
HSAIL
Finalizer
Execution
HSAIL Kernel
Driver
HSA platforms support mainstream programming languages
14 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
CONCLUSIONS
Heterogeneous Compute (HC) is very effective in accelerating Computer Vision and Image Processing
workloads and offers an excellent alternative to custom hardware.
OpenCV, one of the most popular libraries for vision and image processing, is HC-accelerated.
Transparent-API (T-API) introduced in OpenCV 3.0 helps makes it even easier to use HC acceleration.
Results show strong acceleration and excellent scaling across multiple platforms with single source
programming and a single binary.
The next generation of HC platforms based on HSA open up even more value for developers to flexibly
map workloads across CPU/GPU while still programming in mainstream high-level languages.
15 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product
releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the
right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL
DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational
purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used by permission by Khronos.
16 | ACCELERATING COMPUTER VISION AND IMAGE PROCESSING VIA HETEROGENEOUS COMPUTE… TRANSPARENTLY! | JULY 16, 2015 |