Parallel processing infrastructure

Download Report

Transcript Parallel processing infrastructure

Embedded OpenCV Acceleration
Dario Pennisi
Introduction
Open-Source Computer Vision Library
Over 2500 algorithms and functions
Cross platform, portable API
Windows, Linux, OS X, Android, iOS
Real Time performance
BSD license
Professionally developed and maintained
06/07/2015
History
Launched in 1999 by Intel
Showcasing Intel Performance Library
First Alpha released in 2000
1.0 version released in 2006
Corporate support by Willow Garage in 2008
2.0 version released in 2009
Improved c++ interfaces
Releases each 6 months
In 2014 taken over by ItSeez
3.0 in beta now
Drop C API support
06/07/2015
Application structure
Building blocks to ease vision applications
Image
Retrieval
Pre
Processing
Feature
Extraction
Object
Detection
highgui
imgproc
features2d
objdetect
stitching
ml
OpenCV
calib3d
Recognition
Reconstruction
Analisys
06/07/2015
video
Decision Making
Environment
Application
C++
Java
Python
OpenCV
cv::parallel_for_
Threading APIs
Concurrency
CStripes
GCD
OpenMP
TBB
OS
Acceleration
CUDA
06/07/2015
SSE/AVX/NEON
OpenCL
Desktop vs Embedded
Cores/Threads
Desktop
Industrial Embedded
8/16
4/4
Core Frequency >4GHz
>1.4GHz
L1 Cache
32K+32K
32K+32K
L2 Cache
256K per core
2M shared
L3 Cache
20M
-
DDR Controllers 4x64 Bit DDR4 @ 1066 MHz
2x32 Bit DDR3 @ 800MHz
TDP
140W (CPU)
10W (SoC)
GPU cores
2880
1+4+16
06/07/2015
System Engineering
Dimensioning system is fundamental
Understand your algorithm
Carefully choose your toolbox
Embedded means no chance for “one size fits all”
06/07/2015
Acceleration Strategies
Optimize Algorithms
Profile
Optimize
Partition (CPU/GPU/DSP)
FPGA acceleration
High level synthesis
Custom DSP
RTL coding
06/07/2015
Brute Force
Increase number of CPUs
Increase CPU Frequency
Accelerated libraries
NEON
OpenCL/CUDA
Bottlenecks
Know your enemy
06/07/2015
Memory
Access to external memory is expensive
CPU load instructions are slow
Memory has Latency
Memory bandwidth is shared among CPUs
Cache
Prevents CPU to access external memory
Data and instruction
06/07/2015
Disordered accesses
What happens when we have cache miss?
Fetch data from same memory row  13 clocks
Fetch data from a different row 23 clocks
Cache line usually 32 bytes
8 clocks to fill a line (32 bit data bus)
Memory bandwidth Efficiency
38% on same row
26% on different row
06/07/2015
Bottlenecks - Cache
1920x1080 YCbCr 4:2:2 (Full HD) 4MB
Double the size of the biggest ARM L2 cache
1280x720 YCbCr 4:2:2 (HD)  1.8 MB
Just fits L2 Cache… ok if reading and writing to
the same frame
720x576 YCbCr 4:2:2 (SD)  800KB
2 images in L2 cache…
06/07/2015
OpenCV Algorithms
Mostly designed for PCs
Well structured
General purpose
Optimized functions for SSE/AVX
Relatively optimized
Small number of accelerated functions
• NEON
• Cuda (nVidia GPU/Tegra)
• OpenCL (GPU, Multicore processors)
06/07/2015
Multicore ARM/NEON
NEON SIMD instructions work on vectors of
registers
Load-process-store philosophy
Load/store costs 1 cycle only if in L1 cache
• 4-12 cycles if in L2
• 25 to 35 cycles on L2 cache miss
SIMD instructions can take from 1 to 5 clocks
Fast clock useless on big datasets/small
computation
06/07/2015
Generic DSP
Very similar to ARM/NEON
High speed pipeline impaired by inefficient
memory access subsystem
When smart DMA is available it is very complex to
program
When DSP is integrated in SoC it shares
ARM’s bandwidth
06/07/2015
OpenCL on GPU
OpenCL on Vivante GC2000
Claimed capability up to 16 GFLOPS
Real Applications
only on internal registers: 13.8 GFLOPS
computing 1000x1000 matrix: 600 MFLOPS
Bandwidth and inefficiencies:
Only 1K local memory and 64 byte memory cache
06/07/2015
OpenCL on FPGA
Same code can run on FPGA and GPU
Transform selected functions in hardware
Automated memory access coalescing
Each function requires dedicated logic
Large FPGAs required
Partial reconfiguration may solve this
Significant compilation time
06/07/2015
HLS on FPGA
High Level Synthesis
Convert C to hardware
HLS requires Code to be heavily modified
Pragmas to instruct compiler
Code restructuring
Not portable anymore
Each function requires dedicated logic
Large FPGAs required
Partial reconfiguration may solve this
Significant compilation time
06/07/2015
A different approach
Demanding algorithms on low cost/power HW
Algorithm
Analysis
Memory
Access
Pattern
DMA
06/07/2015
Data
intensive
processing
DSP
NEON
Custom
Instruction
(RTL)
Decision
Making
ARM
program
External co-processing
ARM
Memory
GPU
P
C
I
e
FPGA
06/07/2015
ARM
Memory
FPGA
Memory
Co-processor details
FPGA Co-Processor
Separate memory
• Adds bandwidth
• Reduces access conflict
ARM
ARM
Memory
Algorithm aware DMA
• Access memory in ordered way
• Add caching through embedded RAM
Algorithm specific processors
• HLS/OpenCL synthesized IP blocks
• DSP with custom instructions
• Hardcoded IP blocks
DMA
Processor
DMA
Processor
Block
Block
Block
capture
capture
capture
DPRAM(s)
DPRAM(s)
DPRAM
DPRAM(s)
DPRAM(s)
DPRAM
DSP
core
DSP
core(s)
(s)
DSP
core/IP
Block
06/07/2015
Co-processor details
Flex DMA
Dedicated processor with DMA custom
instruction
Software defined memory access pattern
Block Capture
Extracts data for each tile
DPRAM
Local, high speed cache
DSP Core
Dedicated processor with Algorithm
specific custom instructions
06/07/2015
ARM
ARM
Memory
Flex DMA
Flex DMA
Block
Block
Block
Block
capture
capture
capture
capture
DPRAM(s)
DPRAM(s)
DPRAM(s)
DPRAM(s)
DPRAM
DPRAM
DPRAM(s)
DPRAM(s)
DPRAM(s)
DPRAM(s)
DPRAM
DPRAM
DSP
core
DSP
core(s)
(s)
DSP
core/IP
Block
Environment
Application
C++
Java
Python
OpenCV
cv::parallel_for_
Threading APIs
Concurrency
CStripes
GCD
OpenMP
TBB
OS
SSE/AVX/NEON
06/07/2015
Acceleration
OpenVX
OpenCL
CUDA
FPGA
OpenVX
06/07/2015
OpenVX Graph Manager
Graph Construction
Allocates resources
Logical representation of algorithm
Graph Execution
Concatenate nodes avoiding memory storage
Tiling extensions
Single node execution can be split in multiple tiles
Memory
Memory
Node1
Node1
Memory
Node2
Node2
Memory
Multiple
accelerators
executing
single
task Memory
in
parallel
06/07/2015
Summary
• OpenCV today is mainly PC oriented.
• ARM, Cuda, OpenCL support growing
What we learnt
• Existing acceleration only on selected functions
• Embedded CV requires good partitioning among resources
• When ASSPs are not enough FPGAs are key
• OpenVX provides a consistent HW acceleration
platform, not only for OpenCV
06/07/2015
Questions
06/07/2015
Thank you
06/07/2015