Transcript OpenCL

OpenCL
Ryan Renna
Overview
 Introduction
 History
 Anatomy of OpenCL
 Execution Model
 Memory Model
 Implementation
 Applications
 The Future
2
Goals
 Knowledge that is transferable to all APIs
 Overview of concepts rather than API specific
terminology
 Avoid coding examples as much as possible
3
Introduction
What is OpenCL
A Language:
 Open Computer Language, it’s C like!
 Execute code across mixed platforms consisting
of CPUs, GPUs and other processors.
An API:
 Runs on the “Host”, manipulate and control
OpenCL objects and code.
 Deals with devices as abstract processing units
5
Why Use GPUs?
 Modern GPUs are made up of highly parallelizable processing
units. Have been named “Stream Processors”
 Modern pc’s all have dedicated GPUs which sit idle for most of
the day to day processing
 This strategy is known as “General-Purpose Computation on
Graphical Processing Units” or GPGPU
6
The Stream Processor
 Any device capable of Stream Processing, related to SIMD
 Given a set of data (the stream) a series of functions (called
Kernel functions) are applied to each element
 On-chip memory is used, to minimize external memory
bandwidth
Did you know:
The Cell processor, invented
by Toshiba, Sony & IBM is a
Stream Processor?
7
Streams
 Most commonly 2D grids (Textures)
 Maps well to Matrix Algebra, Image Processing, Physics
simulations, etc
Did you know:
The latest ATI card has 1600
individual Stream
Processors?
8
Kernel Functions
Traditional sequential method:
for(int i = 0; i < 100 * 4; i++)
{
result[i] = source0[i] + source1[i];
}
The same process, using the kernel “vector_sum”
for(int el = 0; el < 100; el++)
{
vector_sum(result[el],source0[el],source1[el]);
}
9
An “Open” Computing Language
 Multiple CPU machines with multiple GPUs, all from
different vendors, can work together.
10
History
GPGPU
 General-Purpose Computation on Graphical Processing Units
 Coined in 2002, with the rise of using GPUs for non-graphics
applications
 Hardware specific GPGPU APIs have been created :
CUDA NVidia 2007
12
Close To Metal ATI 2006
GPGPU
 General-Purpose Computation on Graphical Processing Units
 Coined in 2002, with the rise of using GPUs for non-graphics
applications
 Hardware specific GPGPU APIs have been created :
CUDA NVidia 2007
13
Close To Metal ATI 2006
The next step
 OpenCL:
 Developed by Apple computers
 Collaborated with AMD, Intel, IBM and NVidia to
refine the proposal
 Submitted to the Khronos Group
 The specification for OpenCL 1.0 was finished 5
months later
14
You may remember me from such
open standards as…
 OpenGL – 2D and 3D graphics API
 OpenAL – 3D audio API
 OpenGL ES – OpenGL for
embedded system. Used in all
smartphones.
15
 Collada – XML-based schema for
storing 3D assets.
Anatomy of OpenCL
API – Platform Layer
 Compute Device
 A processor that executes data-parallel programs. Contains Compute Units
 Compute Unit
 A Processing element.
 Example: a CORE of a CPU
 Queues
 Submits work to a compute device. Can be in-order or out-of-order.
 Context
 Collection of compute devices. Enables memory sharing across devices.
 Host
 Container of Contexts. Represents the computer itself.
17
Host Example
 A host computer with
one device group
 A Dual-core CPU
 A GPU with 8 Stream
Processors
18
API – Runtime Layer
 Memory Objects
 Buffers
 Blocks of memory, accessed as
arrays, pointers or structs
 Images
 2D or 3D images
Caveat:
Each image can be read or
written in a kernel, but not
both.
19
 Executable Objects
 Kernel
 A data-parallel function that
is executed by a compute
device
 Program
 A group of kernels and
functions
 Synchronization:
 Events
Example Flow
Program
•Program
with a
collection of
Kernels
CPU & GPU
Binaries
Compile
Code
20
Memory
Objects
Buffers
Images
Create
Data &
Arguments
In-Order Queue
Out-of-Order Queue
Compute Device
Send to
Execution
Execution Model of OpenCL
N-D Space
 The N-Dimensional computation domain is called the N-D
Space, defines the total number of elements of execution
 Defines the Global Dimensions
 Each element of execution, representing an instance of a
kernel, is called a work-item
 Work-items are grouped in local workgroups
 Size is defined by Local Dimensions
22
Work-Items
 Global work-items don’t belong to a workgroup and run in parallel
independently (no synchronization)
 Local work-items can be synchronized within a workgroup, and share
workgroup memory
 Each work-item runs as it’s own thread
 Thousands of lightweight threads can be running at a time, and are
managed by the device
 Each work-item is assigned a unique id, a local id within it’s workgroup
and naturally each workgroup is assigned a workgroup id
23
Example – Image Filter
Executed on a 128 x 128 image, our Global Dimensions are 128, 128. We will have 16,384 workitems in total.
We can then define a Local Dimensions of 30, 30.
Since workgroups are executed together, and work-items can only be synchronized within
workgroups, picking your Global and Local Dimensions is problem specific.
If we asked for the local id of work-item 31, we’d receive 1. As it’s the 1st work-item of the 2nd
workgroup.
24
Memory Model of OpenCL
Memory Model
 Private
 Per work-item
Private
Private
Private
Work Item
 Local
.. Work Item
Compute Unit 1
Private
Work Item
.. Work Item
Compute Unit 2
 Shared within a
workgroup
Local Memory
 Global/Constant
 Not synchronized, per
device
 Host Memory
26
Local Memory
Global / Constant Memory
Compute Device
Host Memory
Host
Intermission
27
Implementation
Identifying Parallelizable Routines
 Key thoughts:
 Work-items should be independent of each other
 Workgroups share data, but are executed in sync, so they cannot depend
on each others results
 Find tasks that are independent and highly repeated, pay attention to
loops
 Transferring data over a PCI bus has overhead, parallelization is only
justified for large data sets, or ones with lots of mathematical
computations
29
An Example – Class Average
 Let’s imagine we were writing an application that
computed the class average
 There are two tasks we’d need to perform:
 Compute the final grade for each student
 Obtain a class average by averaging the final grades
30
An Example – Class Average
 Let’s imagine we were writing an application that
computed the class average
 There are two tasks we’d need to perform:
 Compute the final grade for each student
 Obtain a class average by averaging the final grades
31
Pseudo Code
 Compute the final grade for each student
Foreach(student in class)
{
grades = student.getGrades();
sum = 0;
count = 0;
foreach(grade in grades)
{
sum += grade;
count++;
}
student.averageGrade = sum/count;
}
32
Pseudo Code
 This code can be isolated.
Foreach(student in class)
{
grades = student.getGrades();
sum = 0;
count = 0;
foreach(grade in grades)
{
sum += grade;
count++;
}
student.averageGrade = sum/count;
_kernel void calcGrade
(__global const float*
input,__global float*
output)
{
}
}
33
int i = get_global_id(0);
//Do work on class[i]
Determining the Data Dimensions
 First decide how to represent your problem, this will tell you the
dimensionality of your Global and Local dimensions.
 Global dimensions are problem specific
 Local dimensions are algorithm specific
 Local dimensions must have the same number of dimensions as Global.
 Local dimensions must divide the global space evenly
 Passing NULL as a workgroup size argument will let OpenCL pick the most
efficient setup, but no synchronization will be possible between work-items
34
Execution Steps
 An OpenCL calculation needs to perform 6 key steps:






35
Initialization
Allocate Resources
Creating Programs/Kernels
Execution
Read the Result(s)
Clean Up
Warning! Code
Ahead
Initialization
 Store Kernel in string/char array
const char* Kernel_Source = "\n
"__calcGrade(__global const float* input,__global float*
output)
{
int i = get_global_id(0);
//Do work on class[i]
}”;
36
Initialization
 Selecting a device and creating a context in which to run the
calculation
cl_int err;
Cl_context context;
cl_device_id devices;
cl_command_queue cmd_queue;
err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU,1,&devices,NULL);
context = clCreateContext(0,1,&devices,NULL,NULL,&err);
cmd_queue = clCreateCommandQueue(context,devices,0,NULL);
37
Allocation
 Allocation of memory/storage that will be used on the device
and push it to the device
cl_mem ax_mem =
clCreateBuffer(context,CL_MEM_READ_ONLY,atom_buffer_size,NU
LL,NULL);
err =
clEnqueueWriteBuffer(cmd_queue,ax_mem,CL_TRUE,0,atom_buffer
_size,(void*)values,0,NULL,NULL);
38
Program/Kernel Creation
 Programs and Kernels are read in from source and loaded as
binary
cl_program program[1];
cl_kernel kernel[1];
Program[0] = clCreateProgramWithSource(context,1,(const
char**)&kernel_source,NULL,&err);
err = clBuildProgram(program[0],NULL,NULL,NULL,NULL);
Kernel[0]= clCreateKernel(program[0],”calcGrade”,&err);
39
Execution
 Arguments to the kernel are set and the kernel is executed on
all data
size_t global_work_size[1],local_work_size[1];
global_work_size[0] = x;
local_work_size[0] = x/2;
err = clSetKernelArg(kernel[0],0,sizeof(cl_mem),&values);
err =
clEnqueueNDRangeKernel(cmd_queue,kernel[0],1,NULL,&global_w
ork_size,&local_work_size,NULL,NULL);
40
Read the Result(s)
 We read back the results to the Host
err =
clEnqueueReadBuffer(cmd_queue,val_mem,CL_TRUE,0,grid_buffer
_size,val,0,NULL,NULL);
41
Note:
If we were working on
images, the function
clEnqueueReadImage()
would be called instead.
Clean Up
 Clean up memory, release all OpenCL objects.
 Can check OpenCL reference count and ensure it equals zero
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmd_queue);
clReleaseContext(context);
42
Advanced Techniques
 Instead of finding the first GPU, we could create a context out
of all OpenCL devices, or decide to use specific dimensions /
devices which would perform best on the devices dynamically.
 Debugging can be done more efficiently on the CPU then on a
GPU, prinf functions will work inside a kernel
43
Applications
Applications










45
Raytracing
Weather forecasting, Climate research
Physics Simulations
Computational finance
Computer Vision
Signal processing, Speech processing
Cryptography / Cryptanalysis
Neural Networks
Database operations
…Many more!
The Future
OpenGL Interoperability
 OpenCL + OpenGL




Efficient, inter-API communication
OpenCL efficiently shares resources with OpenGL (doesn’t copy)
OpenCL objects can be created from OpenGL objects
OpenGL 4.0 has been designed to align both standards to closely
work together
 Example Implementation:
Vertex and Image
data generated
with OpenCL
47
Rendered with
OpenGL
Post Processed
with OpenCL
Kernels
Competitor
 DirectCompute by Microsoft




Bundled with DirectX 11
Requires a DX10 or 11 graphic card
Requires Windows Vista or 7
Close to OpenCL feature wise
 Internet Explorer 9 and Firefox 3.7 both use DirectX to speed up
dom tree rendering (Windows Only)
48
Overview
 With OpenCL
 Leverage CPUs, GPUs and other processors to accelerate parallel
computation
 Get dramatic speedups for computationally intensive applications
 Write accelerated portable code across different devices and
architectures
49
Getting Started…
 ATI Stream SDK
 Support for OpenCL/OpenGL interoperability
 Support for OpenCL/DirectX interoperability
 http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx
 Cuda Toolkit
 http://developer.nvidia.com/object/cuda_3_0_downloads.html
 OpenCL.NET
 OpenCL Wrapper for .NET languages
 http://www.hoopoe-cloud.com/Solutions/OpenCL.NET/Default.aspx
50
The End? No… The Beginning
51
References









52
http://www.macresearch.org/opencl_episode1
http://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpenCL.aspx
http://en.wikipedia.org/wiki/Stream_Processing
http://techreport.com/articles.x/11211
http://www.geeks3d.com/20100115/gpu-computing-geforce-and-radeonopencl-test-part-1/
http://gpgpu.org/about
http://developer.apple.com/Mac/library/documentation/Performance/Concept
ual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html
http://www.khronos.org/developers/library/overview/opencl_overview.pdf
http://gpgpu.org/wp/wp-content/uploads/2009/09/C1-OpenCL-API.pdf