Transcript Slide 1
Monte Carlo Simulation
and its Efficient Implementation
Robert Tong
28 January 2010
Experts in numerical algorithms
and HPC services
Introduction to NAG
Founded in 1970 as a co-operative project in UK
Operates as a commercial, not-for-profit organization
Worldwide operations
Oxford & Manchester, UK
Chicago, US
Tokyo, Japan
Taipei, Taiwan
Over 3,000 customer sites worldwide
Staff of ~100, over 50% technical, over 25 PhDs
£7m+ financial turnover
2
Portfolio
Numerical Libraries
Highly flexible for use in many computing languages, programming
environments, hardware platforms and for high performance
computing methods
Connector Products for Excel, MATLAB and Maple and
Giving users of the Excel and mathematical software packages MATLAB
and Maple access to NAG’s library of highly optimized and often
superior numerical routines and allowing easy integration
NAG Fortran Compiler and GUI based Windows
Compiler: Fortran Builder
Visualization and graphics software
Build data visualization applications with NAG’s IRIS Explorer
Consultancy services
3
Outline
Why use Monte Carlo simulation?
Higher order methods and convergence
GPU acceleration
The need for numerical libraries
4
Why use Monte Carlo methods?
Essential for high dimensional problems – many
degrees of freedom
For applications with uncertainty in inputs
In finance
Important in risk modelling
Pricing/hedging derivatives
5
The elements of Monte Carlo simulation
Derivative pricing
Simulate a path of asset values
Compute payoff from path
Compute option value
Numerical components
Pseudo-random number generator
Discretization scheme
6
The demand for ever increasing performance
In the past
Faster solution has been provided by increasing processor
speeds
Want a quicker solution? Buy a new processor
Present
Multi-core/Many-core architectures, without increased
processor clock speeds
A major challenge for existing numerical algorithms
The escalator has stopped... or gone into reverse!
Existing codes may well run slower on multi-core
7
Ways to improve performance in Monte Carlo
simulation
1. Use higher order discretization
2. Keep low order (Euler) discretization –
make use of multi-core potential
e.g. GPU (Graphics Processing Unit)
3. Use high order discretization on GPU
4. Use quasi-random sequence (Sobol’, …) and
Brownian Bridge
5. Implement Sobol’ sequence and Brownian Bridge
on GPU
8
Higher order methods – 1
(work by Kai Zhang, University of Warwick, UK)
9
Higher order methods – 2
10
Numerical example – 1
11
Numerical example – 1a
12
Numerical example – 1b
13
Numerical example – 2a
14
Numerical example – 2b
15
GPU acceleration
Retain low order Euler discretization
Use multi-core GPU architecture to achieve speed-up
16
The Emergence of GPGPU Computing
Initially – computation carried out by CPU (scalar,
serial execution)
CPU
evolves to add cache, SSE instructions, ...
GPU
added to speed graphics display – driven by gaming needs
multi-core, SIMT, limited flexibility
CPU and GPU move closer
CPU becomes multi-core
GPU becomes General Purpose (GPGPU) – fully
programmable
17
Current GPU architecture – e.g. NVIDIA Tesla
18
Tesla – processing power
SM – Streaming Multiprocessor
8 X SP - Streaming Processor core
2 X Special Function Unit
MT – Multithreaded instruction fetch and issue unit
Instruction cache
Constant cache (read only)
Shared memory (16 Kb, read/write)
C1060 – adds double precision
30 double precision cores
240 single precision cores
19
Tesla C1060 memory
(from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))
20
Programming GPUs – CUDA and OpenCL
CUDA (Compute Unified Device Architecture,
developed by NVIDIA)
Extension of C to enable programming of GPU devices
Allows easy management of parallel threads executing on
GPU
Handles communication with ‘host’ CPU
OpenCL
Standard language for multi-device programming
Not tied to a particular company
Will open up GPU computing
Incorporates elements of CUDA
21
First step – obtaining and installing CUDA
FREE download from
http://www.nvidia.com/object/cuda_learn.html
See: Quickstart Guide
Require:
CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro
Recent version of NVIDIA driver
CUDA Toolkit – essential components to compile and build applications
CUDA SDK – example projects
Update environment variables (Linux default shown)
PATH
LD_LIBRARY_PATH
/usr/local/cuda/bin
/usr/local/cuda/lib
CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)
22
Host (CPU) – Device (GPU) Relationship
Application program initiated on Host (CPU)
Device ‘kernels’ execute on GPU in SIMT (Single
Instruction Multiple Thread) manner
Host program
Transfers data from Host memory to Device (GPU) memory
Specifies number and organisation of threads on Device
Calls Device ‘kernel’ as a C function, passing parameters
Copies output from Device back to Host memory
23
Organisation of threads on GPU
SM (Streaming Multiprocessor) manages up to 1024
threads
Each thread is identified by an index
Threads execute as Warps of 32 threads
Threads are grouped in blocks (user specifies number
of threads per block)
Blocks make up a grid
24
Memory hierarchy
• On device can
•
Read/write per-thread
•
•
•
•
•
Registers
Local memory
Read/write per-block shared memory
Read/write per-grid global memory
Read only per-grid constant memory
• On host (CPU) can
•
Read/write per-grid
•
•
Global memory
Constant memory
25
CUDA terminology
‘kernel’ – C function executing on the GPU
__global__ declares function as a kernel
Executed on the Device
Callable only from the Host
void return type
__device__ declares function that is
Executed on the Device
Callable only from the Device
26
Application to Monte Carlo simulation
Monte Carlo paths lead to highly parallel
algorithms
• Applications in finance e.g. simulation based
on SDE (return on asset)
dSt
dt dWt
St
drift + Brownian motion
• Requires fast pseudorandom or
Quasi-random number generator
• Additional techniques improve efficiency:
Brownian Bridge, stratified sampling, …
27
Random Number Generators:
choice of algorithm
Must be highly parallel
Implementation must satisfy statistical
tests of randomness
Some common generators do not
guarantee randomness properties when
split into parallel streams
A suitable choice: MRG32k3a (L’Ecuyer)
28
MRG32k3a: skip ahead
Generator combines 2 recurrences:
xn,1 a1xn2,1 b1xn3,1 modm1
xn,2 a2 xn1,2 b2 xn3,2 modm2
Each recurrence of form (M Giles, note on
implementation)
yn Ayn1
Precompute
xn
yn xn 1
x
n2
A p in O(log p) operations on CPU,
yn p A p yn modm
29
MRG32k3a: modulus
Combined and individual recurrences
zn xn,1 xn, 2 modm1
Can compute using double precision divide – slow
Use 64 bit integers (supported on GPU) – avoid
divide
Bit shift – faster (used in CPU implementations)
Note: speed of different possibilities subject to
change as NVIDIA updates floating point
capability
30
MRG32k3a: memory coalescence
GPU performance limited by memory access
Require memory coalescence for fast transfer of data
Order RNG output to retain consecutive memory
accesses
xn,t ,b thenth elementgeneratedby threadt in block b
is stored at
t Nt n Nt N p b
sequential ordering
n N pt Nt N pb
Nt num threads,N p num pointsper thread
(Implementation by M Giles)
31
MRG32k3a: single – double precision
L’Ecuyer’s example implementation in double
precision floating point
Double precision on high end GPUs – but
arithmetic operations much slower in execution
than single precision
GPU implementation in integers – final output
cast to double
Note: output to single precision gives sequence
that does not pass randomness tests
32
MRG32k3a: GPU benchmarking –
double precision
GPU – NVIDIA Tesla C1060
CPU – serial version of integer implementation running on
single core of quad-core Xeon
VSL – Intel Library MRG32k3a
ICC – Intel C/C++ compiler
VC++ – Microsoft Visual C++
GPU
Samples/
sec
CPU-ICC
CPU-VC++
3.00E+09 3.46E+07 4.77E+07
VSL-ICC
VSL-VC++
9.35E+07 9.32E+07
33
MRG32k3a: GPU benchmarking –
single precision
Note: for double precision all sequences were identical
For single precision GPU and CPU identical
GPU and VSL differ
max abs err 5.96E-008
Which output preferred?
use statistical tests of randomness
GPU
Samples/
sec
CPU-ICC
CPU-VC++
3.49E+09 3.58E+07 5.24E+07
VSL-ICC
VSL-VC++
1.02E+08 9.75E+07
34
LIBOR Market Model on GPU
Equally weighted portfolio of 15 swaptions each with
same maturity, but different lengths and
different strikes
35
Numerical Libraries for GPUs
The problem
The time-consuming work of writing basic numerical
components should not be repeated
The general user should not need to spend many days
writing each application
The solution
Standard numerical components should be available as
libraries for GPUs
36
NAG Routines for GPUs
37
nag_gpu_mrg32k3a_uniform
38
nag_gpu_mrg32k3a_uniform
39
nag_gpu_mrg32k3a_uniform
40
Example program: generate random numbers on GPU
...
// Allocate memory on Host
host_array = (double *)calloc(N,sizeof(double));
// Allocate memory on GPU
cudaMalloc((void **)&device_array, sizeof(double)*N);
// Call GPU functions
// Initialise random number generator
nag_gpu_mrg32k3a_init(V1, V2, offset);
// Generate random numbers
nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array);
// Read back GPU results to host
cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMem
cpyDeviceToHost);
...
41
NAG Routines for GPUs
42
nag_gpu_mrg32k3a_next_uniform
43
nag_gpu_mrg32k3a_next_uniform
44
Example program – kernel function
__global__ void mrg32k3a_kernel(int np, FP *d_P){
unsigned int v1[3], v2[3];
int n, i0;
FP x, x2 = nanf("");
// initialisation for first point
nag_gpu_mrg32k3a_stream_init(v1, v2, np);
// now do points
i0 = threadIdx.x + np*blockDim.x*blockIdx.x;
for (n=0; n<np; n++) {
nag_gpu_mrg32k3a_next_uniform(v1, v2, x);
}
d_P[i0] = x;
i0 += blockDim.x;
}
45
Library issues: Auto-tuning
Performance affected by mapping of algorithm to
GPU via threads, blocks and warps
Implement a code generator to produce variants
using the relevant parameters
Determine optimal performance
Li, Dongarra & Tomov (2009)
46
Early Success with BNP Paribas
Working with Fixed Income Research & Strategies
Team (FIRST)
NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol
Monte-Carlo”
Passes rigorous statistical tests for randomness properties
(Diehard, Dieharder,TestU01)
Performance good
Being able to match the GPU random numbers with the
CPU version of mrg32k3a has been very valuable for
establishing validity of output
47
BNP Paribas Results – local vol example
48
And with Bank of America Merrill Lynch
“The NAG GPU libraries are helping us enormously
by providing us with fast, good quality algorithms.
This has let us concentrate on our models and deliver
GPGPU based pricing much more quickly.”
49
“A N Other Tier 1” Risk Group
“Thank you for the GPU code, we have achieved
speed ups of x120”
In a simple uncorrelated loss simulation:
Number of simulations
50,000
Time taken in seconds
2.373606
Simulations per second
21065
Simulated default rate
311.8472
Theoretical default rate311.9125
24 trillion numbers in 6 hours
50
NAG routines for GPUs – 1
Currently available
Random Number Generator (L’Ecuyer mrg32k3a)
Uniform distribution
Normal distribution
Exponential distribution
Gamma distribution
Sobol sequence for Quasi-Monte Carlo (to 19,000
dimensions)
Brownian Bridge
51
NAG routines for GPUs – 2
Future plans
Random Number Generator – Mersenne Twister
Linear algebra components for PDE option pricing methods
Time series analysis – wavelets ...
52
Summary
GPUs offer high performance computing for specific
massively parallel algorithms such as Monte Carlo
simulations
GPUs are lower cost and require less power than
corresponding CPU configurations
Numerical libraries for GPUs will make these an
important computing resource
Higher order methods for GPUs being considered
53
Acknowledgments
Mike Giles (Mathematical Institute, University of
Oxford) – algorithmic input
Technology Strategy Board through Knowledge
Transfer Partnership with Smith Institute
NVIDIA for technical support and supply of Tesla
C1060 and Quadro FX 5800
See
www.nag.co.uk/numeric/GPUs/
Contact:
[email protected]
54