Use of CUDA for Continuous Space Language Model

Download Report

Transcript Use of CUDA for Continuous Space Language Model

Use of CUDA for Continuous Space
Language Model
Elizabeth A. Thompson, Ph.D.a
Timothy R. Anderson, Ph.D.b
aPurdue
University, Fort Wayne
Fort Wayne, IN, USA 46805
bAir
Force Research Lab
Wright Patterson Air Force Base
Dayton, OH, USA 45433
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
Continuous-Space Language Models
(CSLM)

This work was based on the article “ContinuousSpace Language Models for Statistical Machine
Translation” by Holger Schwenk of the University
of Le Mans, France, published in the Prague
Bulletin of Mathematical Linguistics, January
2010, and his corresponding open source
implementation.
CSLM (Cont'd)
CSLM (Cont'd)



The CSLM consists of a 3 layer neural network:
projection layer, hidden layer, output layer.
Input  3 word sequence
Output  The probability of all words in the
vocabulary being the 4th word in the sequence.
Training of the CSLM


The neural network must be trained through a
process of adaptive learning.
It is trained using a series of 63,070 4-grams:
• Prague Stock Market falls
• Stock Market falls to
target
• Market falls to minus
word
• falls to minus by
Training of the CSLM (Cont’d)


Text file vocab.txt contains list of vocabulary
terms
Each of 14,024 terms in vocab.txt is assigned a
numerical index, which is used for training the
neural network:
Index
term
0
>
1
…
619
abandon
Training the Neural Network

In the training stage, values are propagated in
the forward direction through the neural network
in order to assign weighting values to the input
data, and then errors are propagated in the
reverse direction to improve these weighting
factors.
Projection Layer



The projection layer maps each of the 3 input
words to a unique 256 length sequence.
Initially, these are generated as uniformly
distributed random values, but their values
change as the neural network is trained.
For each input word, the corresponding 256
length sequence is the output of the projection
layer.
Projection layer

The projection layer consists of a lookup table.
0
1
2
3
4
...
...
...
...
14023
-0.100000 0.009774
-0.099803 0.001762
-0.091674 -0.081308
...
...
-0.079890 -0.067392
...
...
...
Hidden Layer

For the forward pass, the output of the projection
layer is fed as input to the hidden layer.
D  tanhMC  B
192x768
weight
matrix
768x128
output of
projection
layer
192x128
bias
matrix
Output Layer

For the forward pass, the output of the hidden
layer is fed as input to the output layer.
O  VD  K
14024x192
192x128 output
14024x128
weight matrix of hidden layer
bias matrix
 After applying these weights and biases, a
softmax normalization is applied.
Backward Pass for Training



The error of the output compared to the target
value is propagated backward through the
network.
Weights and biases in the output layer and then
the hidden layer are updated.
Finally, the projection layer table is updated to
reflect the results of the forward pass.
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
CUDA for CSLM



The GPU is specialized for compute intensive,
highly parallel computation.
All NVIDIA GPUs can support at least 768
concurrently active threads per multiprocessor.
However, there is an overhead associated with
using the GPU.
GPU Overhead




To use the GPU, memory must be allocated on
both the host CPU as well as on the GPU.
Variables to be used in the computation must be
transferred to the GPU.
The computation is then performed on the GPU.
The results must be transferred back to the host
CPU.
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
CUDA Architecture
GPU
Streaming
multiprocessor
processors
(cores)
CUDA Architecture (Cont’d)



The CUDA programmer defines
functions, called kernels.
A kernel is executed as a grid of
thread blocks.
The number of threads per block
and threads per multiprocessor
depend on compute capability of
CUDA device.
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
Implementation of CSLM using CUDA


The CSLM algorithm is highly computationally
intensive and a good candidate for
implementation with CUDA.
The matrix multiplications in the hidden and
output layer, both forward and backward pass,
are highly parallel.
CUBLAS Routines for CSLM


CUBLAS is a CUDA implementation of BLAS
(Basic Linear Algebra Subprogram), which
perform matrix multiplication operations.
Provide matrix multiplications and handle all
overhead issues regarding programming of
threads—does not require programmer to define
kernels, grids, or thread blocks.
CUBLAS Implementation of CSLM



The matrix operations were replaced with the
CUBLAS function, cublasSgemm(), which
performs the operation
C  AB  C
A, B, and C are matrices containing singleprecision values (floats).
α and β are scalars.
CUBLAS Implementation of CSLM
(Cont’d)




NVIDIA Performance Primitives Library (NPP)
nppsExp_32f_I – performs an exponential
operation “in-place” on single precision values
nppsMulC_32f_I – performs “in-place”
multiplication of a single precision matrix by a
constant.
These functions were used to implement the
softmax normalization operations.
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
CUBLAS CSML on various platforms
CUDA
device
Compute Number
capability of MP
version
number
Number
of CUDA
cores
Maximum Maximum CPU
threads
threads
platform
per block per MP
CPU
operating
system
Execution
time per
epoch
(min)
Quadro
FX 380
LP
1.2
2
16
512
1024
HP Z200
SFF
workstn 4
Intel Core
i3-530
processrs
2.93 GHz
Fedora
2.6.33.385.fx13x86_
64
3
Quadro
FX
2700M
1.1
6
48
512
768
Scientific
Linux 6.0
2.5
Quadro
FX 5800
1.3
30
240
512
1024
Duo core
Intel
T9600 2.8
GHz
HP Z800
workstn
12 Intel
Xeon
x5660
processrs
2.8 GHz
CentOS
Linux
2.6.3271.29.1e16.
x86-64
1.33
Comparison of revised CUDA version
using Quadro FX 5800 vs. original
Schwenk algorithm using MKL
Algorithm
Original Schwenk
using MKL
CUDA version
Time per epoch
(sec)
36
26
Outline






I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions
Conclusions



A framework has been provided to introduce
CUDA to the CSLM and a time savings over the
traditional CPU approach has been demonstrated.
CUBLAS & NPP libraries provide a good starting
point for the use of GPUs
For best performance, avoid redundant uploading
and downloading of interim results.
Conclusions (Cont’d)


GPUs provide a substantial performance benefit at
relatively low cost, making high performance
computing accessible to the average user.
The availability of GPUs on laptops may make it
more appealing and practical than a
supercomputer in some applications.
Questions?