Slide - salsahpc

Download Report

Transcript Slide - salsahpc

Deep Learning and HPC
Adam Coates
Visiting Scholar at IU Informatics
Post-doc at Stanford CS
Adam Coates
What do we want computers to do with our data?
Images/video
Audio
Text
Adam Coates
Label: “Motorcycle”
Suggest tags
Image search
…
Speech recognition
Music classification
Speaker identification
…
Web search
Anti-spam
Machine translation
…
Computer vision is hard!
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Motorcycle
Adam Coates
What do we want computers to do with our data?
Images/video
Label: “Motorcycle”
Suggest tags
Image search
…
Machine learning performs well on many of these problems,
but is a lot of work.
Speech recognition
Audio
What is it about machine learning
use?
Text
Adam Coates
Music classification
Speaker identification
that…makes it so hard
Web search
Anti-spam
Machine translation
…
to
Machine learning for image classification
“Motorcycle”
Adam Coates
Why is this hard?
You see this:
But the camera sees this:
Adam Coates
Machine learning and feature
representations
pixel 1
Learning
algorithm
Input
pixel 2
Motorbikes
“Non”-Motorbikes
pixel 2
Raw image
pixel 1
Adam Coates
Machine learning and feature
representations
pixel 1
Learning
algorithm
Input
pixel 2
Motorbikes
“Non”-Motorbikes
pixel 2
Raw image
pixel 1
Adam Coates
Machine learning and feature representations
pixel 1
Learning
algorithm
Input
pixel 2
Motorbikes
“Non”-Motorbikes
pixel 2
Raw image
pixel 1
Adam Coates
What we want
handlebars
wheel
Learning
algorithm
E.g., Does it have Handlebars? Wheels?
Input
Motorbikes
“Non”-Motorbikes
Features
Wheels
pixel 2
Raw image
pixel 1
Adam Coates
Feature
representation
Handlebars
How is computer perception done?
Images/video
Image
Audio
Vision features
Detection
Coming up with features is difficult, timeconsuming, requires expert knowledge.
Audio on applications
Audio features
When working
of learning,
we
Speaker
ID
spend a lot of time tuning the features.
Text classification,
Machine translation,
Information retrieval, ....
Text
Text
Adam Coates
Text features
Deep Learning
• Find algorithms that can learn
representations/features from data.
– Deep neural networks.
– “Unsupervised feature learning”
• Learn representations without knowing task.
Adam Coates
Deep Learning
• Build multi-stage pipelines from simple pieces.
– Classic system: deep neural net.
“Motorcycle”
Optimize weights
inside network to
give correct
answers on
training data.
– Generally: compositions of differentiable functions.
Adam Coates
Basic algorithmic components
• In a loop over entire training set:
1. Evaluate deep network.
• Usually process a batch of training
examples (e.g., 100) at once
2. Compute gradient of loss
function w.r.t parameters.
• Sum up gradients over batch of
examples.
3. Update trainable
parameters using gradient.
Adam Coates
Scaling Up Deep Learning at Stanford
• Most DL networks built on a few primitives.
– Mostly large dense matrix/vector operations.
– A few “block” matrices for widely-used cases.
– Communication hidden in distributed arrays.
• Most operations are hardware-friendly.
– Not far from sgemm throughput.
– Relatively low communication / IO needs.
• But hard to avoid doing many iterations.
– Have to focus on making each loop very fast.
Adam Coates
Scaling Up Deep Learning at Stanford
• In-house MPI+CUDA infrastructure.
– Up to 11.2B parameter networks.
– Typical experiment: ~14M images (Image-Net).
64
11.2B
Factor Speedup
32
6.9B
16
3.0B
8
1.9B
4
680M
2
185M
1
Linear
1
Adam Coates
4
9
16
# GPUs
36
64
[Coates et al., ICML 2013]
Scaling Up Deep Learning at Stanford
• Duplicated “Google Brain” with 3 machines.
– Compared to 1000+ machines.
– Unsupervised learning from 10M YouTube frames.
• Largest artificial neural nets ever trained.
– 6.5x larger than previous system.
… but what should we do with it!?
Surprisingly hard to find a problem big enough that such models matter!
Adam Coates
[Coates et al., ICML 2013]
Applications
• Building universal representations
– “One neural net to rule them all.”
Object Recognition
…
Shared representation
for many tasks.
Localization
…
Tagging
…
Depth Estimation
…
[E.g., Collobert et al., 2011]
Adam Coates
Applications
• Autonomous Driving
1 year * 1 Hz = ~30M frames
[Actually have to drive for 1 year!]
Can we train from a few hundred 1080p
frames per second?
Adam Coates
Applications: why these?
• High impact.
– Universal representations: many applications with
diffused value.
– Driving: single application with high value.
• Train once, deploy everywhere.
– Training is hard, expensive.
– Deploying is easy, cheap.
– A supercomputer can generate an artifact that gets reused by others.
Adam Coates
Things that work
• Find common cases; tightly optimize
– Surprisingly few core pieces. E.g., 10.
• Distributed arrays
– Massive time-saver; easy to think about.
– Easy to save and restore from Lustre.
– Load shards and sanity-check them in Matlab.
• High-level language bindings
– Low-level code in C++/CUDA (JIT)
Adam Coates
Challenges
• Experiment turn-around time is still long.
– Maybe 3-5 experiments running at once.
– Weeks for big models / big datasets.
• Productivity is still much lower than, e.g., Matlab.
– Lack of strong tools at every level except lowest.
• Many DL hackers are not systems hackers.
• Lots of hard-won lessons that are trapped in our
group.
Adam Coates
Laundry list from Stanford infrastructure
•
Job control and scripting is painful
–
–
•
Zombies
PBS/Torque mostly works
JIT compilation
–
JIT compile C/C++ code
•
•
Flexible enough to do many things.
Easier to use CUDA runtime, templatizing, etc.
•
Easier to link with high-level languages.
–
–
Needs to be thread-savvy
•
•
–
Caching of compiled modules
Avoiding deadlocks or locking problems in cache(s)
Ideally invisible to users
•
But first use of kernels is really slow.
•
Debugging
•
Distributed arrays
–
–
–
Unclear what to do here. Support for common tools? NVTX, VampirTrace…?
Stanford implementation is rough. Should have pursued more standard approach.
MATLAB’s Co-distributed arrays; ScaLapack-style arrays.
•
•
•
•
•
Multi-dimensional array with a “distributor” that maps indices to ranks.
Support to re-distribute array.
Support to save/load arrays even when process grid changes.
Distribution-aware implementations of most functionality.
Execution structure
–
Imperative programming is just easier (esp. with students + scientists).
•
•
DAGs, etc. are static and difficult to alter. Works OK for us; but many headaches.
CUDA streams+events semantics is really nice.
–
–
•
–
Solves the same problem: hide massive parallelism from the caller.
But allows arbitrary scheduling on the fly. Easy to understand behavior as viewed by the host.
If you want custom functionality, you just have to write the parallel code.
–
–
In CUDA, you have to write the kernel.
For ScaLapack, you had to write code on top of BLACS.
Single-rank case should look like 100-rank case.
•
•
Avoids Driver API, which is much less convenient.
Students can prototype single-rank. Easier to think about.
IO tools
–
We spend a lot of time writing file loaders.
•
Application-specific, but lots of boiler-plate.
•
Currently difficult to handle distributed saving/loading of large arrays of data.
–
Adam Coates
Many common cases in ML. E.g., a list of samples, where each sample = video, image, string, vector.