How Microsoft had made deep learning red-hot in IT

Download Report

Transcript How Microsoft had made deep learning red-hot in IT

How Microsoft Had
Made Deep Learning
Red-Hot in IT Industry
Zhijie Yan, Microsoft Research Asia
USTC visit, May 6, 2014
Self Introduction
 @MSRA鄢志杰
 996 – studied in USTC from 1999 to 2008
 Graduate student – studied in iFlytek speech lab from
2003 to 2008, supervised by Prof. Renhua Wang
 Intern – worked in MSR Asia from 2005 to 2006
 Visiting scholar – visited Georgia Tech in 2007
 FTE – worked in MSR Asia since 2008
 Research interests
 Speech, deep learning, large-scale machine learning
In Today’s Talk
 Deep learning becomes very hot in the past few years
 How Microsoft had made deep learning hot in IT
industry
 Deep learning basics
 Why Microsoft can turn all these ideas into reality
 Further reading materials
How Hot is Deep Learning
 “This announcement
comes on the heels
of a $600,000 gift
Google awarded
Professor Hinton’s
research group to
support further work
in the area of neural
nets.” – U. of T.
website
How Hot is Deep Learning
How Hot is Deep Learning
How Hot is Deep Learning
How Hot is Deep Learning
Microsoft Had Made Deep
Learning Hot in IT Industry
 Initial attempts made by University of Toronto had
shown promising results using DL in speech recognition
on TIMIT phone recognition task
 Prof. Hinton’s student visited MSR as an intern, good
results were obtained on Microsoft Bing voice search
task
 MSR Asia and Redmond collaborated and got
amazing results on Switchboard task, which shocked
the whole industry
Microsoft Had Made Deep
Learning Hot in IT Industry
*figure borrowed from MSR principal researcher Li DENG
Microsoft Had Made Deep
Learning Hot in IT Industry
 Followed by others and results were confirmed in
various different speech recognition tasks
 Google / IBM / Apple / Nuance / 百度 / 讯飞
 Continuously advanced by MSR and others
 Expand to solve more and more problems
 Image processing
 Natural language processing
 Search
 …
Deep Learning From Speech
to Image
 ILSVRC-2012 competition on ImageNet
 Classification task: classify an image into 1 of the 1,000
classes in your 5 bets
lifeboat
airliner
school bus
Institution
Error rate (%)
University of Amsterdam
29.6
XRCE/INRIA
27.1
Oxford
27.0
ISI
26.2
Deep Learning From Speech
to Image
 ILSVRC-2012 competition on ImageNet
 Classification task: classify an image into 1 of the 1,000
classes in your 5 bets
lifeboat
airliner
school bus
Institution
Error rate (%)
University of Amsterdam
29.6
XRCE/INRIA
27.1
Oxford
27.0
ISI
26.2
SuperVision
16.4
Deep Learning Basics
 Deep learning  deep neural networks  multi-layer
perceptron (MLP) with a deep structure (many hidden
layers)
Output layer
Output layer
W1
Hidden layer
W0
Input layer
W3
Hidden layer
W2
Hidden layer
W1
Hidden layer
W0
Input layer
Deep Learning Basics
 Sounds not new at all? Sounds familiar like you’ve
learned in class?
 Things not change over the years
 Network topology / activation functions / …
 Backpropagation (BP)
 Things changed recently
 Data  Big data
 General-purpose computing on graphics processing units
(GPGPU)
 “A bag of tricks” accumulated over the years
E.g. Deep Neural Network
for Speech Recognition
 Three key
components that
make DNN-HMM
work
Many layers of
nonlinear
feature
transformation
Tied triphones as the
basis units for
HMM states
Long window
of frames
*figure borrowed from MSR
senior researcher Dong YU
E.g. Deep Neural Network
for Image Classification
 The ILSVRC-2012 winning solution
*figure copied from Krizhevsky, et al., “ImageNet Classification with
Deep Convolutional Neural Networks”
Scale Out Deep Leaning
 Training speed was a major problem of DL
 Speech recognition model trained with 1,800-hour data
(~650,000,000 vector frames) costs 2 weeks using 1 GPU
 Image classification model trained with ~1,000,000 figures
costs 1 weeks using 2 GPUs*
 How to scale out if 10x, 100x training data becomes
available?
*Krizhevsky, et al., “ImageNet Classification with Deep Convolutional Neural Networks”
DNN-GMM-HMM
 Joint work with USTC-MSRA Ph.D. program student,
Jian XU (许健, 0510)
 The “DNN-GMM-HMM” approach for speech
recognition*
 DNN as hierarchical nonlinear feature extractor, trained
using a sub-set of training data
 GMM-HMM as acoustic model, trained using full data
*Z.-J. Yan, Q. Huo, and J. Xu, “A scalable approach to using DNNderived features in GMM-HMM based acoustic modeling for LVCSR”
DNN-GMM-HMM
 GMM-HMM modeling of DNN-derived features:
combine the best of both worlds
DNNderived
features
PCA
HLDA
Tied-state
WE-RDLT
MMI
sequence
training
CMLLR
unsupervised
adaptation
Experimental Results
 300hr DNN (18k states, 7 hidden layers) + 2,000hr
GMM-HMM (18k states)*
 Training time reduced from 2 weeks to 3-5 days
Word Error Rate (%)
DNN-HMM (CE)
DNN-GMM-HMM (RDLT)
DNN-GMM-HMM (MMI)
DNN-GMM-HMM (UA)
15.4
14
14.7
10% WERR
13.8
12
15% WERR
13.1
10
*Z.-J. Yan, Q. Huo, and J. Xu, “A scalable approach to using DNNderived features in GMM-HMM based acoustic modeling for LVCSR”
A New Optimization Method
 Joint work with USTC-MSRA Ph.D. program student, Kai
Chen (陈凯, 0700)
 Using 20 GPUs, time needed to train a 1,800-hour
acoustic model is cut from 2 weeks to 12 hours,
without accuracy loss
 The magic is to be published
 We believe the scalability issue in DNN training for
speech recognition is now solved!
Why Microsoft Can Do All
These Good Things
 Research
 Bridge the gap between academia and industry via our
intern and visiting scholar programs
 Scale out from toy problems to real-world industry-scale
applications
 Product team
 Solve practical issues and deploy technologies to serve
users worldwide via our services
 All together
 We continuously improve our work towards larger scale,
higher accuracy, and to tackle more challenging tasks
 Finally
 We have big-data + world-leading computational
infrastructure
If You Want to Know More
About Deep Learning
 Neural networks for machine learning:
https://class.coursera.org/neuralnets-2012-001
 Prof. Hinton’s homepage:
http://www.cs.toronto.edu/~hinton/
 DeepLearning.net: http://deeplearning.net/
 Open-source
 Kaldi (speech): http://kaldi.sourceforge.net/
 cuda-convent (image):
http://code.google.com/p/cuda-convnet/
Thanks!