Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)

Download Report

Transcript Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)

Distributed Parameter Synchronization in DNN
Hucheng Zhou (MSRA)
Zheng Zhang (MSRA)
Minjie Wang (SJTU)
Model
several GBs of model size
several layers
millions of edges between two layers
thousands of neurons per layer
TBs of data
Training Data
DNN model training could take weeks or even more
What if we can train the DNN model in one day?
It is still a dream
Fast training needs parallelism, even in a distributed fashion
Model Parallelism
Model
• Model is partitioned and trained in parallel
Machine
Training Data
Model Parallelism
Model
Network traffic bounded
Non-linear speedup
Training is still slow with large data sets
Training Data
Another dimension of parallelism, data parallelism, is required
Data Parallelism
1. Training data is partitioned, and multi-models are trained in parallel
2. Intermediate trained results (model parameters) are synchronized
Outline
• Problem statement
• Design goals
• Design
• Evaluation
It is not a good idea to combine model training and model synchronization
Parameter Server
Application
•
Separate the model training and model synchronization
•
Build a dedicated system PS (Parameter Server) to synchronize the
intermediate model parameters
•
DistBlief (NIPS 2012)
Outline
• Problem statement
• Design goals
• Design
• Evaluation
How to build a scalable, reliable and still efficient parameter server?
A Centralized Approach
p’ ==pp+’ ∆p
+ ∆p’
Parameter Server p’’
∆p’
p
p’
Model
workers
Data
Asynchronous Stochastic Gradient Descent (A-SGD)
A Centralized Approach
Parameter Server p’ = p + ∆p
∆p
Model
workers
Data
• ∆p is vector or matrix with float
type, rather than key-value pair
• p’ = p + ∆p is commutative and
associate, which makes
synchronization in bulk is possible
However, it is non-scalable if large-scale model workers exist
Parameter Server
• Depends on:
• The size of model parameters
(240MB)
∆p1
Model
Workers
Data
Shards
∆pn
∆pi
…
…
…
…
• The model update rate (3times/s,
thus 720MB/s)
• The number of model workers
(overloaded if n is large)
• GPU scenario
Model parameter partition helps
Parameter Server
∆p1
Model
Workers
∆pn
∆pi
…
…
…
…
Data
Shards
•
Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong
Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed
ML, Manuscript, arXiv:1312.7651, communicated 30 Dec 2013).
A local cache (model slaves) of model parameters helps
Parameter Server
parameter master
∆p1
Model
Workers
Data
Shards
∆pn
∆pi
…
…
parameter slaves
However, parameter master may still be the bottleneck
A decentralized (peer-2-peer) system design is motivated
And, what if faults happened?
Parameter Server
1. Networking delay or down
∆p1
∆pn
∆pi
2. Machine crash and restart
Model
Workers
Data
Shards
…
…
…
…
3. Software crash, data
lost, job preempted
Again, it is not reliable without fault-tolerance support
A fault-tolerant system design is motivated
How about performance if staleness (consistency) is required?
Staleness is required
Parameter Server
∆p1
Model
Workers
Data
Shards
p
p1 = p + ∆p1
p
p
Parameter Server
p1
fast
Model
Workers
Data
Shards
∆p2
p1 = p + ∆p1
slower
slowest
Staleness is required for fast model convergence
global optimal
global optimal
𝑡3
𝑡3
𝑡2
𝑡1
𝑡2
𝑡1
𝑡2
𝑡1
initialization
𝑡1
Update by worker 1
initialization
𝑡2
𝑡3
Update by worker 2
Model synchronization
With coordination
local optimal
Without coordination
(Worker 2 works on a over-staled model)
The working pace of each worker should be coordinated
Parameter Server
Coordinator
Model
Workers
Data
L-BFGS
However, a centralized coordinator is costly, and
the system performance (parallelism) is not fully exploited
Balance between the system performance and
model convergence rate is motivated
Outline
• Problem statement
• Design goals
• Design
• Evaluation
1. Each worker machine has a local parameter server
(model replica), and the system is responsible for
parameter synchronization
System Architecture
…
…
…
…
Parameter Server
• Reduced network traffic
by only exchanging the
accumulated updates
(commutative and
associative)
• Non-blocking of training
• Asynchronization
2. How to mutually exchange parameter updates between
two connected local parameter servers, with fault-tolerance
on network delay or even down?
…
…
Parameter Server Pairwise fault-tolerant update exchange protocol
…
…
Pairwise Protocol Invariants
1. Node p’s “belief” of model
(Θ𝑝) equals to its own
p
contribution (𝑥𝑝) and
Pairwise fault-tolerant update exchange protocol contribution from its
φrp
φqp
neighbors (φqp)
𝑥𝑝
…
…
r
q
…
…
Θ𝑝 = 𝑥𝑝 +∑q∊Np φqp
(1)
Pairwise Protocol Invariants
2. A node (p) also propagates
updates to neighbor (q)
p
from the contribution of itself
Pairwise fault-tolerant update exchange protocol and the accumulated
updates from other
Θ𝑝 - φqp
r
neighbors (r)
q
…
…
…
…
φpq = Θ𝑝 - φqp
(2)
Pairwise Protocol Details
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
5
X.D=5
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
5
X.D=5
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
5
X.D=5
skip update
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
X.D=8
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
X.D=8
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
X.D=8
skip update
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
X.D=8
skip update
3
X.D=11
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
X.D=7
X.D=8
skip update
3
X.D=11
[5, <0, 1>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
[5, <0, 1>]q
X.D=7
X.D=8
skip update
3
X.D=11
[8, <1, 3>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
[5, <0, 1>]q
X.D=7
X.D=8
skip update
3
X.D=11
extend update
[8, <1, 3>]q
Process p:
[X.D, <Xsend .low,Xsend.high>]p
= [0, <0,0>]p
Process q:
[X.D, <Xrecv.low,Xrecv.high]q
= [0, <0,0>]q
5
X.D=5
skip update
2
[5, <0, 1>]q
X.D=7
X.D=8
skip update
3
X.D=11
extend update
[8, <1, 3>]q
[11, <1, 4>]q
3. How about flow control?
Straightforward, just control the timing of synchronization via
such as timer, the version gap, or even dynamic adjusted
4. How about the fault-tolerance?
NOT based on redundancy (multiple copies)
Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola.
Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013
Instead,
1. get the history from its neighbors (Θ𝑝 - φqp )
2. Or, just keep the accumulated local updates in
persistent store
• Temporary outage
• Scheduled failure
• Permanent failure
Dynamic adding or removing of model
replicas has the same logic as fault tolerance
5. How local parameter servers are connected (topology)?
• The right topology is hard to determine for system, which
depends on the application, such as model size, update
rate, network bandwidth, the number of neighbors, etc.
• Therefore, topology configuration is motivated
• Further more, as workers leaves and joins in, the right
topology would be adjusted.
• For example, increasingly added model replicas
would be helpful for DNN training
• Therefore, topology re-configuration is necessary
Master-slave
…
…
master
Parameter Server
• Shortest propagation
delay (one hop)
• But high workload in
master
…
…
Tree-based topology
…
…
• Longer propagation
delay (multiple hops)
Parameter Server Decentralized
• Without bottleneck
…
…
Scalability is sensitive to topology
Convergence Time (s)
45
m/s
40
chain
35
twin
30
25
20
15
10
5
0
0
2
4
6
# machines
8
10
12
Topology affects staleness
12
4 slaves
8 slaves
10
10 slaves
Staleness
8
6
4
2
0
0
20
40
60
Elapsed Time(s)
80
100
6. And how to set the right staleness to balance the system
performance and model convergence rate?
Application-defined staleness is supported, such as
• Best effort (no extra requirement)
• Maximal delayed time (block push if previous n pushes not complete)
• User-defined filters (only push significant update)
• SSP* (bound the max gap between the fastest and slowest worker)
• Bound the update version gap
• Bound the parameter value gap
*. Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More
Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, Advances in Neural Information Processing
Systems 27 (NIPS 2013).
Outline
• Problem statement
• Design goals
• Design
• Evaluation
Recap
• Re-configurability is the king in system design
• The layered design is beautiful
• Pure p2p design
• Pairwise protocol
• Flow control
• Fault tolerance
• Node joining in or leaving
• Topology configurable
• Staleness configurable
Future work
• Parameter server design is not only for DNN, but also for
general inference problems
•
•
•
Generalized linear model with a single massive vector
Topic model with sparse vectors
Graphical model with plates
• The design is also works for areas other than machine learning
•
The scenarios with structured data and the aggregation is both commutative and
associative, such as Sensor network to get aggregated data
Related work
• Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.
Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large
Scale Distributed Deep Networks, NIPS 2012
• Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E.
P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter
Server, NIPS 2013.
• Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent
Bounded-Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv:1312.7869,
communicated 30 Dec 2013).
• Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E.
P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript,
arXiv:1312.7651, communicated 30 Dec 2013).
• Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter
Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013
Thanks! and Questions?
Backup