Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)
Download ReportTranscript Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)
Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU) Model several GBs of model size several layers millions of edges between two layers thousands of neurons per layer TBs of data Training Data DNN model training could take weeks or even more What if we can train the DNN model in one day? It is still a dream Fast training needs parallelism, even in a distributed fashion Model Parallelism Model • Model is partitioned and trained in parallel Machine Training Data Model Parallelism Model Network traffic bounded Non-linear speedup Training is still slow with large data sets Training Data Another dimension of parallelism, data parallelism, is required Data Parallelism 1. Training data is partitioned, and multi-models are trained in parallel 2. Intermediate trained results (model parameters) are synchronized Outline • Problem statement • Design goals • Design • Evaluation It is not a good idea to combine model training and model synchronization Parameter Server Application • Separate the model training and model synchronization • Build a dedicated system PS (Parameter Server) to synchronize the intermediate model parameters • DistBlief (NIPS 2012) Outline • Problem statement • Design goals • Design • Evaluation How to build a scalable, reliable and still efficient parameter server? A Centralized Approach p’ ==pp+’ ∆p + ∆p’ Parameter Server p’’ ∆p’ p p’ Model workers Data Asynchronous Stochastic Gradient Descent (A-SGD) A Centralized Approach Parameter Server p’ = p + ∆p ∆p Model workers Data • ∆p is vector or matrix with float type, rather than key-value pair • p’ = p + ∆p is commutative and associate, which makes synchronization in bulk is possible However, it is non-scalable if large-scale model workers exist Parameter Server • Depends on: • The size of model parameters (240MB) ∆p1 Model Workers Data Shards ∆pn ∆pi … … … … • The model update rate (3times/s, thus 720MB/s) • The number of model workers (overloaded if n is large) • GPU scenario Model parameter partition helps Parameter Server ∆p1 Model Workers ∆pn ∆pi … … … … Data Shards • Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv:1312.7651, communicated 30 Dec 2013). A local cache (model slaves) of model parameters helps Parameter Server parameter master ∆p1 Model Workers Data Shards ∆pn ∆pi … … parameter slaves However, parameter master may still be the bottleneck A decentralized (peer-2-peer) system design is motivated And, what if faults happened? Parameter Server 1. Networking delay or down ∆p1 ∆pn ∆pi 2. Machine crash and restart Model Workers Data Shards … … … … 3. Software crash, data lost, job preempted Again, it is not reliable without fault-tolerance support A fault-tolerant system design is motivated How about performance if staleness (consistency) is required? Staleness is required Parameter Server ∆p1 Model Workers Data Shards p p1 = p + ∆p1 p p Parameter Server p1 fast Model Workers Data Shards ∆p2 p1 = p + ∆p1 slower slowest Staleness is required for fast model convergence global optimal global optimal 𝑡3 𝑡3 𝑡2 𝑡1 𝑡2 𝑡1 𝑡2 𝑡1 initialization 𝑡1 Update by worker 1 initialization 𝑡2 𝑡3 Update by worker 2 Model synchronization With coordination local optimal Without coordination (Worker 2 works on a over-staled model) The working pace of each worker should be coordinated Parameter Server Coordinator Model Workers Data L-BFGS However, a centralized coordinator is costly, and the system performance (parallelism) is not fully exploited Balance between the system performance and model convergence rate is motivated Outline • Problem statement • Design goals • Design • Evaluation 1. Each worker machine has a local parameter server (model replica), and the system is responsible for parameter synchronization System Architecture … … … … Parameter Server • Reduced network traffic by only exchanging the accumulated updates (commutative and associative) • Non-blocking of training • Asynchronization 2. How to mutually exchange parameter updates between two connected local parameter servers, with fault-tolerance on network delay or even down? … … Parameter Server Pairwise fault-tolerant update exchange protocol … … Pairwise Protocol Invariants 1. Node p’s “belief” of model (Θ𝑝) equals to its own p contribution (𝑥𝑝) and Pairwise fault-tolerant update exchange protocol contribution from its φrp φqp neighbors (φqp) 𝑥𝑝 … … r q … … Θ𝑝 = 𝑥𝑝 +∑q∊Np φqp (1) Pairwise Protocol Invariants 2. A node (p) also propagates updates to neighbor (q) p from the contribution of itself Pairwise fault-tolerant update exchange protocol and the accumulated updates from other Θ𝑝 - φqp r neighbors (r) q … … … … φpq = Θ𝑝 - φqp (2) Pairwise Protocol Details Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p 5 X.D=5 Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p 5 X.D=5 Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p 5 X.D=5 skip update Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 X.D=8 [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 X.D=8 [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 X.D=8 skip update [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 X.D=8 skip update 3 X.D=11 [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 X.D=7 X.D=8 skip update 3 X.D=11 [5, <0, 1>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 [5, <0, 1>]q X.D=7 X.D=8 skip update 3 X.D=11 [8, <1, 3>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 [5, <0, 1>]q X.D=7 X.D=8 skip update 3 X.D=11 extend update [8, <1, 3>]q Process p: [X.D, <Xsend .low,Xsend.high>]p = [0, <0,0>]p Process q: [X.D, <Xrecv.low,Xrecv.high]q = [0, <0,0>]q 5 X.D=5 skip update 2 [5, <0, 1>]q X.D=7 X.D=8 skip update 3 X.D=11 extend update [8, <1, 3>]q [11, <1, 4>]q 3. How about flow control? Straightforward, just control the timing of synchronization via such as timer, the version gap, or even dynamic adjusted 4. How about the fault-tolerance? NOT based on redundancy (multiple copies) Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013 Instead, 1. get the history from its neighbors (Θ𝑝 - φqp ) 2. Or, just keep the accumulated local updates in persistent store • Temporary outage • Scheduled failure • Permanent failure Dynamic adding or removing of model replicas has the same logic as fault tolerance 5. How local parameter servers are connected (topology)? • The right topology is hard to determine for system, which depends on the application, such as model size, update rate, network bandwidth, the number of neighbors, etc. • Therefore, topology configuration is motivated • Further more, as workers leaves and joins in, the right topology would be adjusted. • For example, increasingly added model replicas would be helpful for DNN training • Therefore, topology re-configuration is necessary Master-slave … … master Parameter Server • Shortest propagation delay (one hop) • But high workload in master … … Tree-based topology … … • Longer propagation delay (multiple hops) Parameter Server Decentralized • Without bottleneck … … Scalability is sensitive to topology Convergence Time (s) 45 m/s 40 chain 35 twin 30 25 20 15 10 5 0 0 2 4 6 # machines 8 10 12 Topology affects staleness 12 4 slaves 8 slaves 10 10 slaves Staleness 8 6 4 2 0 0 20 40 60 Elapsed Time(s) 80 100 6. And how to set the right staleness to balance the system performance and model convergence rate? Application-defined staleness is supported, such as • Best effort (no extra requirement) • Maximal delayed time (block push if previous n pushes not complete) • User-defined filters (only push significant update) • SSP* (bound the max gap between the fastest and slowest worker) • Bound the update version gap • Bound the parameter value gap *. Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, Advances in Neural Information Processing Systems 27 (NIPS 2013). Outline • Problem statement • Design goals • Design • Evaluation Recap • Re-configurability is the king in system design • The layered design is beautiful • Pure p2p design • Pairwise protocol • Flow control • Fault tolerance • Node joining in or leaving • Topology configurable • Staleness configurable Future work • Parameter server design is not only for DNN, but also for general inference problems • • • Generalized linear model with a single massive vector Topic model with sparse vectors Graphical model with plates • The design is also works for areas other than machine learning • The scenarios with structured data and the aggregation is both commutative and associative, such as Sensor network to get aggregated data Related work • Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large Scale Distributed Deep Networks, NIPS 2012 • Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, NIPS 2013. • Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent Bounded-Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv:1312.7869, communicated 30 Dec 2013). • Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv:1312.7651, communicated 30 Dec 2013). • Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013 Thanks! and Questions? Backup