Scaling Up Graphical Model Inference

Download Report

Transcript Scaling Up Graphical Model Inference

Scaling Up
Graphical Model Inference
Graphical Models
β€’ View observed data and unobserved properties as random variables
β€’ Graphical Models: compact graph-based encoding of probability
distributions (high dimensional, with complex dependencies)
πœƒ
π‘₯𝑖𝑗
𝑦𝑖
𝐷
𝑁
β€’ Generative/discriminative/hybrid, un-,semi- and supervised learning
– Bayesian Networks (directed), Markov Random Fields (undirected), hybrids,
extensions, etc. HMM, CRF, RBM, M3N, HMRF, etc.
β€’ Enormous research area with a number of excellent tutorials
– [J98], [M01], [M04], [W08], [KF10], [S11]
Graphical Model Inference
β€’ Key issues:
– Representation: syntax and semantics (directed/undirected,variables/factors,..)
– Inference: computing probabilities and most likely assignments/explanations
– Learning: of model parameters based on observed data. Relies on inference!
β€’ Inference is NP-hard (numerous results, incl. approximation hardness)
β€’ Exact inference: works for very limited subset of models/structures
– E.g., chains or low-treewidth trees
β€’ Approximate inference: highly computationally intensive
– Deterministic: variational, loopy belief propagation, expectation propagation
– Numerical sampling (Monte Carlo): Gibbs sampling
Inference in Undirected Graphical Models
β€’ Factor graph representation
𝑝 π‘₯1 , . . , π‘₯𝑛
1
=
𝑍
πœ“π‘–π‘— π‘₯1 , π‘₯2
π‘₯𝑗 βˆˆπ‘(π‘₯𝑖 )
β€’ Potentials capture compatibility of related observations
– e.g., πœ“ π‘₯𝑖 , π‘₯𝑗 = exp(βˆ’π‘ π‘₯𝑖 βˆ’ π‘₯𝑗 )
β€’ Loopy belief propagation = message passing
– iterate (read, update, send)
Synchronous Loopy BP
β€’ Natural parallelization: associate a processor to every node
– Simultaneous receive, update, send
β€’ Inefficient – e.g., for a linear chain:
2𝑛/𝑝 time per iteration
𝑛 iterations to converge
[SUML-Ch10]
Optimal Parallel Scheduling
β€’ Partition, local forward-backward for center, then cross-boundary
Processor 1
Processor 2
Synchronous Schedule
Parallel
Component
Gap
Processor 3
Optimal Schedule
Sequential
Component
6
Splash: Generalizing Optimal Chains
1) Select root, grow fixed-size BFS Spanning tree
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
β€’ Parallelization:
– Partition graph
β€’ Maximize computation, minimize
communication
β€’ Over-partition and randomly assign
– Schedule multiple Splashes
β€’ Priority queue for selecting root
β€’ Belief residual: cumulative change
from inbound messages
β€’ Dynamic tree pruning
DBRSplash: MLN Inference Experiments
Experiments: MLN Inference
8K variables, 406K factors
Single-CPU runtime: 1 hour
Cache efficiency critical
120
Speedup
β€’
β€’
β€’
β€’
70
20
Speedup
-30
β€’ 1K variables, 27K factors
β€’ Single-CPU runtime: 1.5 minutes
β€’ Network costs limit speedups
No Over-Part
5x Over-Part
0
60
50
40
30
20
10
0
30
60
90
Number of CPUs
120
No Over-Part
5x Over-Part
0
30
60
90
Number of CPUs
120
Topic Models
β€’ Goal: unsupervised detection of topics in corpora
– Desired result: topic mixtures, per-word and per-document topic assignments
[B+03]
Directed Graphical Models:
Latent Dirichlet Allocation [B+03, SUML-Ch11]
β€’ Generative model for document collections
– 𝐾 topics, topic π‘˜: Multinomial(πœ™π‘˜ ) over words
– 𝐷 documents, document 𝑗:
β€’ Topic distribution πœƒπ‘— ∼ Dirichlet 𝛼
β€’ 𝑁𝑗 words, word π‘₯𝑖𝑗 :
– Sample topic 𝑧𝑖𝑗 ∼ Multinomial πœƒπ‘—
– Sample word π‘₯𝑖𝑗 ∼ Multinomial πœ™π‘§π‘–π‘—
Prior on topic
distributions
𝛼
Document’s
topic distribution
πœƒπ‘—
Word’s topic
𝑧𝑖𝑗
Word
π‘₯𝑖𝑗
𝑁𝑗
β€’ Goal: infer posterior distributions
– Topic word mixtures {πœ™π‘˜ }
– Document mixtures πœƒπ‘—
– Word-topic assignments {𝑧𝑖𝑗 }
𝐷
Topic’s word
distribution
πœ™π‘˜
Prior on word
distributions
𝛽
𝐾
Gibbs Sampling
β€’ Full joint probability
𝑝 πœƒ, 𝑧, πœ™, π‘₯ 𝛼, 𝛽 =
𝑝(πœ™π‘˜ |𝛽)
π‘˜=1..𝐾
𝑝(πœƒπ‘— |𝛼)
𝑗=1..𝐷
𝑝 𝑧𝑖𝑗 πœƒπ‘— 𝑝(π‘₯𝑖𝑗 |πœ™π‘§π‘–π‘— )
𝑗=1..𝑁𝑗
β€’ Gibbs sampling: sample πœ™, πœƒ, 𝑧 independently
β€’ Problem: slow convergence (a.k.a. mixing)
β€’ Collapsed Gibbs sampling
– Integrate out πœ™ and πœƒ analytically
β€²
β€² +𝛽
𝑁π‘₯𝑧
𝑁𝑑𝑧
+𝛼
𝑝 𝑧 π‘₯, 𝑑, 𝛼, 𝛽 ∝
β€²
β€²
π‘₯(𝑁π‘₯𝑧 +𝛽) 𝑧(𝑁𝑑𝑧 +𝛼)
– Until convergence:
β€’ resample 𝑝 𝑧𝑖𝑗 π‘₯𝑖𝑗 , 𝛼, 𝛽),
β€’ update counts: 𝑁𝑧 , 𝑁𝑧𝑑 , 𝑁π‘₯𝑧
Parallel Collapsed Gibbs Sampling [SUML-Ch11]
β€’ Synchronous version (MPI-based):
–
–
–
–
Distribute documents among 𝑝 machines
Global topic and word-topic counts 𝑁𝑧 , 𝑁𝑀𝑧
Local document-topic counts 𝑁𝑑𝑧
After each local iteration, AllReduce 𝑁𝑧 , 𝑁𝑀𝑧
β€’ Asynchronous version: gossip (P2P)
– Random pairs of processors exchange statistics upon pass completion
– Approximate global posterior distribution (experimentally not a problem)
– Additional estimation to properly account for previous counts from neighbor
Parallel Collapsed Gibbs Sampling [SN10,S11]
β€’ Multithreading to maximize concurrency
– Parallelize both local and global updates of 𝑁π‘₯𝑧 counts
– Key trick: 𝑁𝑧 and 𝑁π‘₯𝑧 are effectively constant for a given document
β€’ No need to update continuously: update once per-document in a separate thread
β€’ Enables multithreading the samplers
– Global updates are asynchronous -> no blocking
[S11]
Scaling Up Graphical Models: Conclusions
β€’ Extremely high parallelism is achievable, but variance is high
– Strongly data dependent
β€’ Network and synchronization costs can be explicitly accounted for in
algorithms
β€’ Approximations are essential to removing barriers
β€’ Multi-level parallelism allows maximizing utilization
β€’ Multiple caches allow super-linear speedups
References
[SUML-Ch11] Arthur Asuncion, Padhraic Smyth, Max Welling, David Newman, Ian Porteous, and Scott Triglia. Distributed
Gibbs Sampling for Latent Variable Models. In β€œScaling Up Machine Learning”, Cambridge U. Press, 2011.
[B+03] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[B11] D. Blei. Introduction to Probabilistic Topic Models. Communications of the ACM, 2011.
[SUML-Ch10] J. Gonzalez, Y. Low, C. Guestrin. Parallel Belief Propagation in Factor Graphs. In β€œScaling Up Machine Learning”,
Cambridge U. Press, 2011.
[KF10] D. Koller and N. Friedman Probabilistic graphical models. MIT Press, 2010.
[M01] K. Murphy. An introduction to graphical models, 2001.
[M04] K. Murphy. Approximate inference in graphical models. AAAI Tutorial, 2004.
[S11] A.J. Smola. Graphical models for the Internet. MLSS Tutorial, 2011.
[SN10] A.J. Smola, S. Narayanamurthy. An Architecture for Parallel Topic Models. VLDB 2010.
[W08] M. Wainwright. Graphical models and variational methods. ICML Tutorial, 2008.