Gibbs Sampling for LDA - Brigham Young University

Download Report

Transcript Gibbs Sampling for LDA - Brigham Young University

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 679: Text Mining
Lecture #13: Gibbs Sampling for LDA
Credit: Many slides are from presentations by Tom Griffiths of Berkeley.
Announcements
 Required reading for Today
 Griffiths & Steyvers: “Finding Scientific Topics”
 Final Project Proposal
 Clear, detailed: ideally, the first half of your
project report!
 Talk to me about ideas
 Teams are an option
 Due date to be specified
Objectives
 Gain further understanding of LDA
 Understand the intractability of inference with
the model
 Gain further insight into Gibbs sampling
 Understand how to estimate the parameters
of interest in LDA using a collapsed Gibbs
sampler
Latent Dirichlet Allocation
(slightly different symbols this time)
(Blei, Ng, & Jordan, 2001; 2003)

distribution over topics
for each document
Dirichlet priors

distribution over words
for each topic
 (d)  Dirichlet()
 (d)
topic assignment
for each word
 (j)  Dirichlet()

(j)
zi  Categorical( (d) )
zi
T
word generated from
assigned topic
wi  Categorical( (zi) )
wi
Nd
D
The Statistical Problem of Meaning
 Generating data from parameters is easy
 Learning parameters from data is hard
 What does it mean to identify the “meaning”
of a document?
Estimation of
the LDA Generative Model
 Maximum likelihood estimation (EM)
 Similar to method presented by Hofmann for pLSI
(1999)
 Deterministic approximate algorithms
 Variational EM (Blei, Ng & Jordan, 2001, 2003)
 Expectation propagation (Minka & Lafferty, 2002)
 Markov chain Monte Carlo – our focus
 Full Gibbs sampler (Pritchard et al., 2000)
 Collapsed Gibbs sampler (Griffiths & Steyvers, 2004)
 The papers you read for today
Review: Markov Chain Monte Carlo
(MCMC)
 Sample from a Markov chain
 converges to a target distribution
 Allows sampling from an unnormalized
posterior distribution
 Can compute approximate statistics from
intractable distributions
(MacKay, 2002)
Review: Gibbs Sampling
 Most straightforward kind of MCMC
 For variables 𝑥1, 𝑥2, … , 𝑥𝑛
 Require the full (or “complete”) conditional
distribution for each variable:
Draw 𝑥𝑖
𝑡
from 𝑃(𝑥𝑖 |𝑥−𝑖 ) = 𝑃(𝑥𝑖|𝑀𝐵(𝑥𝑖))
x-i = x1(t), x2(t),…, xi-1(t), xi+1(t-1), …, xn(t-1)
Bayesian Inference in LDA
 We would like to reason with the full joint distribution:
𝑃 𝑤, 𝑧, Φ, Θ 𝛼, 𝛽 = 𝑃 𝑤, 𝑧 Φ, Θ 𝑃 Φ 𝛽 𝑃(Θ|𝛼)
 Given 𝑤, the distribution over the latent variables is
desirable, but the denominator (the marginal likelihood) is
intractable to compute:
𝑃 𝑧, Φ, Θ 𝑤, 𝛼, 𝛽 =
𝑃 𝑤, 𝑧, Φ, Θ 𝛼, 𝛽
𝑃 𝑤 𝛼, 𝛽
 We marginalize the model parameters out of the joint
distribution so that we can focus on the words in the
corpus (𝑤) and their assigned topics (𝑧):
𝑃 𝑤, 𝑧|𝛼, 𝛽 =
𝑃 𝑤, 𝑧 Φ, Θ 𝑃 Φ 𝛽 𝑃 Θ 𝛼 𝑑Θ𝑑Φ
Φ Θ
 This leads to our use of the term “collapsed sampler”
Posterior Inference in LDA
 From this marginalized joint dist., we can
compute the posterior distribution over topics
for a given corpus (𝑤):
𝑃 𝑧 𝑤, 𝛼, 𝛽 =
𝑃 𝑤, 𝑧|𝛼, 𝛽
𝑃 𝑤|𝛼, 𝛽
=
𝑃 𝑤, 𝑧|𝛼, 𝛽
𝑧𝑃
𝑤, 𝑧|𝛼, 𝛽
 But 𝑧 = 𝑇 𝑛 possible topic assignments, where 𝑛
is the number of tokens in the corpus!
 i.e., inference is still intractable!
 Working with this topic posterior is only tractable
up to a constant multiple:
𝑃 𝑧 𝑤, 𝛼, 𝛽 ∝ 𝑃 𝑤, 𝑧|𝛼, 𝛽
Collapsed Gibbs Sampler for LDA
 Since we’re now focusing on the topic posterior, namely:
 𝑃 𝑧 𝑤, 𝛼, 𝛽 ∝ 𝑃 𝑤, 𝑧|𝛼, 𝛽 𝑃 𝑤|𝑧, 𝛼, 𝛽 𝑃 𝑧|𝛼, 𝛽
 Let’s find these factors by marginalizing separately:
D
P ( z |  ,  )   P ( z | ) p ( |  ,  ) d   
d 1
 ( n
j
(d )
j
)
( )T
(T )
( n (jd )   )
j
P ( w | z ,  ,  )   P ( w | z,  ) p( |  ) d  
T

j 1
( w)

(
n
w j   )
( )W
(W  )
( n(jw)   )
w
Where:
𝑤
• 𝑛𝑗 is the number of times word 𝑤 assigned to topic 𝑗
• 𝑛𝑗
𝑑
is the number of times topic 𝑗 is used in document 𝑑
Collapsed Gibbs Sampler for LDA
 We only sample each 𝑧𝑖 !
 Complete (or full) conditionals can now be derived for each 𝑧𝑖
in 𝑧.
P( zi  j | z  i , w )  P ( wi | zi  j , z  i , w  i ) P( zi  j | z  i )

n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
( di )
n
 i ,k  T 
k
Where:
• 𝑑𝑖 is the document in which word wi occurs
𝑤
• 𝑛−𝑖,𝑗 is the number of times (ignoring position i) word w assigned to topic j
𝑑
• 𝑛−𝑖,𝑗 is the number of times (ignoring position i) topic j used in document d
Steps for deriving the complete
conditionals
1.
2.
3.
4.
5.
6.
7.
8.
Begin with the full joint distribution over the data, latent variables, and model
parameters, given the fixed parameters and of the prior distributions.
Write out the desired collapsed joint distribution and set it equal to the
appropriate integral over the full joint in order to marginalize over and .
Perform algebra and group like terms.
Expand the generic notation by applying the closed-form definitions of the
Multinomial, Categorical, and Dirichlet distributions.
Transform the representation: change the product indices from products over
documents and word sequences, to products over cluster labels and token
counts.
Simplify by combining products, adding exponents and pulling constant
multipliers outside of integrals.
When you have integrals over terms that are in the form of the kernel of the
Dirichlet distribution, consider how to convert the result into a familiar
distribution.
Once you have the expression for the joint, derive the expression for the
conditional distribution
Collapsed Gibbs Sampler for LDA
For 𝑡 = 1 to 𝑏𝑢𝑟𝑛 + 𝑙𝑒𝑛𝑔𝑡ℎ:
For variables 𝑧 = 𝑧1, 𝑧2, … , 𝑧𝑛 (i.e., for 𝑖 = 1 to 𝑛):
Draw 𝑧𝑖
𝑧−𝑖 =
𝑡
from 𝑃 𝑧𝑖 𝑧−𝑖 , 𝑤
𝑡
𝑧1
𝑡
, 𝑧2
𝑡
𝑡−1
, … , 𝑧𝑖−1 , 𝑧𝑖+1
𝑡−1
, … , 𝑧𝑛
Collapsed Gibbs Sampler for LDA
 This is nicer than your average Gibbs sampler:
 Memory: counts (the “𝑛⋅ ⋅ ” counts) can be
cached in two sparse matrices
 No special functions, simple arithmetic
 The distributions on Φ and Θ are analytic in topic
assignments 𝑧 and 𝑤, and can later be
recomputed from the samples in a given iteration
of the sampler:
  from 𝑤| 𝑧
  from 𝑧
Gibbs sampling in LDA
T=2
Nd=10
M=5
iteration
1
i
wi
di
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
P ( zi  j | z  i , w ) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
P( zi  j | z i , w) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
?
P( zi  j | z i , w) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
?
P( zi  j | z i , w) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
?
P( zi  j | z i , w) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
?
P( zi  j | z i , w) 
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
Gibbs sampling in LDA
iteration
1
2
…
1000
i
wi
di
zi
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
2
2
2
1
2
2
1
2
.
.
.
1
2
2
2
1
2
2
2
1
2
2
2
2
.
.
.
1
P( zi  j | z i , w) 
…
n( wi ,i j)  
n
( w)
i , j
w
W 

n( di ,i )j  
n
( di )
i ,k
k
 T
A Visual Example: Bars
sample each pixel from
a mixture of topics
pixel = word
image = document
A toy problem.
Just a metaphor for inference on text.
Documents generated from the topics.
Evolution of the topics (𝜙 matrix)
Interpretable decomposition
• SVD gives a basis for the data, but not an interpretable one
• The true basis is not orthogonal, so rotation does no good
Effects of Hyper-parameters
  and  control the relative sparsity of  and 
 smaller : fewer topics per document
 smaller : fewer words per topic
 Good assignments z are a compromise in sparsity
Bayesian model selection
 How many topics do we need?
 A Bayesian would consider the posterior:
P(T|w)  P(w|T) P(T)
 Involves summing over assignments z
Sweeping T
Analysis of PNAS abstracts
 Used all D = 28,154 abstracts from 1991-2001
 Used any word occurring in at least five abstracts,
not on “stop” list (W = 20,551)
 Segmentation by any delimiting character, total of n
= 3,026,970 word tokens in corpus
 Also, PNAS class designations for 2001
(Acknowledgment: Kevin Boyack)
Running the algorithm
 Memory requirements linear in T(W+D), runtime
proportional to nT
 T = 50, 100, 200, 300, 400, 500, 600, (1000)
 Ran 8 chains for each T, burn-in of 1000 iterations,
10 samples/chain at a lag of 100
 All runs completed in under 30 hours on BlueHorizon
supercomputer at San Diego
How many topics?
Topics by Document Length
P(w | z) 
A Selection of Topics
FORCE
SURFACE
MOLECULES
SOLUTION
SURFACES
MICROSCOPY
WATER
FORCES
PARTICLES
STRENGTH
POLYMER
IONIC
ATOMIC
AQUEOUS
MOLECULAR
PROPERTIES
LIQUID
SOLUTIONS
BEADS
MECHANICAL
HIV
VIRUS
INFECTED
IMMUNODEFICIENCY
CD4
INFECTION
HUMAN
VIRAL
TAT
GP120
REPLICATION
TYPE
ENVELOPE
AIDS
REV
BLOOD
CCR5
INDIVIDUALS
ENV
PERIPHERAL
MUSCLE
CARDIAC
HEART
SKELETAL
MYOCYTES
VENTRICULAR
MUSCLES
SMOOTH
HYPERTROPHY
DYSTROPHIN
HEARTS
CONTRACTION
FIBERS
FUNCTION
TISSUE
RAT
MYOCARDIAL
ISOLATED
MYOD
FAILURE
STRUCTURE
ANGSTROM
CRYSTAL
RESIDUES
STRUCTURES
STRUCTURAL
RESOLUTION
HELIX
THREE
HELICES
DETERMINED
RAY
CONFORMATION
HELICAL
HYDROPHOBIC
SIDE
DIMENSIONAL
INTERACTIONS
MOLECULE
SURFACE
NEURONS
BRAIN
CORTEX
CORTICAL
OLFACTORY
NUCLEUS
NEURONAL
LAYER
RAT
NUCLEI
CEREBELLUM
CEREBELLAR
LATERAL
CEREBRAL
LAYERS
GRANULE
LABELED
HIPPOCAMPUS
AREAS
THALAMIC
TUMOR
CANCER
TUMORS
HUMAN
CELLS
BREAST
MELANOMA
GROWTH
CARCINOMA
PROSTATE
NORMAL
CELL
METASTATIC
MALIGNANT
LUNG
CANCERS
MICE
NUDE
PRIMARY
OVARIAN
Cold topics
Hot topics
Cold topics
Hot topics
2
SPECIES
GLOBAL
CLIMATE
CO2
WATER
ENVIRONMENTAL
YEARS
MARINE
CARBON
DIVERSITY
OCEAN
EXTINCTION
TERRESTRIAL
COMMUNITY
ABUNDANCE
134
MICE
DEFICIENT
NORMAL
GENE
NULL
MOUSE
TYPE
HOMOZYGOUS
ROLE
KNOCKOUT
DEVELOPMENT
GENERATED
LACKING
ANIMALS
REDUCED
179
APOPTOSIS
DEATH
CELL
INDUCED
BCL
CELLS
APOPTOTIC
CASPASE
FAS
SURVIVAL
PROGRAMMED
MEDIATED
INDUCTION
CERAMIDE
EXPRESSION
Cold topics
37
CDNA
AMINO
SEQUENCE
ACID
PROTEIN
ISOLATED
ENCODING
CLONED
ACIDS
IDENTITY
CLONE
EXPRESSED
ENCODES
RAT
HOMOLOGY
289
KDA
PROTEIN
PURIFIED
MOLECULAR
MASS
CHROMATOGRAPHY
POLYPEPTIDE
GEL
SDS
BAND
APPARENT
LABELED
IDENTIFIED
FRACTION
DETECTED
75
ANTIBODY
ANTIBODIES
MONOCLONAL
ANTIGEN
IGG
MAB
SPECIFIC
EPITOPE
HUMAN
MABS
RECOGNIZED
SERA
EPITOPES
DIRECTED
NEUTRALIZING
Hot topics
2
SPECIES
GLOBAL
CLIMATE
CO2
WATER
ENVIRONMENTAL
YEARS
MARINE
CARBON
DIVERSITY
OCEAN
EXTINCTION
TERRESTRIAL
COMMUNITY
ABUNDANCE
134
MICE
DEFICIENT
NORMAL
GENE
NULL
MOUSE
TYPE
HOMOZYGOUS
ROLE
KNOCKOUT
DEVELOPMENT
GENERATED
LACKING
ANIMALS
REDUCED
179
APOPTOSIS
DEATH
CELL
INDUCED
BCL
CELLS
APOPTOTIC
CASPASE
FAS
SURVIVAL
PROGRAMMED
MEDIATED
INDUCTION
CERAMIDE
EXPRESSION
Conclusions
 Estimation/inference in LDA is more or less
straightforward using Gibbs Sampling
 i.e., easy!
 Not so easy in all graphical models
Coming Soon
 Topical n-grams