Probabilistic topic models - LEAR

Download Report

Transcript Probabilistic topic models - LEAR

Probabilistic topic models - basics and extensions
Reading group “Advanced Topics in Manifold Learning”
May 15, 2012
Jakob Verbeek
LEAR team, INRIA, Grenoble, France
Goals of this talk

Basics
►
►

Review of Probabilistic Latent Semantic Analysis (PLSA)
Review of Latent Dirichlet Allocation (LDA)
Extensions
►
►
Model selection with Chinese Restaurant Process (CRP)
Hierarchical structure learning with Nested CRP
Motivation

Exploratory data analysis of text documents
►
Modeling of document-level co-occurrence structure as “topics”


►

Ignores finer structure, eg paragraphs, sentences, parts of speech, …
Applies to a much wider variety of data
►
►

Sports topic: match, player, wins, season, …
Finance topic: bank, money, dollar, IMF, …
For example “Bag-of-words” image representation,
Visual topics such as “building”, “beach”, …
Unsupervised learning of topic representation useful for other tasks
►
►
►
►
►
Clustering
Classification
Retrieval
Visualization
…
Motivation

A form of low-rank matrix decomposition, like PCA but
►
►
Different error measure
Different constraints on coefficient matrices
C= A B
C ∈ℝ

n×m
, A∈ℝ
n×k
, B∈ℝ
k ×m
=
Word - Document count matrix C decomposed into
►
►
►
Word x Topic matrix A
Topic x Document matrix B
n words, m documents, k topics
x
Probabilistic Latent Semantic Analysis (PLSA)

“Unsupervised Learning by Probabilistic Latent Semantic Analysis”
►

Thomas Hofmann, Machine Learning Journal, 42(1), 2001, pp. 177-196
Learning of “contexts of word usage”
►
►
Synonyms: should be part of the same “context” or topic
Polysemy (bank): same word can appear in various contexts or topics


“river bank”, “financial bank”
Probabilistic version of Latent Semantic Analysis (Deerwester et al 1990)
► Uses SVD / PCA to project word count vectors to compact vector
► L2 reconstruction error implies inappropriate Gaussian noise model
► Reconstruction might contain negative counts …
Latent Semantic Analysis

“Data sparseness” or “zero-frequency” problem
►
►
(Small) text documents typically lead to very sparse count vectors ~ 1%
Document matching hard without using notions of similarity between terms



Approach: map term frequency vectors to “Latent Semantic Space”
►
►
►
►

Synonym matching might be pessimistic (might miss matching meanings)
Polysemy might be too optimistic (might match different meanings)
Dimension reduction of ~3 orders of magnitude, eg from 105 to 102 dims.
Restriction to linear transformation
Minimize Frobenius norm: sum of squared element-wise errors
Solution given by SVD / PCA
New representation is generally dense instead of sparse
►
►
Does not require exact term matching to generate similarity
Similar terms are expected to be mapped to similar latent coordinates
Probabilistic Latent Semantic Analysis

Probabilistic model for term frequencies
►
►
Associates a latent variable “z” with each term occurrence
Encodes in which of the K different contexts / topics term appears
p(w∣d )=∑k p( z=k∣d ) p (w∣z=k)
=
►
►
►

x
p(w|d) : probability to find word w (“cat”) when picking a word at random
from a specific given document d
p(z=k|d) : probability to find topic k when picking a word context/topic
(eg. animals) at random from that specific document d
p(w|z) : probability to find word w when picking a word at random from
context / topic z
Non-negative decomposition: Reconstructions are strictly non-negative
► Reconstructed p(w|d) also sums to 1 over words
Probabilistic Latent Semantic Analysis

Probabilistic model for term frequencies, to sample a new word, we
► Sample a new topic, conditioned on the document, from p(z|d)
► Sample a new word, conditioned on topic, from p(w|z)
p(w∣d )=∑k p( z=k∣d ) p (w∣z=k)

Models word frequency in documents as convex combination of topics
► Goal: identify p(w|z) that enables accurate reconstructions of this form
► Maximum likelihood criterion
L= ∑n ∑m C nm log p (w=n∣d =m)
Probabilistic Latent Semantic Analysis

Learning with EM algorithm
►
Tool used for parameter estimation of many latent variable models

►
Here, each document is modeled as a mixture of topics



Mixture models most famous example
Mixing weights vary per document, and need to be estimated
Topics are shared across documents
Iterations guaranteed to converge to a (local) maximum of log-liklihood
►
E-step: Inference of context/topic of each word in each document
p( z=k∣w , d )= p (w∣z=k ) p ( z=k∣d )/ p (w∣d )
►
M-step: Update of parameter estimates given inference result
p(w∣z=k )∝ ∑d C wd p ( z=k∣d , w)
p( z=k∣d )∝ ∑w C wd p ( z=k∣d , w)
Probabilistic Latent Semantic Analysis

Continuous document representation by mixing weights of the topics
p(w∣d )=∑k p( z=k∣d ) p (w∣z=k)
►
►
Latent variables are discrete per-word context / topic
Maps term frequency vector to point on K-1 dimensional simplex


Given topics, optimization of mixing weights is convex (not closed form)
Minimizes KL divergence between empirical and model version of p(w|d)
Alternatives to Probabilistic Latent Semantic Analysis

PLSA: Continuous document representation by topic mixing weights
p({w }∣d)=∏i ∑ k p(w i∣z i=k) p(z i=k∣d)
►

Topic mixing for documents relatively sparse
Alternatively: associate a single topic / cluster label to each document
p({w }∣d)=∑k p(z=k) ∏i p( wi∣z=k)
►

The topic posteriors p(z|{wi}) are mapped near corners of simplex
Alternatively: independent model with document independent mixing weights
p({w }∣d)= p ({w })=∏i ∑ k p(w i∣z i=k) p(z i=k)
►
Topic of a word cannot be disambiguated by other words in document
►
Words are individually mapped to topic simplex (and then averaged)
►
Would lead to less sparse topic vectors than PLSA
Example of topics learned on abstracts about “clustering”
Most likely terms in 4 topics
M
ed
ic
al
Vi
de
o
im
ag
in
g
se
gm
en
ta
t
Bo
un
da
r
io
n
Ph
on
e
y
de
te
ct
io
n
tic
se
gm
en
ta
tio
n
Experimental results

Comparison between PLSA and LSA in terms of perplexity
►
►
►

Is a form of log-likelihood on held-out test data
Better predictions using PLSA
Tempered EM algorithm better than naive EM implementation
Comparison between PLSA and LSA in terms of retrieval performance
►
►
PLSA consistently improves on three data sets over complete recall range
LSA not always better than direct use of cosine similarity
Experimental results

Correlation between data likelihood and retrieval performance
►
►
►
Varying the beta parameter of the Tempered-EM training algorithm
Optimal values for likelihood and retrieval similar
Log-likelihood might be a proxy for measures optimized only difficultly
Goals of this talk

Basics
►
►

Review of Probabilistic Latent Semantic Analysis (PLSA)
Review of Latent Dirichlet Allocation (LDA)
Extensions
►
►
Model selection with Chinese Restaurant Process (CRP)
Hierarchical structure learning with Nested CRP
Latent Dirichlet Allocation

“Latent Dirichlet Allocation”
►

As a probabilistic model PLSA fails to account for the topic mixing
proportions p(z|d)
p({w }∣d)=∏i ∑ k p(w i∣z i=k) p(z i=k∣d)
►
►

Blei, Ng, Jordan, J. of Machine Learning Research, 3, 2003, pp.993-1022
How do we define this probability if we do not have p(z|d) ?!
Nr of parameters grows with each additional document: overfitting !
Instead we will treat the topic-document distribution as a latent variable
►
►
►
A-priori unknown
Define a distribution over the topic-document distributions
Compute posterior on p(z|d) given the document word counts
Latent Dirichlet Allocation

An exchangeable model for document-word counts
►
►
►

De Finetti's representation theorem:
►

Ordering of words (or documents) does not affect modeling
Words are not independent and identically distributed (iid)
Roughly: conditioned on a latent variable data is iid
Any collection of exchangeable random variables has a representation as
a mixture distribution – potentially an infinite mixture
Thus if we wish to use exchangeable representations we need to
consider mixture models
►
►
Does not imply simple iid data models, word-count representations, or
linear transformations thereof
Exchangeability does bring a major simplification of text modeling
Dirichlet distribution on topic proportions

Probability density function on discrete probability distributions
►
Vectors of non-negative numbers that sum to 1
p(θ)=

∏k Γ (α k
α −1
θ
∏
k
) k
k
αk
E p(θ) [ θk ]=
∑k α k
L1 normalization of parameters
Expected value:
►

Γ ( ∑k α k )
Effect of the additional degree of freedom
►
►
Small alphas: sparse, few non-zero entries in theta
Large alphas: dense, many non-zero entries in theta
LDA: generative model

To sample words in a new document (of given length)
►
►
Sample a topic distribution for the complete document (from Dirichlet)
Per word, independently:



Similar to PLSA, documents associated with various topics
►
►
►

Sample a topic from the topic distribution (from multinomial)
Sample word from topic-word distribution (from multinomial)
Topic mixing proportions are document specific
Topic-word distributions are shared across all documents in corpus
Words can be associated with multiple topics
Difference with PLSA
►
►
The document specific topic mixing proportions treated as latent variable
Dirichlet prior over topic proportions is shared across documents in corpus
LDA: generative model

Joint likelihood on all variables (observed and latent)
p(θ, {z},{w })= p(θ) ∏i p( z i∣θ) p( wi∣z i )
►

topic likelihood given theta p(z i=k∣θ)=θk
Marginal likelihood on observed words
p({w })=∫θ p (θ) ∏i ∑ z p(z i∣θ) p (w i∣z i)
i
LDA and exchangeability

Consider probability on a word given the topic mixing proportions theta
p(w∣θ)=∑ z p( z∣θ) p(w∣z)

Marginal likelihood on observed words
p({w })=∫θ p (θ) ∏i p(wi∣θ)

This form is a necessary consequence of de Finetti's theorem
►
►

Exchangeable variables are iid given latent variable theta
Marginalizing the latent variable gives a mixture of iid data models
In this case the latent variable can take any value in the simplex
►
Results in a continuous mixture of iid models
Example of LDA distribution

Vocabulary of 3 words:
►
►

corners of the outer simplex in the plane
Simplex contains all possible distributions over the three words
4 different topics
►
►
Each topic, p(w|z), is point in the word simplex
Sparse Dirichlet prior on the topics
Comparison of LDA with other models

Unigram model
►
All words in all documents generated from global multinomial
p({w })=∏i p(w i)

Mixture of unigrams
►
All words in a given document generated from single topic
p({w })=∑ z p ( z) ∏i p( w i∣z)

PLSA
►
►
Document specific mixture of topics, different topic choice per word
No modeling of topic mixing proportions, max. likelihood fitting to data
p({w }∣d)=∏i ∑ z p(z i∣d ) p(w i∣z)
Example of LDA distribution

Mixture of unigrams
►

PLSA
►

Document associated with a single topic
Empirical distribution on topic simplex
LDA
►
Smooth density over the topic simplex
Inference of topic proportions in LDA model

Given a document, what can we say about underlying topic proportions
►
Suppose we knew the word-topic associations
p(θ∣{z })=Dir (α ' )
►

α ' k =α k + n k
Posterior is Dirichlet, parameters increased by topic counts nk in document
If we only have the document, we average over topic assignments
p(θ∣{w })=∑{z} p(θ∣{z}, {w }) p({z}∣{w })=∑{z} p(θ∣{z}) p({z}∣{w })
►
►
►

Where we use that words are independent of theta given topic assignment
Exponentially many topic assignments ( #topics ^ word count )
Posterior is mixture of Dirichlets with exponential nr of components
Computation and representation of posterior on theta is intractable
►
Computation of data likelihood suffers from same problem
p({w })=∑{z} p ({z}) p({w }∣{z})
Approximation with variational inference

Difficulty due to coupling between theta and z
►
Solution: Assume that they are independent in the posterior
q(θ ,{z })=q(θ) ∏i q (z i)

Fit approximate posterior to true posterior by minimizing KL divergence
D ( q (θ , {z })∥ p(θ ,{z }∣{w}) )
►
Coordinate descend alternates:



fix q(z) fit q(theta)
fix q(theta) fit q(z)
Data log-likelihood bounded by free-energy
log p ({w})≥log p({w})− D ( q(θ , {z })∥ p (θ ,{z }∣{w}) )
►
►
Can be efficiently evaluated
Parameter estimation based on maximizing this bound

Results in three-step coordinate-ascend iterations
Example of LDA in action
Goals of this talk

Basics
►
►

Review of Probabilistic Latent Semantic Analysis (PLSA)
Review of Latent Dirichlet Allocation (LDA)
Extensions
►
►
Model selection with Chinese Restaurant Process (CRP)
Hierarchical structure learning with Nested CRP
Model selection with Chinese Restaurant Process

How to select the number of topics in an LDA model ?
►
►

Cross-validation
Bayesian inference
Define an infinite pool of topics for potential use
►
►
Parameters of topics (distributions over words) treated as latent variables
Prior on choice of topic (Dirichlet prior in LDA), needs to be modified to
accommodate infinite nr of topics
Chinese Restaurant Process

Distribution on partitions of integers
►
Imagine a process by which M customers sit down in Chinese Restaurant
with an infinite number of tables
►
First customer sits at the first table
m-th customer sits at a table drawn from
►
p(occupied table i∣previous customers)=
mi
γ+ (m−1)
p(next unoccupied table∣previous customers)=
γ
γ + (m−1)
►

With mi the number of previous customers at table i
Generalization of symmetric Dirichlet distribution over infinite partitions
►
►
Also known as Dirichlet process
Single concentration parameter gamma controls the sparsity
Prior on topic-word distributions

Topic-word distributions treated as latent variable with associated Dirichlet
prior distributions
►
Probability to draw word v from topic k denoted as p(w i=v∣z i =k)=βvk
►
Beta parameters are unknown, with prior
p(βk )= Dir (βk ; η)=
►

Γ ( ∑v ηv )
∏v
η −1
β
∏
v kv
Γ(η )
v
v
Parameters of topics are iid sampled from Dirichlet prior, and shared
across the document corpus
Number of parameters that needs to be estimated from data does not
depend on the number of topics that is used
►
►
Number of parameters equals the size of the vocabulary
Allows an infinite nr of topics, while fitting a finite nr. of parameters
Generative process for Chinese Restaurant version of LDA

For the complete corpus
►
►
Sample topic-word distributions beta
For each document

For each word i



The number of topics that is effectively used to generate words is
►
►

Sample topic assignment zi given previous document assignments using
Chinese Restaurant process
Sample word wi from the corresponding topic distribution
Finite: since only a finite number of words is generated at any point
Data adaptive: as more data comes in new topics can be allocated
Same principle used to define “infinite” versions of classic models, eg.
►
►
Mixutre of Gaussians [Rasmussen, NIPS 2000]
Hidden markov models [Beal et al, NIPS 2002]
Inference for Chinese Restaurant version of LDA

Exact inference is again intractable

Approximate inference using Gibbs sampling
►
►
Fix all-but-one of the word-topic assignments
Sample the held-out topic assignment from its exact posterior

Unknown topic-word distributions can be integrated-out analytically

Given all topic assignments, the held-out assignment can be sampled
from a simple-to-evaluate multinomial distribution
►
nkv : count of word v associated with topic k (disregarding current word)
►
mk : count of associations to topic k in docum. (disregarding current word)
p(z i∣{z}−i ,{w })∝ p(z i∣{z}−i) p(w i∣{z}, {w }−i )
p(z i=k∣{z}−i )=
p(w i=v∣z i =k ,{z }−i ,{w}−i )=
1
γ
×
γ + N −1 mk
{
n kv + ηv
∑v ' nkv ' + η v '
new topic
existing topic
Goals of this talk

Basics
►
►

Review of Probabilistic Latent Semantic Analysis (PLSA)
Review of Latent Dirichlet Allocation (LDA)
Extensions
►
►
Model selection with Chinese Restaurant Process (CRP)
Hierarchical structure learning with Nested CRP
Extension of of CRP to hierarchies

“Hierarchical topic models and the Nested Chinese Restaurant Process”
►

Blei, Griffiths, Jordan, Tenenbaum, NIPS '04
Suppose there is an infinite number of Chinese Restaurants
►
►
►
►
►
each restaurant has an infinite number of tables
Each table refers to another restaurant
Each restaurant is referred to by only one table
There is one “root” restaurant
Then, all restaurants are organized into an infinitely branched tree
Extension of of CRP to hierarchies

On a L-day vacation a tourist dines as follows
►
On day 1, he goes to the root restaurant, and picks a table:
 Empty table with probability proportional to gamma
 Non-empty table with probability proportional to nr of seated customers
►
On subsequent nights, he
 goes to the restaurant referred to by the table of the night before
 chooses a table in the same manner

After the vacations the tourist has visited L restaurants
►
Corresponds to a single path in the infinitely branching tree

After M tourists take L-day vacations, they traced M paths
►
Corresponds to a subtree with at most M leafs
Extension of of CRP to hierarchies

Can be used as a prior on L-level topic hierarchies
►
Standard CRP models uncertainty about nr of topics in LDA
►
Nested CRP models uncertainty about possible L-level trees
 Lower-level topics used by less documents: more specific/sparse topics
Extension of of CRP to hierarchies

Associate a (latent) topic distribution over words with each restaurant
►

Dirichlet prior on topic-word distributions with parameter eta
Given a path through the L-level hierarchy
►
Draw vector of L mixing weights from Dirichlet prior with parameter alpha

Corpus-wide
►
Sample (infinitely many) topic distributions beta from Dir(eta)

Per document
►
Sample path in L-level tree using nested CRP
►
Sample L level-proportions theta from Dir(alpha)
►
Per word
 Sample level z from Mult(theta)
 Sample word w from topic found on level z of the path using Mult(beta)
Graphical model representation

T denotes the (infinite) collection of paths in L-level trees
Approximate inference using Gibss sampling

Not surprisingly, inference is intractable in this model

Gibbs sampler iterates over sampling
►
►
Path in tree taken by each document
Level in tree used to produce the individual words

Multinomial parameters are integrated out
►
Level distributions theta
►
Topic-word distributions beta

Four parameters control the model
►
L: Number of levels in the tree
►
Gamma: concentration parameter of nested CRP
►
Alpha: concentration parameter of prior over level distributions
►
Eta: concentration parameter of prior over topic-word distributions
Example of nested CRP trained on 1717 NIPS abstracts '87-'99

Top 8 words shown for each topic
Function words
Machine learning
Neuro science
Speach / character
recognition
Neural nets
control
Summary of this talk

Basics
►
►

Review of Probabilistic Latent Semantic Analysis (PLSA)
Review of Latent Dirichlet Allocation (LDA)
Extensions
►
►
Model selection with Chinese Restaurant Process
Hierarchical structure learning with Nested Chinese Restaurant Process

PLSA is a form of (non-negative) matrix factorization

LDA defines proper generative model for same structure (Dirichlet prior)

Chinese restaurant process defines distributions over infinite partitions

Nested version introduces hierarchical coarse-to-fine structure