www.inf.ed.ac.uk
Download
Report
Transcript www.inf.ed.ac.uk
Latent Dirichlet Allocation
David M Blei, Andrew Y Ng & Michael I Jordan
presented by Tilaye Alemu & Anand Ramkissoon
Motivation for LDA
In lay terms:
document modelling
text classification
collaborative filtering
...
...in the context of Information Retrieval
The principal focus in this paper is on document
classification within a corpus
Structure of this talk
Part 1:
Theory
Background
(some) other approaches
Part 2:
Experimental results
some details of usage
wider applications
LDA: conceptual features
Generative
Probabilistic
Collections of discrete data
3-level hierarchical Bayesian model
mixture models
efficient approximate inference techniques
variational methods
EM algorithm for empirical Bayes parameter
estimation
How to classify text documents
Word (term) frequency
tf-idf
term-by-document matrix
discriminative sets of words
fixed-length lists of numbers
little statistical structure
Dimensionality reduction techniques
Latent Semantic Indexing
Singular value decomposition
not generative
How to classify text documents
ct'd
probabilistic LSI (PLSI)
each word generated by one topic
each document generated by a mixture of topics
a document is represented as a list of mixing
proportions for topics
No generative model for these numbers
Number of parameters grows linearly with the corpus
Overfitting
How to classify documents outside training set
A major simplifying assumption
A document is a “bag of words”
A corpus is a “bag of documents”
order is unimportant
exchangeability
de Finetti representation theorem
any collection of exchangeable random variables has a
representation as a (generally infinite) mixture distribution
A note about exchangeability
Does not mean that random variables are iid
iid when conditioned on wrt to an underlying
latent parameter of a probability distribution
Conditionally the joint distribution is simple and
factored
Notation
word: unit of discrete data, an item from a
vocabulary indexed {1,...,V}
each word is a unit basis V-vector
document: sequence of N words w=(w1,...,wN)
corpus a collection of M documents D=(w1,...,wM)
Each document is considered a random mixture
over latent topics
Each topic is considered a distribution over words
LDA assumes a generative process
for each document in the corpus
Probability density for the Dirichlet
Random variable
Joint distribution of a Topic
mixture
Marginal distribution of a
document
Probability of a corpus
Marginalize over z
The word distribution
The generative process
a Unigram Model
probabilistic Latent Semantic
Indexing
Inference from LDA
Variational Inference
A family of distributions on latent
variables
The Dirichlet parameter γ and the multinomial
parameters φ are the free variational parameters
The update equations
Minimize the Kullback-Leibler divergence
between the distribution and the true posterior
Variational Inference Algorithm