Author-Topic Models - Stanford University
Download
Report
Transcript Author-Topic Models - Stanford University
Modeling Documents
Amruta Joshi
Department of Computer Science
Stanford University
6th June 2005
Research in Algorithms for the InterNet
1
Outline
Topic Models
Topic
Extraction2
Author Information
Modeling Topics
Modeling Authors
Author Topic Model
Inference
Integrating topics and syntax
Probabilistic
Models
Composite Model
Inference
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
2
Motivation
Identifying content of a document
Identifying its latent structure
More specifically
Given
a collection of documents we want to
create a model to collect information about
Authors
Topics
Syntactic constructs
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
3
Topics & Authors
Why model topics?
Observe
topic trends
How documents relate to one-another
Tagging abstracts
Why model authors’ interests?
Identifying
what author writes about
Identifying authors with similar interests
Authorship attribution
Creating reviewer lists
Finding unusual work by an author
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
4
Topic Extraction: Overview
Supervised Learning
Techniques
Learn
from labeled document
collection
But Unlabeled documents,
Rapidly changing fields (Yang
1998)
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
rivers
In floods, the
banks of a
river overflow
5
Topic Extraction: Overview
Dimensionality Reduction
Represent documents in
Vector Space of terms
Map to low-dimensionality
Non-linear dim. reduction
WEBSOM (Lagus et. al. 1999)
Linear Projection
LSI (Berry, Dumais, O’Brien
1995)
Regions represent topics
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
6
Topic Extraction: Overview
Cluster documents on semantic content
Typically,
each cluster has just 1 topic
Aspect Model
Topic
modeled as distribution over words
Documents generated from multiple topics
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
7
Author Information: Overview
Analyzing text using
Stylometry
statistical analysis using
literary style, frequency of
word usage, etc
Semantics
Content of document
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
As doth the lion in
the Capitol, A man
no mightier than
thyself or me …
8
Author Information: Overview
Graph-based models
D1
D2
Build
Interactive
ReferralWeb using citations
D3
D4
Kautz, Selman, Shah 1997
Build
Co-Author Graphs
White & Smith
Page-Rank for analysis
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
9
The Big Idea
Topic Model
Author Model
Model topics as distribution over words
Model author as distribution over words
Author-Topic Model
Probabilistic Model for both
Model topics as distribution over words
Model authors as distribution over topics
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
10
Bayesian Networks
Pneumonia
Tuberculosis
nodes = random variables
edges = direct probabilistic
influence
Lung Infiltrates
XRay
Sputum Smear
Topology captures independence:
XRay conditionally independent of Pneumonia given
Infiltrates
Slide Credit: Lisa Getoor, UMD College Park
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
11
Bayesian Networks
Pneumonia
Tuberculosis
Lung Infiltrates
XRay
Sputum Smear
P T
P(I |P, T )
p
t
0.7
0.3
p
t
0.6
0.4
p
t
0.2
0.8
p
t
0.01 0.99
Associated
with each node Xi there is a conditional
probability distribution P(Xi|Pai:) — distribution over
Xi for each assignment to parents
If variables are discrete, P is usually multinomial
P can be linear Gaussian, mixture of Gaussians, …
Slide Credit: Lisa Getoor, UMD College Park
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
12
BN Learning
P
I
Inducer
Data
T
X
S
BN models can be learned from empirical data
parameter
estimation via numerical optimization
structure learning via combinatorial search.
Slide Credit: Lisa Getoor, UMD College Park
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
13
Generative Model
Probabilistic Generative Process
Mixture
components
Mixture
weights
Amruta Joshi, Stanford Univ.
Statistical Inference
Bayesian approach: use priors
Mixture weights
~ Dirichlet( a )
Mixture components ~ Dirichlet( b )
Research in Algorithms for the InterNet
14
Bayesian Network for modeling
document generation
Doc 1
T1
…
T2
Z
Z
TT
w1
w2
…
wv
W
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
W
15
Topic Model: Plate Notation
Document specific
distribution over
topics
Document
Topic
Topic distribution
over words
z
w
T
Word
Nd
D
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
16
Topic Model:
Geometric Representation
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
17
Modeling Authors with words
Uniform
distribution over
authors of doc
Document
ad
Distribution of
authors over words
Author
x
Word
w
A
Amruta Joshi, Stanford Univ.
Nd
Research in Algorithms for the InterNet
D
18
Author-Topic Model
Uniform
distribution of
documents over
authors
Document
ad
Author
Distribution of
authors over
topics
x
Topic
z
A
Topic
distribution
over words
w
T
Amruta Joshi, Stanford Univ.
Word
Nd
Research in Algorithms for the InterNet
D
19
Inference
Expectation Maximization
But poor results (local Maxima)
Gibbs Sampling
Parameters: ,
Start
with initial random assignment
Update parameter using other parameters
Converges after ‘n’ iterations
Burn-in time
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
20
Inference and Learning for
Documents
Prob. that ith topic is
assigned to topic j
keeping other topic
assn unchanged
# of times
word m is
assigned to
topic j
Amruta Joshi, Stanford Univ.
mj
Research in Algorithms for the InterNet
# of times
topic j has
occurred in
document d
dj
21
Matrix Factorization
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
22
Topic Model: Inference
River
River
Stream
Stream
Bank
Bank
Money
Money
Loan
Loan
documents
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Can we recover the original topics and topic mixtures from this data?
Slide Credit: Padhraic Smyth, UC Irvine
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
23
Example of Gibbs Sampling
Assign word tokens randomly to topics
(●=topic 1; ●=topic 2 )
River
River
Stream
Stream
Bank
Bank
Money
Money
Loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Slide Credit: Padhraic Smyth, UC Irvine
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
24
After 1 iteration
Apply sampling equation to each word
token
River
River
Stream
Stream
Bank
Bank
Money
Money
Loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Slide Credit: Padhraic Smyth, UC Irvine
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
25
After 4 iterations
River
River
Stream
Bank
Stream
Bank
Money
Money
Loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Slide Credit: Padhraic Smyth, UC Irvine
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
26
After 32 iterations
●
●
topic 1
stream .40
bank .35
river .25
River
River
Stream
Bank
Stream
Bank
topic 2
bank .39
money .32
loan .29
Money
Money
Loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Slide Credit: Padhraic Smyth, UC Irvine
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
27
Results
Tested on Scientific Papers
NIPS
Dataset
V=13,649 D=1,740 K=2,037
#Topics = 100
#tokens = 2,301,375
CiteSeer
Dataset
V=30,799 D=162,489 K=85,465
#Topics = 300
#tokens = 11,685,514
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
28
Evaluating Predictive Power
Perplexity
Indicates
ability to predict words on new
unseen documents
Lower the
better
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
29
Results: Perplexity
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
30
Recap
First
Author Model
Topic Model
Then
Author-Topic Model
Next…
Integrating Topics & Syntax
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
31
Integrating topics & syntax
Probabilistic Models
Short-range
Syntactic Constraints
Represented as distinct syntactic classes
HMM, Probabilistic CFGs
Long-range
dependencies
dependencies
Semantic Constraints
Represented as probabilistic distribution
Bayes Model, Topic Model
New Idea! Use both
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
32
How to integrate these?
Mixture of Models
Product of Models
Each word exhibits either short or long range
dependencies
Each word exhibits both short or long range
dependencies
Composite Model
Asymmetric
All words exhibit short-range dependencies
Subset of words exhibit long-range
Research in Algorithms for the InterNet
Amruta Joshi, Stanforddependencies
Univ.
33
The Composite Model 1
Capturing asymmetry
Replace
probability distribution over words with
semantic model
Syntactic model chooses when to emit content
word
Semantic model chooses which word to emit
Methods
Syntactic
component is HMM
Semantic component is Topic model
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
34
Generating phrases
0.9
in
with
for
on
...
0.5
0.4
0.1
network
neural
output
networks
...
image
images
object
objects
...
kernel
support
svm
vector
...
0.9
0.2
0.7
used
trained
obtained
described
...
network used for images
image obtained with kernel
output described with objects
neural network trained with svm images
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
35
The Composite Model 2
(Graphical)
Doc’s distribution
over topics
Topics
z1
z2
z3
z4
Words
w1
w2
w3
w4
Classes
c1
Amruta Joshi, Stanford Univ.
c2
c3
c4
Research in Algorithms for the InterNet
36
The Composite Model 3
(d) : document’s distribution over topics
Transitions between classes ci-1 and ci follow
distribution (Ci-1)
A document is generated as:
For each word wi in document
Draw zi from (d)
Draw ci from (Ci-1)
If ci=1, then draw wi from (zi),
else draw wi from (ci)
Amruta Joshi, Stanford Univ.
d
Research in Algorithms for the InterNet
37
Results
Tested on
Brown
corpus (tagged with word types)
Concatenated Brown & TASA corpus
HMM & Topic Model
20
T
Classes
start/end Markers Class + 19 classes
= 200
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
38
Results
Identifying Syntactic classes & semantic topics
Clean
separation observed
Identifying function words & content words
“control”
: plain verb (syntax) or semantic word
Part-of-Speech Tagging
Identifying
syntactic class
Document Classification
Brown
corpus: 500 docs => 15 groups
Results similar
to plain Topic Model
Research in Algorithms for the InterNet
Amruta Joshi, Stanford Univ.
39
Extensions to Topic Model
Integrating link information (Cohn,
Hofmann 2001)
Learning Topic Hierarchies
Integrating Syntax & Topics
Integrate authorship info with content
(author-topic model)
Grade-of-membership Models
Random sentence generation
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
40
Conclusion
Identifying its latent structure
Document Content is modeled for
– topic model
Authorship - author topic model
Syntactic Constructs – HMM
Semantic Associations
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
41
Acknowledgements
Prof. Rajeev Motwani
Advice and guidance regarding topic
selection
T.
K. Satish Kumar
Help on Probabilistic Models
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
42
Thank you!
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
43
References
Primary
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic
Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. Seattle,
Washington.
Steyvers, M. & Griffiths, T. Probabilistic topic models.
(http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted
.pdf)
Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic
Model for Authors and Documents. In 20th Conference on Uncertainty in
Artificial Intelligence. Banff, Canada
Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press).
Integrating Topics and Syntax. In: Advances in Neural Information Processing
Systems, 17.
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of
the National Academy of Sciences, 101 (suppl. 1), 5228-5235.
Amruta Joshi, Stanford Univ.
Research in Algorithms for the InterNet
44