2013_tvcg_utopian - Georgia Institute of Technology
Download
Report
Transcript 2013_tvcg_utopian - Georgia Institute of Technology
UTOPIAN: User-Driven Topic Modeling Based
on Interactive Nonnegative Matrix Factorization
Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1
1Georgia
Institute of Technology, 2Wayne State University
*e-mail: [email protected]
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
Document :
a distribution over topic
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Latent Dirichlet Allocation (LDA) in Visual Analytics
• LDA has been widely used in visual analytics.
• TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics
[Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], …
*Image courtesy of original papers.
Overview of Our Work
• Proposes nonnegative matrix factorization (NMF) for topic modeling.
• Highlights advantages of NMF over LDA in visual analytics.
• Presents UTOPIAN, an NMF-based interactive topic modeling system.
Topic merging
Topic splitting
Keyword-induced topic
creation
Doc-induced topic
creation
What is Nonnegative Matrix Factorization?
Nonnegative Matrix Factorization (NMF)
Lower-rank approximation with nonnegativity constraints
H
A
~
=
min || A – WH ||F
W
W>=0, H>=0
Why nonnegativity?
Easy interpretation and semantically meaningful output
Algorithm
Alternating nonnegativity-constrained least squares [Kim et al., 2008]
H
NMF as Topic Modeling
A
~
=
W
Document 1 Document 2 Document 3 Document 4
Document :
a distribution over topic
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene
nerve
neuron
life organism
Why NMF in Visual Analytics?
Advantages of NMF in Visual Analytics
• Reliable algorithmic behaviors
• Flexible support for user interactions
NMF vs. LDA
Consistency from Multiple Runs
Documents’ topical membership changes among 10 runs
InfoVis/VAST paper data set
20 newsgroup data set
NMF vs. LDA
Empirical Convergence
Documents’ topical membership changes between iterations
InfoVis/VAST paper data set
48 seconds
10 minutes
NMF
LDA
NMF vs. LDA
Topic Summary (Top Keywords)
InfoVis/VAST paper data set
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Topic 7
Run #1
visualization
design
information
user
analysis
system
graph
layout
visual
analytics
data
sets
color
weaving
Run #2
visualization
design
information
user
analysis
system
graph
layout
visual
analytics
data
sets
color
weaving
Run #1
documents
similarities
knowledge
edge
query
collaborative
social
tree
measures
multivariate
tree
animation
dimensions
treemap
Run #2
documents
query
analysts
scatterplot
spatial
collaborative
text
documents
multidimensi
onal, high
tree
aggregation
dimensions
treemap
NMF
LDA
Topics are more consistent in NMF than in LDA.
Topic quality is comparable between NMF and LDA.
Advantages of NMF in Visual Analytics
• Reliable algorithmic behaviors
• Flexible support for user interactions
Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.]
min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2
W>=0, H>=0
•Wr, Hr : reference matrices for W and H
•MW, MH : diagonal matrices for weighting/masking columns/rows of W
and H
Provides flexible yet intuitive means for user interaction.
Maintains the same computational complexity as original NMF.
UTOPIAN:
User-Driven Topic Modeling Based on Interactive NMF
Topic merging
Keyword-induced
topic creation
Doc-induced
topic creation
Topic splitting
UTOPIAN Overview
Supervised t-distributed stochastic neighbor embedding (t-SNE)
User interactions supported
• Keyword refinement
• Topic merging/splitting
• Keyword-/document-induced
topic creation
Real-time interaction via
PIVE (Per-Iteration
Visualization Environment)
Topic merging
Topic splitting
Keyword-induced
topic creation
Doc-induced
topic creation
Supervised t-SNE
Original t-SNE
• Documents are often too noisy
to work with.
Supervised t-SNE
• d(xi, xj) ← α•d(xi, xj) if xi and xj
belongs to the same topic cluster.
PIVE (Per-Iteration Visualization Environment) for
Real-time Interaction [Choo et al., under revision]
Standard approach
PIVE approach
Demo Video
http://tinyurl.com/UTOPIAN2013
Usage Scenario:
Hyundai Genesis Review Data
Initial result
After interaction
Summary
• Presented UTOPIAN, a User-Driven Topic Modeling based on
Interactive NMF.
• Highlighted the advantages of NMF over LDA in visual analytics.
• Reliable algorithmic behaviors
• Consistency from multiple runs
• Early empirical convergence
• Flexible support for user interactions
• Keyword refinement
• Topic merging/splitting
• Keyword-/document-induced topic creation
More in the paper & On-going Work
• A general taxonomy of user interactions with computational methods
• Keyword-based vs. document-based
• Template-based vs. from-scratch-based
• Algorithmic details about supported user interactions
• Implementation details
• More usage scenarios
On-going Work
• Scaling up the system with parallel distributed NMF
Jaegul Choo
Thank you!
http://tinyurl.com/UTOPIAN2013
Topic merging
[email protected]
http://www.cc.gatech.edu/~joyfull/
Keyword-induced
topic creation
Doc-induced
topic creation
Topic splitting
For more details,
please find me at
‘Meet the Candidate’
A601+ A602,
6PM today