2013_tvcg_utopian - Georgia Institute of Technology

Download Report

Transcript 2013_tvcg_utopian - Georgia Institute of Technology

UTOPIAN: User-Driven Topic Modeling Based
on Interactive Nonnegative Matrix Factorization
Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1
1Georgia
Institute of Technology, 2Wayne State University
*e-mail: [email protected]
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Intro: Topic Modeling
Document 1 Document 2 Document 3 Document 4
Document :
a distribution over topic
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene nerve
neuron
life organism
Latent Dirichlet Allocation (LDA) in Visual Analytics
• LDA has been widely used in visual analytics.
• TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics
[Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], …
*Image courtesy of original papers.
Overview of Our Work
• Proposes nonnegative matrix factorization (NMF) for topic modeling.
• Highlights advantages of NMF over LDA in visual analytics.
• Presents UTOPIAN, an NMF-based interactive topic modeling system.
Topic merging
Topic splitting
Keyword-induced topic
creation
Doc-induced topic
creation
What is Nonnegative Matrix Factorization?
Nonnegative Matrix Factorization (NMF)
Lower-rank approximation with nonnegativity constraints
H
A
~
=
min || A – WH ||F
W
W>=0, H>=0
Why nonnegativity?
 Easy interpretation and semantically meaningful output
Algorithm
 Alternating nonnegativity-constrained least squares [Kim et al., 2008]
H
NMF as Topic Modeling
A
~
=
W
Document 1 Document 2 Document 3 Document 4
Document :
a distribution over topic
Topic 1
Topic 2
Topic 3
Topic: a distribution
over keywords
brain
evolve
dna
genetic
gene
nerve
neuron
life organism
Why NMF in Visual Analytics?
Advantages of NMF in Visual Analytics
• Reliable algorithmic behaviors
• Flexible support for user interactions
NMF vs. LDA
Consistency from Multiple Runs
Documents’ topical membership changes among 10 runs
InfoVis/VAST paper data set
20 newsgroup data set
NMF vs. LDA
Empirical Convergence
Documents’ topical membership changes between iterations
InfoVis/VAST paper data set
48 seconds
10 minutes
NMF
LDA
NMF vs. LDA
Topic Summary (Top Keywords)
InfoVis/VAST paper data set
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Topic 7
Run #1
visualization
design
information
user
analysis
system
graph
layout
visual
analytics
data
sets
color
weaving
Run #2
visualization
design
information
user
analysis
system
graph
layout
visual
analytics
data
sets
color
weaving
Run #1
documents
similarities
knowledge
edge
query
collaborative
social
tree
measures
multivariate
tree
animation
dimensions
treemap
Run #2
documents
query
analysts
scatterplot
spatial
collaborative
text
documents
multidimensi
onal, high
tree
aggregation
dimensions
treemap
NMF
LDA
 Topics are more consistent in NMF than in LDA.
 Topic quality is comparable between NMF and LDA.
Advantages of NMF in Visual Analytics
• Reliable algorithmic behaviors
• Flexible support for user interactions
Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.]
min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2
W>=0, H>=0
•Wr, Hr : reference matrices for W and H
•MW, MH : diagonal matrices for weighting/masking columns/rows of W
and H
Provides flexible yet intuitive means for user interaction.
Maintains the same computational complexity as original NMF.
UTOPIAN:
User-Driven Topic Modeling Based on Interactive NMF
Topic merging
Keyword-induced
topic creation
Doc-induced
topic creation
Topic splitting
UTOPIAN Overview
Supervised t-distributed stochastic neighbor embedding (t-SNE)
User interactions supported
• Keyword refinement
• Topic merging/splitting
• Keyword-/document-induced
topic creation
Real-time interaction via
PIVE (Per-Iteration
Visualization Environment)
Topic merging
Topic splitting
Keyword-induced
topic creation
Doc-induced
topic creation
Supervised t-SNE
Original t-SNE
• Documents are often too noisy
to work with.
Supervised t-SNE
• d(xi, xj) ← α•d(xi, xj) if xi and xj
belongs to the same topic cluster.
PIVE (Per-Iteration Visualization Environment) for
Real-time Interaction [Choo et al., under revision]
Standard approach
PIVE approach
Demo Video
http://tinyurl.com/UTOPIAN2013
Usage Scenario:
Hyundai Genesis Review Data
Initial result
After interaction
Summary
• Presented UTOPIAN, a User-Driven Topic Modeling based on
Interactive NMF.
• Highlighted the advantages of NMF over LDA in visual analytics.
• Reliable algorithmic behaviors
• Consistency from multiple runs
• Early empirical convergence
• Flexible support for user interactions
• Keyword refinement
• Topic merging/splitting
• Keyword-/document-induced topic creation
More in the paper & On-going Work
• A general taxonomy of user interactions with computational methods
• Keyword-based vs. document-based
• Template-based vs. from-scratch-based
• Algorithmic details about supported user interactions
• Implementation details
• More usage scenarios
On-going Work
• Scaling up the system with parallel distributed NMF
Jaegul Choo
Thank you!
http://tinyurl.com/UTOPIAN2013
Topic merging
[email protected]
http://www.cc.gatech.edu/~joyfull/
Keyword-induced
topic creation
Doc-induced
topic creation
Topic splitting
For more details,
please find me at
‘Meet the Candidate’
A601+ A602,
6PM today