Powerpoint presentation

Download Report

Transcript Powerpoint presentation

Research on Intelligent Text
Information Management
ChengXiang Zhai
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology
Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
Contains joint work with Xuehua Shen, Bin Tan, Qiaozhu Mei,Yue Lu,
and other members of the TIMan group
,
© 2009 ChengXiang Zhai
1
Research Roadmap
Web, Email, and Bioinformatics
Search
Applications
Summarization
Filtering
Information
Access
Search
Mining
Applications
Mining
Information
Organization
Categorization
Current
focus
Visualization
Extraction
Knowledge
Acquisition
Clustering
Natural Language Content Analysis
Current
focus
- Personalized
-Comparative text mining
Text
- Retrieval models
-Opinion integration
-Controversy discovery
- Topic map
Entity/Relation Extraction
- Recommender
2
Sample Projects
• User-Centered Adaptive Information Retrieval
• Multi-Resolution Topic Map for Browsing
• Comparative Text Mining
• Opinion Integration and summarization
3
Project 1:
User-Centered Adaptive IR (UCAIR)
• A novel retrieval strategy emphasizing
– user modeling (“user-centered”)
– search context modeling (“adaptive”)
– interactive retrieval
• Implemented as a personalized search agent that
– sits on the client-side (owned by the user)
– integrates information around a user (1 user vs. N
sources as opposed to 1 source vs. N users)
– collaborates with each other
– goes beyond search toward task support
4
Non-Optimality of
Document-Centered Search Engines
Query = Jaguar
As of Oct. 17, 2005
Car
Car
Software
Mixed results, unlikely optimal
for any particular user
Car
Animal
Car
5
The UCAIR Project (NSF CAREER)
WEB
Viewed
Web pages
Search
Engine
Search
Engine
Desktop
Files
...
Email
Query
History
Search
Engine
Personalized
search agent
“jaguar
”
Personalized
search agent
“jaguar
”
6
Potential Benefit of Personalization
Suppose we know:
Car
Car
Software
Car
1. Previous query = “racing cars”
vs. “Apple OS”
2. “car” occurs far more frequently
than “Apple” in pages browsed
by the user in the last 20 days
3. User just viewed an “Apple OS”
document
Animal
Car
7
Intelligent Re-ranking of Unseen Results
When a user clicks on the “back” button after viewing a document,
UCAIR reranks unseen results to
pull up documents similar to the one the user has viewed
8
UCAIR Outperforms Google
[Shen et al. 05]
Precision at N documents
Ranking
Method
prec@5 prec@10
Google
0.538
UCAIR
0.581
Improvement 8.0%
0.472
0.556
17.8%
prec@20
prec@30
0.377
0.453
20.2%
0.308
0.375
21.8%
UCAIR toolbar available at http://sifaka.cs.uiuc.edu/ir/ucair/
9
Future: Personal Information Agent
WWW
Desktop
Intranet
Email
IM
User Profile
Active Info Service
Security
Handler
Blog
E-COM
…
Task
Support
Personal Content Index
Frequently Accessed Info
Sports
…
Literature
10
Ongoing Work
• UCAIR system
• Recommendation and advertising on social networks
11
Project 2: Multi-Resolution Topic Map
for Browsing
• Promoting browsing as a “first-class citizen”
• Multi-resolution topic map for browsing
– Enable a user to find information through navigation
– Very useful when a user can’t formulate effective
queries or uses a small screen device
• Search log as information footprints
– Organize search log into a topic map
– Allow a user to follow information footprints of
previous users
– Enable social surfing
2009 © ChengXiang Zhai
12
Querying vs. Browsing
13
Information Seeking as Sightseeing
• Know the address of an attraction site?
– Yes: take a taxi and go directly to the site
– No: walk around or take a taxi to a nearby place
then walk around
• Know what exactly you want to find?
– Yes: use the right keywords as a query and find
the information directly
– No: browse the information space or start with a
rough query and then browse
When query fails, browsing comes to
rescue…
14
Current Support for Browsing is Limited
• Hyperlinks
– Only page-to-page
Beyond
hyperlinks?
– Mostly manually constructed
– Browsing step is very small
• Web directories
– Manually constructed
– Fixed categories
OD
P
Beyond fixed categories?
– Only support vertical navigation
How to promote browsing as a “first-class
citizen”?
15
Sightseeing Analogy Continues…
16
Topic Map for Touring Information
Space
Zoom in
Multiple resolutions
Level 3
auto
insurance
car
Topic
regions
cars
rental
loan
car::parts
car::rental
rental::boat
car::pictures
car::used
Level 2
car::blue+book
national+car+rental
alamo+car+rental
enterprise+car+rental
exotic+car+rental
advantage+car+rental
Horizontal
navigation
Level 1
Zoom
out
17
Topic-Map based Browsing
Querying
MultiResolution
Topic Map
Topic
Region
Parents
Current
Position
Demo
Horizontal
Neighbors
18
How can we construct such a
multi-resolution topic map?
Multiple possibilities…
19
Search Logs as Information Footprints
Footprints in information space
User 2722 searched for "national car rental" [!] at 2006-03-09
11:24:29
User 2722 searched for "military car rental benefits" [!] at
2006-03-10 09:33:37 (found http://www.valoans.com)
User 2722 searched for "military car rental benefits" [!] at
2006-03-10 09:33:37 (found http://benefits.military.com)
User 2722 searched for "military car rental benefits" [!] at
2006-03-10 09:33:37 (found http://www.avis.com)
User 2722 searched for "enterprise rent a car" [!] at 2006-0405 23:37:42 (found http://www.enterprise.com)
User 2722 searched for "meineke car care center" [!] at 200605-02 09:12:49 (found http://www.meineke.com)
User 2722 searched for "car rental" [!] at 2006-05-25
15:54:36
User 2722 searched for "autosave car rental" [!] at 2006-0525 23:26:54 (found http://eautosave.com)
User 2722 searched for "budget car rental" [!] at 2006-05-25
23:29:53
User 2722 searched for "alamo car rental" [!] at 2006-05-25
23:56:13
……
20
Information Footprints  Topic Map
• Challenges
– How to define/construct a topic region
– How to control granularities/resolutions of topic
regions
– How to connect topic regions to support effective
browsing
• Two approaches
– Multi-granularity clustering
– Query editing
21
Collaborative Surfing
New queries become new footprints
Navigation
trace enriches
map structures
Clickthroughs become new footprint
Browse logs offer more opportunities
to understand user interests and intents
22
Project 3:
Comparative Text Mining
• Documents are often associated with context (metadata)
– Direct context: time, location, source, authors,…
– Indirect context: events, policies, …
• Many applications require “contextual text analysis”:
– Discovering topics from text in a context-sensitive way
– Analyzing variations of topics over different contexts
– Revealing interesting patterns (e.g., topic evolution,
topic variations, topic communities)
23
Example 1:
Comparing News Articles
Vietnam War
CNN
Afghan War
Fox
Before 9/11 During Iraq war
US blog
European blog
Iraq War
Blog
Current
Others
Common Themes
“Vietnam” specific
“Afghan” specific
“Iraq” specific
United nations
…
…
…
Death of people
…
…
…
…
…
…
…
What’s in common? What’s unique?
24
More Contextual Analysis Questions
• What positive/negative aspects did people say about
X (e.g., a person, an event)? Trends?
• How does an opinion/topic evolves over time?
• What are emerging topics? What topics are fading
away?
• How can we characterize a social network?
25
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a unified
approach?
• How can we bring human into the loop?
26
Contextual Probabilistic
Latent Semantics Analysis ([KDD 2006]…)
Themes
Choose a theme
View1 View2 View3
Criticism
of government
Draw a word from
i
response togovernment
the hurricane
government 0.3
primarily
consisted of
response 0.2..
Document
government
response
criticism
of its response
context:
to … The total shut-in oil
Time = from
July the
2005
production
Gulf
Location
=
Texas
of Mexico …donate
approximately
= xxx
24% ofAuthor
the annual
help
aid
production
the shutOccup. = and
Sociologist
in gas
Over
Ageproduction
Group = …
45+
seventy countries pledged
…Orleans
monetary donations or
new
other assistance. …
donate 0.1
relief 0.05
help 0.02 ..
donation
city 0.2
new 0.1
orleans 0.05 ..
New
Orleans
Texas
July
2005
Theme
coverages:
sociolo
gist
Choose a view
1
2
3
4
Texas
July 2005
1
2
3
4
1
2
3
4
……
document
Choose a
Coverage
27
Comparing News Articles
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars
Cluster 1
Common
Theme
Iraq
Theme
Afghan
Theme
united
nations
…
0.042
0.04
n
0.03
Weapons 0.024
Inspections 0.023
…
Northern 0.04
alliance
0.04
kabul
0.03
taleban
0.025
aid
0.02
…
Cluster 2
Cluster 3
killed
0.035
month
0.032
deaths
0.023
…
troops
0.016
hoon
0.015
sanches 0.012
…
taleban
0.026
rumsfeld 0.02
hotel
0.012
front
0.011
…
…
…
…
Collection-specific themes indicate different roles of “United Nations” in the two wars
28
Spatiotemporal Patterns in Blog Articles
•
•
Query= “Hurricane Katrina”
Topics in the results:
Government Response
bush 0.071
president 0.061
federal 0.051
government 0.047
fema 0.047
administrate 0.023
response 0.020
brown 0.019
blame 0.017
governor 0.014
•
New Orleans
city 0.063
orleans 0.054
new 0.034
louisiana 0.023
flood 0.022
evacuate 0.021
storm 0.017
resident 0.016
center 0.016
rescue 0.012
Oil Price
price 0.077
oil 0.064
gas 0.045
increase 0.020
product 0.020
fuel 0.018
company 0.018
energy 0.017
market 0.016
gasoline 0.012
Praying and Blessing
god 0.141
pray 0.047
prayer 0.041
love 0.030
life 0.025
bless 0.025
lord 0.017
jesus 0.016
will 0.013
faith 0.012
Aid and Donation
donate 0.120
relief 0.076
red 0.070
cross 0.065
help 0.050
victim 0.036
organize 0.022
effort 0.020
fund 0.019
volunteer 0.019
Personal
i 0.405
my 0.116
me 0.060
am 0.029
think 0.015
feel 0.012
know 0.011
something 0.007
guess 0.007
myself 0.006
Spatiotemporal patterns
29
Theme Life Cycles (“Hurricane Katrina”)
Oil Price
New Orleans
price 0.0772
oil 0.0643
gas 0.0454
increase 0.0210
product 0.0203
fuel 0.0188
company 0.0182
…
city 0.0634
orleans 0.0541
new 0.0342
louisiana 0.0235
flood 0.0227
evacuate 0.0211
storm 0.0177
…
30
Theme Snapshots (“Hurricane Katrina”)
Week2: The discussion moves towards the north and west
Week1: The theme is the strongest along the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week5: The theme fades out in most states
31
Theme Life Cycles (KDD Papers)
Normalized Strength of Theme
0.02
Biology Data
0.018
Web Information
0.016
Time Series
0.014
Classification
Association Rule
0.012
Clustering
0.01
Bussiness
0.008
0.006
0.004
0.002
0
1999
2000
2001
2002
2003
2004
gene 0.0173
expressions 0.0096
probability 0.0081
microarray 0.0038
…
marketing 0.0087
customer 0.0086
model 0.0079
business 0.0048
…
rules 0.0142
association 0.0064
support 0.0053
…
Time (year)
32
Theme Evolution Graph: KDD
1999
2000
2001
2002
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
decision 0.006
tree
0.006
classifier 0.005
class
0.005
Bayes
0.005
…
web 0.009
classifica –
tion 0.007
features0.006
topic 0.005
…
2003
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
…
…
Classifica
- tion
text
unlabeled
document
labeled
learning
…
0.015
0.013
0.012
0.008
0.008
0.007
…
Informa
- tion 0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
33
Multi-Faceted Sentiment Summary
(query=“Da Vinci Code”)
Facet 1:
Movie
Facet 2:
Book
Neutral
Positive
Negative
... Ron Howards selection of
Tom Hanks to play Robert
Langdon.
Tom Hanks stars in the
movie,who can be mad at
that?
But the movie might get
delayed, and even killed off if
he loses.
Directed by: Ron Howard
Writing credits: Akiva
Goldsman ...
Tom Hanks, who is my
favorite movie star act the
leading role.
protesting ... will lose your faith
by ... watching the movie.
After watching the movie I
went online and some
research on ...
Anybody is interested in
it?
... so sick of people making
such a big deal about a
FICTION book and movie.
I remembered when i first
read the book, I finished the
book in two days.
Awesome book.
... so sick of people making
such a big deal about a
FICTION book and movie.
I’m reading “Da Vinci Code”
now.
So still a good book to
past time.
This controversy book cause
lots conflict in west society.
…
34
Separate Theme Sentiment Dynamics
“book”
“religious beliefs”
35
Event Impact Analysis: IR Research
Theme:
retrieval models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
vector
concept
extend
model
space
boolean
function
feedback
…
xml
email
model
collect
judgment
rank
subtopic
…
0.0514
0.0298
0.0297
0.0291
0.0236
0.0151
0.0123
0.0077
1992
0.0678
0.0197
0.0191
0.0187
0.0102
0.0097
0.0079
SIGIR papers
Publication of the paper “A language
modeling approach to information retrieval”
year
Starting of the TREC conferences
probabilist 0.0778
model
0.0432
logic
0.0404
ir
0.0338
boolean 0.0281
algebra 0.0200
estimate 0.0119
weight
0.0111
…
1998
model
0.1687
language 0.0753
estimate 0.0520
parameter 0.0281
distribution 0.0268
probable
0.0205
smooth
0.0198
markov
0.0137
likelihood 0.0059
…
36
Topic Modeling + Social Networks
Authors writing about the same topic form a community
Separation of 3 research communities: IR, ML, Web
Topic Model Only
Topic Model + Social Network
37
37
On-Going Work
•
•
•
Combining contextual text analysis with visualization
More detailed semantic modeling (entities, relations,…)
Integration of search and contextual text analysis to develop
an analyst’s workbench:
– Interactive semantic navigation and probing
– Synthesis of information/knowledge
– Personalized/customized service
38
Project 4:
Opinion Integration and Summarization
• Increasing popularity of Web 2.0 applications
– more people express opinions on the Web
How to digest all?
190,451
posts
4,773,658
results
39
Motivation:
Two kinds of opinions
190,451 posts
4,773,658 results
How to benefit from both?
Expert opinions
•CNET editor’s review
•Wikipedia article
•Well-structured
•Easy to access
•Maybe biased
•Outdated soon
Ordinary opinions
•Forum discussions
•Blog articles
•Represent the majority
•Up to date
•Hard to access
•fragmental
40
Problem Definition
Input
Topic: iPod
Expert review
with aspects
Text collection
of ordinary
opinions, e.g.
Weblogs
Design
Battery
Price..
Extra Aspects Review Aspects
Output
Design
Battery
Price
Similar
opinions
cute… tiny…
last many
hrs
could afford
it
Supplementary
opinions
..thicker..
die out
soon
still
expensive
iTunes
… easy to use…
warranty
…better to extend..
Integrated Summary
41
Methods
• Semi-Supervised Probabilistic Latent Semantic
Analysis (PLSA)
– The aspects extracted from expert reviews serve as
clues to define a conjugate prior on topics
– Maximum a Posteriori (MAP) estimation
– Repeated applications of PLSA to integrate and align
opinions in blog articles to expert review
Results: Product (iPhone)
• Opinion Integration with review aspects
Review article
Similar opinions
You can make
N/A
emergency calls, but
you can't use any
other functions…
Confirm the
Activation
opinions from the
review will Feature
rated battery life of 8 iPhone
hours talk time, 24
Up to 8 Hours of Talk
hours of music
Time, 6 Hours of
playback, 7 hours of Internet Use, 7 Hours
video playback, and 6 of Video Playback or
hours on Internet use. 24 Hours of Audio
Playback
Battery
Supplementary opinions
… methods for unlocking the
iPhone have emerged on the
Unlock/hack
Internet in the past few weeks,
iPhone they involve tinkering
although
with the iPhone hardware…
Playing relatively high bitrate
VGA H.264 videos, our iPhone
lasted almost exactly 9 freaking
hours of continuous playback
with cell and WiFi on (but
Bluetooth off).
Additional info
under real usage
43
Results: Product (iPhone)
• Opinions on extra aspects
support
Supplementary opinions on extra aspects
15
You may have heard of iASign … an iPhone Dev Wiki tool that
Another way to
allows you to activate your phone without going through the
activate iPhone
iTunes rigamarole.
13
Cisco has owned the trademark on the name "iPhone" since
2000, when it acquired InfoGeariPhone
Technology
Corp., which
trademark
originally registered the name. originally owned by
13
Cisco
With the imminent availability of Apple's
uber cool iPhone, a
look at 10 things current smartphones like the Nokia N95 have
choiceand
for that the iPhone can't currently
been able toAdobetter
for a while
smart phones?
match...
44
Results: Product (iPhone)
• Support statistics for review aspects
People care about
price
Controversy: activation
requires contract with
AT&T
People comment a lot
about the unique wi-fi
feature
45
Summarization of Contradictory Opinions
[Kim & Zhai CIKM 09]
Facet 1:
Movie
Facet 2:
Book
Neutral
Positive
Negative
... Ron Howards selection of
Tom Hanks to play Robert
Langdon.
Tom Hanks stars in the
movie,who can be mad at
that?
But the movie might get
delayed, and even killed off if
he loses.
Directed by: Ron Howard
Writing credits: Akiva
Goldsman ...
Tom Hanks, who is my
favorite movie star act the
leading role.
protesting ... will lose your faith
by ... watching the movie.
went online and some
research on ...
it?
such a big deal about a
FICTION book and movie.
I remembered when i first
read the book, I finished the
book in two days.
Awesome book.
... so sick of people making
such a big deal about a
FICTION book and movie.
I’m reading “Da Vinci Code”
now.
So still a good book to
past time.
This controversy book cause
lots conflict in west society.
How can we help analysts digest and
After watching
the movie I contradictory
Anybody is interested opinioons?
in
... so sick of people making
interpret
…
46
Contrastive Opinion Summarization
X
Y
x1
y1
x2
y2
x3
y3
x4
y4
…
x5
…
ym
xn
47
Contrastive Opinion Summarization
X
Y
V Y
U  X,
x1
y1
x2
u1
v1
y2
x3
u2
v2
y3
x4
uk
x5
…
…
…
vk
ym
…
xn
y4
Contrastive Opinion Summary
48
Problem Formulation
Representativeness
X
Y
x1
U
V
y1
x2
u1
v1
y2
x3
u2
v2
y3
x4
uk
x5
…
…
…
vk
ym
…
xn
y4
Contrastiveness
49
Problem Formulation
Representativeness
1
r (S ) 
X
X
1
max ( x, ui ) 

i[1, k ]
Y
xX
x1
U
 max ( y, v )
yY
i[1, k ]
V
i
Y
y1
x2
u1
v1
y2
x3
u2
v2
y3
x4
uk
x5
…
…
…
vk
ym
…
xn
y4
Contrastiveness
1 k
c( S )   (ui , vi )
k i 1
50
Summarization as Optimization
S *  arg max(r ( S )  (1   )c( S ))
S
 arg max(
S

X

 max ( x, u )  Y  max ( y, v )
x X
i[1, k ]
i
yY
i[1, k ]
i
1  k

 (ui , vi ))

k i 1
1. Define an appropriate content similarity function Ф
2. Define an appropriate contrastive similarity function ψ
3. Solve the optimization problem efficiently.
51
Sample Results
No
Positive
Negative
1
oh ... and file transfers are fast &
easy .
you need the software to actually
transfer files
2
i noticed that the micro adjustment
knob and collet are well made and
work well too.
the adjustment knob seemed ok, but
when lowering the router, i have to
practically pull it down while turning
the knob.
3
the navigation is nice enough , but
scrolling and searching through
thousands of tracks ,
hundreds of albums or artists , or
even dozens of genres is not
conducive to save driving
difficult navigation - i wo n’t
necessarily say " difficult ,“ but i do n’t
enjoy the scrollwheel to navigate .
4
i imagine if i left my player
untouched (no backlight) it could
play for considerably more than 12
hours at a low volume level.
there are 2 things that need fixing first
is the battery life.
it will run for 6 hrs without problems
with medium usage of the buttons.
52
Sample Result
No
Positive
Negative
1
oh ... and file transfers are fast &
easy .
you need the software to actually
transfer files
2
i noticed that the micro adjustment
knob and collet are well made and
work well too.
the adjustment knob seemed ok, but
when lowering the router, i have to
practically pull it down while turning
the knob.
3
the navigation is nice enough , but
scrolling and searching through
thousands of tracks ,
hundreds of albums or artists , or
even dozens of genres is not
conducive to save driving
difficult navigation - i wo n’t
necessarily say " difficult ,“ but i do n’t
enjoy the scrollwheel to navigate .
i imagine if i left my player
untouched (no backlight) it could
play for considerably more than 12
hours at a low volume level.
there are 2 things that need fixing first
is the battery life.
it will run for 6 hrs without problems
with medium usage of the buttons.
Different polarities of opinions
made from different perspectives.
4
53
Sample Result
No
Positive
1
oh ... and file transfers are fast &
easy .
2
i noticed that the micro adjustment
knob and collet are well made and
work well too.
Negative
you need the software to actually
transfer files
Positive vs. negative
the adjustment knob seemed ok, but
when lowering the router, i have to
Not much disagreement
practically pull it down while turning
the knob.
3
the navigation is nice enough , but
scrolling and searching through
thousands of tracks ,
hundreds of albums or artists , or
even dozens of genres is not
conducive to save driving
difficult navigation - i wo n’t
necessarily say " difficult ,“ but i do n’t
enjoy the scrollwheel to navigate .
4
i imagine if i left my player
untouched (no backlight) it could
play for considerably more than 12
hours at a low volume level.
there are 2 things that need fixing first
is the battery life.
it will run for 6 hrs without problems
with medium usage of the buttons.
54
Sample Result
No
Positive
Negative
1
oh ... and file transfers are fast &
easy .
you need the software to actually
transfer files
2
i noticed that the micro adjustment
knob and collet are well made and
work well too.
the adjustment knob seemed ok, but
when lowering the router, i have to
practically pull it down while turning
the knob.
3
the navigation is nice enough , but
scrolling and searching through
thousands of tracks ,
hundreds of albums or artists , or
even dozens of genres is not
conducive to save driving
difficult navigation - i wo n’t
necessarily say " difficult ,“ but i do n’t
enjoy the scrollwheel to navigate .
i imagine if i left my player
untouched (no backlight) it could
play for considerably more than 12
hours at a low volume level.
there are 2 things that need fixing first
is the battery life.
it will run for 6 hrs without problems
with medium usage of the buttons.
Judgments revealing detailed
conditions
4
55
Ongoing Work
• Discovery of controversy and contrastive
summarization
• Information trustworthiness
56
The End
Thank You!
More information about our research can be found at
http://timan.cs.uiuc.edu/
57