Research Presentation - University of Massachusetts Amherst

Download Report

Transcript Research Presentation - University of Massachusetts Amherst

Putting Query Representation and
Understanding in Context:
A Decision-Theoretic Framework for Optimal Interactive
Retrieval through Dynamic User Modeling
ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Including joint work with Xuehua Shen, Bin Tan
What is a query?
iPhone battery
Search
Query = a sequence of keywords?
Query = a sequence of keywords that describe the
information need of a particular user at a particular
time for finishing a particular task
Rich context !
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
2
Query must be put in a context
What queries did the user type in before this query?
What documents were just viewed by this user?
What documents were skipped by this user?
What other users looked for similar information?
……
Jaguar
Car ?
Search
Animal ?
Mac OS?
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
3
Context helps query understanding
Suppose we know:
Car
1. Previous query = “racing cars”
vs. “Apple OS”
Car
Software
Car
2. “car” occurs far more frequently
than “Apple” in pages browsed
by the user in the last 20 days
3. User just viewed an “Apple OS”
document
Animal
Car
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
4
Questions
• How can we model a query in a contextsensitive way?
 Generalize query representation to user
model
• How can we model the dynamics of user
information needs?
 Dynamic updating of user models
• How can we put query representation into a
retrieval framework to improve search?
 A framework for optimal interactive retrieval
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
5
Rest of the talk: UCAIR Project
1. A decision-theoretic framework
2. Statistical language models for implicit
feedback (personalized search without
extra user effort)
3. Open challenges
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
6
UCAIR Project
• UCAIR = User-Centered Adaptive IR
– user modeling (“user-centered”)
– search context modeling (“adaptive”)
– interactive retrieval
• Implemented as a personalized search
agent that
– sits on the client-side (owned by the user)
– integrates information around a user (1 user vs.
N sources as opposed to 1 source vs. N users)
– collaborates with each other
– goes beyond search toward task support
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
7
Main Idea: Putting the User in the Center!
A search agent can know about
a particular user very well
WEB
Email
Viewed
Web pages
Search
Engine
Search
Engine
...
Query
History
Search
Engine
Personalized
search agent
“java”
Desktop
Files
Personalized
search agent
“java”
8
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
1. A Decision-Theoretic Framework
for Optimal Interactive Retrieval
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
IR as Sequential Decision Making
(Information Need)
User
A1 : Enter a query
Which documents
to view?
A2 : View document
View more?
(Model of Information Need)
System
Which documents to present?
How to present them?
Ri: results (i=1, 2, 3, …)
Which part of the document
to show? How?
R’: Document content
A3 : Click on “Back” button
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
Retrieval Decisions
History H={(Ai,Ri)}
i=1, …, t-1
Given U, C, At , and H, choose
the best Rt from all possible
responses to At
Query=“Jaguar”
User U:
System:
A1 A2 … … At-1
R1 R2 … … Rt-1
Click on “Next” button
At
Rt =?
The best ranking for the query
The best ranking of unseen docs
Rt  r(At)
C
Document
Collection
All possible rankings of C
All possible rankings of unseen docs
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
A Risk Minimization Framework
Observed
User Model
User:
U
Interaction history: H
Current user action: At
Document collection: C
Seen docs
M=(S, U…)
Information need
All possible responses:
r(At)={r1, …, rn}
L(ri,At,M)
Loss Function
Optimal response: r* (minimum loss)
Rt  arg min rr ( At )  L(r , At , M ) P( M | U , H , At , C )dM
M
Bayes risk
Inferred
Observed
12
SIGIR 2010 Workshop on Query Representation and
A Simplified Two-Step
Decision-Making Procedure
• Approximate the Bayes risk by the loss at the
mode of the posterior distribution
Rt  arg min rr ( At )  L(r , At , M ) P( M | U , H , At , C )dM
M
 arg min rr ( At ) L(r , At , M *) P( M * | U , H , At , C )
 arg min rr ( At ) L(r , At , M *)
where M *  arg max M P( M | U , H , At , C )
• Two-step procedure
– Step 1: Compute an updated user model M*
based on the currently available information
– Step 2: Given M*, choose a response to minimize
the loss function
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
13
Optimal Interactive Retrieval
User
A1
U
M*1
C
Collection
P(M1|U,H,A1,C)
L(r,A1,M*1)
A2
R1
M*2
P(M2|U,H,A2,C)
L(r,A2,M*2)
A3
R2
IR system
…
14
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
Refinement of Risk Minimization
• r(At): decision space (At dependent)
–
–
–
–
r(At) = all possible subsets of C (document selection)
r(At) = all possible rankings of docs in C
r(At) = all possible rankings of unseen docs
r(At) = all possible subsets of C + summarization strategies
• M: user model
– Essential component: U = user information need
– S = seen documents
– n = “Topic is new to the user”
• L(Rt ,At,M): loss function
– Generally measures the utility of Rt for a user modeled as M
– Often encodes retrieval criteria (e.g., using M to select a ranking of
docs)
• P(M|U, H, At, C): user model inference
– Often involves estimating a unigram language model U
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
15
Case 1: Context-Insensitive IR
– At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word
distribution)
– p(M|U,H,At,C)=p(U |Q)
L(ri , At , M )  L((d1 ,..., d N ), U )
N
  p (viewed | d i )D (U ||  di )
i 1
Since p (viewed | d1 )  p (viewed | d 2 )  ....
the optimal ranking Rt is given by ranking documents by D (U ||  di )
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
16
Case 2: Implicit Feedback
– At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word
distribution)
– H={previous queries} + {viewed snippets}
– p(M|U,H,At,C)=p(U |Q,H)
L(ri , At , M )  L((d1 ,..., d N ), U )
N
  p (viewed | d i )D (U ||  di )
i 1
Since p (viewed | d1 )  p (viewed | d 2 )  ....
the optimal ranking Rt is given by ranking documents by D (U ||  di )
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
17
Case 3: General Implicit Feedback
– At=“enter a query Q” or “Back” button, “Next”
button
– r(At) = all possible rankings of unseen docs in C
– M= (U, S), S= seen documents
– H={previous queries} + {viewed snippets}
– p(M|U,H,At,C)=p(U |Q,H)
L(ri , At , M )  L((d1 ,..., d N ), U )
N
  p (viewed | d i )D (U ||  di )
i 1
Since p (viewed | d1 )  p (viewed | d 2 )  ....
the optimal ranking Rt is given by ranking documents by D (U ||  di )
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
18
Case 4: User-Specific Result Summary
– At=“enter a query Q”
– r(At) = {(D,)}, DC, |D|=k, {“snippet”,”overview”}
– M= (U, n), n{0,1} “topic is new to the user”
– p(M|U,H,At,C)=p(U,n|Q,H), M*=(*, n*)
L( i , n*)
L(ri , At , M )  L( Di ,  i ,  *, n*)
 L( Di ,  *)  L( i , n*)

 D( * || 
d Di
d
)  L( i , n*)
Choose k most relevant docs
n*=1 n*=0
i=snippet
i=overview
1
0
0
1
If a new topic (n*=1),
give an overview summary;
otherwise, a regular snippet summary
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
19
2. Statistical Language Models
for implicit feedback
(Personalized search without
extra user effort)
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
Risk Minimization for Implicit Feedback
– At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word
distribution)
– H={previous queries} + {viewed snippets}
– p(M|U,H,At,C)=p(U |Q,H)
L(ri , At , M )  L((d1 ,..., d N ), U )
Need to
estimate a context-sensitive LM
N
  p (viewed | d i )D (U ||  di )
i 1
Since p (viewed | d1 )  p (viewed | d 2 )  ....
the optimal ranking Rt is given by ranking documents by D (U ||  di )
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
22
Estimate a Context-Sensitive LM
Q1
User Query
e.g., Apple software
C1={C1,1 , C1,2 ,C1,3 ,…}
User Clickthrough
Apple - Mac OS X
The Apple Mac OS X product page.
Describes features in the current version of
Mac OS X, …
e.g.,
Q2
…
Qk
C2={C2,1 , C2,2 ,C2,3 ,… }
e.g., Jaguar
User Model: p(w | k )  p(w | Qk , Qk 1 ,..., Q1 , Ck 1 ,..., C1 )  ?
Query History
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
Clickthrough
23
Short-term vs. long-term implicit
feedback
• Short term implicit feedback
– context = current retrieval session
– past queries in the context are closely related
to the current query
– clickthroughs  user’s current interests
• Long term implicit feedback
– context = all search interaction history
– not all past queries/clickthroughs are related to
the current query
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
24
“Bayesian interpolation” for short-term
implicit feedback
p(w | H C )  k11
C1
i  k 1
 p(w | C )…
i
i 1
 HC
Ck-1
Average user query and
clickthrough history
p( w | H Q ) 
i  k 1
1
k 1

i 1
Intuition: trust the current
query Qk more if it’s longer

Q1
p( w | Qi )…
Qk-1
 HQ
Dirichlet Prior

k
Qk
p( w |  k ) 
c ( w ,Qk )   p ( w| H Q )  p ( w| H C )
|Qk |  
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
25
Overall Effect of Search Context
FixInt
Query
BayesInt
OnlineUp
BatchUp
(=0.1,=1.0)
(=0.2,=5.0)
(=5.0,=15.0)
(=2.0,=15.0)
MAP
MAP
pr@20
MAP
pr@20
MAP
pr@20
pr@20
Q3
0.0421 0.1483 0.0421
0.1483
0.0421
0.1483
0.0421 0.1483
Q3+HQ+HC
0.0726 0.1967 0.0816
0.2067
0.0706
0.1783
0.0810 0.2067
Improve
72.4%
93.8%
39.4%
67.7%
20.2%
92.4%
Q4
0.0536 0.1933 0.0536
0.1933
0.0536
0.1933
0.0536 0.1933
Q4+HQ+HC
0.0891 0.2233 0.0955
0.2317
0.0792
0.2067
0.0950 0.2250
Improve
66.2%
19.9%
47.8%
6.9%
77.2%
32.6%
15.5%
78.2%
39.4%
16.4%
• Short-term context helps system improve retrieval accuracy
• BayesInt better than FixInt; BatchUp better than OnlineUp
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
26
Using Clickthrough Data Only
Clickthrough is the
major contributor
pr@20
Query
MAP
pr@20
Q3
0.0331
0.125
Q3+HC
0.0661
0.178
Query
MAP
Q3
0.0421 0.1483
Improve
99.7%
42.4%
Q3+HC
0.0766 0.2033
Q4
0.0442
0.165
Improve
81.9%
Q4+HC
0.0739
0.188
Q4
0.0536 0.1930
Improve
67.2%
13.9%
Q4+HC
0.0925 0.2283
Improve
72.6%
Query
MAP
pr@20
Q3
0.0421
0.1483
Q3+HC
0.0521
0.1820
Improve
23.8%
23.0%
Q4
0.0536
0.1930
Q4+HC
0.0620
0.1850
Improve
15.7%
-4.1%
37.1%
18.1%
BayesInt (=0.0,=5.0)
Snippets for non-relevant docs
are still useful!
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
Performance
on
unseen docs
27
Mixture model with dynamic weighting
for long-term implicit feedback
S1
q1 D 1 C 1
θS1
St-1
S2
...
q2 D 2 C 2
λ 1?
θS2 λ ?
2
St
qt-1Dt-1Ct-1
λt-1?
θSt-1
θq
θH
λq?
1-λq
select {λ} to maximize P(Dt | θq, H)
EM algorithm
qtD t
θq,H
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
28
Results: Different Individual Search
Models
recurring ≫ fresh
combination ≈ clickthrough > docs > query, contextless
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
29
Results: Different Weighting
Schemes for Overall History Model
hybrid ≈ EM > cosine > equal > contextless
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
30
3. Open Challenges
• What is a query?
• How to collect as much context information as possible
without infringing user privacy?
• How to store and organize the collected context
information?
• How to accurately interpret/exploit context information?
• How to formally represent the evolving information need of
a user?
• How to optimize search results for an entire session?
• What’s the right architecture (client-side, server-side, and
client-server combo)?
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
31
References
• Framework
– Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for
Personalized Search , In Proceedings of CIKM 2005, pp. 824-831.
– ChengXiang Zhai and John Lafferty, A risk minimization framework for
information retrieval , Information Processing and Management, 42(1),
Jan. 2006. pages 31-55.
• Short-term implicit feedback
– Xuehua Shen, Bin Tan, ChengXiang Zhai, Context-Sensitive Information
Retrieval with Implicit Feedback, Proceedings of SIGIR 2005, pp. 43-50.
• Long-term implicit feedback
– Bin Tan, Xuehua Shen, ChengXiang Zhai, Mining long-term search history
to improve search accuracy , Proceedings of KDD 2006, pp. 718-723.
SIGIR 2010 Workshop on Query Representation and
Understanding, July 23, 2010, Geneva, Switzerland
32
Thank You!
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010