USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS WITH REVISION HISTORY ANALYSIS Ablimit Aji, Yu Wang Eugene Agichtein, Evgeniy Gabrilovich Oct.

Download Report

Transcript USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS WITH REVISION HISTORY ANALYSIS Ablimit Aji, Yu Wang Eugene Agichtein, Evgeniy Gabrilovich Oct.

USING THE PAST TO SCORE THE PRESENT:
EXTENDING TERM WEIGHTING MODELS
WITH REVISION HISTORY ANALYSIS
Ablimit Aji, Yu Wang
Eugene Agichtein, Evgeniy Gabrilovich
Oct. 28, 2010
1
Revisions of “Topology” on Wikipedia
1st revision:
250th revision:
Current revision:
2
Observable Document Generation Process
#i-1
In mathematics, '''topology''' is a branch
concerned with the study of topological
spaces. Roughly speaking, topology is the
study of geometric objects without
considering their dimensions.
95th revision
#i
In mathematics, '''topology''' is a branch
concerned with the study of topological
spaces.
Topology is also concerned with the study
of the so called topological properties of
figures, that is to say properties that does
not change under a bicontinuous one-toone transformation (call homeomorphisms
96th revision
3
How Revision History Analysis Could Help Retrieval
Revision History Analysis
4
Selected Prior Work
• J. Elsas and S. Dumais. Leveraging temporal dynamics of document
content in relevance ranking. In Proc. of WSDM,2010.
• M. Efron. Linear time series models for term weighting in information
retrieval. JASIST, 2010.
• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned
document collections. In CIKM, New York, NY, USA, 2009.
5
Revision History Analysis (RHA)
RHA redefines term frequency (TF):
- TF is a key indicator of document relevance
- TF can be naturally integrated into ranking models
𝑆 𝑄, 𝐷 =
BM25
𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1
𝐼𝐷𝐹 𝑡 ∙
𝑡𝜖𝑄
𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
Language Model
𝑃 𝑡|𝑄 ∙ log
𝑡𝜖𝑉
𝐷
𝑎𝑣𝑔𝑑𝑙
𝑃 𝑡𝑄
𝑃 𝑡𝐷
6
Model 1: Steady growth
First revision
Current version
Topology, in mathematics, is both a structure used to capture the notions of continuity,
connectedness and convergence, and the name of the branch of mathematics which
studies these.
Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area
of mathematics concerned with spatial properties that are preserved under
continuous deformations of objects, for example
…..
basic examples include compactness and connectedness
7
Model 1 (continued)
8
RHA Global Model: definition
Define the term frequency over the whole document
generation process
– a document grows steadily over time
– a term is relatively important if it appears in the early
revisions.
𝑛
𝑇𝐹𝑔𝑙𝑜𝑏𝑎𝑙 𝑡, 𝑑 =
𝑗=1
𝑐(𝑡, 𝑣𝑗 )
𝑗𝛼
Frequency of term
𝑡 in revision 𝑣𝑗
Decay factor
9
But… Some pages are different:
“Avatar(2009 film)”
1st revision:
500th revision:
Current revision:
10
Model 2: Bursty Growth
Burst of Document (Length) & Change of Term Frequency
Term Frequency
Time
Document Length
“Pandora”
“James Cameron”
Nov. 2009
9
23
2576
Dec. 2009
25
50
6306
Burst of Edit Activity & Associated Events
Month (2009)
Jul.
Aug.
Sep.
Oct
Nov.
Dec.
Edit Activity
89
224
67
154
232
1892
First photo & trailer released
Movie released
Global Model might be insufficient
11
RHA Burst Model: Definition
• A burst resets the decay clock for a term.
• The weight will decrease after a burst.
𝑚
𝑛
𝑇𝐹𝑏𝑢𝑟𝑠𝑡 𝑡, 𝑑 =
𝑗=1 𝑘=𝑏𝑗
𝑐(𝑡, 𝑣𝑘 )
(𝑘 − 𝑏𝑗 + 1)𝛽
Frequency of term
𝑡 in revision 𝑣𝑘
Decay factor for jth
Burst
12
Burst Detection (1): Content-based
Relative content change
ℬ𝑐 𝑣𝑗 =
1,
0,
Δ𝑐
potential burst
|𝑣𝑗 |−|𝑣𝑗−1 |
|𝑣𝑗−1 |
>𝛼
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Content-based Burst
for “Avatar”
13
Burst Detection (2): Activity Based
Intensive edit activity
ℬ𝑎 𝑒𝑝𝑗
1,
=
0,
Δ𝑡
potential bursts
Average revision counts
𝑟𝑒𝑣𝑖𝑠𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 𝑖𝑛 Δ𝑡 > 𝜇 + 𝜎
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Deviation
Activity-based Burst
for “Avatar”
14
Burst Detection (3): Combined Model
15
Putting it All Together: RHA Term Frequency
--Combining global model and burst model
RHA Term Frequency:
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑇𝐹𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑇𝐹𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑇𝐹 𝑡, 𝐷
𝜆1 + 𝜆2 + 𝜆3 = 1
𝜆1 , 𝜆2 𝑎𝑛𝑑 𝜆3 indicate the weights of RHA global model, burst model and original
term frequency (probability).
16
Integrating RHA into Retrieval Models
BM25 + RHA
𝑆 𝑄, 𝐷 =
𝐼𝐷𝐹 𝑡 ∙
𝑡𝜖𝑄
𝑇𝐹
𝑇𝐹
𝐷
𝑟ℎ𝑎 𝑡, 𝐷
𝑇𝐹
𝑡, 𝐷
𝑇𝐹
𝐷
𝑟ℎ𝑎 𝑡,
∙ 𝑘1 + 1
+ 𝑘1 1 − 𝑏 + 𝑏 ∙
𝐷
𝑎𝑣𝑔𝑑𝑙
Statistical Language Models + RHA
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
𝑃 𝑡|𝑄 ∙ log
𝑡𝜖𝑉
𝑃 𝑡𝑄
𝑃𝑃𝑟ℎ𝑎
𝑡 𝐷𝑡, 𝐷
RHA Term Probability:
𝑃𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑃𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑃𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑃 𝑡, 𝐷
17
Experimental Setup
18
Datasets
INEX: well established forum for structured retrieval tasks
(based on Wikipedia collection)
TREC: performance comparison on different set of queries and
general applicability
INEX 65
topic
TREC 68
topic
Wiki
Dump
Top 1000
retrieved articles
1000 revisions
for each article
Corpus for INEX
Top 1000
retrieved articles
1000 revisions
for each article
Corpus for TREC
19
Results
20
INEX Results
Model
bpref
MAP
R-precision
BM25
0.354
0.354
0.314
BM25+RHA
0.375 (+5.93%)
0.360 (+1.69%)
0.337 (+7.32%)
LM
0.357
0.370
0.348
LM+RHA
0.372 (+4.20%)
0.378 (+2.16%)
0.359 (+3.16%)
Parameters tuned on INEX query Set
BM25: 𝜆1 = 0.3 , 𝜆2 = 0.4, 𝜆3 = 0.3
LM:
𝜆1 = 0.3 , 𝜆2 = 0.2, 𝜆3 = 0.5
21
TREC Results
Model
bpref
MAP
NDCG
BM25
0.524
0.548
0.634
BM25+RHA
0.547** (+4.39%)
0.568 ** (+3.65%)
0.656** (+3.47%)
LM
0.527
0.556
0.645
LM+RHA
0.532 (+0.95%)
0.567 (+1.98%)
0.653 (+1.24%)
parameters tuned on INEX query Set, ** indicates statistically
significant differences @ the 0.01 significance level with two tailed
paired t-test
BM25: 𝜆1 = 0.3 , 𝜆2 = 0.4, 𝜆3 = 0.3
LM:
𝜆1 = 0.3 , 𝜆2 = 0.2, 𝜆3 = 0.5
Lab members manually labeled top 20 results for each topic
22
Performance Analysis
Performance Improvements on bpref for BM25+RHA over baseline (BM25)
INEX
TREC
INEX: significant improvement on 40% queries
TREC: significant improvement on 37% queries
Ex: “circus acts skills” , “olive oil health benefit”
(+20% BM25 ,+11% LM improvement)
23
Summary
o RHA captures importance signal from
document authoring process.
o Introduced RHA term weighting approach
o Natural integration with state of the art
retrieval models.
o Consistent improvement over baseline
retrieval models
24
Thank you!
Using the Past to Score the Present:
Extending Term Weighting Models with Revision History Analysis
Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich
Research partially supported by:
25
Query Sets and Evaluation Metrics
• Queries and Labels:
– INEX: provided
– TREC: subset of ad-hoc track
• Metrics:
– Bpref (robust to missing judgments)
– MAP: mean average precision
– R-prec: precision at position R
26
RHA in Statistical Language Models
o 𝑃𝑟ℎ𝑎 𝑤, 𝐷 = 𝜆1 ∙ 𝑃𝑔 𝑤, 𝐷 + 𝜆2 ∙ 𝑃𝑏 𝑤, 𝐷 + 𝜆3 ∙ 𝑃 𝑤, 𝐷
o 𝑃𝑔 𝑤 𝐷 =
𝑛 𝑐(𝑤,𝑣𝑗 )
𝑗=1 𝑗𝛼
𝑤∈𝐷
o 𝑃𝑏 𝑤 𝐷 =
𝑐(𝑤,𝑣𝑗 )
𝑛
𝑗=1 𝑗𝛼
𝑐(𝑤,𝑣𝑘 )
𝑛
𝑘=𝑏𝑗 (𝑘−𝑏 +1)𝛽
𝑗
𝑐(𝑤,𝑣𝑘 )
𝑚
𝑛
𝑗=1 𝑘=𝑏𝑗 (𝑘−𝑏 +1)𝛽
𝑗
(Global Model)
𝑚
𝑗=1
𝑤∈𝐷
(Burst Model)
o 𝜆1 +𝜆2 + 𝜆3 = 1
27
Cross validation on INEX
Model
bpref
MAP
R-precision
BM25
0.307
0.281
0.324
BM25+RHA
0.312 (+1.63%)
0.291 (+3.56%)
0.320 (-1.23%)
LM
0.311
0.284
0.348
LM+RHA
0.338 (+8.68%)
0.298 (+4.93%)
0.359 (+0.61%)
5-fold cross validation on INEX 2008 query Set
Model
bpref
MAP
R-precision
BM25
0.354
0.354
0.314
BM25+RHA
0.363 (+2.54%)
0.348 (-1.70%)
0.333 (+6.05%)
LM
0.357
0.370
0.348
LM+RHA
0.366 (+2.52%)
0.375 (+1.35%)
0.352 (+1.15%)
5-fold cross validation on INEX 2009 query Set
28