Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Committee: Jamie Callan (Chair) Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime.
Download ReportTranscript Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Committee: Jamie Callan (Chair) Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime.
Modeling and Solving Term Mismatch for Full-Text Retrieval
Committee: Jamie Callan (Chair) Dissertation Presentation Le Zhao Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime Carbonell Yiming Yang Bruce Croft (UMass) 1
What is Full-Text Retrieval
• The task User Query Retrieval Engine Results Document Collection • The Cranfield evaluation [Cleverdon 1960] – abstracts away the user, – allows objective & automatic evaluations User 2
Where are We (Going)?
• Current retrieval models – formal models from 1970s, best ones 1990s – based on simple collection statistics (tf.idf), no deep understanding of natural language texts • Perfect retrieval – Query: “information retrieval”, A: “… text search …” imply – Textual entailment (difficult natural language task) – Searcher frustration [Feild, Allan and Jones 2010] – Still far away, what have been holding us back?
3
Two Long Standing Problems in Retrieval
• Term mismatch – [Furnas, Landauer, Gomez and Dumais 1987] – No clear definition in retrieval • Relevance (query dependent term importance – P(
t
|
R
)) – – – Traditionally, idf (rareness) P(
t
|
R
) [Robertson and Spärck Jones 1976; Greiff 1998] Few clues about estimation • This work – connects the two problems, – – shows they can result in huge gains in retrieval, and uses a predictive approach toward solving both problems.
4
What is Term Mismatch & Why Care?
• Job search – You look for
information retrieval
They want
text search
skills.
jobs on the market. – cost you job opportunities, (50% even if you are careful) • Legal discovery – You look for
bribery or foul play
They say
grease
,
pay off
.
in corporate documents.
– cost you cases • Patent/Publication search – cost businesses • Medical record retrieval – cost lives 5
Prior Approaches
• Document: – Full text indexing • Instead of only indexing key words – Stemming • Include morphological variants – Document expansion • Inlink anchor, user tags • Query: – Query expansion, reformulation • Both: – – Latent Semantic Indexing Translation based models 6
Main Questions Answered
• Definition • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 7
Definition Importance Prediction Solution
Definition of Mismatch
P(
t
_ |
R q
) Jobs mismatched Relevant (
q
) All relevant jobs Documents that contain
t “retrieval”
Collection
mismatch (P(
t
_ |
R q
)) == 1 – term recall (P(
t
|
R q
)) Directly calculated given relevance judgments for
q
|{𝑑: 𝑡 ∉ 𝑑 & 𝑑 ∈ 𝑅 𝑞 }| P( 𝑡 | 𝑅 𝑞 )= |𝑅 𝑞 | [CIKM 2010] 8
Definition Importance Prediction Solution
How Often do Terms Match?
Term in Query
P(
t
| R)
Oil Spills
0.9914
Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments
0.9831
0.6885
0.2821
0.1071
(Example TREC-3 topics) 9
Main Questions
• Definition • P(
t
_ |
R
) or P(
t
|
R
), simple, • estimated from relevant documents, • analyze mismatch • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 10
Definition Importance: Theory Prediction Solution
Term Mismatch & Probabilistic Retrieval Models
Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document
d
Term recall Idf (rareness)
– Term weight for Okapi BM25 – Other advanced models behave similarly – Used as effective features in Web search engines 11
Definition Importance: Theory Prediction Solution
Term Mismatch & Probabilistic Retrieval Models
Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document
d
Term recall Idf (rareness)
– “Relevance Weight”, “Term Relevance” • P(
t
|
R
) : only part about the query, & relevance 12
Main Questions
• Definition • Significance • Theory (as idf & only part about relevance) • Practice?
• Mechanism (what causes the problem) • Model and solution 13
Definition Importance: Practice: Mechanism Prediction
Term Mismatch & Probabilistic Retrieval Models
Solution Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document
d
Term recall Idf (rareness)
– “Relevance Weight”, “Term Relevance” • P(
t
|
R
) : only part about the query, & relevance 14
Definition Importance: Practice: Mechanism Prediction
Without Term Recall
Solution • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) 15
Definition Importance: Practice: Mechanism Prediction
Ground Truth (Term Recall)
Solution Query: prognosis/viability of a political third party
party
True P(
t
|
R
) 0.9796
idf 2.402
political
0.7143
2.513
Emphasis
third
0.5918
2.187
viability
0.0408
5.017
prognosis
0.0204
7.471
Wrong Emphasis 16
Definition Importance: Practice: Mechanism Prediction Solution
Top Results (Language model)
Query: prognosis/viability of a political third party 1. … discouraging
prognosis
for 1991 … 2. …
Politics
…
party
… Robertson's
viability
as a candidate … 3. …
political parties
… 4. … there is no
viable
opposition … 5. … A
third
of the votes … 6. …
politics
…
party
… two
thirds
… 7. …
third
ranking
political
movement… 8. …
political parties
… 9. …
prognosis
for the Sunday school … 10. …
third party
provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!) 17
Definition Importance: Practice: Mechanism Prediction
Without Term Recall
Solution • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) – – False positives throughout rank list • especially detrimental at top rank No term recall hurts
precision
at all recall levels • How significant is the emphasis problem?
18
Definition Importance: Practice: Mechanism Prediction Solution
Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%
Mismatch guided expansion
Emphasis 64%
Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today 19
Main Questions
• Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior: Personalization, WSD, structured • Mechanism (what causes the problem) • Model and solution 20
Definition Importance: Practice: Potential Prediction Solution
Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%
Mismatch guided expansion
Emphasis 64%
Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 21
Definition Importance: Practice: Potential Prediction
True Term Recall Effectiveness
Solution • +100% over BIM (in precision at all recall levels) – [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) – This work • For a new query w/o relevance judgments, – – Need to predict Predictions don’t need to be very accurate to show performance gain 22
Main Questions
• Definition • Significance • Theory: as idf & only part about relevance • • Practice: explains common failures, other behavior, +30 to 80% potential from term weighting • Mechanism (what causes the problem) • Model and solution 23
Definition Importance Prediction: Idea Solution
Term in Query
Oil Spills
Varies 0 to 1
How Often do Terms Match?
Same term, different Recall
Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments
P(
t
| R)
0.9914
0.9831
0.6885
0.2821
0.1071
idf
5.201
2.010
2.010
1.647
6.405
Differs from idf (Examples from TREC 3 topics) 24
Definition Importance Prediction: Idea Solution
Statistics
Term recall across all query terms (average ~55-60%) Term Recall P(
t
|
R
) Term Recall P(
t
|
R
) 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall 25
Definition Importance Prediction: Idea Solution
Statistics
Term recall on shorter queries (average ~70%) Term Recall P(
t
|
R
) Term Recall P(
t
|
R
) 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall 26
Definition Importance Prediction: Idea Solution
Statistics
Query dependent (but for many terms, variance is small) Term Recall for Repeating Terms 364 recurring words from TREC 3-7, 350 topics 27
Definition P(
t
|
R
) Importance Prediction: Idea
P(
t
|
R
) vs. idf
df/N P(
t
1 |
R
) 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 -0,5 Solution 0,5 TREC 4
desc
query terms P(
t
|
R
) vs. df/N (Greiff, 1998) 1 idf 28
Definition Importance Prediction: Idea Solution
Prior Prediction Approaches
• Croft/Harper combination match (1979) – – treats P(
t
|
R
) as a tuned constant, or estimated from PRF when >0.5, rewards docs that match more query terms • Greiff’s (1998) exploratory data analysis – – Used idf to predict overall term weighting Improved over basic BIM • Metzler’s (2008) generalized idf – – Used idf to predict P(
t
|
R
) Improved over basic BIM • Simple feature (idf), limited success – Missing piece: P(
t
|
R
) = term recall = 1 – term mismatch 29
Definition Importance Prediction: Idea Solution
What Factors can Cause Mismatch?
• Topic centrality (Is concept central to topic?) – “Laser research
related or potentially related to
defense” – “Welfare laws
propounded
as reforms” • Synonyms (How often they replace original term?) – “retrieval” == “search” == … • Abstractness – – “Laser
research
… “Welfare
laws
”
defense
” “
Prognosis/viability
” (rare & abstract) 30
Main Questions
• Definition • Significance • Mechanism • Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms • Model and solution 31
Definition Importance Prediction: Implement Solution
Designing Features to Model the Factors
• We need to – Identify synonyms/searchonyms of a query term – in a query dependent way • External resource? (WordNet, wiki, or query log) – Biased (coverage problem, collection independent) – – Static (not query dependent) Not easy, not used here • Term-term similarity in concept space!
– Local LSI (Latent Semantic Indexing) Query Retrieval Engine Results Results Top (500) Results Concept Space (150 dim) Document Collection 32
Definition Importance Prediction: Implement Solution
Synonyms from Local LSI
Term
limitation for US Congress members
Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments
P(
t
| R
q
)
0.9831
Similarity with query term 0,5 0,4 0,3 0,2 0,1 0 0.6885
0.1071
Top Similar Terms
33
Definition Importance Prediction: Implement Solution
Synonyms from Local LSI
Term
limitation for US Congress members
Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments
P(
t
| R
q
)
0.9831
0.6885
0.1071
Similarity with query term 0,5 (1) Magnitude of self similarity – Term centrality 0,4 0,3 (2) Avg similarity of supporting terms – Concept centrality 0,2 0,1 0 (3) How likely synonyms replace term
t
in collection
Top Similar Terms
34
Definition Importance Prediction: Experiment Solution
Features that Model the Factors
• Term centrality – Correlation with P(
t
0.3719
Self-similarity (length of
t
|
R
) idf: – 0.1339
) after dimension reduction • Concept centrality – 0.3758
Avg similarity of supporting terms (top synonyms) • Replaceability – – 0.1872
How frequently synonyms appear in place of original query term in collection documents • Abstractness – 0.1278
– Users modify abstract terms with concrete terms
effects on the US educational program prognosis of a political third party
35
Definition Importance Prediction: Implement Solution
Prediction Model
Regression modeling – – Model:
M
: <
f
1 ,
f
2 , ..,
f
5 > P(
t
|
R
) Train on one set of queries (known relevance), – Test on another set of queries (unknown relevance) – RBF kernel Support Vector regression 36
Definition Importance Prediction Solution
A General View of Retrieval Modeling as Transfer Learning
• The traditional restricted view sees a retrieval model as – a document classifier for a given query.
• More general view: A retrieval model really is – a meta-classifier, responsible for many queries, – mapping a query to a document classifier.
• Learning a retrieval model == transfer learning – Using knowledge from related tasks (training queries) to classify documents for a new task (test query) – – Our features and model facilitate the transfer.
More general view more principled investigations and more advanced techniques 37
Definition Importance Prediction: Experiment Solution
Experiments
• Term recall prediction error – L1 loss (absolute prediction error) • Term recall based term weighting retrieval – Mean Average Precision (overall retrieval success) – Precision at top 10 (precision at top of rank list) 38
Definition Importance Prediction: Experiment Solution
Term Recall Prediction Example
Query:
prognosis/viability of a political third party
.
(Trained on TREC 3)
party
True P(
t
|
R
) 0.9796
Predicted 0.7585
political
0.7143
0.6523
third
0.5918
0.6236
viability
0.0408
0.3080 Emphasis
prognosis
0.0204
0.2869 39
Definition Importance Prediction: Experiment Solution
Term Recall Prediction Error
Average Absolute Error (L1 loss) on TREC 4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 Average (constant) L1 Loss: IDF only The lower, the better All 5 features Tuning meta parameters TREC 3 recurring words 40
Main Questions
• Definition • Significance • Mechanism • Model and solution • Can be predicted; Framework to design and evaluate features 41
Definition Importance Prediction Solution: Weighting
Using (
t
|
R
) in Retrieval Models
• In BM25 – Binary Independence Model • In Language Modeling (LM) – Relevance Model [Lavrenko and Croft 2001] Only term weighting, no expansion.
42
Definition Importance Prediction Solution: Weighting
MAP
0,25
*
0,2
Predicted Recall Weighting
* *
10-25% gain (MAP) Baseline LM desc
* *
0,15 0,1
*
0,05 0
*
3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14
Datasets: train -> test
“*”: significantly better by sign & randomization tests 43
Definition Importance Prediction Solution: Weighting
Prec@10
0,6 0,5
*
0,4 0,3 0,2 0,1 0 3 -> 4
Predicted Recall Weighting
10-20% gain (top Precision) Baseline LM desc
!
!
!
*
3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10
Datasets: train -> test
“*”: Prec@10 is significantly better.
“!”: Prec@20 is significantly better.
9 -> 10 11 -> 12 13 -> 14 44
Definition Importance Prediction
vs. Relevance Model
Solution: Weighting Relevance Model [Laverenko and Croft 2001] Term occurrence in top docs Unsupervised Query Likelihood y RM weight (x) ~ Term recall (y) P m (
t
1 P m (
t
2
| R
)
| R
) ~ P(
t
1 |
R
) ~ P(
t
2 |
R
) 5-10% better than unsupervised TREC 7 TREC 13 x 45
Main Questions
• Definition • Significance • Mechanism • Model and solution • Term weighting solves emphasis problem for long queries • Mismatch problem?
46
Definition Importance Prediction Solution: Expansion
Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%
Mismatch guided expansion
Emphasis 64%
Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 47
Definition Importance Prediction
Recap: Term Mismatch
Solution: Expansion • Term mismatch ranges 30%-50% on average • Relevance matching can degrade quickly for multi-word queries • Solution: Fix every query term [SIGIR 2012] 48
Definition Importance Prediction Solution: Expansion
Conjunctive Normal Form (CNF) Expansion
Example keyword query:
placement of cigarette signs on television watched by children
Manual CNF: AND
signage
OR
merchandise
) (
cigarette
OR
cigar
OR
tobacco
) AND AND (
television
OR
TV
(
watch
OR
view
) OR
cable
OR
network
) AND (
placement
OR
place
OR
promotion
OR
logo
OR
sign
OR (
children
OR
teen
OR
juvenile
OR
kid
OR
adolescent
) – Expressive & compact (1 CNF == 100s alternatives) – Highly effective (this work: 50-300% over base keyword) – Used by lawyers, librarians and other expert searchers – But, tedious & difficult to create, little research 49
Definition Importance Prediction
Diagnostic Intervention
Solution: Expansion Query: placement of cigarette signs on television watched by children Diagnosis: Low terms High idf (rare) terms
placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children
Expansion: CNF CNF (
placement
OR
place
OR
promotion
OR
sign
OR
signage
OR
merchandise
) AND
cigar
AND
television
AND (
children
OR OR
adolescent
) AND
watch teen
OR
juvenile
OR
kid
(
placement
OR
place
OR
promotion
OR
sign
OR
signage
OR
merchandise
) AND
cigar
AND (
television
AND
watch
OR
tv
OR
cable
OR
network
) AND
children
• Goal – Least amount user effort near optimal performance – E.g. expand 2 terms 90% of total improvement 50
Definition Importance Prediction
Diagnostic Intervention
Solution: Expansion Query: placement of cigarette signs on television watched by children Diagnosis: Low terms High idf (rare) terms
placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children
Expansion:
Bag of word
[ 0.9 (
placement cigar television watch children
) Expansion query 0.1 (0.4
place
0.3
promotion
0.2
logo
0.1
sign
0.3
signage
0.3
merchandise
0.5
teen
0.4
juvenile
0.2
kid
0.1
adolescent
) ] [ 0.9 (
placement cigar children
)
television watch
0.1 (0.4
place
0.1
sign
0.3 0.3
signage promotion
0.3 0.2
logo merchandise
0.5
tv
0.4
cable
0.2
network
) ] • Goal – Least amount user effort near optimal performance – E.g. expand 2 terms 90% of total improvement 51
Definition Importance Prediction Solution: Expansion
Diagnostic Intervention (We Hope to)
User Keyword query (
child
AND
cigar
) Diagnosis system ( P(
t
|
R
) or idf ) Expansion terms (
child
teen
) Query formulation ( CNF or Keyword ) (
child
OR
teen
) AND
cigar
User expansion Retrieval engine Problem query terms (
child
>
cigar
) Evaluation 52
Definition Importance Prediction Solution: Expansion
Diagnostic Intervention (We Hope to)
User Keyword query (
child
AND
cigar
) Diagnosis system ( P(
t
|
R
) or idf ) Expansion terms (
child
teen
) Query formulation ( CNF or Keyword ) (
child
OR
teen
) AND
cigar
User expansion Retrieval engine Problem query terms (
child
>
cigar
) Evaluation 53
Definition Importance Prediction Solution: Expansion
We Ended up Using Simulation
Diagnosis system ( P(
t
|
R
) or idf ) Expert user Full CNF Offline (
child
OR
teen
) AND (
cigar
OR
tobacco
) Expansion terms (
child
teen
) Keyword query (
child
AND
cigar
) Online simulation User expansion Online simulation Query formulation ( CNF or Keyword ) (
child
OR
teen
) AND
cigar
Retrieval engine Problem query terms (
child
>
cigar
) Evaluation 54
Definition Importance Prediction Solution: Expansion
Diagnostic Intervention Datasets
• Document sets – – TREC 2007 Legal track, 7 million tobacco company TREC 4 Ad hoc track, 0.5 million newswire • CNF Queries, 50 topics per dataset – TREC 2007 by lawyers, TREC 4 by Univ. Waterloo • Relevance Judgments – TREC 2007 sparse, TREC 4 dense • Evaluation measures – TREC 2007 statAP, TREC 4 MAP 55
Definition Importance Prediction
Results – Diagnosis
Solution: Expansion P(
t
|
R
) vs. idf diagnosis Full Expansion 100%
Gain in retrieval (MAP)
90% 80% 70% 60% 50% 40% 30% 20% P(t | R) on TREC 2007 idf on TREC 2007 P(t | R) on TREC 4 idf on TREC 4 10% 0% 0 No Expansion 1 2 3 4 Diagnostic CNF expansion on TREC 4 and 2007 All
# query terms selected
56
Definition Importance Prediction Solution: Expansion
Results – Form of Expansion
CNF vs. bag-of-word expansion
Retrieval performance (MAP)
0,35 Similar level of gain in top precision 0,3 0,25 0,2 0,15 0,1 CNF on TREC 4 Bag of word on TREC 4 CNF on TREC 2007 Bag of word on TREC 2007 50% to 300% gain 0,05 0 0 1 2 3 4 All
# query terms selected
P(
t
|
R
) guided expansion on TREC 4 and 2007 57
Main Questions
• Definition • Significance • Mechanism • Model and solution • Term weighting for long queries • Term mismatch prediction diagnoses problem terms, and produces simple & effective CNF queries 58
Definition Importance Prediction: Efficiency Solution: Weighting
Efficient P(
t
|
R
) Prediction
• 3-10X speedup (close to simple keyword retrieval), while maintaining 70-90% of the gain • Predict using P(
t
|
R
) values from similar, previously-seen queries [CIKM 2012] 59
Definition Importance Prediction
Contributions
Solution • Two long standing problems: mismatch & P(
t
|
R
) • Definition and
initial
quantitative analysis of mismatch – Do better/new features and prediction methods exist?
• Role of term mismatch in
basic
retrieval theory – Principled ways to solve term mismatch – What about advanced learning to rank, transfer learning?
• Ways to automatically predict term mismatch –
Initial
modeling of causes of mismatch, features – Efficient prediction using historic information – Are there better analyses or modeling of the causes?
60
Definition Importance Prediction
Contributions
Solution • Effectiveness of ad hoc retrieval – Term weighting & diagnostic expansion – How to do automatic CNF expansion?
– Better formalisms: transfer learning, & more tasks?
• Diagnostic intervention – Mismatch diagnosis guides targeted expansion – – How to diagnose specific types of mismatch problems or different problems (mismatch/emphasis/precision)?
• Guide NLP, Personalization, etc. to solve the real problem How to proactively identify search and other user needs?
61
Acknowledgements
• Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft • Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui (Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Yi Zhang (and her group), Jin Young Kim, Yangbo Zhu, Runting Shi, Yi Wu, Hui Tan, Yifan Yanggong, Mingyan Fan, Chengtao Wen – Discussions & references & feedback • Reviewers: papers & NSF proposal • David Fisher, Mark Hoy, David Pane – Maintaining the Lemur toolkit • Andrea Bastoni and Lorenzo Clemente – Maintaining LSI code for Lemur toolkit • SVM-light, Stanford parser • TREC: data • NSF Grant IIS-1018317 • Xiangmin Jin, and my whole family and volleyball packs at CMU & SF Bay 62
END
63
Prior Definition of Mismatch
• Vocabulary mismatch (Furnas et al., 1987) – How likely 2 people disagree in vocab choice – Domain experts disagree 80-90% of the times – Leads to Latent Semantic Indexing (Deerwester et al., 1988) – Query independent – – = Avg
q
P(
t
|
R q
)
-
term mismatch 64
Knowledge How Necessity explains behavior of IR techniques
• Why weight query bigrams 0.1, while query unigrams 0.9?
– Bigram decreases term recall, weight reflects recall • Why Bigram not gaining stable improvements?
– Term recall is more of a problem • Why using document structure (field, semantic annotation) not improving performance?
– Improves precision, need to solve structural mismatch • Word sense disambiguation – Enhances precision, instead, should use in mismatch modeling!
• Identify query term sense, for searchonym id, or learning across queries • Disambiguate collection term sense for more accurate replaceability • Personalization – – biases results to what a community/person likes to read (precision) may work well in a mobile setting, short queries 65
Why Necessity?
System Failure Analysis
• Reliable Information Access (RIA) workshop (2003) – Failure analysis for 7 top research IR systems • 11 groups of researchers (both academia & industry) • 28 people directly involved in the analysis (senior & junior) • >56 human*weeks (analysis + running experiments) • 45 topics selected from 150 TREC 6-8 (difficult topics) – Causes (necessity in various disguise) • Emphasize 1 aspect, missing another aspect • Emphasize 1 aspect, missing another term • Missing either 1 of 2 aspects, need both • Missing difficult aspect that need human help • • Need to expand a general term e.g. “Europe”
Precision problem, e.g. “euro”, not “euro-…”
(14+2 topics) (7 topics) (5 topics) (7 topics) (4 topics)
(4 topics)
66
67
68
Local LSI Top Similar Terms
Oil spills spill oil oil
0.5828
0.4210
Insurance coverage pays for term care term which long term
0.3310
Term limitations for US Congress members term term
0.3339
Vitamin the cure of or cause for human ailments ail ail
0.4415
long
0.2173
limit
0.1696
health
0.0825
tank crude
0.0986
0.0972
nurse care
0.2114
0.1694
ballot elect
0.1115
0.1042
disease basler
0.0720
0.0718
water
0.0830
home
0.1268
care
0.0997
dr
0.0695
69
0,2 0 -0,2 -0,4 -0,6 -0,8 1,2 1 0,8 0,6 0,4
Error plot of necessity predictions
Necessity truth Predicted necessity Prediction trend (3rd order polynomial fit) 70
Necessity vs. idf (and emphasis)
71
True Necessity Weighting
TREC
Document collection Topic numbers LM
desc
– Baseline LM
desc
– Necessity Improvement
p
- randomization
p
- sign test Multinomial-abs Multinomial RM Okapi
desc
– Baseline Okapi
desc
– Necessity LM
title
– Baseline LM
title
– Necessity
4
disk 2,3 201-250 0.1789
0.2703
51.09%
0.0000
0.0000
0.1988
0.2613
0.2055
0.2679
N/A N/A
6
disk 4,5 301-350 0.1586
0.2808
77.05%
0.0000
0.0000
0.2088
0.2660
0.1773
0.2786
0.2362
0.2514
8
d4,5 w/o cr 401-450
9 10
WT10g 451-500 501-550 0.1923
0.3057
0.2145
0.2770
0.1627
0.2216
58.97%
0.0000
0.0000
0.2345
29.14%
0.0000
0.0005
0.2239
36.20%
0.0000
0.0000
0.1653
0.2969
0.2183
0.2894
0.2518
0.2606
0.2590
0.1944
0.2387
0.1890
0.2058
0.2259
0.1591
0.2003
0.1577
0.2137
12
.GOV
TD1-50 0.0239
0.0868
261.7%
0.0000
0.0000
0.0645
0.1219
0.0449
0.0776
0.0964
0.1042
14
.GOV2
751-800 0.1789
0.2674
49.47%
0.0001
0.0002
0.2150
0.2260
0.2058
0.2403
0.2511
0.2674
72
Predicted Necessity Weighting
10-25% gain (necessity weight) 10-20% gain (top Precision)
TREC train sets Test/x-validation
LM
desc
– Baseline LM
desc
– Necessity Improvement P@10 P@20 Baseline Necessity Baseline Necessity
3 4
0.1789
0.2261
26.38%
0.4160
0.4940
0.3450
0.4180
3-5 6
0.1586
0.1959
23.52%
0.2980
0.3420
0.2440
0.2900
3-7 8
0.1923
0.2314
20.33%
0.3860
0.4220
0.3310
0.3540
7 8
0.1923
0.2333
21.32%
0.3860
0.4380
0.3310
0.3610
73
Predicted Necessity Weighting (ctd.)
TREC train sets Test/x-validation
LM
desc
– Baseline LM
desc
– Necessity Improvement P@10 Baseline Necessity P@20 Baseline Necessity
3-9 10
0.1627
0.1813
11.43% 0.3180
0.3280
0.2400
0.2790
9 10
0.1627
0.1810
11 12
0.0239
0.0597
11.25% 149.8%
0.3180
0.0200
0.3400
0.0467
0.2400
0.2810
0.0211
0.0411
13 14
0.1789
0.2233
24.82%
0.4720
0.5360
0.4460
0.5030
74
vs. Relevance Model
Relevance Model: #weight( 1 λ #combine(
t
1 λ #weight(
w
1
t
1
t
2
w
2
t
2
w
3 …
t
3 ) ) ) Weight Only ≈ Expansion
Test/x-validation
LM
desc
– Baseline
4
0.1789
6
0.1586
8
0.1923
8
0.1923
0.1627
1 0,5 (5-10%)
10
y
10
0.1627
x ~ y
w
1
w
2 ~ P(
t
1 |
R
) ~ P(
t
2 |
R
) 0 0 0,5 Supervised > Unsupervised
12
0.0239
14
1 0.1789
x Relevance Model RM reweight-Only
desc desc
0.2423
0.2215
RM reweight-Trained desc 0.2330 0.1799
0.1705
0.1921
0.2352
0.2435
0.2542
0.2352
0.2435
0.2563
0.1888
0.1700
0.1809
0.1888
0.1700
0.1793
0.0221
0.0692
0.0534
0.1774
0.1945
0.2258
75
Feature Correlation
f
1 Term
0.3719
f
2 Con
0.3758
f
3 Repl
-0.1872
f
4 DepLeaf
0.1278
f
5 idf
-0.1339
Predicted Necessity:
0.7989
(TREC 4 test set)
RMw
0.6296
≈ 76
vs. Relevance Model
Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) Baseline LM desc Relevance Model desc RM Reweight-Only desc RM Reweight-Trained desc RM is unstable
Datasets: train -> test
77
Definition Importance Prediction: Idea Solution
Efficient Prediction of Term Recall
• Currently: – slow query dependent features that requires retrieval • Can they be more effective and more efficient?
– Need to understand the causes of the query dependent variation – Design a minimal set of efficient features to capture the query dependent variations 78
Causes of Query Dependent Variation (1)
• Example • Cause – Different word sense 79
Causes of Query Dependent Variation (2)
• Example • Cause – Different word use, e.g. term in phrase vs. not 80
Causes of Query Dependent Variation (3)
• Example • Cause – Different Boolean semantics of the queries, AND vs. OR 81
Causes of Query Dependent Variation (4)
• Example • Cause – Different association level with topic 82
Efficient P(
t
|
R
) Prediction (2)
• Causes of P(
t
|
R
) variation of same term in different queries – Different query semantics:
Canada
or Mexico vs.
Canada
– Different word sense: bear (verb) vs. bear (noun) – Different word use: Seasonal affective
disorder
syndrome (SADS) vs. Agoraphobia as a
disorder
– Difference in association level with topic • Use historic occurrences to predict current – 70-90% of the total gain – 3-10X faster, close to simple keyword retrieval 83
Efficient P(
t
|
R
) Prediction (2)
• Low variation of same term in different queries • Use historic occurrences to predict current – 3-10X faster, close to the slower method & real time 0,3 0,25 0,2
MAP * *
Baseline LM desc Necessity LM desc Efficient Prediction
*
0,15 0,1
*
0,05 0
train -> test
3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 84
Using Document Structure
• Stylistic: XML • Syntactic/Semantic: POS, Semantic Role Label • Current approaches – All precision oriented • Need to solve mismatch first?
85
Motivation
• Search is important, information portal • Search is research worthy – Retrieval SIGIR, WWW, CIKM, ASIST, ECIR, AIRS, – Search is difficult – Adapt to changing requirements of mobile, social and semantic Web • Modeling user’s needs User Query Activities Results Retrieval Model 86 Collections
Online or Offline Study?
• Controlling confounding variables – – Quality of expansion terms User’s prior knowledge of the topic – Interaction form & effort • Enrolling many users and repeating experiments • Offline simulations can avoid all these and still make reasonable observations 87
Simulation Assumptions
• Real full CNFs to simulate partial expansions • 3 assumptions about user expansion process – Expansion of individual terms are independent of each other • •
A1
: always same set of expansion terms for a given query term, no matter which subset of query terms get expanded.
A2
: same sequence of expansion terms, no matter … –
A3
: Keyword query is re-constructed from the CNF query • Procedure to ensure vocabulary faithful to that of the original keyword description • Highly effective CNF queries ensure reasonable kw baseline 88
Take Home Message for Ordinary Search Users (people and software)
89
Be mean!
Is the term Necessary for doc relevance?
If not, remove, replace or expand.
90