Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Committee: Jamie Callan (Chair) Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime.

Download Report

Transcript Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Committee: Jamie Callan (Chair) Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime.

Modeling and Solving Term Mismatch for Full-Text Retrieval

Committee: Jamie Callan (Chair) Dissertation Presentation Le Zhao Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Jaime Carbonell Yiming Yang Bruce Croft (UMass) 1

What is Full-Text Retrieval

• The task User Query Retrieval Engine Results Document Collection • The Cranfield evaluation [Cleverdon 1960] – abstracts away the user, – allows objective & automatic evaluations User 2

Where are We (Going)?

• Current retrieval models – formal models from 1970s, best ones 1990s – based on simple collection statistics (tf.idf), no deep understanding of natural language texts • Perfect retrieval – Query: “information retrieval”, A: “… text search …” imply – Textual entailment (difficult natural language task) – Searcher frustration [Feild, Allan and Jones 2010] – Still far away, what have been holding us back?

3

Two Long Standing Problems in Retrieval

• Term mismatch – [Furnas, Landauer, Gomez and Dumais 1987] – No clear definition in retrieval • Relevance (query dependent term importance – P(

t

|

R

)) – – – Traditionally, idf (rareness) P(

t

|

R

) [Robertson and Spärck Jones 1976; Greiff 1998] Few clues about estimation • This work – connects the two problems, – – shows they can result in huge gains in retrieval, and uses a predictive approach toward solving both problems.

4

What is Term Mismatch & Why Care?

• Job search – You look for

information retrieval

They want

text search

skills.

jobs on the market. – cost you job opportunities, (50% even if you are careful) • Legal discovery – You look for

bribery or foul play

They say

grease

,

pay off

.

in corporate documents.

– cost you cases • Patent/Publication search – cost businesses • Medical record retrieval – cost lives 5

Prior Approaches

• Document: – Full text indexing • Instead of only indexing key words – Stemming • Include morphological variants – Document expansion • Inlink anchor, user tags • Query: – Query expansion, reformulation • Both: – – Latent Semantic Indexing Translation based models 6

Main Questions Answered

• Definition • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 7

Definition Importance Prediction Solution

Definition of Mismatch

P(

t

_ |

R q

) Jobs mismatched Relevant (

q

) All relevant jobs Documents that contain

t “retrieval”

Collection

mismatch (P(

t

_ |

R q

)) == 1 – term recall (P(

t

|

R q

)) Directly calculated given relevance judgments for

q

|{𝑑: 𝑡 ∉ 𝑑 & 𝑑 ∈ 𝑅 𝑞 }| P( 𝑡 | 𝑅 𝑞 )= |𝑅 𝑞 | [CIKM 2010] 8

Definition Importance Prediction Solution

How Often do Terms Match?

Term in Query

P(

t

| R)

Oil Spills

0.9914

Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments

0.9831

0.6885

0.2821

0.1071

(Example TREC-3 topics) 9

Main Questions

• Definition • P(

t

_ |

R

) or P(

t

|

R

), simple, • estimated from relevant documents, • analyze mismatch • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution 10

Definition Importance: Theory Prediction Solution

Term Mismatch & Probabilistic Retrieval Models

Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document

d

Term recall Idf (rareness)

– Term weight for Okapi BM25 – Other advanced models behave similarly – Used as effective features in Web search engines 11

Definition Importance: Theory Prediction Solution

Term Mismatch & Probabilistic Retrieval Models

Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document

d

Term recall Idf (rareness)

– “Relevance Weight”, “Term Relevance” • P(

t

|

R

) : only part about the query, & relevance 12

Main Questions

• Definition • Significance • Theory (as idf & only part about relevance) • Practice?

• Mechanism (what causes the problem) • Model and solution 13

Definition Importance: Practice: Mechanism Prediction

Term Mismatch & Probabilistic Retrieval Models

Solution Binary Independence Model – [Robertson and Spärck Jones 1976] – Optimal ranking score for each document

d

Term recall Idf (rareness)

– “Relevance Weight”, “Term Relevance” • P(

t

|

R

) : only part about the query, & relevance 14

Definition Importance: Practice: Mechanism Prediction

Without Term Recall

Solution • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) 15

Definition Importance: Practice: Mechanism Prediction

Ground Truth (Term Recall)

Solution Query: prognosis/viability of a political third party

party

True P(

t

|

R

) 0.9796

idf 2.402

political

0.7143

2.513

Emphasis

third

0.5918

2.187

viability

0.0408

5.017

prognosis

0.0204

7.471

Wrong Emphasis 16

Definition Importance: Practice: Mechanism Prediction Solution

Top Results (Language model)

Query: prognosis/viability of a political third party 1. … discouraging

prognosis

for 1991 … 2. …

Politics

party

… Robertson's

viability

as a candidate … 3. …

political parties

… 4. … there is no

viable

opposition … 5. … A

third

of the votes … 6. …

politics

party

… two

thirds

… 7. …

third

ranking

political

movement… 8. …

political parties

… 9. …

prognosis

for the Sunday school … 10. …

third party

provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!) 17

Definition Importance: Practice: Mechanism Prediction

Without Term Recall

Solution • The emphasis problem for tf.idf retrieval models – Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) – – False positives throughout rank list • especially detrimental at top rank No term recall hurts

precision

at all recall levels • How significant is the emphasis problem?

18

Definition Importance: Practice: Mechanism Prediction Solution

Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%

Mismatch guided expansion

Emphasis 64%

Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today 19

Main Questions

• Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior: Personalization, WSD, structured • Mechanism (what causes the problem) • Model and solution 20

Definition Importance: Practice: Potential Prediction Solution

Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%

Mismatch guided expansion

Emphasis 64%

Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 21

Definition Importance: Practice: Potential Prediction

True Term Recall Effectiveness

Solution • +100% over BIM (in precision at all recall levels) – [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) – This work • For a new query w/o relevance judgments, – – Need to predict Predictions don’t need to be very accurate to show performance gain 22

Main Questions

• Definition • Significance • Theory: as idf & only part about relevance • • Practice: explains common failures, other behavior, +30 to 80% potential from term weighting • Mechanism (what causes the problem) • Model and solution 23

Definition Importance Prediction: Idea Solution

Term in Query

Oil Spills

Varies 0 to 1

How Often do Terms Match?

Same term, different Recall

Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments

P(

t

| R)

0.9914

0.9831

0.6885

0.2821

0.1071

idf

5.201

2.010

2.010

1.647

6.405

Differs from idf (Examples from TREC 3 topics) 24

Definition Importance Prediction: Idea Solution

Statistics

Term recall across all query terms (average ~55-60%) Term Recall P(

t

|

R

) Term Recall P(

t

|

R

) 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall 25

Definition Importance Prediction: Idea Solution

Statistics

Term recall on shorter queries (average ~70%) Term Recall P(

t

|

R

) Term Recall P(

t

|

R

) 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall 26

Definition Importance Prediction: Idea Solution

Statistics

Query dependent (but for many terms, variance is small) Term Recall for Repeating Terms 364 recurring words from TREC 3-7, 350 topics 27

Definition P(

t

|

R

) Importance Prediction: Idea

P(

t

|

R

) vs. idf

df/N P(

t

1 |

R

) 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 -0,5 Solution 0,5 TREC 4

desc

query terms P(

t

|

R

) vs. df/N (Greiff, 1998) 1 idf 28

Definition Importance Prediction: Idea Solution

Prior Prediction Approaches

• Croft/Harper combination match (1979) – – treats P(

t

|

R

) as a tuned constant, or estimated from PRF when >0.5, rewards docs that match more query terms • Greiff’s (1998) exploratory data analysis – – Used idf to predict overall term weighting Improved over basic BIM • Metzler’s (2008) generalized idf – – Used idf to predict P(

t

|

R

) Improved over basic BIM • Simple feature (idf), limited success – Missing piece: P(

t

|

R

) = term recall = 1 – term mismatch 29

Definition Importance Prediction: Idea Solution

What Factors can Cause Mismatch?

• Topic centrality (Is concept central to topic?) – “Laser research

related or potentially related to

defense” – “Welfare laws

propounded

as reforms” • Synonyms (How often they replace original term?) – “retrieval” == “search” == … • Abstractness – – “Laser

research

… “Welfare

laws

defense

” “

Prognosis/viability

” (rare & abstract) 30

Main Questions

• Definition • Significance • Mechanism • Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms • Model and solution 31

Definition Importance Prediction: Implement Solution

Designing Features to Model the Factors

• We need to – Identify synonyms/searchonyms of a query term – in a query dependent way • External resource? (WordNet, wiki, or query log) – Biased (coverage problem, collection independent) – – Static (not query dependent) Not easy, not used here • Term-term similarity in concept space!

– Local LSI (Latent Semantic Indexing) Query Retrieval Engine Results Results Top (500) Results Concept Space (150 dim) Document Collection 32

Definition Importance Prediction: Implement Solution

Synonyms from Local LSI

Term

limitation for US Congress members

Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments

P(

t

| R

q

)

0.9831

Similarity with query term 0,5 0,4 0,3 0,2 0,1 0 0.6885

0.1071

Top Similar Terms

33

Definition Importance Prediction: Implement Solution

Synonyms from Local LSI

Term

limitation for US Congress members

Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments

P(

t

| R

q

)

0.9831

0.6885

0.1071

Similarity with query term 0,5 (1) Magnitude of self similarity – Term centrality 0,4 0,3 (2) Avg similarity of supporting terms – Concept centrality 0,2 0,1 0 (3) How likely synonyms replace term

t

in collection

Top Similar Terms

34

Definition Importance Prediction: Experiment Solution

Features that Model the Factors

• Term centrality – Correlation with P(

t

0.3719

Self-similarity (length of

t

|

R

) idf: – 0.1339

) after dimension reduction • Concept centrality – 0.3758

Avg similarity of supporting terms (top synonyms) • Replaceability – – 0.1872

How frequently synonyms appear in place of original query term in collection documents • Abstractness – 0.1278

– Users modify abstract terms with concrete terms

effects on the US educational program prognosis of a political third party

35

Definition Importance Prediction: Implement Solution

Prediction Model

Regression modeling – – Model:

M

: <

f

1 ,

f

2 , ..,

f

5 >  P(

t

|

R

) Train on one set of queries (known relevance), – Test on another set of queries (unknown relevance) – RBF kernel Support Vector regression 36

Definition Importance Prediction Solution

A General View of Retrieval Modeling as Transfer Learning

• The traditional restricted view sees a retrieval model as – a document classifier for a given query.

• More general view: A retrieval model really is – a meta-classifier, responsible for many queries, – mapping a query to a document classifier.

• Learning a retrieval model == transfer learning – Using knowledge from related tasks (training queries) to classify documents for a new task (test query) – – Our features and model facilitate the transfer.

More general view  more principled investigations and more advanced techniques 37

Definition Importance Prediction: Experiment Solution

Experiments

• Term recall prediction error – L1 loss (absolute prediction error) • Term recall based term weighting retrieval – Mean Average Precision (overall retrieval success) – Precision at top 10 (precision at top of rank list) 38

Definition Importance Prediction: Experiment Solution

Term Recall Prediction Example

Query:

prognosis/viability of a political third party

.

(Trained on TREC 3)

party

True P(

t

|

R

) 0.9796

Predicted 0.7585

political

0.7143

0.6523

third

0.5918

0.6236

viability

0.0408

0.3080 Emphasis

prognosis

0.0204

0.2869 39

Definition Importance Prediction: Experiment Solution

Term Recall Prediction Error

Average Absolute Error (L1 loss) on TREC 4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 Average (constant) L1 Loss: IDF only The lower, the better All 5 features Tuning meta parameters TREC 3 recurring words 40

Main Questions

• Definition • Significance • Mechanism • Model and solution • Can be predicted; Framework to design and evaluate features 41

Definition Importance Prediction Solution: Weighting

Using (

t

|

R

) in Retrieval Models

• In BM25 – Binary Independence Model • In Language Modeling (LM) – Relevance Model [Lavrenko and Croft 2001] Only term weighting, no expansion.

42

Definition Importance Prediction Solution: Weighting

MAP

0,25

*

0,2

Predicted Recall Weighting

* *

10-25% gain (MAP) Baseline LM desc

* *

0,15 0,1

*

0,05 0

*

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14

Datasets: train -> test

“*”: significantly better by sign & randomization tests 43

Definition Importance Prediction Solution: Weighting

Prec@10

0,6 0,5

*

0,4 0,3 0,2 0,1 0 3 -> 4

Predicted Recall Weighting

10-20% gain (top Precision) Baseline LM desc

!

!

!

*

3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10

Datasets: train -> test

“*”: Prec@10 is significantly better.

“!”: Prec@20 is significantly better.

9 -> 10 11 -> 12 13 -> 14 44

Definition Importance Prediction

vs. Relevance Model

Solution: Weighting Relevance Model [Laverenko and Croft 2001] Term occurrence in top docs Unsupervised Query Likelihood y RM weight (x) ~ Term recall (y) P m (

t

1 P m (

t

2

| R

)

| R

) ~ P(

t

1 |

R

) ~ P(

t

2 |

R

) 5-10% better than unsupervised TREC 7 TREC 13 x 45

Main Questions

• Definition • Significance • Mechanism • Model and solution • Term weighting solves emphasis problem for long queries • Mismatch problem?

46

Definition Importance Prediction Solution: Expansion

Failure Analysis of 44 Topics from TREC 6-8 Precision 9% Mismatch 27%

Mismatch guided expansion

Emphasis 64%

Recall term weighting Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) 47

Definition Importance Prediction

Recap: Term Mismatch

Solution: Expansion • Term mismatch ranges 30%-50% on average • Relevance matching can degrade quickly for multi-word queries • Solution: Fix every query term [SIGIR 2012] 48

Definition Importance Prediction Solution: Expansion

Conjunctive Normal Form (CNF) Expansion

Example keyword query:

placement of cigarette signs on television watched by children

 Manual CNF: AND

signage

OR

merchandise

) (

cigarette

OR

cigar

OR

tobacco

) AND AND (

television

OR

TV

(

watch

OR

view

) OR

cable

OR

network

) AND (

placement

OR

place

OR

promotion

OR

logo

OR

sign

OR (

children

OR

teen

OR

juvenile

OR

kid

OR

adolescent

) – Expressive & compact (1 CNF == 100s alternatives) – Highly effective (this work: 50-300% over base keyword) – Used by lawyers, librarians and other expert searchers – But, tedious & difficult to create, little research 49

Definition Importance Prediction

Diagnostic Intervention

Solution: Expansion Query: placement of cigarette signs on television watched by children Diagnosis: Low terms High idf (rare) terms

placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children

Expansion: CNF CNF (

placement

OR

place

OR

promotion

OR

sign

OR

signage

OR

merchandise

) AND

cigar

AND

television

AND (

children

OR OR

adolescent

) AND

watch teen

OR

juvenile

OR

kid

(

placement

OR

place

OR

promotion

OR

sign

OR

signage

OR

merchandise

) AND

cigar

AND (

television

AND

watch

OR

tv

OR

cable

OR

network

) AND

children

• Goal – Least amount user effort  near optimal performance – E.g. expand 2 terms  90% of total improvement 50

Definition Importance Prediction

Diagnostic Intervention

Solution: Expansion Query: placement of cigarette signs on television watched by children Diagnosis: Low terms High idf (rare) terms

placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children

Expansion:

Bag of word

[ 0.9 (

placement cigar television watch children

) Expansion query 0.1 (0.4

place

0.3

promotion

0.2

logo

0.1

sign

0.3

signage

0.3

merchandise

0.5

teen

0.4

juvenile

0.2

kid

0.1

adolescent

) ] [ 0.9 (

placement cigar children

)

television watch

0.1 (0.4

place

0.1

sign

0.3 0.3

signage promotion

0.3 0.2

logo merchandise

0.5

tv

0.4

cable

0.2

network

) ] • Goal – Least amount user effort  near optimal performance – E.g. expand 2 terms  90% of total improvement 51

Definition Importance Prediction Solution: Expansion

Diagnostic Intervention (We Hope to)

User Keyword query (

child

AND

cigar

) Diagnosis system ( P(

t

|

R

) or idf ) Expansion terms (

child

teen

) Query formulation ( CNF or Keyword ) (

child

OR

teen

) AND

cigar

User expansion Retrieval engine Problem query terms (

child

>

cigar

) Evaluation 52

Definition Importance Prediction Solution: Expansion

Diagnostic Intervention (We Hope to)

User Keyword query (

child

AND

cigar

) Diagnosis system ( P(

t

|

R

) or idf ) Expansion terms (

child

teen

) Query formulation ( CNF or Keyword ) (

child

OR

teen

) AND

cigar

User expansion Retrieval engine Problem query terms (

child

>

cigar

) Evaluation 53

Definition Importance Prediction Solution: Expansion

We Ended up Using Simulation

Diagnosis system ( P(

t

|

R

) or idf ) Expert user Full CNF Offline (

child

OR

teen

) AND (

cigar

OR

tobacco

) Expansion terms (

child

teen

) Keyword query (

child

AND

cigar

) Online simulation User expansion Online simulation Query formulation ( CNF or Keyword ) (

child

OR

teen

) AND

cigar

Retrieval engine Problem query terms (

child

>

cigar

) Evaluation 54

Definition Importance Prediction Solution: Expansion

Diagnostic Intervention Datasets

• Document sets – – TREC 2007 Legal track, 7 million tobacco company TREC 4 Ad hoc track, 0.5 million newswire • CNF Queries, 50 topics per dataset – TREC 2007 by lawyers, TREC 4 by Univ. Waterloo • Relevance Judgments – TREC 2007 sparse, TREC 4 dense • Evaluation measures – TREC 2007 statAP, TREC 4 MAP 55

Definition Importance Prediction

Results – Diagnosis

Solution: Expansion P(

t

|

R

) vs. idf diagnosis Full Expansion 100%

Gain in retrieval (MAP)

90% 80% 70% 60% 50% 40% 30% 20% P(t | R) on TREC 2007 idf on TREC 2007 P(t | R) on TREC 4 idf on TREC 4 10% 0% 0 No Expansion 1 2 3 4 Diagnostic CNF expansion on TREC 4 and 2007 All

# query terms selected

56

Definition Importance Prediction Solution: Expansion

Results – Form of Expansion

CNF vs. bag-of-word expansion

Retrieval performance (MAP)

0,35 Similar level of gain in top precision 0,3 0,25 0,2 0,15 0,1 CNF on TREC 4 Bag of word on TREC 4 CNF on TREC 2007 Bag of word on TREC 2007 50% to 300% gain 0,05 0 0 1 2 3 4 All

# query terms selected

P(

t

|

R

) guided expansion on TREC 4 and 2007 57

Main Questions

• Definition • Significance • Mechanism • Model and solution • Term weighting for long queries • Term mismatch prediction diagnoses problem terms, and produces simple & effective CNF queries 58

Definition Importance Prediction: Efficiency Solution: Weighting

Efficient P(

t

|

R

) Prediction

• 3-10X speedup (close to simple keyword retrieval), while maintaining 70-90% of the gain • Predict using P(

t

|

R

) values from similar, previously-seen queries [CIKM 2012] 59

Definition Importance Prediction

Contributions

Solution • Two long standing problems: mismatch & P(

t

|

R

) • Definition and

initial

quantitative analysis of mismatch – Do better/new features and prediction methods exist?

• Role of term mismatch in

basic

retrieval theory – Principled ways to solve term mismatch – What about advanced learning to rank, transfer learning?

• Ways to automatically predict term mismatch –

Initial

modeling of causes of mismatch, features – Efficient prediction using historic information – Are there better analyses or modeling of the causes?

60

Definition Importance Prediction

Contributions

Solution • Effectiveness of ad hoc retrieval – Term weighting & diagnostic expansion – How to do automatic CNF expansion?

– Better formalisms: transfer learning, & more tasks?

• Diagnostic intervention – Mismatch diagnosis guides targeted expansion – – How to diagnose specific types of mismatch problems or different problems (mismatch/emphasis/precision)?

• Guide NLP, Personalization, etc. to solve the real problem How to proactively identify search and other user needs?

61

Acknowledgements

• Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft • Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui (Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Yi Zhang (and her group), Jin Young Kim, Yangbo Zhu, Runting Shi, Yi Wu, Hui Tan, Yifan Yanggong, Mingyan Fan, Chengtao Wen – Discussions & references & feedback • Reviewers: papers & NSF proposal • David Fisher, Mark Hoy, David Pane – Maintaining the Lemur toolkit • Andrea Bastoni and Lorenzo Clemente – Maintaining LSI code for Lemur toolkit • SVM-light, Stanford parser • TREC: data • NSF Grant IIS-1018317 • Xiangmin Jin, and my whole family and volleyball packs at CMU & SF Bay 62

END

63

Prior Definition of Mismatch

• Vocabulary mismatch (Furnas et al., 1987) – How likely 2 people disagree in vocab choice – Domain experts disagree 80-90% of the times – Leads to Latent Semantic Indexing (Deerwester et al., 1988) – Query independent – – = Avg

q

P(

t

|

R q

)

-

term mismatch 64

Knowledge How Necessity explains behavior of IR techniques

• Why weight query bigrams 0.1, while query unigrams 0.9?

– Bigram decreases term recall, weight reflects recall • Why Bigram not gaining stable improvements?

– Term recall is more of a problem • Why using document structure (field, semantic annotation) not improving performance?

– Improves precision, need to solve structural mismatch • Word sense disambiguation – Enhances precision, instead, should use in mismatch modeling!

• Identify query term sense, for searchonym id, or learning across queries • Disambiguate collection term sense for more accurate replaceability • Personalization – – biases results to what a community/person likes to read (precision) may work well in a mobile setting, short queries 65

Why Necessity?

System Failure Analysis

• Reliable Information Access (RIA) workshop (2003) – Failure analysis for 7 top research IR systems • 11 groups of researchers (both academia & industry) • 28 people directly involved in the analysis (senior & junior) • >56 human*weeks (analysis + running experiments) • 45 topics selected from 150 TREC 6-8 (difficult topics) – Causes (necessity in various disguise) • Emphasize 1 aspect, missing another aspect • Emphasize 1 aspect, missing another term • Missing either 1 of 2 aspects, need both • Missing difficult aspect that need human help • • Need to expand a general term e.g. “Europe”

Precision problem, e.g. “euro”, not “euro-…”

(14+2 topics) (7 topics) (5 topics) (7 topics) (4 topics)

(4 topics)

66

67

68

Local LSI Top Similar Terms

Oil spills spill oil oil

0.5828

0.4210

Insurance coverage pays for term care term which long term

0.3310

Term limitations for US Congress members term term

0.3339

Vitamin the cure of or cause for human ailments ail ail

0.4415

long

0.2173

limit

0.1696

health

0.0825

tank crude

0.0986

0.0972

nurse care

0.2114

0.1694

ballot elect

0.1115

0.1042

disease basler

0.0720

0.0718

water

0.0830

home

0.1268

care

0.0997

dr

0.0695

69

0,2 0 -0,2 -0,4 -0,6 -0,8 1,2 1 0,8 0,6 0,4

Error plot of necessity predictions

Necessity truth Predicted necessity Prediction trend (3rd order polynomial fit) 70

Necessity vs. idf (and emphasis)

71

True Necessity Weighting

TREC

Document collection Topic numbers LM

desc

– Baseline LM

desc

– Necessity Improvement

p

- randomization

p

- sign test Multinomial-abs Multinomial RM Okapi

desc

– Baseline Okapi

desc

– Necessity LM

title

– Baseline LM

title

– Necessity

4

disk 2,3 201-250 0.1789

0.2703

51.09%

0.0000

0.0000

0.1988

0.2613

0.2055

0.2679

N/A N/A

6

disk 4,5 301-350 0.1586

0.2808

77.05%

0.0000

0.0000

0.2088

0.2660

0.1773

0.2786

0.2362

0.2514

8

d4,5 w/o cr 401-450

9 10

WT10g 451-500 501-550 0.1923

0.3057

0.2145

0.2770

0.1627

0.2216

58.97%

0.0000

0.0000

0.2345

29.14%

0.0000

0.0005

0.2239

36.20%

0.0000

0.0000

0.1653

0.2969

0.2183

0.2894

0.2518

0.2606

0.2590

0.1944

0.2387

0.1890

0.2058

0.2259

0.1591

0.2003

0.1577

0.2137

12

.GOV

TD1-50 0.0239

0.0868

261.7%

0.0000

0.0000

0.0645

0.1219

0.0449

0.0776

0.0964

0.1042

14

.GOV2

751-800 0.1789

0.2674

49.47%

0.0001

0.0002

0.2150

0.2260

0.2058

0.2403

0.2511

0.2674

72

Predicted Necessity Weighting

10-25% gain (necessity weight) 10-20% gain (top Precision)

TREC train sets Test/x-validation

LM

desc

– Baseline LM

desc

– Necessity Improvement P@10 P@20 Baseline Necessity Baseline Necessity

3 4

0.1789

0.2261

26.38%

0.4160

0.4940

0.3450

0.4180

3-5 6

0.1586

0.1959

23.52%

0.2980

0.3420

0.2440

0.2900

3-7 8

0.1923

0.2314

20.33%

0.3860

0.4220

0.3310

0.3540

7 8

0.1923

0.2333

21.32%

0.3860

0.4380

0.3310

0.3610

73

Predicted Necessity Weighting (ctd.)

TREC train sets Test/x-validation

LM

desc

– Baseline LM

desc

– Necessity Improvement P@10 Baseline Necessity P@20 Baseline Necessity

3-9 10

0.1627

0.1813

11.43% 0.3180

0.3280

0.2400

0.2790

9 10

0.1627

0.1810

11 12

0.0239

0.0597

11.25% 149.8%

0.3180

0.0200

0.3400

0.0467

0.2400

0.2810

0.0211

0.0411

13 14

0.1789

0.2233

24.82%

0.4720

0.5360

0.4460

0.5030

74

vs. Relevance Model

Relevance Model: #weight( 1 λ #combine(

t

1 λ #weight(

w

1

t

1

t

2

w

2

t

2

w

3 …

t

3 ) ) ) Weight Only ≈ Expansion

Test/x-validation

LM

desc

– Baseline

4

0.1789

6

0.1586

8

0.1923

8

0.1923

0.1627

1 0,5 (5-10%)

10

y

10

0.1627

x ~ y

w

1

w

2 ~ P(

t

1 |

R

) ~ P(

t

2 |

R

) 0 0 0,5 Supervised > Unsupervised

12

0.0239

14

1 0.1789

x Relevance Model RM reweight-Only

desc desc

0.2423

0.2215

RM reweight-Trained desc 0.2330 0.1799

0.1705

0.1921

0.2352

0.2435

0.2542

0.2352

0.2435

0.2563

0.1888

0.1700

0.1809

0.1888

0.1700

0.1793

0.0221

0.0692

0.0534

0.1774

0.1945

0.2258

75

Feature Correlation

f

1 Term

0.3719

f

2 Con

0.3758

f

3 Repl

-0.1872

f

4 DepLeaf

0.1278

f

5 idf

-0.1339

Predicted Necessity:

0.7989

(TREC 4 test set)

RMw

0.6296

≈ 76

vs. Relevance Model

Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) Baseline LM desc Relevance Model desc RM Reweight-Only desc RM Reweight-Trained desc RM is unstable

Datasets: train -> test

77

Definition Importance Prediction: Idea Solution

Efficient Prediction of Term Recall

• Currently: – slow query dependent features that requires retrieval • Can they be more effective and more efficient?

– Need to understand the causes of the query dependent variation – Design a minimal set of efficient features to capture the query dependent variations 78

Causes of Query Dependent Variation (1)

• Example • Cause – Different word sense 79

Causes of Query Dependent Variation (2)

• Example • Cause – Different word use, e.g. term in phrase vs. not 80

Causes of Query Dependent Variation (3)

• Example • Cause – Different Boolean semantics of the queries, AND vs. OR 81

Causes of Query Dependent Variation (4)

• Example • Cause – Different association level with topic 82

Efficient P(

t

|

R

) Prediction (2)

• Causes of P(

t

|

R

) variation of same term in different queries – Different query semantics:

Canada

or Mexico vs.

Canada

– Different word sense: bear (verb) vs. bear (noun) – Different word use: Seasonal affective

disorder

syndrome (SADS) vs. Agoraphobia as a

disorder

– Difference in association level with topic • Use historic occurrences to predict current – 70-90% of the total gain – 3-10X faster, close to simple keyword retrieval 83

Efficient P(

t

|

R

) Prediction (2)

• Low variation of same term in different queries • Use historic occurrences to predict current – 3-10X faster, close to the slower method & real time 0,3 0,25 0,2

MAP * *

Baseline LM desc Necessity LM desc Efficient Prediction

*

0,15 0,1

*

0,05 0

train -> test

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14 84

Using Document Structure

• Stylistic: XML • Syntactic/Semantic: POS, Semantic Role Label • Current approaches – All precision oriented • Need to solve mismatch first?

85

Motivation

• Search is important, information portal • Search is research worthy – Retrieval SIGIR, WWW, CIKM, ASIST, ECIR, AIRS, – Search is difficult – Adapt to changing requirements of mobile, social and semantic Web • Modeling user’s needs User Query Activities Results Retrieval Model 86 Collections

Online or Offline Study?

• Controlling confounding variables – – Quality of expansion terms User’s prior knowledge of the topic – Interaction form & effort • Enrolling many users and repeating experiments • Offline simulations can avoid all these and still make reasonable observations 87

Simulation Assumptions

• Real full CNFs to simulate partial expansions • 3 assumptions about user expansion process – Expansion of individual terms are independent of each other • •

A1

: always same set of expansion terms for a given query term, no matter which subset of query terms get expanded.

A2

: same sequence of expansion terms, no matter … –

A3

: Keyword query is re-constructed from the CNF query • Procedure to ensure vocabulary faithful to that of the original keyword description • Highly effective CNF queries ensure reasonable kw baseline 88

Take Home Message for Ordinary Search Users (people and software)

89

Be mean!

Is the term Necessary for doc relevance?

If not, remove, replace or expand.

90