Transcript Document

Cumulative Progress in Language
Models for Information Retrieval
Antti Puurula
6/12/2013
Australasian Language Technology Workshop
University of Waikato
Ad-hoc Information Retrieval
• Ad-hoc Information Retrieval (IR) forms the basic task in IR:
• Given a query, retrieve and rank documents in a collection
• Origins:
• Cranfield 1 (1958-1960), Cranfield 2 (1962-1966), SMART (1961-1999)
• Major evaluations:
• TREC Ad-hoc (1990-1999), TREC Robust (2003-2005), CLEF (2000-2009), INEX
(2009-2010), NTCIR (1999-2013), FIRE (2008-2013)
Illusionary Progress in Ad-hoc IR
• TREC ad-hoc evaluations stopped in 1999, as progress plateaued
• More diverse tasks became the foci of research
• “There is little evidence of improvement in ad-hoc retrieval
technology over the past decade” (Armstrong et al. 2009)
• Weak baselines, non-cumulative improvements
• ⟶ “no way of using LSI achieves a worthwhile improvement in retrieval
accuracy over BM25” (Atreya & Elkan, 2010)
• ⟶ “there remains very little room for improvement in ad hoc search”
(Trotman & Keeler, 2011)
Progress in Language Models for IR?
• Language Models (LM) form one of the main approaches to IR
• Many improvements to LMs not adopted generally or evaluated
systematically
• TF-IDF feature weighting
• Pitman-Yor Process smoothing
• Feedback models
• Are these improvements consistent across standard datasets,
cumulative, and do they improve on a strong baseline?
Query Likelihood Language Models
• Query Likelihood (QL) (Kalt 1996, Hiemstra 1998, Ponte & Croft
1998) is the basic application of LMs for IR
• Unigram case: using count vectors to represent documents 𝒅𝒎 and
queries 𝒘, rank documents 𝑚 given a query according to 𝑝(𝒅𝒎 |𝒘)
• Assuming a generative model 𝑝 𝒅𝒎 𝒘 = 𝑝(𝒅𝒎 , 𝒘)/𝑝(𝒘), and
uniform priors over 𝑚: 𝑝(𝒅𝒎 |𝒘) ≈ 𝑝(𝒘|𝒅𝒎 )
Query Likelihood Language Models 2
• The unigram QL-score for each document 𝑚 becomes:
• where 𝑍(𝒘) is the Multinomial coefficient, and document models
𝑝𝑚 (𝑛) are given by the Maximum Likelihood estimates:
Pitman-Yor Process Smoothing
• Standard methods for smoothing in IR LMs are Dirichlet Prior (DP)
and 2-Stage Smoothing (2SS) (Zhai & Lafferty 2004, Smucker &
Allan 2007)
• Recent suggested improvement is Pitman-Yor Process smoothing
(PYP), an approximation to inference on a Pitman-Yor Process
(Momtazi & Klakow 2010, Huang & Renals 2010)
• All methods interpolate unsmoothed parameters with a
background distribution. PYP additionally discounts the
unsmoothed counts
Pitman-Yor Process Smoothing 2
• All methods share the form:
• DP:
• 2SS:
• PYP:
,
and
Pitman-Yor Process Smoothing 2
• All methods share the form:
• DP:
• 2SS:
• PYP:
,
,
and
Pitman-Yor Process Smoothing 3
• The background model 𝑝𝑐 (𝑛) is most commonly estimated by
concatenating all collection documents into a single document:
• Less commonly, a uniform background model is used:
TF-IDF Feature Weighting
• Multinomial modelling assumptions of text can be corrected with
TF-IDF weighting (Rennie et al. 2003, Frank & Bouckaert 2006)
• Traditional view: IDF-weighting unnecessary with IR LMs (Zhai &
Lafferty 2004)
• Recent view: combination is complementary (Smucker & Allan
2007, Momtazi et al. 2010)
TF-IDF Feature Weighting 2
• Dataset documents can be weighted by TF-IDF:
• , where 𝒅’’ is the unweighted count vector, 𝑀 the number of
documents, and 𝑀𝑛 number of documents where word 𝑛 occurs
• First factor is TF log transform using unique length normalization
(Singhal et al. 1996)
• Second factor is Robertson-Walker IDF(Robertson & Zaragoza 2009)
TF-IDF Feature Weighting 3
• IDF has a overlapping function to collection smoothing (Hiemstra &
Kraaij 1998)
• Interaction taken into account by replacing collection model by a
uniform model in smoothing:
Model-based Feedback
• Pseudo-feedback is a traditional method in Ad-hoc IR:
• Using the retrieved documents for original query 𝒘’, construct and rank
using a new query 𝒘
• With LMs two different formalizations enable model-based
feedback:
• Kl-Divergence Retrieval (Zhai & Lafferty 2001)
• Relevance Models (Lavrenko & Croft 2001)
• Both enable replacing the original query counts 𝒘’ by a model
Model-based Feedback 2
• Many modeling choices exist for the feedback models, such as:
•
•
•
•
Using top 𝐾 retrieved documents (commonly 𝐾 = 50)
Truncating the word vector to words present in the original query
Weighting the feedback documents using 𝑝(𝑚|𝒘’)
Interpolating the feedback model with the original query
• These modeling choices are combined here
Model-based Feedback 3
• The interpolated query model 𝒘 is estimated for the query words
𝒘′𝑛 > 0 from the top 𝐾 = 50 document models 𝑝𝑘 (𝑛):
• , where 𝜆 is the interpolation weight and 𝑍 is normalizer:
Experimental Setup
• Ad-hoc IR experiments conducted on
13 standard datasets
• TREC1-5 split according to data source
• OHSU-TREC
• FIRE 2008-2011 English
• Preprocessing: stopword & short
word(< 3) removal, Porter-stemming
• Each dataset split into development
and evaluation subsets
Experimental Setup 2
• Software used for experiments was the SGMWeka 1.44 toolkit:
• http://sourceforge.net/projects/sgmweka/
• Smoothing parameters optimized on development sets using
Gaussian Random Searches (Luke 2009)
• Evaluation performed on evaluation sets, using Mean Average
Precision of top 50 documents (MAP@50)
• Significance tested with paired one-tailed t-tests between the
datasets, with 𝑝 < 0.05
Results
• Significant differences:
• PYP > DP
• PYP+TI > 2SS
• PYP+TI+FB > PYP+TI
• PYP+TI+FB improves on 2SS by
4.07 MAP@50 absolute, a
17.1% relative improvement
Discussion
• The 3 evaluated improvements in language models for IR:
• require little additional computation
• can be implemented with small modifications to existing IR systems
• are substantial, significant and cumulative across 13 standard datasets,
compared to DP and 2SS baselines (4.07 MAP@50 absolute, 17.1% relative)
• Improvements requiring more computation possible
• document neighbourhood smoothing, word correlation models, passagebased LMs, bigram LMs, …
• More extensive evaluations needed for confirming progress