IR Challenges and Language Modeling
Download
Report
Transcript IR Challenges and Language Modeling
IR Challenges and Language
Modeling
IR Achievements
Search engines
Meta-search
Cross-lingual search
Factoid question answering
Filtering
Statistical approach to language
Evaluation methodology
Effectiveness and efficiency
The importance of users
Current Status
Everyone is an Web or language technology person
today...
SIGMOD, VLDB
WWW
ACL, EMNLP
ICML
KDD
IJCAI, AAAI, ....
Funding agencies have declared some problems
“solved”
Defining the Research Challenges
What are the driving forces?
What should we work on?
What are the grand challenges?
What should be funded?
cf. Asilomar Report produced by the database
community
Language Modeling
One challenge: Defining the formal basis for IR
retrieval models
indexing models
Lots of papers, any consensus?
Relationship to real systems?
Language models are an attempt to provide a different
perspective for retrieval models
shown promise in describing a range of IR “tasks”
potential for better integration with other language technologies
Why retrieval models?
“Why do we need new retrieval models now that we
have Google?”
Web search IR
Typical web queries information needs
Google shows that, for some types of queries, effective
ranking can be obtained by combining an AND query
with a number of other features
effect of scale - ranking within the top group
features such as links, anchor text, tagging used
Retrieval models provide frameworks for improving
effectiveness in more general contexts
LM for IR
What is a language model?
Query-likelihood and document models
Document-likelihood and query models
KL divergence comparison of models
Other models
Applications
What is a Language Model?
• A statistical model for generating text
– Probability distribution over strings in a given language
M
P(
|M) =P(
P(
P(
P(
© Victor Lavrenko, Aug. 2002
|M)
| M, )
| M,
)
| M,
)
Unigram and higher-order models
P(
)
=P( )P( | ) P( | ) P( |
• Unigram Language Models
P( )P( )P( )P( )
• N-gram Language Models
P( )P( | )P( | )P( | )
• Other Language Models
– Grammar-based models, etc.
© Victor Lavrenko, Aug. 2002
)
The fundamental problem of LMs
• Usually we don’t know the model M
– But have a sample of text representative of that model
P(
|M(
))
• Estimate a language model from a sample
• Then compute the observation probability
M
© Victor Lavrenko, Aug. 2002
Models of Text Generation
P(Query | M )
P(M | Searcher)
Searcher
Query Model
Query
Is this the same model?
Writer
Doc Model
P( M | Writer )
Doc
P( Doc | M )
Retrieval Using Language Models
Query
Query Model
P(w | Query)
1
3
Doc
2
Doc Model
P(w | Doc)
Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)
Query Likelihood
P(Q|Dm)
Major issue is estimating document model
i.e. smoothing techniques instead of tf.idf weights
cf. Van Rijsbergen’s P(DQ) and InQuery’s P(I|D)
Good retrieval results
e.g. UMass, BBN, Twente, CMU
Problems dealing with relevance feedback, query
expansion, structured queries
Document Likelihood
Rank by likelihood ratio P(D|R)/P(D|N)
treat as a generation problem
P(w|R) is estimated by P(w|Qm)
Qm is the query or relevance model
P(w|N) is estimated by collection probabilities P(w)
Issue is estimation of query model
Treat query as generated by mixture of topic and background
Estimate relevance model from related documents (query
expansion)
Relevance feedback is easily incorporated
Good retrieval results
e.g. UMass at SIGIR 01
inconsistent with heterogeneous document collections
Model Comparison
Estimate query and document models and compare
Obvious measure is KL divergence D(Qm||Dm)
equivalent to query-likelihood approach if simple empirical
distribution used for query model
More general risk minimization framework has been
proposed
Zhai and Lafferty
Consistently better results than query-likelihood or
document-likelihood approaches
Other Approaches
HMMs (BBN)
Probabilistic Latent Semantic Indexing (Hofmann)
assume documents are generated by a mixture of “aspect”
models
estimation more difficult
Translation model (Berger and Lafferty)
Applications
CLIR
TDT
Novelty and redundancy
Links
Distributed retrieval
QA
Filtering
Summarization
The Future of IR and LM