IR Challenges and Language Modeling

Transcript IR Challenges and Language Modeling

IR Challenges and Language
Modeling
IR Achievements
 Search engines
 Meta-search
 Cross-lingual search
 Factoid question answering
 Filtering
 Statistical approach to language
 Evaluation methodology
 Effectiveness and efficiency
 The importance of users
Current Status
 Everyone is an Web or language technology person
today...
 SIGMOD, VLDB
 WWW
 ACL, EMNLP
 ICML
 KDD
 IJCAI, AAAI, ....
 Funding agencies have declared some problems
“solved”
Defining the Research Challenges





What are the driving forces?
What should we work on?
What are the grand challenges?
What should be funded?
cf. Asilomar Report produced by the database
community
Language Modeling
 One challenge: Defining the formal basis for IR
 retrieval models
 indexing models
 Lots of papers, any consensus?
 Relationship to real systems?
 Language models are an attempt to provide a different
perspective for retrieval models
 shown promise in describing a range of IR “tasks”
 potential for better integration with other language technologies
Why retrieval models?
 “Why do we need new retrieval models now that we
have Google?”
 Web search  IR
 Typical web queries information needs
 Google shows that, for some types of queries, effective
ranking can be obtained by combining an AND query
with a number of other features
 effect of scale - ranking within the top group
 features such as links, anchor text, tagging used
 Retrieval models provide frameworks for improving
effectiveness in more general contexts
LM for IR






What is a language model?
Query-likelihood and document models
Document-likelihood and query models
KL divergence comparison of models
Other models
Applications
What is a Language Model?
• A statistical model for generating text
– Probability distribution over strings in a given language
M
P(
|M) =P(
P(
P(
P(
© Victor Lavrenko, Aug. 2002
|M)
| M, )
| M,
)
| M,
)
Unigram and higher-order models
P(
)
=P( )P( | ) P( | ) P( |
• Unigram Language Models
P( )P( )P( )P( )
• N-gram Language Models
P( )P( | )P( | )P( | )
• Other Language Models
– Grammar-based models, etc.
© Victor Lavrenko, Aug. 2002
)
The fundamental problem of LMs
• Usually we don’t know the model M
– But have a sample of text representative of that model
P(
|M(
))
• Estimate a language model from a sample
• Then compute the observation probability
M
© Victor Lavrenko, Aug. 2002
Models of Text Generation
P(Query | M )
P(M | Searcher)
Searcher
Query Model
Query
Is this the same model?
Writer
Doc Model
P( M | Writer )
Doc
P( Doc | M )
Retrieval Using Language Models
Query
Query Model
P(w | Query)
1
3
Doc
2
Doc Model
P(w | Doc)
Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3)
Query Likelihood
 P(Q|Dm)
 Major issue is estimating document model
 i.e. smoothing techniques instead of tf.idf weights
 cf. Van Rijsbergen’s P(DQ) and InQuery’s P(I|D)
 Good retrieval results
 e.g. UMass, BBN, Twente, CMU
 Problems dealing with relevance feedback, query
expansion, structured queries
Document Likelihood
 Rank by likelihood ratio P(D|R)/P(D|N)
 treat as a generation problem
 P(w|R) is estimated by P(w|Qm)
 Qm is the query or relevance model
 P(w|N) is estimated by collection probabilities P(w)
 Issue is estimation of query model
 Treat query as generated by mixture of topic and background
 Estimate relevance model from related documents (query
expansion)
 Relevance feedback is easily incorporated
 Good retrieval results
 e.g. UMass at SIGIR 01
 inconsistent with heterogeneous document collections
Model Comparison
 Estimate query and document models and compare
 Obvious measure is KL divergence D(Qm||Dm)
 equivalent to query-likelihood approach if simple empirical
distribution used for query model
 More general risk minimization framework has been
proposed
 Zhai and Lafferty
 Consistently better results than query-likelihood or
document-likelihood approaches
Other Approaches
 HMMs (BBN)
 Probabilistic Latent Semantic Indexing (Hofmann)
 assume documents are generated by a mixture of “aspect”
models
 estimation more difficult
 Translation model (Berger and Lafferty)
Applications








CLIR
TDT
Novelty and redundancy
Links
Distributed retrieval
QA
Filtering
Summarization
The Future of IR and LM

IR Challenges and Language Modeling

Transcript IR Challenges and Language Modeling

Directory