A Risk Minimization Framework for Information Retrieval

Download Report

Transcript A Risk Minimization Framework for Information Retrieval

Statistical Language Models for
Biomedical Literature Retrieval
ChengXiang Zhai
Department of Computer Science,
Institute for Genomic Biology ,
And Graduate School of Library & Information Science
University of Illinois, Urbana-Champaign
Motivation
•
•
•
•
Biomedical literature serves as a “complete”
documentation of the biomedical knowledge discovered by
scientists
Medline: > 10,000,000 literature abstracts (1966-)
Effective access to biomedical literature is essential for
–
–
–
–
Understanding related existing discoveries
Formulating new hypotheses
Verifying hypotheses
…
Biologists routinely use PubMed to access literature
(http://www.ncbi.nlm.nih.gov/PubMed)
Challenges in
Biomedical Literature Retrieval
• Tokenization
– Many names are irregular with special characters
such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha
– Ambiguous words: “was” and “as” can be genes
• Semi-structured queries
– It is often desirable to expand a query about a gene
with synonyms of the gene; the expanded query
would have several fields (original name + symbols)
– “Find the role of gene A in disease B” (3 fields)
•…
TREC Genomics Track
• TREC (Text REtrieval Conference):
– Started 1992; sponsored by NIST
– Large-scale evaluation of information retrieval (IR)
techniques
• Genomics Track
– Started in 2003
– Still continuing
– Evaluation of IR for biomedical literature search
Typical TREC Cycle
• Feb: Application for participation
• Spring: Preliminary (training) data available
• Beginning of Summer: Official test data
available
• End of Summer: Result submission
• Early Fall: Official evaluation; results are out
in Oct
• Nov: TREC Workshop; plan for next year
UIUC Participation
• 2003: Obtained initial experience; recognized
the problem of “semi-structured queries”
• 2005: Continued developing semi-structured
language models
• 2006: Applied hidden Markov models to
passage retrieval
Outline
• Standard IR Techniques
• Semi-structured Query Language Models
• Parameter Estimation
• Experiment Results
• Conclusions and Future Work
What is Text Retrieval (TR)?
• There exists a collection of text documents
• User gives a query to express the information
need
• A retrieval system returns relevant documents
to users
• More commonly known as “Information
Retrieval” (IR)
• Known as “search technology” in industry
TR is Hard!
• Under/over-specified query
– Ambiguous: “buying CDs” (money or music?)
– Incomplete: what kind of CDs?
– What if “CD” is never mentioned in document?
• Vague semantics of documents
– Ambiguity: e.g., word-sense, structural
– Incomplete: Inferences required
• Even hard for people!
– 80% agreement in human judgments
TR is “Easy”!
• TR CAN be easy in a particular case
– Ambiguity in query/document is RELATIVE to the
database
– So, if the query is SPECIFIC enough, just one
keyword may get all the relevant documents
• PERCEIVED TR performance is usually better
than the actual performance
– Users can NOT judge the completeness of an
answer
Formal Formulation of TR
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi  V
• Document di = di1,…,dimi, where dij  V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q)  C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task =
compute R’(q), an “approximate R(q)”
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an
indicator function or classifier
– System must decide if a doc is relevant or not
(“absolute relevance”)
• Strategy 2: Document ranking
– R(q) = {dC|f(d,q)>}, where f(d,q)  is a relevance
measure function;  is a cutoff
– System must decide if one doc is more likely to be
relevant than another (“relative relevance”)
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
1
Doc Selection
f(d,q)=?
Doc Ranking
f(d,q)=?
0
+ +- + ++
R’(q)
- -- - - + - 0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)
Problems of Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific):
no relevant documents found
– “Under-constrained” query (terms are too general):
over delivery
– It is extremely hard to find the right position
between these two extremes
• Even if it is accurate,
all relevant documents
are not equally relevant
• Relevance is a matter of degree!
Ranking is often preferred
• Relevance is a matter of degree
• A user can stop browsing anywhere, so the
boundary is controlled by the user
– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking
Principle [Robertson 77]
Evaluation Criteria
• Effectiveness/Accuracy
– Precision, Recall
• Efficiency
– Space and time complexity
• Usability
– How useful for real user tasks?
Methodology: Cranfield Tradition
• Laboratory testing of system components
– Precision, Recall
– Comparative testing
• Test collections
– Set of documents
– Set of questions
– Relevance judgments
The Contingency Table
Action
Doc
Relevant
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
Not relevant Irrelevant Retrieved Irrelevant Rejected
Re l e van tRe tri e ve d
Pre ci si on
Re tri e ve d
Re l e van tRe tri e ve d
Re call
Re l e van t
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
precision
precision
x
Which is better?
x
x
x
x
recall
x
x
recall
x
Summarize a Ranking
•
Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document
is retrieved => p(1),…,p(k), if we have k rel. docs
•
•
•
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
Compute the average over all the relevant documents
– Average precision = (p(1)+…p(k))/k
This gives us (non-interpolated) average precision, which
captures both precision and recall and is sensitive to the rank of
each relevant document
Mean Average Precisions (MAP)
– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more
affected by difficult topics)
Precion-Recall Curve
Out of 4728 rel docs,
we’ve got 3212
Recall=3212/4728
Precision@10docs
about 5.5 docs
in the top 10 docs
are relevant
Breakeven Point
(prec=recall)
Mean Avg. Precision (MAP)
D1 +
D2 +
D3 –
D4 –
D5 +
D6 -
Total # rel docs = 4
System returns 6 docs
Average Prec = (1/1+2/2+3/5+0)/4
Typical TR System Architecture
docs
query
Feedback
Tokenizer
Query
Rep
Doc Rep (Index)
Indexer
Index
judgments
User
Scorer
results
Tokenization
• Normalize lexical units: Words with similar
meanings should be mapped to the same
indexing term
• Stemming: Mapping all inflectional forms of
words to the same root form, e.g.
– computer -> compute
– computation -> compute
– computing -> compute (but king->k?)
• Porter’s Stemmer is popular for English
Relevance Feedback
Query
Retrieval
Engine
Updated
query
Document
collection
Feedback
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
User
Judgments:
d1 +
d2 d3 +
…
dk ...
Pseudo/Blind/Automatic
Feedback
Query
Retrieval
Engine
Updated
query
Document
collection
Feedback
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
Judgments:
d1 +
d2 +
d3 +
…
dk ...
top 10
Traditional approach
= Vector space model
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between
the query vector and document vector in the
vector space
VS Model: illustration
Starbucks
D9
D11
??
??
D2
D5
D3
D10
D4 D6
Java
Query
D7
D8
Microsoft
D1
??
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, “latent concept”, …
How to Assign Weights?
• Very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– TF normalization
Language Modeling Approaches are
becoming more and more popular…
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• Context-dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also called
a “generative” model
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each
word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
• Essentially a multinomial distribution over
1
N
words
• A piece of text can be regarded as a sample
drawn according to this word distribution
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document
…
Topic 1:
Text mining
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
Text mining
paper
…
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
…
Food nutrition
paper
Estimation of Unigram LM
(Unigram) Language Model 
p(w| )=?
…
10/100
5/100
3/100
3/100
1/100
text ?
mining ?
assocation ?
database ?
…
query ?
…
Estimation
Document
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
A “text mining paper”
(total #words=100)
Language Models for Retrieval
(Ponte & Croft 98)
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
…
Food nutrition
paper
food ?
nutrition ?
healthy ?
diet ?
…
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
p(q| dN)
dN
dN
q
Kullback-Leibler (KL) Divergence
Retrieval Model
• Unigram similarity model
query entropy
(ignored for ranking)
Sim(d ; q)   D(ˆQ || ˆD )
  p( w | ˆQ ) log p(w | ˆD )  ( p( w | ˆQ ) log p(w | ˆQ ))
•
w
w
Retrieval  Estimation of Q and D
sim(q; d )

wi  d
p ( wi |Q )  0
•
pseen ( wi | d )
ˆ
[ p( wi | Q ) log
]  log  d
 d p( wi | C )
Special case: ˆQ = empirical distribution of q
Estimating p(w|d) (i.e., D)
• Simplified Jelinek-Mercer: Shrink uniformly
toward p(w|C)
p(w | d )  (1   )pml (w | d )   p(w | C)
• Dirichlet prior (Bayesian): Assume pseudo counts
p(w|C)
c ( w;d )   p ( w|C )

|d |
p (w | d ) 
 |d |  pml ( w | d )  |d |  p(w | C )
|d | 
• Absolute discounting: Subtract a constant 
p (w | d ) 
max( c ( w;d )  , 0 )  |d |u p ( w|C )
|d |
Estimating Q (Feedback)
Document D
D
D( Q ||  D )
Query Q
Q
 Q '  (1   ) Q  F
=0
Q '  Q
No feedback
Results
=1
Q '   F
Full feedback
F
Feedback Docs
F={d1, d2 , …, dn}
Generative model
Generative Mixture Model
Background words

P(w| C)
w
F={d1, …, dn}
P(source)
Topic words
1-
P(w|  )
w
log p( F |  )   c( w; di ) log[(1   ) p( w |  )   p( w | C )]
i
w
Maximum Likelihood
 F  arg max log p(F |  )

 = Noise in feedback documents
How to Estimate F?
Known
Background
p(w|C)
the 0.2
a 0.1
we 0.01
to 0.02
…
text 0.0001
mining 0.00005
=0.7
…
Unknown
query topic
p(w|F)=?
…
“Text mining”
…
text =?
mining =?
association =?
word =?
Observed
Doc(s)
ML
Estimator
=0.3
Suppose,
we know
the identity of each word ...
Can We Guess the Identity?
Identity (“hidden”) variable: zi {1 (background), 0(topic)}
zi
the
paper
presents
a
text
mining
algorithm
the
paper
...
1
1
1
1
0
0
0
1
0
...
Suppose the parameters are all known, what’s a
reasonable guess of zi?
- depends on  (why?)
- depends on p(w|C) and p(w|F) (how?)
p( zi  1| wi ) 

p new ( wi |  F ) 
p( zi  1) p( wi | zi  1)
p( zi  1) p( wi | zi  1)  p( zi  0) p( wi | zi  0)
 p( wi | C )
 p( wi | C )  (1   ) p( wi |  F )
c( wi , F )(1  p( zi  1| wi ))
 c(wj , F )(1  p( z j  1| wi ))
wj
E-step
M-step
Initially, set p(w| F) to some random value, then iterate …
Example of Feedback Query Model
Trec topic 412: “airport security”
=0.9
W
security
airport
beverage
alcohol
bomb
terrorist
author
license
bond
counter-terror
terror
newsnet
attack
operation
headline
Mixture model approach
p(W|  F )
0.0558
0.0546
0.0488
0.0474
0.0236
0.0217
0.0206
0.0188
0.0186
0.0173
0.0142
0.0129
0.0124
0.0121
0.0121
Web database
Top 10 docs
=0.7
W
the
security
airport
beverage
alcohol
to
of
and
author
bomb
terrorist
in
license
state
by
p(W|  F )
0.0405
0.0377
0.0342
0.0305
0.0304
0.0268
0.0241
0.0214
0.0156
0.0150
0.0137
0.0135
0.0127
0.0127
0.0125
Problem with Standard IR Methods:
Semi-Structured Queries
•
TREC-2003 Genomics Track, Topic 1:
Find articles about the following gene:
OFFICIAL_GENE_NAME activating transcription factor 2
OFFICIAL_SYMBOL ATF2
Bag-of-word Representation:
ALIAS_SYMBOL HB16
activating transcription factor 2,
ALIAS_SYMBOL CREB2
ATF2, HB16, CREB2, TREB7,
ALIAS_SYMBOL TREB7
CRE-BP1
ALIAS_SYMBOL CRE-BP1
•
Problems with unstructured representation
– Intuitively, matching “ATF2” should be counted more than
matching “transcription”
– Such a query is not a natural sample of a unigram language
model, violating the assumption of the language modeling
retrieval approach
Problem with Standard IR Methods:
Semi-Structured Queries (cont.)
• A topic in TREC-2005 Genomics Track
Find information about the role
of the gene interferona-beta
in the disease multiple sclerosis
• 3 different fields
• Should be weighted differently?
• What about expansion?
Semi-Structured Language Models
Semi-structured query Q  (Q1 ,..., Qk )
1,...,k
k
Semi-structured query model p( w | Q )   i p( w | i )
i 1
Semi-structured LM estimation:
Fit a mixture model to pseudo feedback documents using
Expectation-Maximization (EM)
Parameter Estimation
•
Synonym queries:
– Each field is estimated using maximum likelihood:
p( w |  i ) 
•
c( w, Qi )
| Qi |
– Each field has equal weights: i=1/k
Aspect queries:
– Use top-ranked documents to estimate all the parameters
– Similar to single-aspect model, but use query as prior and
Bayesian estimation
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
ˆ  argmax P ( X |  )

– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior”
knowledge and explaining data well
ˆ  argmax P ( | X )  argmax P ( X |  ) P( )


– Problem: how to define prior?
Illustration of Bayesian Estimation
Posterior:
p(|X) p(X|)p()
Likelihood:
p(X|)
X=(x1,…,xN)
Prior: p()

: prior mode
: posterior mode
ml: ML estimate
Experiment Results
TREC 2003 (Uniform weights)
Query Model
TREC 2005 (Estimated weights)
Unstruct
Semi-struct
Imp.
Unstruct
Semi-struct
Imp.
MAP
0.16
0.185
+13.5%
0.242
0.258
+6.6%
Pr@10docs
0.14
0.154
+10%
0.382
0.412
+7.8%
More Experiment Results
(with slightly different model)
Conclusions
• Standard IR techniques are effective for
biomedical literature retrieval
• Modeling and exploiting the structure in a
query can improve accuracy
• Overall TREC Genomics Track findings
– Domain-specific resources are very useful
– Sound retrieval models and machine learning
techniques are helpful
Future Work
• Using HMMs to model relevant documents
• Incorporate biomedical resources into
principled statistical models