Transcript slides
How to Make Manual Conjunctive
Normal Form Queries Work
in Patent Search
Le Zhao and Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Technology Survey Task @ Chem
• Document Collection
– 1.3 million patents + 0.18 million scientific articles
– Tend to be long, have XML field structure
• Topics
– 6 topics (last year only 2 groups submitted runs,
not reusable)
– About use/detection of chemicals (in certain
applications)
– Similar to Ad hoc retrieval queries
2
Example Topic: TS-20
• <title>tests for HCG hormone</title>
<narrative>The hormone Human Chorionic Gonadotrophin
(HCG) is produced when a women becomes pregnant.
Tests are usually carried out by analysing blood or urine.
We are looking for articles and patents on these pregnancy
test kits or the chemical tests used to produce
them.</narrative>
<details>
<chemicals>Human Chorionic Gonadotrophin OR
HCG</chemicals>
<condition>pregnancy</condition>
<target>Human Chorionic Gonadotrophin OR
HCG</target>
</details>
3
Our Runs
• Automatic Queries
– Unweighted bag of word baseline
– Weighting and combining words from different
query fields
• Manual Queries
– Interactive search using Boolean CNF queries
• (test OR check OR detection OR detect)thesaurus & interaction
AND
check top ranked results
(HCG OR “Human Chorionic Gonadotrophin” OR
“Chorionic Gonadotropin” OR Choriogonadotropin OR
Choriogonin)
MeSH etc. thesauri
• Effective, used by lawyers, librarians, medical, IR
4
Lemur CGI
Identify synonyms
0.5 hours per topic
5
Results at Large (xinfAP)
Not much difference on average
Worst manual queries have reasonable AP
Manual queries lower some high AP topics slightly
Figure credit: Mihai Lupu
6
Observations
• Weighting different query fields helped.
• Boolean CNF query (manual interaction)
– Good
• Expressive
• Helps a lot for hard (low AP) queries
– Bad
• Takes time & care to create & interact
• Manual error in formulating those queries
• Phrase or window restrictions improves top precision,
but destroys lower level recall/precision
– Difficult to identify from top rank, new tools needed
7
Comparisons with Best Runs
• Fraunhofer-SCAI
– Semantic search (similar to our CNF queries)
– IPC classification filtering
– Doc field based term weighting
• Topics that our manual queries got better
– TS-22 detect => detection test predict check
determine determination
– TS-29 minimum inhibitory concentration => …
– Expanded all terms, but not all resulted in
8
• Thanks to track organizers
• NSF grant IIS-1018317
• Questions?
9