Transcript 幻灯片 1
Lemur Toolkit
http://net.pku.edu.cn/~wbia
彭波
[email protected]
北京大学信息科学技术学院
3/21/2011
Recap
Information Retrieval Models
Vector Space Model
Probabilistic models
Language model
Some formulas for Sim(VSM)
Dot product
Cosine
Sim( D, Q) D Q
Sim( D, Q)
(a * b )
i
i
ai 2 *
i
Dice
bi 2
(a * b )
Sim( D, Q)
a b
2
i
Q
t1
i
t2
i
2
D
θ
i
i
2
i
i
Jaccard
t3
(ai * bi )
i
i
i
(a * b )
Sim( D, Q)
a b (a * b )
i
i
i
2
2
i
i
i
i
i
i
i
3
BM25 (Okapi system) – Robertson
et al.
Consider tf, qtf, document length
(k1 +1)tfi (k3 +1)qtfi
avdl - dl
Score(D,Q) = å ci
+ k2 | Q |
K + tfi k3 + qtfi
avdl + dl
ti ÎQ
dl
K = k1 ((1- b) + b
)
avdl - dl
TF factors
Doc. length
normalization
k1, k2, k3, b: parameters
qtf: query term frequency
dl: document length
avdl: average document length
4
Standard Probabilistic IR
Informati
on need
P(R| Q, d)
matching
query
d1
d2
…
dn
document collection
5
IR based on Language Model (LM)
Informati
on need
P(Q | M d )
generation
query
M d1
d1
M d2
d2
A query generation process
For an information need, imagine an ideal
document
Imagine what words could appear in that
document
Formulate a query using those words
M dn
…
…
dn
document collection
6
Language Modeling for IR
Estimate a multinomial
probability distribution Smooth the distribution
from the text
with one estimated from
the entire collection
P(w|D) = (1-) P(w|D)+ P(w|C)
Query Likelihood
?
P(Q|D) = P(q|D)
Estimate probability that document generated
the query terms
Kullback-Leibler Divergence
?
=
KL(Q|D) = P(w|Q) log(P(w|Q) / P(w|D))
Estimate models for document and query and
compare
Question
Among the three classic information retrieval
model, which one is your best choice in designing
your retrieval system?
How can you tune the model parameters to
achieve optimized performance?
When you have a new idea on retrieval problem,
how can you prove it?
A Brief History of IR
Slides from Prof. Ray Larson
University of California, Berkeley
School of Information
http://courses.sims.berkeley.edu/i240/s11/
Experimental IR systems
Probabilistic indexing – Maron and Kuhns, 1960
SMART – Gerard Salton at Cornell – Vector space
model, 1970’s
SIRE at Syracuse
I3R – Croft
Cheshire I (1990)
TREC – 1992
Inquery
Cheshire II (1994)
MG (1995?)
Lemur (2000?)
IS 240 – Spring 2011
Historical Milestones in IR Research
1958 Statistical Language Properties (Luhn)
1960 Probabilistic Indexing (Maron & Kuhns)
1961 Term association and clustering (Doyle)
1965 Vector Space Model (Salton)
1968 Query expansion (Roccio, Salton)
1972 Statistical Weighting (Sparck-Jones)
1975 2-Poisson Model (Harter, Bookstein,
Swanson)
1976 Relevance Weighting (Robertson, SparckJones)
1980 Fuzzy sets (Bookstein)
1981 Probability without training (Croft)
IS 240 – Spring 2011
Historical Milestones in IR Research (cont.)
1983 Linear Regression (Fox)
1983 Probabilistic Dependence (Salton, Yu)
1985 Generalized Vector Space Model (Wong,
Rhagavan)
1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
al.)
1990 Latent Semantic Indexing (Dumais,
Deerwester)
1991 Polynomial & Logistic Regression (Cooper,
Gey, Fuhr)
1992 TREC (Harman)
1992 Inference networks (Turtle, Croft)
1994 Neural networks (Kwok)
1998 Language Models (Ponte, Croft)
IS 240 – Spring 2011
Information Retrieval
Research
Boolean model, statistics
of language (1950’s)
Vector space model,
probablistic indexing,
relevance feedback
(1960’s)
Probabilistic querying
(1970’s)
Fuzzy set/logic, evidential
reasoning (1980’s)
Regression, neural nets,
inference networks, latent
semantic indexing, TREC
(1990’s)
– Historical View
Industry
DIALOG, Lexus-Nexus,
STAIRS (Boolean based)
Information industry
(O($B))
Verity TOPIC (fuzzy logic)
Internet search engines
(O($100B?)) (vector
space, probabilistic)
IS 240 – Spring 2011
Research Systems Software
INQUERY (Croft)
OKAPI (Robertson)
PRISE (Harman)
SMART (Buckley)
MG (Witten, Moffat)
CHESHIRE (Larson)
http://potomac.ncsl.nist.gov/prise
http://cheshire.berkeley.edu
LEMUR toolkit
Lucene
Others
IS 240 – Spring 2011
Lemur Toolkit Project
Some slides from
Don Metzler, Paul Ogilvie & Trevor Strohman
Zoology 101
Lemurs are primates
found only in
Madagascar
50 species (17 are
endangered)
Ring-tailed lemurs
lemur catta
Zoology 101
The indri is the largest
type of lemur
When first spotted the
natives yelled “Indri!
Indri!”
Malagasy for
"Look! Over there!"
About The Lemur Project
The Lemur Project was started in 2000 by the Center for
Intelligent Information Retrieval (CIIR) at the University
of Massachusetts, Amherst, and the Language
Technologies Institute (LTI) at Carnegie Mellon
University. Over the years, a large number of UMass and
CMU students and staff have contributed to the project.
The project's first product was the Lemur Toolkit, a
collection of software tools and search engines designed
to support research on using statistical language models
for information retrieval tasks. Later the project added the
Indri search engine for large-scale search, the Lemur
Query Log Toolbar for capture of user interaction data,
and the ClueWeb09 dataset for research on web search.
Installation
Linux, OS/X:
Extract software/lemur-4.12.tar.gz
./configure --prefix=/install/path
./make
./make install
Windows
Run software/lemur-4.12-install.exe
Documentation in windoc/index.html
Installation
Use Lemur-4.12 instead~
JAVA Runtime(JDK 6) need for evaluation tool.
Environment Variable : PATH
Linux: modify ~/.bash_profile
Windows: MyComputer/Properties…
Indexing
Document Preparation
Indexing Parameters
Time and Space Requirements
Two Index Formats
KeyFile
Term Positions
Metadata
Offline Incremental
InQuery Query
Language
Indri
Term Positions
Metadata
Fields / Annotations
Online Incremental
InQuery and Indri
Query Languages
Indexing – Document Preparation
Document Formats:
The Lemur Toolkit can inherently deal with several
different document format types without any
modification:
HTML
TREC Text
XML
TREC Web
PDF
Plain Text
Mbox
Microsoft Word(*)
Microsoft PowerPoint(*)
(*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a
Windows-based machine, and Office must be installed.
Indexing – Document Preparation
1.
2.
If your documents are not in a format that the
Lemur Toolkit can inherently process:
If necessary, extract the text from the document.
Wrap the plaintext in TREC-style wrappers:
<DOC>
<DOCNO>document_id</DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
– or –
For more advanced users, write your own parser to extend the Lemur
Toolkit.
Indexing - Parameters
Basic usage to build index:
IndriBuildIndex <parameter_file>
Parameter file includes options for
Where to find your data files
Where to place the index
How much memory to use
Stopword, stemming, fields
Many other parameters.
Indexing – Parameters
Standard parameter file specification an XML
document:
<parameters>
<option></option>
<option></option>
…
<option></option>
</parameters>
Indexing – Parameters
where to find your source files and what type to expect
BuildIndex
<dataFiles>
name of file containing list of datafiles to index.
IndriBuildIndex
<corpus>
<path>: (required) the path to the source files (absolute or
relative)
<class>: (optional) the document type to expect. If omitted,
IndriBuildIndex will attempt to guess at the filetype based on
the file’s extension.
<parameters>
<corpus>
<path>/path/to/source/files</path>
<class>trectext</class>
</corpus>
</parameters>
Indexing - Parameters
The <index> parameter tells IndriBuildIndex where to
create or incrementally add to the index
If index does not exist, it will create a new one
If index already exists, it will append new documents
into the index.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Indexing - Parameters
<memory> - used to define a “soft-limit” of the
amount of memory the indexer should use before
flushing its buffers to disk.
Use K for kilobytes, M for megabytes, and G for
gigabytes.
<parameters>
<memory>256M</memory>
</parameters>
Indexing - Parameters
Stopwords defined within
<stopwords>filename</stopwords>
IndriBuildIndex
Stopwords can be defined within a <stopper> block
with individual stopwords within enclosed in <word>
tags.
<parameters>
<stopper>
<word>first_word</word>
<word>next_word</word>
…
<word>final_word</word>
</stopper>
</parameters>
Indexing – Parameters
Term stemming can be used while indexing as
well via the <stemmer> tag.
Specify the stemmer type via the <name> tag within.
Stemmers included with the Lemur Toolkit include the
Krovetz Stemmer and the Porter Stemmer.
<parameters>
<stemmer>
<name>krovetz</name>
</stemmer>
</parameters>
Retrieval
Parameters
Query Formatting
Interpreting Results
Retrieval - Parameters
Basic usage for retrieval:
IndriRunQuery/RetEval <parameter_file>
Parameter file includes options for
Where to find the index
The query or queries
How much memory to use
Formatting options
Many other parameters.
Retrieval - Parameters
Just as with indexing:
A well-formed XML document with options, wrapped
by <parameters> tags:
<parameters>
<options></options>
<options></options>
…
<options></options>
</parameters>
Retrieval - Parameters
The <index> parameter tells
IndriRunQuery/RetEval where to find the
repository.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Retrieval - Parameters
The <query> parameter
specifies a query
plain text or using the
Indri query language
<parameters>
<query>
<number>1</number>
<text>this is the
first query</text>
</query>
<query>
<number>2</number>
<text>another query
to run</text>
</query>
</parameters>
Query file format
<DOC>
<DOCNO> 1 </DOCNO>
What articles exist which
deal with TSS (Time
Sharing System),
anoperating system for IBM
computers?
</DOC>
<DOC>
<DOCNO> 2 </DOCNO>
I am interested in articles
written either by Prieve or
Udo PoochPrieve, B.Pooch,
U.
</DOC>
Retrieval – Query Formatting
TREC-style topics are not directly able to be
processed via IndriRunQuery/RetEval.
Format the queries accordingly:
Format by hand
Write a script to extract the fields (可爱的Python~)
Retrieval – Parameters
To specify a maximum number of results to
return, use the <count> tag:
<parameters>
<count>50</count>
</parameters>
Retrieval - Parameters
Result formatting options:
IndriRunQuery/RetEval has built in formatting
specifications for TREC and INEX retrieval tasks
Retrieval – Parameters
TREC – Formatting directives:
<runID>: a string specifying the id for a query run,
used in TREC scorable output.
<trecFormat>: true to produce TREC scorable
output, otherwise use false (default).
<parameters>
<runID>runName</runID>
<trecFormat>true</trecFormat>
</parameters>
Outputting INEX Result Format
Must be wrapped in <inex> tags
<participant-id>: specifies the participant-id attribute used
in submissions.
<task>: specifies the task attribute (default CO.Thorough).
<query>: specifies the query attribute (default automatic).
<topic-part>: specifies the topic-part attribute (default T).
<description>: specifies the contents of the description tag.
<parameters>
<inex>
<participant-id>LEMUR001</participant-id>
</inex>
</parameters>
Retrieval - Evaluation
To use trec_eval:
format IndriRunQuery results with appropriate
trec_eval formatting directives in the parameter file:
<runID>runName</runID>
<trecFormat>true</trecFormat>
Resulting output will be in standard TREC format
ready for evaluation:
<queryID> Q0 <DocID> <rank> <score> <runID>
150 Q0 AP890101-0001 1 -4.83646 runName
150 Q0 AP890101-0015 2 -7.06236 runName
Use RetEval for TF.IDF
First run ParseToFile to convert doc formatted queries
into queries
<parameters>
<docFormat>web</docFormat>
<outputFile>filename</outputFile>
<stemmer>stemmername</stemmer>
<stopwords>stopwordfile</stopwords>
</parameters>
ParseToFile paramfile queryfile
http://www.lemurproject.org/lemur/parsing.html#parseto
file
Use RetEval for TF.IDF
Then run RetEval
<parameters>
<index>index</index>
<retModel>0</retModel>
// 0 for TF-IDF, 1 for Okapi,
// 2 for KL-divergence,
// 5 for cosine similarity
<textQuery>querie filename</textQuery>
<resultCount>1000</resultCount>
<resultFile>tfidf.res</resultFile>
</parameters>
RetEval paramfile
http://www.lemurproject.org/lemur/retrieval.html#RetEva
l
Evluate Results
TREC qrels
Ground Truth: judge by human assessors.
Ireval tool
java -jar “D:\Program
Files\Lemur\Lemur
4.12\bin\ireval.jar”res
ult qrels >pr.result
More Stories about Indri
lemur_sigir_2006
Paul Ogilvie
Trevor Strohman
Thank You!
Q&A