Transcript 幻灯片 1

Lemur Toolkit
http://net.pku.edu.cn/~wbia
彭波
[email protected]
北京大学信息科学技术学院
3/21/2011
Recap

Information Retrieval Models



Vector Space Model
Probabilistic models
Language model
Some formulas for Sim(VSM)
Dot product
Cosine
Sim( D, Q)  D  Q 
Sim( D, Q) 

 (a * b )
i
i
ai 2 *

i
Dice
bi 2
 (a * b )
Sim( D, Q) 
 a  b
2
i
Q
t1
i
t2
i
2
D
θ
i
i
2
i
i
Jaccard
t3
(ai * bi )
i

i
i
 (a * b )
Sim( D, Q) 
 a   b   (a * b )
i
i
i
2
2
i
i
i
i
i
i
i
3
BM25 (Okapi system) – Robertson
et al.
Consider tf, qtf, document length
(k1 +1)tfi (k3 +1)qtfi
avdl - dl
Score(D,Q) = å ci
+ k2 | Q |
K + tfi k3 + qtfi
avdl + dl
ti ÎQ
dl
K = k1 ((1- b) + b
)
avdl - dl
TF factors
Doc. length
normalization
k1, k2, k3, b: parameters
 qtf: query term frequency
 dl: document length
 avdl: average document length

4
Standard Probabilistic IR
Informati
on need
P(R| Q, d)
matching
query
d1
d2
…
dn
document collection
5
IR based on Language Model (LM)
Informati
on need
P(Q | M d )
generation
query
M d1
d1
M d2
d2
A query generation process



For an information need, imagine an ideal
document
Imagine what words could appear in that
document
Formulate a query using those words
M dn
…
…

dn
document collection
6
Language Modeling for IR
Estimate a multinomial
probability distribution Smooth the distribution
from the text
with one estimated from
the entire collection
P(w|D) = (1-) P(w|D)+  P(w|C)
Query Likelihood
?
P(Q|D) =  P(q|D)
Estimate probability that document generated
the query terms
Kullback-Leibler Divergence
?
=
KL(Q|D) =  P(w|Q) log(P(w|Q) / P(w|D))
Estimate models for document and query and
compare
Question



Among the three classic information retrieval
model, which one is your best choice in designing
your retrieval system?
How can you tune the model parameters to
achieve optimized performance?
When you have a new idea on retrieval problem,
how can you prove it?
A Brief History of IR
Slides from Prof. Ray Larson
University of California, Berkeley
School of Information
http://courses.sims.berkeley.edu/i240/s11/
Experimental IR systems










Probabilistic indexing – Maron and Kuhns, 1960
SMART – Gerard Salton at Cornell – Vector space
model, 1970’s
SIRE at Syracuse
I3R – Croft
Cheshire I (1990)
TREC – 1992
Inquery
Cheshire II (1994)
MG (1995?)
Lemur (2000?)
IS 240 – Spring 2011
Historical Milestones in IR Research










1958 Statistical Language Properties (Luhn)
1960 Probabilistic Indexing (Maron & Kuhns)
1961 Term association and clustering (Doyle)
1965 Vector Space Model (Salton)
1968 Query expansion (Roccio, Salton)
1972 Statistical Weighting (Sparck-Jones)
1975 2-Poisson Model (Harter, Bookstein,
Swanson)
1976 Relevance Weighting (Robertson, SparckJones)
1980 Fuzzy sets (Bookstein)
1981 Probability without training (Croft)
IS 240 – Spring 2011
Historical Milestones in IR Research (cont.)










1983 Linear Regression (Fox)
1983 Probabilistic Dependence (Salton, Yu)
1985 Generalized Vector Space Model (Wong,
Rhagavan)
1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
al.)
1990 Latent Semantic Indexing (Dumais,
Deerwester)
1991 Polynomial & Logistic Regression (Cooper,
Gey, Fuhr)
1992 TREC (Harman)
1992 Inference networks (Turtle, Croft)
1994 Neural networks (Kwok)
1998 Language Models (Ponte, Croft)
IS 240 – Spring 2011
Information Retrieval





Research
Boolean model, statistics
of language (1950’s)
Vector space model,
probablistic indexing,
relevance feedback
(1960’s)
Probabilistic querying
(1970’s)
Fuzzy set/logic, evidential
reasoning (1980’s)
Regression, neural nets,
inference networks, latent
semantic indexing, TREC
(1990’s)
– Historical View
Industry





DIALOG, Lexus-Nexus,
STAIRS (Boolean based)
Information industry
(O($B))
Verity TOPIC (fuzzy logic)
Internet search engines
(O($100B?)) (vector
space, probabilistic)
IS 240 – Spring 2011
Research Systems Software



INQUERY (Croft)
OKAPI (Robertson)
PRISE (Harman)




SMART (Buckley)
MG (Witten, Moffat)
CHESHIRE (Larson)




http://potomac.ncsl.nist.gov/prise
http://cheshire.berkeley.edu
LEMUR toolkit
Lucene
Others
IS 240 – Spring 2011
Lemur Toolkit Project
Some slides from
Don Metzler, Paul Ogilvie & Trevor Strohman
Zoology 101



Lemurs are primates
found only in
Madagascar
50 species (17 are
endangered)
Ring-tailed lemurs

lemur catta
Zoology 101



The indri is the largest
type of lemur
When first spotted the
natives yelled “Indri!
Indri!”
Malagasy for
"Look! Over there!"
About The Lemur Project


The Lemur Project was started in 2000 by the Center for
Intelligent Information Retrieval (CIIR) at the University
of Massachusetts, Amherst, and the Language
Technologies Institute (LTI) at Carnegie Mellon
University. Over the years, a large number of UMass and
CMU students and staff have contributed to the project.
The project's first product was the Lemur Toolkit, a
collection of software tools and search engines designed
to support research on using statistical language models
for information retrieval tasks. Later the project added the
Indri search engine for large-scale search, the Lemur
Query Log Toolbar for capture of user interaction data,
and the ClueWeb09 dataset for research on web search.
Installation

Linux, OS/X:





Extract software/lemur-4.12.tar.gz
./configure --prefix=/install/path
./make
./make install
Windows


Run software/lemur-4.12-install.exe
Documentation in windoc/index.html
Installation



Use Lemur-4.12 instead~
JAVA Runtime(JDK 6) need for evaluation tool.
Environment Variable : PATH


Linux: modify ~/.bash_profile
Windows: MyComputer/Properties…
Indexing



Document Preparation
Indexing Parameters
Time and Space Requirements
Two Index Formats
KeyFile
 Term Positions
 Metadata
 Offline Incremental
 InQuery Query
Language
Indri
 Term Positions
 Metadata
 Fields / Annotations
 Online Incremental
 InQuery and Indri
Query Languages
Indexing – Document Preparation
Document Formats:





The Lemur Toolkit can inherently deal with several
different document format types without any
modification:
 HTML
TREC Text
 XML
TREC Web
 PDF
Plain Text
 Mbox
Microsoft Word(*)
Microsoft PowerPoint(*)
(*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a
Windows-based machine, and Office must be installed.
Indexing – Document Preparation

1.
2.
If your documents are not in a format that the
Lemur Toolkit can inherently process:
If necessary, extract the text from the document.
Wrap the plaintext in TREC-style wrappers:
<DOC>
<DOCNO>document_id</DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
– or –
For more advanced users, write your own parser to extend the Lemur
Toolkit.
Indexing - Parameters

Basic usage to build index:


IndriBuildIndex <parameter_file>
Parameter file includes options for





Where to find your data files
Where to place the index
How much memory to use
Stopword, stemming, fields
Many other parameters.
Indexing – Parameters

Standard parameter file specification an XML
document:
<parameters>
<option></option>
<option></option>
…
<option></option>
</parameters>
Indexing – Parameters

where to find your source files and what type to expect

BuildIndex



<dataFiles>
name of file containing list of datafiles to index.
IndriBuildIndex
<corpus>
 <path>: (required) the path to the source files (absolute or
relative)
 <class>: (optional) the document type to expect. If omitted,
IndriBuildIndex will attempt to guess at the filetype based on
the file’s extension.
<parameters>
<corpus>
<path>/path/to/source/files</path>
<class>trectext</class>
</corpus>
</parameters>

Indexing - Parameters

The <index> parameter tells IndriBuildIndex where to
create or incrementally add to the index
 If index does not exist, it will create a new one
 If index already exists, it will append new documents
into the index.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Indexing - Parameters

<memory> - used to define a “soft-limit” of the
amount of memory the indexer should use before
flushing its buffers to disk.

Use K for kilobytes, M for megabytes, and G for
gigabytes.
<parameters>
<memory>256M</memory>
</parameters>
Indexing - Parameters


Stopwords defined within
<stopwords>filename</stopwords>
IndriBuildIndex

Stopwords can be defined within a <stopper> block
with individual stopwords within enclosed in <word>
tags.
<parameters>
<stopper>
<word>first_word</word>
<word>next_word</word>
…
<word>final_word</word>
</stopper>
</parameters>
Indexing – Parameters

Term stemming can be used while indexing as
well via the <stemmer> tag.


Specify the stemmer type via the <name> tag within.
Stemmers included with the Lemur Toolkit include the
Krovetz Stemmer and the Porter Stemmer.
<parameters>
<stemmer>
<name>krovetz</name>
</stemmer>
</parameters>
Retrieval



Parameters
Query Formatting
Interpreting Results
Retrieval - Parameters

Basic usage for retrieval:


IndriRunQuery/RetEval <parameter_file>
Parameter file includes options for





Where to find the index
The query or queries
How much memory to use
Formatting options
Many other parameters.
Retrieval - Parameters

Just as with indexing:

A well-formed XML document with options, wrapped
by <parameters> tags:
<parameters>
<options></options>
<options></options>
…
<options></options>
</parameters>
Retrieval - Parameters

The <index> parameter tells
IndriRunQuery/RetEval where to find the
repository.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Retrieval - Parameters

The <query> parameter
specifies a query

plain text or using the
Indri query language
<parameters>
<query>
<number>1</number>
<text>this is the
first query</text>
</query>
<query>
<number>2</number>
<text>another query
to run</text>
</query>
</parameters>

Query file format
<DOC>
<DOCNO> 1 </DOCNO>
What articles exist which
deal with TSS (Time
Sharing System),
anoperating system for IBM
computers?
</DOC>
<DOC>
<DOCNO> 2 </DOCNO>
I am interested in articles
written either by Prieve or
Udo PoochPrieve, B.Pooch,
U.
</DOC>
Retrieval – Query Formatting

TREC-style topics are not directly able to be
processed via IndriRunQuery/RetEval.

Format the queries accordingly:
 Format by hand
 Write a script to extract the fields (可爱的Python~)
Retrieval – Parameters

To specify a maximum number of results to
return, use the <count> tag:
<parameters>
<count>50</count>
</parameters>
Retrieval - Parameters

Result formatting options:

IndriRunQuery/RetEval has built in formatting
specifications for TREC and INEX retrieval tasks
Retrieval – Parameters

TREC – Formatting directives:


<runID>: a string specifying the id for a query run,
used in TREC scorable output.
<trecFormat>: true to produce TREC scorable
output, otherwise use false (default).
<parameters>
<runID>runName</runID>
<trecFormat>true</trecFormat>
</parameters>
Outputting INEX Result Format

Must be wrapped in <inex> tags





<participant-id>: specifies the participant-id attribute used
in submissions.
<task>: specifies the task attribute (default CO.Thorough).
<query>: specifies the query attribute (default automatic).
<topic-part>: specifies the topic-part attribute (default T).
<description>: specifies the contents of the description tag.
<parameters>
<inex>
<participant-id>LEMUR001</participant-id>
</inex>
</parameters>
Retrieval - Evaluation

To use trec_eval:

format IndriRunQuery results with appropriate
trec_eval formatting directives in the parameter file:



<runID>runName</runID>
<trecFormat>true</trecFormat>
Resulting output will be in standard TREC format
ready for evaluation:
<queryID> Q0 <DocID> <rank> <score> <runID>
150 Q0 AP890101-0001 1 -4.83646 runName
150 Q0 AP890101-0015 2 -7.06236 runName
Use RetEval for TF.IDF

First run ParseToFile to convert doc formatted queries
into queries
<parameters>
<docFormat>web</docFormat>
<outputFile>filename</outputFile>
<stemmer>stemmername</stemmer>
<stopwords>stopwordfile</stopwords>
</parameters>


ParseToFile paramfile queryfile
http://www.lemurproject.org/lemur/parsing.html#parseto
file
Use RetEval for TF.IDF

Then run RetEval
<parameters>
<index>index</index>
<retModel>0</retModel>
// 0 for TF-IDF, 1 for Okapi,
// 2 for KL-divergence,
// 5 for cosine similarity
<textQuery>querie filename</textQuery>
<resultCount>1000</resultCount>
<resultFile>tfidf.res</resultFile>
</parameters>


RetEval paramfile
http://www.lemurproject.org/lemur/retrieval.html#RetEva
l
Evluate Results

TREC qrels

Ground Truth: judge by human assessors.
Ireval tool

java -jar “D:\Program
Files\Lemur\Lemur
4.12\bin\ireval.jar”res
ult qrels >pr.result
More Stories about Indri



lemur_sigir_2006
Paul Ogilvie
Trevor Strohman
Thank You!
Q&A