Lecture 7

Transcript Lecture 7

Introduction to Lucene
Rong Jin
1
What is Lucene ?

Lucene is a high performance, scalable
Information Retrieval (IR) library





Free, open-source project implemented in Java
Originally written by Doug Cutting
Become a project in the Apache Software
Foundation in 2001
It is the most popular free Java IR library.
Lucene has been ported to Perl, Python, Ruby,
C/C++, and C# (.NET).
2
Lucene Users










IBM Omnifind Y! Edition
Technorati
Wikipedia
Internet Archive
LinkedIn
Eclipse
JIRA
Apache Roller
jGuru
More than 200 others
3
The Lucene Family








Lucene & Apache Lucene & Java Lucene: IR library
Nutch: Hadoop-loving crawler, indexer, searcher for web-scale SE
Solr: Search server
Droids: Standalone framework for writing crawlers
Lucene.Net: C#, incubator graduate
Lucy: C Lucene implementation
PyLecene: Python port
Tika: Content analysis toolkit
4
Indexing Documents
Document
Field
Field
Field
:



Analyzer
Dictionary
Tokenizer
TokenFilter
Index
Writer
Inverted
Index
Each document is comprised of multiple fields
Analyzer extracts words from texts
IndexWriter creates and writes inverted index to disk
5
Indexing Documents
6
Indexing Documents
7
Indexing Documents
8
Lucene Classes for Indexing

Directory class



An abstract class representing the location of a
Lucene index.
FSDirectory stores index in a directory in the
filesystem,
RAMDirectory holds all its data in memory.

useful for smaller indices that can be fully loaded in
memory and can be destroyed upon the termination of
an application.
9
Lucene Classes for Indexing

IndexWriter Class
 Creates a new index or opens an existing one, and
adds, removes or updates documents in the index.

Analyzer Class


An abstract class for extracting tokens from texts
to be indexed
StandardAnalyzer is the most common one
10
Lucene Classes for Indexing

Document Class


A document is a collection of fields
The meta-data such as author, title, subject, date
modified, and so on, are indexed and stored
separately as fields of a document.
11
Index Segments and Merge



Each index consists of multiple segments
Every segment is actually a standalone index itself, holding a
subset of all indexed documents.
At search time, each segment is visited separately and the
results are combined together.
# ls -lh
total 1.1G
-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt
-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx
-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm
-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq
-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm
-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx
-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii
-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis
12
-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2
-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen
Index Segments and Merge

Each segment consists of multiple files



_X.<ext> : X is the name and <ext> is the extension that
identifies which part of the index that file corresponds to.
Separate files to hold the different parts of the index (term
vectors, stored fields, inverted index, etc.).
Optimize() operation will merge all the segments into one


Involves a lot of disk IO and
time consuming
Significantly improves search
efficiency
# ls -lh
total 1.1G
-rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt
-rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx
-rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm
-rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq
-rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm
-rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx
-rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii
-rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis13
-rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2
-rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen
Lucene Classes for Reading Index

IndexReader class


Terms class


Read index from the indexed file
A container for all the terms in a specified field
TermsEnum class

Implement BytesRefIterator interface, providing
interface for accessing each term
14
Reading Document Vector

Enable storing
term vector at
indexing step.
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors( true);
fieldType.setIndexed( true );
fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS);
fieldType.setStored( true );
doc.add( new Field(“contents”, contentString, fieldType )) ;
15
Reading Document Vector



Enable storing
term vector at
indexing step.
Read document
vector
Obtain each term
in the document
vector
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors( true);
fieldType.setIndexed( true );
fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS);
fieldType.setStored( true );
doc.add( new Field(“contents”, contentString, fieldType )) ;
IndexReader reader = IndexReader.open(
FSDirectory.open ( new File( indexPath )) );
int maxDoc = reader.maxDoc();
for (int i=0; i<maxDoc; i++) {
Terms terms = reader.getTermVector( i, “contents”);
TermsEnum termsEnum = terms.iterator( null );
BytesRef text = null;
while ( (text = termsEnum.next()) !=null ) {
String termtext = text.utf8ToString();
int docfreq = termsEnum.docFreq();
}
16
}
Updating Documents in Index



IndexWriter.add(): add documents to the
existing index
IndexWriter.delete(): remove
documents/fields from the existing index
IndexWriter.update(): update documents in
the existing index
17
Other Features of Lucene Indexing

Concurrency



Multiple IndexReaders may be open at once on a
single index
But only one IndexWriter can be open on an
index at once
IndexReaders may be open even while a single
IndexWriter is making changes to the index; each
IndexReader will always show the index as of the
point-in-time that it was opened.
18
Other Features of Lucene Indexing

A file-based lock is used to prevent two
writers working on the same index

If the file write.lock exists in your index
directory, a writer currently has the index open;
any attempt to create another writer on the same
index will hit a LockObtainFailedException.
19
Search
Documents
20
Search
Documents
21
Search
Documents
22
Lucene Classes for Searching

IndexSearcher class


Search through the index
TopDocs class


A container of pointers to the top N ranked results
Records the docID and score for each of the top
N results (docID can be used to retrieve the
document)
23
Lucene Classes for Searching

QueryParser


Parse a text query into the Query class
Need the analyzer to extract tokens from the text query

Search single term

Term class


Similar to Field, is pair of name and value
Use together TermQuery class to create query
24
Similarity Functions in Lucene

Many similarity functions are implemented in
Lucene



Okapi (BM25Similarity)
Language model (LMDirichletSimilarity)
Example :
Similarity simfn = new BM25Similarity();
searcher.setSimilarity(simfn); // searcher is an IndexSearcher
25
Similarity Functions in Lucene

Default similarity function

Allow implementing various similarity functions
26
Lucene Scoring in DefaultSimilarity







tf - how often a term appears in the document
idf - how often the term appears across the index
coord - number of terms in both the query and the document
lengthNorm - total number of terms in the field
queryNorm - normalization factor makes queries comparable
boost(index) – boost of the field at index-time
boost(query) – boost of the field at query-time
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
27
Lucene Scoring in DefaultSimilarity







tf - how often a term appears in the document sqrt( freq )
idf - how often the termlog(numDocs/(docFreq+1))+1
appears across the index
coord - number of terms in both theoverlap/maxOverlap
query and the document
lengthNorm - total number of terms
in the field
1/sqrt(
numTerms )
queryNorm - normalization
factor makes queries comparable
1/sqrt(sumOfSquaredWeights)
boost(index) – boost of the field at index-time
boost(query) – boost of the field at query-time
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
28
Customizing Scoring

Subclass DefaultSimilarity and override the
method you want to customize.

ignore how common a term appears across the index

Increase the weight of terms in “title” field
29
Queries in Lucene

Lucene support many types of queries

RangeQuery

PrefixQuery

WildcardQuery, BooleanQuery, PhraseQuery, …
30
Analyzers

Basic analyzers

Analyzers for different languages (in analyzers-common)

Chinese, Japanese, Arabic, German, Greek, ….
31
Analysis in Action
"The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer :
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer :
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer :
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
32
Analysis in Action
"XY&Z Corporation - [email protected]"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [[email protected]]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [[email protected]]
33
Analyzer: Key Structure

Breaks text into a stream of tokens enumerated by
the TokenStream class.
34
Analyzer

Breaks text into a stream of tokens enumerated by
the TokenStream class.
The only required method for Analyzer
35
Analyzer

Breaks text into a stream of tokens enumerated by
the TokenStream class.
Allows the TokenStream class to be reused;
save space allocation and garbage collection
36
TokenStream Class

Two types of TokenStream



Tokenizer: a TokenStream that tokenizes the input from a
Reader. i.e., chunks the input into Tokens.
TokenFilter: allows you to chain TokenStreams together,
i.e., further modify the Tokens including removing it,
stemming it, and other actions.
A chain usually includes 1 Tokenizer and N TokenFilters
37
TokenStream Class

Example: StopAnalyzer
Text
LowerCaseTokenizer
TokenStream
StopFilter
TokenStream
38
TokenStream Class

Example: StopAnalyzer
Text
LowerCaseTokenizer
TokenStream
StopFilter
TokenStream
Text
LetterTokenizer
TokenStream
StopFilter
TokenStream
LowerCaseFilter
Order Matter !
TokenStream
39
Tokenizer
TokenFilter
40