Lucene Performance

Transcript Lucene Performance

Lucene Performance
Grant Ingersoll
November 16, 2007
Atlanta, GA
Overview
• Defining Performance
• Basics
• Indexing
– Parameters
– Threading
• Search
• Document Retrieval
• Search Quality
Defining Performance
• Many factors in assessing Lucene (and search)
performance
• Speed
• Quality of results (subjective)
– Precision
• # relevant retrieved out of # retrieved
– Recall
• # relevant retrieved out of total # relevant
• Size of index
– Compression rate
• Other Factors:
– Local vs. distributed
Basics
• Consider latest version of Lucene
– Lucene 2.3/Trunk has many performance improvements over prior
versions
• Consider Solr
– Solr employs many Lucene best practices
• contrib/benchmark can help assess many aspects of
performance, including speed, precision and recall
– Task based approach makes for easy extension
• Sanity check your needs
• Profile to identify bottlenecks
Indexing Factors
• Lucene indexes Documents into memory
• On certain occasions, memory is flushed to
the index representation (called a segment)
• Segments are periodically merged
• Internal Lucene models are changing and
(drastically) improving performance
IndexWriter factors
• setMaxBufferedDocs controls minimum # of docs
before merge occurs
– Larger == faster
– > RAM
• setMergeFactor controls how often segments are
merged
– Smaller == less RAM, better for large # of updates
– Larger == faster, better for batch
• setMaxFieldLength controls the # of terms indexed
from a document
• setUseCompoundFile controls the file format Lucene
uses. Turning off compound file format is faster, but you
could run out of file descriptors
Lucene 2.3 IndexWriter
Changes
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs and
setMergeFactor
• Takes storage and term vectors out of the merge
process
• Turn off auto-commit if there are stored fields and
term vectors
• Provides significant performance increase
Analysis
• An Analyzer is a Tokenizer and one or more
TokenFilters
• More complicated analysis, slower indexing
– Many applications could use simpler Analyzers than
the StandardAnalyzer
– StandardTokenizer is now faster in 2.3 (thus
making StandardAnalyzer faster)
• Reuse in 2.3:
– Re-use Token, Document and Field instances
– Use the char[] API with Token instead of String API
Thread Safety
• Use a single IndexWriter for the duration of indexing
• Share IndexWriter between threads
• Parallel Indexing
– Index to separate Directory instances
– Merge when done with IndexWriter.addIndexes()
– Distribute and collect
Other Indexing Factors
• NFS
– Have been some improvements lately, but…
– “proceed with caution”
– Not as good as local filesystem
• Replication
– Index locally and then use rsync to replicate
copies of index to other servers
– Have I mentioned Solr?
Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and trunk (2.3)
– contrib/benchmark/conf:
• indexing.alg
• indexing-multithreaded.alg
• Info:
– Mac Pro 2 x 2GHz Dual-Core Xeon
– 4 GB RAM
–
ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
Benchmarking Results
2.2
Trunk
Records/Sec
421
2,122
Avg. T Mem
39M
52M
Trunk-mt (4)
3,680
57M
Search Performance
• Many factors influence search speed
– Query Type, size, analysis, # of occurrences,
index size, index optimization, index type
– Known Enemies
• Search Quality also has many factors
– Query formulation, synonyms, analysis, etc.
– How to judge quality?
Query Types
• Some queries in Lucene get rewritten into
simpler queries:
– WildcardQuery rewrites to a
BooleanQuery of all the terms that satisfy
the wildcards
• a* -> abe, apple, an, and, array…
– Likewise with RangeQuery, especially with
date ranges
Query Size
•
•
•
•
Stopword removal can help reduce size
Choose expansions carefully
Consider using fewer fields to search over
When doing relevance feedback, don’t use
whole document, instead focus on most
important terms
Index Factors for Search
• Size:
– more unique terms, more to search
– Stopword removal and stemming can help
reduce
– Not a linear factor due to index compression
• Type
– RAMDirectory if index smaller
– MMapDirectory may perform better
Search Speed Tips
• IndexSearcher
– Thread-safe, so share
– Open once and use as long as possible
• Cache Filters when appropriate
• Optimize if you have the time
• Warm up your Searcher first by sending
it some preliminary queries before making
it live
Known Enemies
• CPU, Memory, I/O are all known enemies
of performance
– Can’t live without them, either!
• Profile, run benchmarks, look at garbage
collection policies, etc.
• Check your needs
– Do you need wildcards?
– Do you need so many Fields?
Document Retrieval
• Common Search Scenario:
– Many small Fields containing info about the
Document
– One or two big Fields storing content
– Run search, display small Fields to user
– User picks one result to view content
FieldSelector
• Gives developer greater control over how
the Document is loaded
– Load, Lazy, No Load, Load and Break, Size,
etc.
• In previous scenario, lazy load the large
Fields
• Easier to store original content without
performance penalty
Quality Queries
• Evaluating search quality is difficult and
subjective
• Lucene provides good out of the box quality
by most accounts
• Can evaluate using TREC or other
experiments, but these risk overtuning
• Unfortunately, judging quality is a laborintensive task
Quality Experiments
• Needs:
– Standard collection of docs - easy
– Set of queries
• Query logs
• Develop in-house
• TREC, other conferences
– Set of judgments
• Labor intensive
• Can use log analysis to determine estimates of which queries
are relevant based on clicks, etc.
Query Formulation
• Invest the time in determining the proper analysis
of the fields you are searching
– Case sensitive search
– Punctuation analysis
– Strict matching
• Stopword policy
– Stopwords can be useful
• Operator choice
• Synonym choices
Effective Scoring
• Similarity class provides callback mechanism
for controlling how some Lucene scoring factors
count towards the score
– tf(), idf(), coord()
• Experiment with different length normalization
factors
– You may find Lucene is overemphasizing shorter or
longer documents
Effective Scoring
• Can also implement your own Query class
– Ask if anyone else has done it first on java-user
mailing list
• Go beyond the obvious:
– org.apach.lucene.search.function
package provides means for using values of Fields to
change the scores
• Geographic scoring, user ratings, others
• Payloads (stay tuned for next presentation)
Resources
• Talk available at:
http://lucene.grantingersoll.com/apachecon07/LucenePerformance.ppt
• http://lucene.apache.org
• Mailing List
– [email protected]
• Lucene In Action
– http://www.lucenebook.com