PPTX - Personal Web Pages
Download
Report
Transcript PPTX - Personal Web Pages
For ITCS 6265
Professor: Wensheng Wu
Present by TA: Xu Fei
What is Lucene
“Apache Lucene is a high-performance, full-featured
text search engine library written entirely in Java. It is a
technology suitable for nearly any application that
requires full-text search, especially cross-platform. ”
high performance, scalable Information Retrieval (IR)
library.
a project in the Apache Software Foundation
mature, free, open-source
implemented in Java.
full-text indexing and searching
“In text retrieval, full text search refers to a technique
for searching a computer-stored document or
database. In a full text search, the search engine
examines all of the words in every stored document as
it tries to match search words supplied by the user. ”
“Search engine indexing collects, parses, and stores
data to facilitate fast and accurate information
retrieval. ”
Lucene is popular
a number of ports or integrations to other
programming languages
C/C++, C#, Ruby, Perl, Python, PHP, etc.
1500+ installations:
HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo,
Healthline, Webmail, CNET, Lookout (acquired by
Microsoft), webshots.com (100M docs, 4M queries/day),
Siderean, Monster….
Lucene is just a hammer!
NOT a ready-to-use search application, like Google
a software library, a toolkit
a single compact JAR file (less than 1 MB!)
A number of full-featured search applications have
been built on top of Lucene.
What Lucene can do for you
add search capabilities to your application
index and make searchable any data that you can
extract text from
Lucene doesn’t care about the source of the data, its
format, or even its language, as long as you can derive
text from it.
You can even index data stored in your databases,
indirectly!
Search Application
Components for indexing
Acquire Content
Build Document
Analyze Document
Index Document
Components for searching
Search User Interface
Build Query
Search Query
Render Results
Others
Administration Interface
Analytics Interface
Scaleout
Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.
Ranking formula
score(Q,D) = coord(Q,D) · queryNorm(Q)
· ∑ t in Q ( tf(t in D) · idf(t)2
· t.getBoost() · norm(D) )
tf–idf weight (term frequency–inverse document
frequency)
Key index files in Lucene
Segments file
Fields information file
Text information file
Frequency file
Position file
Inverted Index Example
Doc 1:
Penn State
Football …
Posting
id
word
doc
offset
1
football
Doc 1
3
Doc 1
67
Doc 2
1
football
Doc 2:
Football
players …
State
2
penn
Doc 1
1
3
players
Doc 2
2
4
state
Doc 1
2
Doc 2
13
Posting
Table
Demo
How to install Lucene and run the demo
Boolean retrieval example
apache – lucene
apache + lucene
apache lucene
Luke: http://www.getopt.org/luke/
A online demo (PHP + Lucene) : http://tiny.cc/JCA9K
Reference:
Lucene: http://lucene.apache.org/
Apache: http://www.apache.org/
“Lucene in Action” Chapter 1 and code: Link
Lucene index: http://www.ibm.com/developerworks/library/walucene/
http://lucene.apache.org/java/2_4_0/scoring.html
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea
rch/Similarity.html
http://en.wikipedia.org/wiki/Full_text_search
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Tf-idf