Transcript Document

Powerful Full-Text Search
with Solr
Yonik Seeley
[email protected]
Web 2.0 Expo, Berlin
8 November 2007
download at
http://www.apache.org/~yonik
What is Lucene
• High performance, scalable, full-text
search library
• Focus: Indexing + Searching Documents
– “Document” is just a list of name+value pairs
• No crawlers or document parsing
• Flexible Text Analysis (tokenizers + token
filters)
• 100% Java, no dependencies, no config
files
What is Solr
•
•
•
•
•
•
•
•
•
•
A full text search server based on Lucene
XML/HTTP, JSON Interfaces
Faceted Search (category counting)
Flexible data schema to define types and fields
Hit Highlighting
Configurable Advanced Caching
Index Replication
Extensible Open Architecture, Plugins
Web Administration Interface
Written in Java5, deployable as a WAR
Basic App
Indexer
Document
super_name: Mr. Fantastic
name: Reed Richards
category: superhero
powers: elasticity
http://solr/update
Servlet Container
admin
HTML
Webapp
Query
(powers:agility)
Query Response
(matching docs)
http://solr/select
update
select
XML response writer
Solr
JSON response writer
XML Update Handler
Standard request handler
CSV Update Handler
Custom request handler
Lucene
Indexing Data
HTTP POST to http://localhost:8983/solr/update
<add><doc>
<field name=“id”>05991</field>
<field name=“name”>Peter Parker</field>
<field name=“supername”>Spider-Man</field>
<field name=“category”>superhero</field>
<field name=“powers”>agility</field>
<field name=“powers”>spider-sense</field>
</doc></add>
Indexing CSV data
Iron Man, Tony Stark, superhero, powered armor | flight
Sandman, William Baker|Flint Marko, supervillain, sand transform
Wolverine,James Howlett|Logan, superhero, healing|adamantium
Magneto, Erik Lehnsherr, supervillain, magnetism|electricity
http://localhost:8983/solr/update/csv?
fieldnames=supername,name,category,powers
&separator=,
&f.name.split=true&f.name.separator=|
&f.powers.split=true&f.powers.separator=|
Data upload methods
URL=http://localhost:8983/solr/update/csv
• HTTP POST body (curl, HttpClient, etc)
curl $URL -H 'Content-type:text/plain;
charset=utf-8' --data-binary @info.csv
• Multi-part file upload (browsers)
• Request parameter
?stream.body=‘Cyclops, Scott Summers,…’
• Streaming from URL (must enable)
?stream.url=file://data/info.csv
Indexing with SolrJ
// Solr’s Java Client API… remote or embedded/local!
SolrServer server = new
CommonsHttpSolrServer("http://localhost:8983/solr");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("supername","Daredevil");
doc.addField("name","Matt Murdock");
doc.addField(“category",“superhero");
server.add(doc);
server.commit();
Deleting Documents
• Delete by Id, most efficient
<delete>
<id>05591</id>
<id>32552</id>
</delete>
• Delete by Query
<delete>
<query>category:supervillain</query>
</delete>
Commit
• <commit/> makes changes visible
– Triggers static cache warming in
solrconfig.xml
– Triggers autowarming from existing caches
• <optimize/> same as commit, merges all
index segments for faster searching
_0.fnm
_0.fdt
_0.fdx
_0.frq
_0.tis
_0.tii
_0.prx
_0.nrm
_0_1.del
Lucene Index Segments
_1.fnm
_1.fdt
_1.fdx
[…]
Searching
http://localhost:8983/solr/select?q=powers:agility
&start=0&rows=2&fl=supername,category
<response>
<result numFound=“427" start="0">
<doc>
<str name=“supername">Spider-Man</str>
<str name=“category”>superhero</str>
</doc>
<doc>
<str name=“supername">Msytique</str>
<str name=“category”>supervillain</str>
</doc>
</result>
</response>
Response Format
• Add &wt=json for JSON formatted response
{“result": {"numFound":427, "start":0,
"docs": [
{“supername”:”Spider-Man”, “category”:”superhero”},
{“supername”:” Msytique”, “category”:” supervillain”}
]
}
• Also Python, Ruby, PHP, SerializedPHP, XSLT
Scoring
•
•
•
•
•
•
Query results are sorted by score descending
VSM – Vector Space Model
tf – term frequency: numer of matching terms in field
lengthNorm – number of tokens in field
idf – inverse document frequency
coord – coordination factor, number of matching
terms
• document boost
• query clause boost
http://lucene.apache.org/java/docs/scoring.html
Explain
http://solr/select?q=super fast&indent=on&debugQuery=on
<lst name="debug">
<lst name="explain">
<str name="id=Flash,internal_docid=6">
0.16389132 = (MATCH) product of:
0.32778263 = (MATCH) sum of:
0.32778263 = (MATCH) weight(text:fast in 6), product of:
0.5012072 = queryWeight(text:fast), product of:
2.466337 = idf(docFreq=5)
0.20321926 = queryNorm
0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of:
1.4142135 = tf(termFreq(text:fast)=2)
2.466337 = idf(docFreq=5)
0.1875 = fieldNorm(field=fast, doc=6)
0.5 = coord(1/2)
</str>
<str name="id=Superman,internal_docid=7">
0.1365761 = (MATCH) product of:
Lucene Query Syntax
1. justice league
• Equiv: justice OR league
• QueryParser default operator is “OR”/optional
2. +justice +league –name:aquaman
• Equiv: justice AND league NOT name:aquaman
3. “justice league” –name:aquaman
4. title:spiderman^10 description:spiderman
5. description:“spiderman movie”~100
Lucene Query Examples2
1. releaseDate:[2000 TO 2007]
2. Wildcard searches: sup?r, su*r, super*
3. spider~
•
•
Fuzzy search: Levenshtein distance
Optional minimum similarity: spider~0.7
4. *:*
5. (Superman AND “Lex Luthor”) OR
(+Batman +Joker)
DisMax Query Syntax
•
Good for handling raw user queries
– Balanced quotes for phrase query
– ‘+’ for required, ‘-’ for prohibited
– Separates query terms from query structure
http://solr/select?qt=dismax
&q=super man
// the user query
&qf=title^3 subject^2 body
// field to query
&pf=title^2,body
// fields to do phrase queries
&ps=100
// slop for those phrase q’s
&tie=.1
// multi-field match reward
&mm=2
// # of terms that should match
&bf=popularity
// boost function
DisMax Query Form
• The expanded Lucene Query:
+( DisjunctionMaxQuery( title:super^3 |
subject:super^2 | body:super)
DisjunctionMaxQuery( title:man^3 |
subject:man^2 | body:man)
)
DisjunctionMaxQuery(title:”super man”~100^2
body:”super man”~100)
FunctionQuery(popularity)
• Tip: set up your own request handler with default parameters
to avoid clients having to specify them
Function Query
• Allows adding function of field value to score
– Boost recently added or popular documents
•
•
•
•
Current parser only supports function notation
Example: log(sum(popularity,1))
sum, product, div, log, sqrt, abs, pow
scale(x, target_min, target_max)
– calculates min & max of x across all docs
• map(x, min, max, target)
– useful for dealing with defaults
Boosted Query
• Score is multiplied instead of added
– New local params <!...> syntax added
&q=<!boost b=sqrt(popularity)>super man
• Parameter dereferencing in local params
&q=<!boost b=$boost v=$userq>
&boost=sqrt(popularity)
&userq=super man
Analysis & Search Relevancy
Document Indexing Analysis
Query Analysis
LexCorp BFG-9000
Lex corp bfg9000
WhitespaceTokenizer
LexCorp
WhitespaceTokenizer
BFG-9000
Lex
WordDelimiterFilter catenateWords=1
Lex
Corp
BFG
9000
corp
bfg9000
WordDelimiterFilter catenateWords=0
Lex
corp
bfg
9000
LexCorp
LowercaseFilter
lex
corp
bfg
LowercaseFilter
9000
lex
lexcorp
A Match!
corp
bfg
9000
Configuring Relevancy
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt“/>
<filter class="solr.StopFilterFactory“
words=“stopwords.txt”/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
</analyzer>
</fieldType>
Field Definitions
• Field Attributes: name, type, indexed, stored,
multiValued, omitNorms, termVectors
<field name="id“
type="string"
indexed="true" stored="true"/>
<field name="sku“
type="textTight” indexed="true" stored="true"/>
<field name="name“ type="text“
indexed="true" stored="true"/>
<field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/>
<field name=“price“
type=“sfloat“
indexed="true“ stored=“false"/>
<field name="category“ type="text_ws“ indexed="true" stored="true“
multiValued="true"/>
• Dynamic Fields
<dynamicField name="*_i" type="sint“ indexed="true" stored="true"/>
<dynamicField name="*_s" type="string“ indexed="true" stored="true"/>
<dynamicField name="*_t" type="text“ indexed="true" stored="true"/>
copyField
• Copies one field to another at index time
• Usecase #1: Analyze same field different ways
– copy into a field with a different analyzer
– boost exact-case, exact-punctuation matches
– language translations, thesaurus, soundex
<field name=“title” type=“text”/>
<field name=“title_exact” type=“text_exact”
stored=“false”/>
<copyField source=“title” dest=“title_exact”/>
• Usecase #2: Index multiple fields into single
searchable field
Facet Query
http://solr/select?q=foo&wt=json&indent=on
&facet=true&facet.field=cat
&facet.query=price:[0 TO 100]
&facet.query=manu:IBM
{"response":{"numFound":26,"start":0,"docs":[…]},
“facet_counts":{
"facet_queries":{
"price:[0 TO 100]":6,
“manu:IBM":2},
"facet_fields":{
"cat":[ "electronics",14, "memory",3,
"card",2, "connector",2]
}}}
Filters
• Filters are restrictions in addition to the query
• Use in faceting to narrow the results
• Filters are cached separately for speed
1. User queries for memory, query sent to solr is
&q=memory&fq=inStock:true&facet=true&…
2. User selects 1GB memory size
&q=memory&fq=inStock:true&fq=size:1GB&…
3. User selects DDR2 memory type
&q=memory&fq=inStock:true&fq=size:1GB
&fq=type:DDR2&…
Highlighting
http://solr/select?q=lcd&wt=json&indent=on
&hl=true&hl.fl=features
{"response":{"numFound":5,"start":0,"docs":[
{"id":"3007WFP", “price”:899.95}, …]
"highlighting":{
"3007WFP":{ "features":["30\" TFT active matrix
<em>LCD</em>, 2560 x 1600”
"VA902B":{ "features":["19\" TFT active matrix
<em>LCD</em>, 8ms response time, 1280 x
1024 native resolution"]}}}
MoreLikeThis
• Selects documents that are “similar” to the
documents matching the main query.
&q=id:6H500F0
&mlt=true&mlt.fl=name,cat,features
"moreLikeThis":{
"6H500F0":{"numFound":5,"start":0,
"docs”: [
{"name":"Apple 60 GB iPod with Video
Playback Black", "price":399.0,
"inStock":true, "popularity":10, […]
}, […]
]
[…]
High Availability
Dynamic
HTML
Generation
Appservers
HTTP search
requests
Load Balancer
Solr Searchers
Index Replication
admin queries
updates
updates
admin terminal
Updater
Solr Master
DB
Resources
• WWW
– http://lucene.apache.org/solr
– http://lucene.apache.org/solr/tutorial.html
– http://wiki.apache.org/solr/
• Mailing Lists
– [email protected][email protected]