ResIn : A Combination of Results Caching and Index Pruning

Download Report

Transcript ResIn : A Combination of Results Caching and Index Pruning

ResIn: A Combination of
Results Caching and Index Pruning
for High-performance Web Search Engines
Gleb Skobeltsyn
Flavio Junqueira
Vassilis Plachouras
Ricardo Baeza-Yates
The 31st Annual International ACM SIGIR Conference
Singapore, 21 July 2008
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
Motivation
• Caching – crucial for
WSE to save resources
• Results caching:
+ Is efficient with
real queries
– But its hit rate is
limited due to
singletons
• How to increase
the hit-rate further?
– index pruning
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
2
Contents
• ResIn architecture
• Original query stream vs.
query stream after the results cache (misses)
• Static pruned index:
•
Term pruning
•
Document pruning
•
A combination of both
• Conclusion
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
3
ResIn architecture
Query processing:
1. from the main index
Main Index
Term cache
Term cache
Term cache
Back
Top
results
end
Back
Top
results
end
Back
Top
results
end
query
query result
Front query
end
Broker
query
result
• We study Results Caching and Index Pruning together
• … to reduce latency and load on back-end servers
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
4
ResIn architecture
Query processing:
2. from the results cache
Main Index
Term cache
Term cache
Term cache
Back
end
Back
end
Back
end
query
query result
Front query Results miss
end
cache
hit
Broker
result
• We study Results Caching and Index Pruning together
• … to reduce latency and load on back-end servers
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
5
ResIn architecture
Query processing:
3. from the pruned index
Main Index
Term cache
Term cache
Term cache
Back
end
Back
end
Back
end
query
query result
Front query Results miss
end
cache
hit
miss
Pruned
Pruned
Broker
index
index
hit
result
• We study Results Caching and Index Pruning together
• … to reduce latency and load on back-end servers
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
6
Original query stream (all queries)
vs.
query stream after the results cache (misses)
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
7
All queries vs. Misses: Experimental setup
Original query log
all queries
“Miss-log”
misses
Q1: britney spears
Q1: britney spears
Q2: sigir 2007
Results cache
(LRU)
Q3: britney spears
Q4: sigir 2008
miss
Q2: sigir 2007
Q4: sigir 2008
…
…
hit
185M queries from yahoo.co.uk
• Real query log to test results cache and generate a “miss-log”:
Q185’000’000: last query
Q3: britney spears
…
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
8
All queries vs. Misses: Number of terms in a query
• Average number of terms for all queries = 2.4, for misses = 3.2
• Most single term queries are hits in the results cache
• Queries with many
terms are unlikely
to be hits
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
9
All queries vs. Misses: Query result size distribution
• Randomly selected 2000 queries from all queries and misses:
• Avg. result size for misses is ~100 times smaller than for all queries
• Approx. half of the
misses returns less
than 5000 results –
SMALL!
• Similar results with a
“small” UK document
collection (78M)
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
10
All queries vs. Misses: Term popularity distribution
• Each point -> avg.
Log sizes: 185M – all queries, 41M - misses
popularity of 1000
consecutive terms
• The order of terms
for misses is the same
as for all queries
• Terms which were
popular before the
results cache remain
popular after
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
11
Static index pruning
Main Index
Term cache
Term cache
Term cache
Back
end
Back
end
Back
end
query result
Front
end
query Results miss
cache
hit
miss
Pruned
Pruned
Broker
index
index
hit
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
result
12
Static pruned index
• Smaller version of the main index, returns:
•
•
the top-k response that is the same as the main index’s, or
a miss otherwise.
• Assumes Boolean query processing
• Types of pruning:
•
•
•
Term pruning – full posting lists for selected terms
Document pruning – truncated posting lists
Term+Document pruning – combination of both
Full index
Term pruning
Document pruning
T+D pruning
t1
t1
t1
t1
t2
t2
t2
t2
t3
t3
t3
t3
t4
t4
t4
t4
Posting list
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
13
Term Pruning: Performance
• Term pruning based on profit(t)=popularity(t)/df(t)
• Answers a query if all query
terms are in the pruned index
UK document collection, 78M documents:
• Performs well for all queries
• For misses as well:
e.g., can process almost
50% of the queries with
25% of the index
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
14
Result Caching + Term Pruning
• Results caching performance is independent of the collection size
results cache capacity
is up to 10% of the full
index size
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
15
Term pruning: Frequent terms in misses
• MinDF (df of the least frequent
query term) correlates to the result size
• MaxDF (df of the most frequent
query term) is high for most of the misses
MinDF
Gleb
Flavio
Vassilis
Ricardo
••••••••••
••••••••••••••••••
•••••••••••••
•••••••••••••••••••••••••••••
MaxDF
• Many misses contain at least one frequent term
• => the term pruned index has to include large posting lists
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
16
Document pruning
• Based on Fagin’s top-k intersection algorithm
• Keeps postings with high scores only:
•
Sufficient to compute top-k results for some queries
• Determining correctness of the result requires computing of a
scoring threshold – LATENCY!
t1
D1
D5
D3
D2
t2
D2
D1
D5
…
t3 D4
D1
D2
D3
D4
…
Top-2 results:
D1 D2
…
Score threshold:
s(D2,t1)+s(D1,t2)+s(D2,t3)
Posting list, sorted by score
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
17
Document pruning: Experimental setup
• Scoring function:
•
•
pr(d) – query independent score of the document d (pagerank)
ω, k – normalization constants:
• ω=[0,10,20]
• k=1
• We try different values of PLLmax – maximum Posting List Length
and choose the one that maximizes the hit rate
• We only look at the upper bound for the hit rate:
Whether the original top-10 results found in the top portions of all PLs?
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
18
Document pruning: performance
• Doc. pruning
needs high
pagerank
weights
• It performs
better for All
queries than
for Misses
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
19
Term+Document pruning: performance
• T+D pruning is
the best but
expensive
(high latency)
• profit2 is better
than profit1
• Improvement is
marginal for
misses unless the
pagerank weight
is very high
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
20
Conclusions
• Results caching:
•
delivers good hit rates with a constant capacity
•
but hit rate is limited because of singletons
• Index pruning:
Lesson learned:
Important to
consider the
interaction
between the
components
•
has no limit on hit rate,
•
but the pruned index size grows with the doc. collection – more expensive
• Static index pruning: addition to results caching, not replacement
•
Term pruning performs well for misses also => “compatible” with results cache
•
Document pruning: all queries - OK, misses - only with high pagerank weights
•
Term+Document pruning slightly improves over document pruning
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
21
Last slide
Thank you
Questions?
Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines”
The 31st Annual International ACM SIGIR Conference, 21 July 2008, Singapore
22