Document 7620771

Download Report

Transcript Document 7620771

Type Less, Find More:
Fast Autocompletion Search
with a Succinct Index
SIGIR 2006 in Seattle, USA, August 6 - 11
Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with Ingmar Weber
It's useful

Basic Autocompletion
– saves typing
– no more information than necessary
salton
– find out about formulations used
autocomplete, autocompose
– error correction
autocomplit, autocompleet
It's more useful

Complete to phrases
– phrase voronoi diagram → add word voronoi_diagram to index

Complete to subwords
– compound word eigenproblem → add word problem to index

Complete to category names
– author Börkur Sigurbjörnsson → add sigurbjörnson:börkur::author
börkur::sigurbjörnson:author

Faceted search
– add ct:conference:sigir
– add ct:author:Börkur_Sigurbjörnson
– add ct:year:2005
Workshop on
Faceted Search
on Thursday
all via the same mechanism
Related Engines
Related Engines
Basic Problem Definition

Query
– a set D of documents (= hits for the first part of the query)
– a range W of words (= potential completions of last word)

Answer
– all documents D' from D, containing a word from W
– all words W' from W, contained in a document from D

Extensions (see paper)
– ranking (best hits from D' and best completions from W')
– positional information (proximity queries)

First try: inverted index (INV)
Processing 1-word queries with INV



For example, sigir*
D
all documents
W
all words matching sigir*
Iterate over all words from W
sigir
Doc.18, Doc. 53, Doc. 591, ...
sigir03
Doc. 3, Doc. 66, Doc. 765, ...
sigir04
Doc. 25, Doc. 98, Doc. 221, ...
sigirlist
Doc. 67, Doc. 189, Doc. 221, ...
sigirforum
Doc. 16, Doc. 110, Doc. 141, ...
Merge the documents lists
D'

Doc. 3, Doc. 16, Doc. 18, Doc. 25, …
Output all words from range as completions
W'
Expensive!
sigir, sigir03, sigir04, sigirlist, …
Trivial
for 1-word
queries
Processing multi-word queries with INV




For example, sigir* sal*
D
Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for sigir*)
W
all words matching sal*
Iterate over all words from W
salary
Doc. 8, Doc. 23, Doc. 291, ...
salesman
Doc. 24, Doc. 36, Doc. 165, ...
salton
Doc. 3, Doc. 18, Doc. 66, ...
salutation
Doc. 56, Doc. 129, Doc. 251, ...
salvador
Doc. 18, Doc. 21, Doc. 25, ...
Intersect each list with D, then merge
Most intersection
D'
Doc. 3, Doc. 18, Doc. 25, …
are empty, but INV
has to compute
Output all words with non-empty intersection
them all!
W'
salton, salvador
INV — Problems

Asymptotic time complexity is bad (for our problem)
– many intersections (one per potential completion)
– has to merge/sort (the non-empty intersections)

Still hard to beat INV in practice
– highly compressible
half the space on disk means half the time to read it
– INV has very good locality of access
the ratio random access time/sequential access time is
50,000 for disk, and still 100 for main memory
– simple code
instruction cache, branch prediction, etc.
A Hybrid Index (HYB)

Basic Idea: have lists for ranges of words
salary – salvador
Doc. 3, Doc. 16, Doc.18, Doc. 25, ...

Problem: not enough to show completions

Solution: store the word(s) along with each doc id
salary – salvador
Doc. 3, Doc. 16, Doc.18, Doc. 25, ...
salary salvador salton salary
salton
salvador
But this looks very wasteful
HYB — Details

HYB has a block for each word range, conceptually:
1 3
D A

3
C
5
A
5
B
6
A
7
C
8 8 9 11 11 11 12 13 15
A D A A B C A C A
Replace doc ids by gaps and words by frequency ranks:
+1 + 2 +0 +2 +0 +1 +1 +1 + 0 +1 +2 +0 +0 +1 +1 + 2
3rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x  log2 x bits
+0  0
1st (A)  0

+1  10
2nd (C)  10
+2  110
3rd (D)  111
4th (B)  110
An actual block of HYB
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110
111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
How well does it compress? Which block size?
INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is
Σ ni ∙ (1/ln 2 + log2(n/ni))
Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
ni = number of documents containing i-th word, n = number of documents
MEDICINE
WIKIPEDIA
TREC .GOV
44,015 docs
263,817 words
with positions
2,866,503 docs
6,700,119 words
with positions
25,204,013 docs
25,263,176 words
no positions
raw size
452 MB
7.4 GB
426 GB
INV
13 MB
0.48 GB
4.6 GB
HYB
14 MB
0.51 GB
4.9 GB
Nice match of theory and practice
INV vs. HYB — Query Time

Theoretical analysis  see paper

Experiment: type ordinary queries from left to right
– sig , sigi , sigir , sigir sal , sigir salt , sigir salto , sigir salton
MEDICINE
WIKIPEDIA
TREC .GOV
44,015 docs
263,817 words
5,732 real queries
with proximity
2,866,503 docs
6,700,119 words
100 random queries
with proximity
25,204,013 docs
25,263,176 words
50 TREC queries
no proximity
INV
avg : 0.03 secs
max: 0.38 secs
avg : 0.17 secs
max: 2.27 secs
avg : 0.58 secs
max: 16.83 secs
HYB
avg : .003 secs
max: 0.06 secs
avg : 0.05 secs
max: 0.49 secs
avg : 0.11 secs
max: 0.86 secs
HYB better by an order of magnitude
System Design — High Level View
Compute Server
C++
Web Server
PHP
User Client
JavaScript
Debugging such an application is hell!
Summary of Results

Properties of HYB
– highly compressible (just like INV)
– fast prefix-completion queries (perfect locality of access)
– fast indexing (no full inversion necessary)

Autocompletion and more
– phrase and subword completion, semantic completion,
XML support, …
– faceted search (Workshop Talk on Thursday)
– efficient DB joins: author[sigir sigmod]
all with one and the same (efficient) mechanism
INV vs. HYB — Space Consumption
Definition: empirical entropy H = optimal number of bits
Theorem: H(INV)
Σ ni ∙ (1/ln 2 + log2(n/ni))
Theorem:
The empirical entropy of HYB with block size ε∙n is
n = number of documents containing i-th word, n = number of documents
i
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
MED BOOKS
44,015 docs
263,817 words
WIKIPEDIA
2,866,503 docs
6,700,119 words
TREC .GOV
25,204,013 docs
25,263,176 words
raw size
452 MB
7.4 GB
426 GB
INV
13 MB
0.48 GB
4.6 GB
HYB
14 MB
0.51 GB
4.9 GB
Perfect match of theory and practice
INV vs. HYB — Space Consumption
Theorem: Entropy(INV) = Σ ni ∙ (1/ln 2 + log2(n/ni))
Theorem: Entropy(HYB) =
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
We define a notion of empirical entropy in the paper, in terms of
ni = number of documents containing i-th word, n = number of documents
MED BOOKS
44,015 docs
263,817 words
WIKIPEDIA
2,866,503 docs
6,700,119 words
TREC .GOV
25,204,013 docs
25,263,176 words
raw size
452 MB
7.4 GB
426 GB
INV
13 MB
0.48 GB
4.6 GB
HYB
14 MB
0.51 GB
4.9 GB
Perfect match of theory and practice
HYB vs. INV — Query Time
MED BOOKS
44,015 docs
263,817 words
WIKIPEDIA
TREC .GOV
2,866,503 docs
6,700,119 words
25,204,013 docs
25,263,176 words
avg:
0.03 secs
avg:
0.17 secs
avg:
0.58 secs
max:
0.38 secs
max:
2.27 secs
max:
16.83 secs
avg:
.003 secs
avg:
0.05 secs
avg:
0.11 secs
max
0.06 secs
max:
0.49 secs
max:
0.86 secs
INV
HYB
Processing a 1-word Query with INV

Processing a 1-word query, e.g., sigir*
1.
Iterate over all words matching sigir*
2.
sigir
Doc.18, Doc. 53, Doc. 591, ...
sigir03
Doc. 3, Doc. 66, Doc. 765, ...
sigir04
Doc. 25, Doc. 98, Doc. 221, ...
sigir05
Doc. 57, Doc.99, Doc. 110, ...
sigirlist
Doc. 67, Doc. 189, Doc. 221, ...
sigirforum
Doc. 16, Doc. 110, Doc. 141, ...
Hits
Doc. 3, Doc. 16, Doc. 18, ...
Merge the documents lists
Completions sigir, sigir03, sigir04, sigir05, ...
Processing sigir* sal with INV


Iterate over all words matching sigir*
sigir
Doc.18, Doc. 53, Doc. 591, ...
sigir03
Doc. 3, Doc. 66, Doc. 765, ...
sigir04
Doc. 25, Doc. 98, Doc. 221, ...
sigirlist
Doc. 67, Doc. 189, Doc. 221, ...
sigirforum
Doc. 16, Doc. 110, Doc. 141, ...
Merge the documents lists
Hits D'

Doc. 3, Doc. 16, Doc. 18, …
Output all words from range as completions
Completions W' sigir, sigir03, sigir05, …
Expensive!
Trivial
for 1-word
queries
Using an Inverted Index (INV)
salary
Doc.18, Doc. 53, Doc. 591, ...
salesman
Doc. 3, Doc. 66, Doc. 765, ...
salient
Doc. 25, Doc. 98, Doc. 221, ...
salton
Doc. 57, Doc.99, Doc. 110, ...
salutation
Doc. 67, Doc. 189, Doc. 221, ...
salvador
Doc. 16, Doc. 110, Doc. 141, ...
salvucci
Doc. 18, Doc. 25, Doc. 765, ...
salzberg
Doc. 53, Doc. 121, Doc. 187, ...
D
Doc. 57, Doc 87, Doc. 110, ...
W
salary - salzberg
D'
Doc. 57, Doc. 110, ...
W'
salton, salvador
Problem 1: one intersection per potential completion
Problem 2: merging of non-empty intersections
HYB — Details

HYB has a block for each word range
document ids
words
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15
D A C A B A C A D A A B C A C A
gaps
ranks by frequency
+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +2
3rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st
universal
encoding:
small gaps/ranks
=> short codes
one block of HYB
+0  0
1st (A)  0
+1  10
2nd (C)  10
+2  110
3rd (D)  111
4th (B)  110
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110
111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
INV vs. HYB — Query Time
MED BOOKS
WIKIPEDIA
TREC .GOV
44,015 docs
263,817 words
2,866,503 docs
6,700,119 words
25,204,013 docs
25,263,176 words
INV
avg: 0.03 secs
max: 0.38 secs
avg: 0.17 secs
max: 2.27 secs
avg: 0.58 secs
max: 16.83 secs
HYB
avg: .003 secs
max: 0.06 secs
avg: 0.05 secs
max: 0.49 secs
avg: 0.11 secs
max: 0.86 secs
avg = average time per keystroke
max = maximum time per keystroke (outliers removed)
Start with DEMO
autocomp
sig
sigir
sigir sal
sal
Related Search Engine Features

Complete from precompiled list of queries
– Google Suggest
– AllTheWeb Livesearch
–…

Desktop Search engines
– Apple Spotlight
– Copernic Desktop Search
–…