Indexing and Search - MediaLab
Download
Report
Transcript Indexing and Search - MediaLab
Search Engines & Question Answering
Indexing
Giuseppe Attardi
Università di Pisa
(some slides borrowed from C. Manning, H. Schütze)
Topics
Indexing and Search
– Indexing and inverted files
– Compression
– Postings Lists
– Query processing
Indexing
Inverted index storage
– Compressing dictionaries in memory
Processing Boolean queries
– Optimizing term processing
– Skip list encoding
Wild-card queries
Positional/phrase/proximity queries
Query
Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT
Calpurnia?
Could grep all of Shakespeare’s plays for
Brutus and Caesar then strip out lines
containing Calpurnia?
– Slow (for large corpora)
– NOT is non-trivial
– Other operations (e.g., find the phrase
Romans and countrymen) not feasible
Term-document incidence
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
0
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
Worser
1
0
1
1
1
0
1 if play contains
word, 0 otherwise
Incidence vectors
So we have a 0/1 vector for each term
To answer query:
– take the vectors for Brutus, Caesar and
Calpurnia (complemented)
– perform bitwise AND
110100 AND 110111 AND 101111 = 100100
Answers to query
Antony and Cleopatra, Act III, Scene
ii
Agrippa [Aside to Domitius Enobarus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
Bigger corpora
Consider n = 1 M documents, each
with about 1 K terms
Avg 6 bytes/term including
spaces/punctuation
– 6 GB of data
Say there are m = 500 K distinct
terms among these
Can’t build the matrix
500K x 1M matrix has half-a-trillion
0’s and 1’s
But it has no more than one billion
1’s
Why?
– matrix is extremely sparse
What’s a better representation?
Inverted index
Documents are parsed to extract
words and these are saved with
the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
caesar
2
was
ambitious
2
2
After all documents
have been parsed
the inverted file is
sorted by terms
Term
Doc #
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
caesar
2
did
1
enact
1
hath
1
I
1
I
1
i'
1
it
2
julius
1
killed
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2
Multiple term
entries in a single
document are
merged and
frequency
information added
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc # Freq
2
1
2
1
1
1
2
1
1
1
1
1
2
2
1
1
1
1
2
1
1
2
1
1
2
1
1
1
1
2
2
1
1
1
2
1
2
1
1
1
2
1
2
1
2
1
1
1
2
1
2
1
The file is commonly split into a Dictionary and a Postings
file
Term
Doc # Freq
ambitious
2
1
be
2
1
brutus
1
1
brutus
2
1
capitol
1
1
caesar
1
1
caesar
2
2
did
1
1
enact
1
1
hath
2
1
I
1
2
i'
1
1
it
2
1
julius
1
1
killed
1
2
let
2
1
me
1
1
noble
2
1
so
2
1
the
1
1
the
2
1
told
2
1
you
2
1
was
1
1
was
2
1
with
2
1
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Doc # Freq
2
1
2
1
1
1
2
1
1
1
1
1
2
2
1
1
1
1
2
1
1
2
1
1
2
1
1
1
1
2
2
1
1
1
2
1
2
1
1
1
2
1
2
1
2
1
1
1
2
1
2
1
Where do we pay in storage?
Doc #
Terms
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Pointers
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Two conflicting forces
A term like Calpurnia occurs in
maybe one doc out of a million would like to store this pointer using
log2 1M ~ 20 bits
A term like the occurs in virtually
every doc, so 20 bits/pointer is too
expensive
– Prefer 0/1 vector in this case
Postings file entry
Store list of docs containing a term
in increasing order of doc id
– Brutus: 33,47,154,159,202 …
Consequence: suffices to store gaps
– 33,14,107,5,43 …
Hope: most gaps encoded with far
fewer than 20 bits
Variable encoding
For Calpurnia, use ~20 bits/gap entry
For the, use ~1 bit/gap entry
If the average gap for a term is G,
want to use ~log2G bits/gap entry
g codes for gap encoding
length
offset
Represent a gap G as the pair
<length,offset>
length is in unary and uses log2G +1 bits
to specify the length of the binary
encoding of
offset = G - 2log2G
e.g., 9 represented as 1110001
Encoding G takes 2 log2G +1 bits
What we’ve just done
Encoded each gap as tightly as
possible, to within a factor of 2
For better tuning (and a simple
analysis) - need some handle on the
distribution of gap values
Zipf’s law
The kth most frequent term has
frequency proportional to 1/k
Use this for a crude analysis of the
space used by our postings file
pointers
Zipf’s law log-log plot
Rough analysis based on Zipf
Most frequent term occurs in n docs
– n gaps of 1 each
Second most frequent term in n/2
docs
– n/2 gaps of 2 each …
kth most frequent term in n/k docs
– n/k gaps of k each - use 2log2k +1 bits
for each gap
– net of ~(2n/k) log2k bits for kth most
frequent term
Sum over k from 1 to 500K
Do this by breaking values of k into
groups:
– group i consists of 2i-1 k < 2i
Group i has 2i-1 components in the
sum, each contributing at most
(2ni)/2i-1
Summing over i from 1 to 19, we get
a net estimate of 340 Mbits, ~45 MB
Work out
for our index
calculation
Caveats
This is not the entire space for our index:
– does not account for dictionary storage
– as we get further, we’ll store even more stuff in
the index
Assumes Zipf’s law applies to occurrence
of terms in docs
All gaps for a term taken to be the same
Does not talk about query processing
Issues with index we just built
How do we process a query?
What terms in a doc do we index?
– All words or only “important” ones?
Stopword list: terms that are so
common that they’re ignored for
indexing
– e.g., the, a, an, of, to …
– language-specific
Exercise: Repeat postings size calculation if 100 most
frequent terms are not indexed.
Issues in what to index
Cooper’s concordance of Wordsworth was published in
1911. The applications of full-text retrieval are legion:
they include résumé scanning, litigation support and
searching published journals on-line.
Cooper’s vs. Cooper vs. Coopers
Full-text vs. full text vs. {full, text} vs.
fulltext
Accents: résumé vs. resume
Punctuation
Ne’er: use language-specific,
handcrafted “locale” to normalize
State-of-the-art: break up hyphenated
sequence
U.S.A. vs. USA - use locale
a.out
Numbers
3/12/91
Mar. 12, 1991
55 B.C.
B-52
100.2.86.144
– Generally, don’t index as text
– Creation dates for docs
Case folding
Reduce all letters to lower case
– exception: upper case in mid-sentence
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
Thesauri and soundex
Handle synonyms and homonyms
– Hand-constructed equivalence classes
• e.g., car automobile
• your you’re
Index such equivalences, or expand
query?
– More later ...
Spell correction
Look for all words within (say) edit
distance 3 (Insert/Delete/Replace) at
query time
– e.g. Alanis Morisette
Spell correction is expensive and
slows the query (upto a factor of 100)
– Invoke only when index returns zero
matches.
– What if docs contain mis-spellings?
Lemmatization
Reduce inflectional/variant forms to
base form
E.g.,
– am, are, is be
– car, cars, car's, cars' car
the boy's cars are different colors
the boy car be different color
Stemming
Reduce terms to their “roots” before
indexing
– language dependent
– e.g. automate(s), automatic, automation
all reduced to automat
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compres and
compres are both accept
as equival to compres.
Porter’s algorithm
Commonest algorithm for stemming
English
Conventions + 5 phases of
reductions
– phases applied sequentially
– each phase consists of a set of
commands
– sample convention: Of the rules in a
compound command, select the one
that applies to the longest suffix
Typical rules in Porter
sses ss
ies i
ational ate
tional tion
Other stemmers
Other stemmers exist, e.g., Lovins
stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/gener
al/lovins.htm
Single-pass, longest suffix removal
(about 250 rules)
Motivated by Linguistics as well as
IR
Full morphological analysis - modest
benefits for retrieval
Beyond term search
What about phrases?
Proximity: Find Gates NEAR
Microsoft
– Need index to capture position
information in docs
Zones in documents: Find
documents with (author = Ullman)
AND (text contains automata)
Dictionary and postings files:
a fast, compact inverted index
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
did
1
enact
1
hath
2
I
1
i'
1
it
2
julius
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2
Freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Usually in memory
Doc #
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Gap-encoded,
on disk
Freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Inverted index storage
Dictionary storage
– Dictionary in main memory, postings on
disk
• This is common, especially for something
like a search engine where high throughput
is essential, but can also store most of it on
disk with small, in-memory index
Tradeoffs between compression and
query processing speed
– Cascaded family of techniques
How big is the lexicon V?
Grows (but more slowly) with corpus size
Empirically okay model:
Exercise: Can one
derive this from
V = kNb
Zipf’s Law?
where b ≈ 0.5, k ≈ 30–100; N = # tokens
For instance TREC collection (2 Gb;
750,000 newswire articles): ~ 500,000
terms
Number is decreased by case-folding,
stemming
Indexing all numbers could make it
extremely large (so some SE don’t)
Spelling errors contribute a fair bit of size
Dictionary storage - first cut
Array of fixed-width entries
– 500,000 terms; 28 bytes/term = 14 MB
Terms
Freq.
a
999,712
aardvark
71
….
….
zzzz
99
Allows for fast binary 20 bytes
search into dictionary
Postings ptr.
4 bytes each
Exercises
Is binary search really a good idea?
What are the alternatives?
Fixed-width terms are wasteful
Most of the bytes in the Terms column are wasted
– we allot 20 bytes for 1 letter terms
– and still can’t handle supercalifragilisticexpialidocious
Written English averages ~4.5 characters
– Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?
– Short words dominate token counts
Average word type in English: ~8 characters
Store dictionary as a string of characters:
– Pointer of next word shows end of last
– Hope to save up to 60% of dictionary space
Compressing the term list
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq.
Postings ptr. Term ptr.
33
Total string length =
500KB x 8 = 4MB
29
44
Pointers resolve 4M
positions: log24M =
22bits = 3bytes
126
Binary search
these pointers
Total space for compressed list
4 bytes per term for Freq
4 bytes per term for pointer to
Now avg. 11
Postings
bytes/term,
3 bytes per term pointer
not 20.
Avg. 8 bytes per term in term string
500 K terms 9.5 MB
Blocking
Store pointers to every kth on term string
Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq.
Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes
on 3
pointers.
Lose 4 bytes on
term lengths.
Exercise
Estimate the space usage (and
savings compared to 9.5 MB) with
blocking, for block sizes of k = 4, 8
and 16
IXE: IndeXing Engine
Design Goals
Specialized tool (indexing and search)
C++ framework with high-level primitives
– Applications built with few lines of C++
– Specialization by inheritance
High performance
Scalability
Simple to maintain
– Hard to deal with autoconf, autoheader,
automake, configure, libtool, …
– Developed my own Make templates
Keep it as simple as possible but not
simpler.
Albert Einstein
Lexicon
Bigram index
Word index
ab: 24930
ac: 24931
ad: 24932
ae: 24933
ate0cent0cute0rial0
Postings
Extreme compression (see MG)
Front-coding:
– Sorted words commonly have long common
prefix – store differences only (for 3 in 4)
Using perfect hashing to store terms
“within” their pointers
– not good for vocabularies that change
Partition dictionary into pages
– use B-tree on first terms of pages
– pay a disk seek to grab each page
– if we’re paying 1 disk seek anyway to get the
postings, “only” another seek/query term
Is it worth it?
Average lexicon search time:
– IXE: 8 msec
– Front coding: 6 msec
Average query time: 300 msec
Number Encoding
Whole chapter in Managing
Gigabytes
Best solution: local Bernoulli using
Golomb coding
– Roughly: quotient (unary) + remainder
(binary)
– Compression: ~1 bit per posting
Quick and clean solution: eptacode
Eptacode
Use 7 bits in a byte, sign is
continuation
1
2
3
4
5
127
16129
2048383
260144641
33038369407
Golomb
1
2
3
4
5
127
16384
2097152
268435456
34359738368
Golomb drawbacks
Need to store base for quotient
Postings are non consecutive
Result: 30% increase in size of index
50%
25
40%
20
30%
15
20%
10
10%
5
0%
0
No
compression
Golomb
Eptacode
Index size
Query time
Fundamental Ideas
Rely on hardware caching and mmap
– Keep data as compact as possible
– Stucture on disk same as used by
algorithms
Rely on good data structures and
algorithms
– STL
Specialize data structures
– For indexing
– For search
Indexing
Posting Lists are created in memory
– Provide as much memory as possible to
indexing machines
When size of lists reaches a
threshold, dump partial index to disk
Perform final merging of partial
indexes
Merging operation used also for:
– Incremental indexing
– Distributed indexing
Search
Search mmaps index:
– lexicon completely
– Posting on demand (too big)
Can’t be done while indexing and
viceversa
However one can:
– add dynamically a collection with new
documents to search
– Mark documents as deleted
Index Structure
Full-text index file
Postings file
Full-text Index File Structure
FileHeader
Column0 (Lexicon)
…
Columnn (Lexicon)
StopWords
Colors
Colors
Generalization of Google hits
properties (anchor, size,
capitalization)
Similar to Fulcrum zones
Used for ranking
– E.g. title words contribute more to rank
of document
and selective queries
text matches author = attardi
Query processing exercises
If the query is friends AND romans
AND (NOT countrymen), how could
we use the freq of countrymen?
How can we perform the AND of two
postings entries without explicitly
building the 0/1 term-doc incidence
vector?
Boolean queries: Exact match
An algebra of queries using AND, OR and
NOT together with query words
– Uses “set of words” document representation
– Precise: document matches condition or not
Primary commercial retrieval tool for 3
decades
– Researchers had long argued superiority of
ranked IR systems, but not much used in practice
until spread of web search engines
– Professional searchers still like boolean queries:
you know exactly what you’re getting
• Cf. Google’s boolean AND criterion
Query optimization
Consider a query that is an AND of t
terms
The idea: for each of the t terms, get
its term-doc incidence from the
postings, then AND together
Process in order of increasing freq:
– start with smallest set, then keep
cutting further
This is why
we kept freq
in dictionary
Small Adaptive Set Intersection
Query compiler
– One cursor on posting lists for each
node
– CursorWord, CursorAnd, CursorOr,
CursorPhrase
QueryCursor.next(Result& min)
– Returns first result r >= min
Single operator for all kind of
queries: e.g. proximity
SASI example
world
wide
web
3
1
2
9
8
4
12
10
6
20
25
21
40
40
30
47
41
35
40
Speeding up postings merges
Insert skip pointers
Say our current list of candidate
docs for an AND query is 8,13,21
– (having done a bunch of ANDs)
We want to AND with the following
postings entry:
2,4,6,8,10,12,14,16,18,20,22
Linear scan is slow
Skip pointers or skip lists
At indexing time: augment postings with
skip pointers
2,4,6,8,10,12,14,16,18,20,22,24, ...
At query time: as we walk the current
candidate list, concurrently walk inverted
file entry - can skip ahead
– (e.g., 8,21).
Skip size: recommend about (list length)
General query optimization
e.g. (madding OR crowd) AND
(ignoble OR strife)
– Can put any boolean query into CNF
Get freq’s for all terms
Estimate the size of each OR by the
sum of its freq’s (conservative)
Process in increasing order of OR
sizes
Exercise
Recommend a
query processing
order for
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Freq
213312
87009
107913
271658
46653
316812
IXE Architecture
DocInfo
name:
time:
size:
DocInfo
Table<DocInfo>
name:
time:
size:
title:
summary:
type:
mmap
DocStore
DocInfo
DocInfo
name:
time:
size:
name:
time:
size:
title:
summary:
type:
mmap
Lexicon
Crawler
local
cache
Indexer
mmap
Postings
Document Store
Storing Objects in Relational Tables
SQL
create table video (
name
varchar(256),
caption varchar(2048),
format
INT,
PRIMARY KEY(name)
)
Template Metaprogramming
class Video : public DocInfo {
char* name;
char* caption;
int
format;
Attribute (for indexing)
META(Video, (SUPERCLASS(DocInfo),
VARKEY(name, 256),
VARFIELD(caption, 2048),
FIELD(format)));
};
Programming Applications (C++)
Collection<Video> videos(“CNN”);
videos.insert(video1);
Query q(“caption MATCHES Jordan and
format = wav”);
Cursor<Video> cursor(videos, q);
while (cursor.MoveNext())
cout << cursor.Current();
Single cursor operator
Struct QueryResult {
CollectionID cid;
DocID
did;
Position
pos;
}
QueryResult qmin;
Cursor.next(qmin);
Returns next result qr (document or word within document) such that
qr >= qmin
Normal search:
pos = 0
Proximity search:
pos = i
Multiple collections search (increment cid or select cid)
‘where’ clauses (e.g. date > 1/1/2002)
Boolean combinations
Performance
An independent benchmark
250,00
query/sec
200,00
150,00
AltaVista
IXE
doc/sec
100,00
50,00
0,00
Indexing (Intel)
Retrieval (Intel)
Independent evaluations
Major portal, Germany
Major portal, France
Major portal, Italy
– Stress test with 300 concurrent queries
– Verity crashed in several cases
Microsoft Redmond
TREC Terabyte 2004
GOV2 collection:
– ~ 25 million documets from .gov domain
– ~ 500 GB of documents
IXE index split into 23 shard
Data Structure
Lexicon
Size
4,2 GB
Posting Lists (including offsets)
62,0 GB
Metadata
26,0 GB
Document cache (optional)
84,0 GB
Total
176,2 GB
TREC Terabyte 2004
Average query time (sec)
2
1,8
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
Monash U. Pisa
MSR
Asia
U.
Am st.
CMU
Sabir
Dublin
C.U.
RMIT
Distributed Search Architecture
MultiThreaded
HTTP Server
query
Broker
Module
Broker
Module
Query
Server
Async
I/O
Shard
index
…
Query
Server
Shard
index
Other Features
Snippets
Document cache
Colors
Multiple collections
– Sorted by page rank
– Authoritativeness
– Popularity
Filter/Group by similarity
Index Compression
Impact on search
Binary search down to 4-term block
Then linear search through terms in block
8 documents: binary tree ave. = 2.6
compares
Blocks of 4 (binary tree), ave. = 3 compares
3
5
7
2
4
6
8
= (1+2∙2+4∙3+4)/8
1
1
5
2
6
3
7
4
8
=(1+2∙2+2∙3+2∙4+5)/8
Compression: Two alternatives
Lossless compression: all information is
preserved, but we try to encode it
compactly
– What IR people mostly do
Lossy compression: discard some
information
– Using a stoplist can be thought of in this way
– Techniques such as Latent Semantic Indexing
can be viewed as lossy compression
– One could prune from postings entries unlikely
to turn up in the top k list for query on word
• Especially applicable to web search with huge
numbers of documents but short queries
• e.g., Carmel et al. SIGIR 2002
Caching
If 25% of your users are searching
for
Britney Spears
then you probably do need spelling
correction, but you don’t need to
keep on intersecting those two
postings lists
Web query distribution is extremely
skewed, and you can usefully cache
results for common queries
Query vs. index expansion
Recall:
– thesauri for term equivalents
– soundex for homonyms
How do we use these?
– Can “expand” query to include
equivalences
• Query car tyres car tyres automobile tires
– Can expand index
• Index docs containing car under
automobile, as well
Query expansion
Usually do query expansion
– No index blowup
– Query processing slowed down
• Docs frequently contain equivalences
– May retrieve more junk
• puma jaguar
– Carefully controlled wordnets
Wild-card queries: *
mon*: find all docs containing any word
beginning “mon”.
Easy with binary tree (or B-Tree) lexicon:
retrieve all words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder
Permuterm index: for word hello index
under:
– hello$, ello$h, llo$he, lo$hel, o$hell
Queries:
– X lookup on X$
X* lookup on X*$
– *X lookup on X$*
*X* lookup on X*
– X*Y lookup on Y$X* X*Y*Z ??? Exercise!
Wild-card queries
Permuterm problem: ≈ quadruples lexicon
size
Another way: index all k-grams occurring
in any word (any sequence of k chars)
e.g., from text “April is the cruelest
month” we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
– $ is a special word boundary symbol
Processing n-gram wild-cards
Query mon* can now be run as
– $m AND mo AND on
Fast, space efficient
But we’d get a match on moon
Must post-filter these results against
query
Further wild-card refinements
– Cut down on pointers by using blocks
– Wild-card queries tend to have few bigrams
• keep postings on disk
– Exercise: given a trigram index, how do you
process an arbitrary wild-card query?
Phrase search
Search for “to be or not to be”
No longer suffices to store only
<term:docs> entries
But could just do this anyway, and then
post-filter [i.e., grep] for phrase matches
– Viable if phrase matches are uncommon
Alternatively, store, for each term, entries
–
–
–
–
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of these docs
could contain “to be
or not to be”?
Can compress position values/offsets as
we did with docs in the last lecture
Nevertheless, this expands postings list in
size substantially
Processing a phrase query
Extract inverted index entries for each distinct
term: to, be, or, not
Merge their doc:position lists to enumerate all
positions where “to be or not to be” begins.
to:
2:1,17,74,222,551; 4:8,27,101,429,433;
7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434;
5:14,19,101; ...
Same general method for proximity searches
Example: WestLaw
http://www.westlaw.com/
Largest commercial (paying subscribers)
legal search service (started 1975; ranking
added 1992)
About 7 terabytes of data; 700,000 users
Majority of users still use boolean queries
Example query:
– What is the statute of limitations in cases
involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM
Long, precise queries; proximity operators;
incrementally developed; not like web
search
Index size
Stemming/case folding cut
– number of terms by ~40%
– number of pointers by 10-20%
– total space by ~30%
Stop words
– Rule of 30: ~30 words account for ~30%
of all term occurrences in written text
– Eliminating 150 commonest terms from
indexing will cut almost 25% of space
Positional index size
Need an entry for each occurrence, not
just once per document
Index size depends on average document
Why?
size
– Average web page has <1000 terms
– SEC filings, books, even some epic poems …
easily 100,000 terms
Consider a term with frequency 0.1%
Document size
1000
100,000
Postings
Positional postings
1
1
1
100
Rules of thumb
Positional index size factor of 2-4
over non-positional index
Positional index size 35-50% of
volume of original text
Caveat: all of this holds for “Englishlike” languages
Index construction
Thus far, considered index space
What about index construction time?
What strategies can we use with
limited main memory?
Somewhat bigger corpus
Number of docs = n = 40M
Number of terms = m = 1M
Use Zipf to estimate number of
postings entries:
n + n/2 + n/3 + …. + n/m ~ n ln m =
560M entries
Check for
No positional info yet
yourself
Recall index construction
Documents are parsed to extract
words and these are saved with
the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Key step
After all documents
have been parsed the
inverted file is sorted
by terms
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Index construction
As we build up the index, cannot
exploit compression tricks
– parse docs one at a time, final postings
entry for any term incomplete until the
end
– (actually you can exploit compression,
but this becomes a lot more complex)
At 10-12 bytes per postings entry,
demands several temporary
gigabytes
System parameters for design
Disk seek ~ 1 millisecond
Block transfer from disk ~ 1
microsecond per byte (following a
seek)
All other ops ~ 10 microseconds
Bottleneck
Parse and build postings entries one
doc at a time
To now turn this into a term-wise
view, must sort postings entries by
term (then by doc within each term)
Doing this with random disk seeks
would be too slow
If every comparison took 1 disk seek, and n items could be
sorted with nlog2n comparisons, how long would this take?
Sorting with fewer disk seeks
12-byte (4+4+4) records (term, doc, freq)
These are generated as we parse docs
Must now sort 560M such 12-byte records
by term
Define a Block = 10M such records
– can “easily” fit a couple into memory
Will sort within blocks first, then merge
the blocks into one long sorted order
Sorting 56 blocks of 10M records
First, read each block and sort within:
– Quicksort takes about 2 x (10M ln 10M) steps
Exercise: estimate total time to read each
block from disk and and quicksort it
56 times this estimate - gives us 56 sorted
runs of 10M records each
Need 2 copies of data on disk, throughout
Merging 56 sorted runs
Merge tree of log256 ~ 6 layers
During each layer, read into memory runs
in blocks of 10M, merge, write back
1
2
1
3
3
4
2
4
Disk
Merging 56 runs
Time estimate for disk transfer:
block
transfer time
6 x 56 x (120M x 10-6) x 2 ~ 22 hours
block size
At each stage the run
size doubles but the
runs divide by half
read+write
Exercise - fill in this table
Step
1 56 initial quicksorts of 10M records
each
2 read 2 sorted blocks for merging, write
back
3 merge 2 sorted blocks
?
4 add (2) + (3) = time to read/merge/write
5 56 times (4) = total merge time
Time
Large memory indexing
Suppose instead that we had 16GB of
memory for the above indexing task.
Exercise: how much time to index?
Repeat with a couple of values of n, m.
In practice, spidering interlaced with
indexing.
– Spidering bottlenecked by WAN speed and
many other factors - more on this later
Improving on merge tree
Compressed temporary files
– compress terms in temporary dictionary runs
Merge more than 2 runs at a time
– maintain heap of candidates from each run
…
1
5
2
…
4
3
6
Indexing speed in practice
From TREC TeraByte 2004:
24-38 GB/hour on a 1GHz Pentium PC
(depending on HTML parser)
Dynamic indexing
Docs come in over time
– postings updates for terms already in
dictionary
– new terms added to dictionary
Docs get deleted
Simplest approach
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, merge results
Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector
Periodically, re-index into one main index
More complex approach
Fully dynamic updates
Only one index at all times
– No big and small indices
Active management of a pool of
space
Fully dynamic updates
Inserting a (variable-length) record
– e.g., a typical postings entry
Maintain a pool of (say) 64KB chunks
Chunk header maintains metadata on
records in chunk, and its free space
Header
Free space
Record
Record
Record
Record
Global tracking
In memory, maintain a global record
address table that says, for each
record, the chunk it’s in.
Define one chunk to be current.
Insertion
– if current chunk has enough free space
• extend record and update metadata.
– else look in other chunks for enough
space
– else open new chunk
Changes to dictionary
New terms appear over time
– cannot use a static perfect hash for
dictionary
OK to use term character string
w/pointers from postings as in
lecture 2
Index on disk vs. memory
Most retrieval systems keep the dictionary
in memory and the postings on disk
Web search engines frequently keep both
in memory
– massive memory requirement
– feasible for large web service installations,
less so for standard usage where
• query loads are lighter
• users willing to wait 2 seconds for a response
More on this when discussing deployment
models
Distributed indexing
Suppose we had several machines
available to do the indexing
– how do we exploit the parallelism
Two basic approaches
– stripe by dictionary as index is built up
– stripe by documents
Indexing in the real world
Typically, don’t have all documents sitting
on a local filesystem
– Documents need to be spidered
– Could be dispersed over a WAN with varying
connectivity
– Must schedule distributed spiders/indexers
– Could be (secure content) in
• Databases
• Content management applications
• Email applications
http often not the most efficient way of
fetching these documents - native API
fetching
Indexing in the real world
Documents in a variety of formats
–
–
–
–
word processing formats (e.g., MS Word)
spreadsheets
presentations
publishing formats (e.g., pdf)
Generally handled using format-specific
“filters”
– convert format into text + meta-data
Documents in a variety of languages
– automatically detect language(s) in a
document
– tokenization, stemming, are languagedependent