Indexing and Search - MediaLab

Transcript Indexing and Search - MediaLab

Search Engines & Question Answering
Indexing
Giuseppe Attardi
Università di Pisa
(some slides borrowed from C. Manning, H. Schütze)
Topics

Indexing and Search
– Indexing and inverted files
– Compression
– Postings Lists
– Query processing
Indexing

Inverted index storage
– Compressing dictionaries in memory

Processing Boolean queries
– Optimizing term processing
– Skip list encoding
Wild-card queries
 Positional/phrase/proximity queries

Query

Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT
Calpurnia?

Could grep all of Shakespeare’s plays for
Brutus and Caesar then strip out lines
containing Calpurnia?
– Slow (for large corpora)
– NOT is non-trivial
– Other operations (e.g., find the phrase
Romans and countrymen) not feasible
Term-document incidence
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
0
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
Worser
1
0
1
1
1
0
1 if play contains
word, 0 otherwise
Incidence vectors
So we have a 0/1 vector for each term
 To answer query:

– take the vectors for Brutus, Caesar and
Calpurnia (complemented)
– perform bitwise AND
110100 AND 110111 AND 101111 = 100100
Answers to query

Antony and Cleopatra, Act III, Scene
ii
Agrippa [Aside to Domitius Enobarus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
Bigger corpora
Consider n = 1 M documents, each
with about 1 K terms
 Avg 6 bytes/term including
spaces/punctuation

– 6 GB of data

Say there are m = 500 K distinct
terms among these
Can’t build the matrix
500K x 1M matrix has half-a-trillion
0’s and 1’s
 But it has no more than one billion
1’s
Why?

– matrix is extremely sparse

What’s a better representation?
Inverted index

Documents are parsed to extract
words and these are saved with
the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
caesar
2
was
ambitious
2
2

After all documents
have been parsed
the inverted file is
sorted by terms
Term
Doc #
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
caesar
2
did
1
enact
1
hath
1
I
1
I
1
i'
1
it
2
julius
1
killed
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2

Multiple term
entries in a single
document are
merged and
frequency
information added
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc # Freq
2
1
2
1
1
1
2
1
1
1
1
1
2
2
1
1
1
1
2
1
1
2
1
1
2
1
1
1
1
2
2
1
1
1
2
1
2
1
1
1
2
1
2
1
2
1
1
1
2
1
2
1

The file is commonly split into a Dictionary and a Postings
file
Term
Doc # Freq
ambitious
2
1
be
2
1
brutus
1
1
brutus
2
1
capitol
1
1
caesar
1
1
caesar
2
2
did
1
1
enact
1
1
hath
2
1
I
1
2
i'
1
1
it
2
1
julius
1
1
killed
1
2
let
2
1
me
1
1
noble
2
1
so
2
1
the
1
1
the
2
1
told
2
1
you
2
1
was
1
1
was
2
1
with
2
1
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Doc # Freq
2
1
2
1
1
1
2
1
1
1
1
1
2
2
1
1
1
1
2
1
1
2
1
1
2
1
1
1
1
2
2
1
1
1
2
1
2
1
1
1
2
1
2
1
2
1
1
1
2
1
2
1

Where do we pay in storage?
Doc #
Terms
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Pointers
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Two conflicting forces
A term like Calpurnia occurs in
maybe one doc out of a million would like to store this pointer using
log2 1M ~ 20 bits
 A term like the occurs in virtually
every doc, so 20 bits/pointer is too
expensive

– Prefer 0/1 vector in this case
Postings file entry

Store list of docs containing a term
in increasing order of doc id
– Brutus: 33,47,154,159,202 …

Consequence: suffices to store gaps
– 33,14,107,5,43 …

Hope: most gaps encoded with far
fewer than 20 bits
Variable encoding
For Calpurnia, use ~20 bits/gap entry
 For the, use ~1 bit/gap entry
 If the average gap for a term is G,
want to use ~log2G bits/gap entry

g codes for gap encoding
length





offset
Represent a gap G as the pair
<length,offset>
length is in unary and uses log2G +1 bits
to specify the length of the binary
encoding of
offset = G - 2log2G
e.g., 9 represented as 1110001
Encoding G takes 2 log2G +1 bits
What we’ve just done
Encoded each gap as tightly as
possible, to within a factor of 2
 For better tuning (and a simple
analysis) - need some handle on the
distribution of gap values

Zipf’s law
The kth most frequent term has
frequency proportional to 1/k
 Use this for a crude analysis of the
space used by our postings file
pointers

Zipf’s law log-log plot
Rough analysis based on Zipf

Most frequent term occurs in n docs
– n gaps of 1 each

Second most frequent term in n/2
docs
– n/2 gaps of 2 each …

kth most frequent term in n/k docs
– n/k gaps of k each - use 2log2k +1 bits
for each gap
– net of ~(2n/k) log2k bits for kth most
frequent term
Sum over k from 1 to 500K

Do this by breaking values of k into
groups:
– group i consists of 2i-1  k < 2i
Group i has 2i-1 components in the
sum, each contributing at most
(2ni)/2i-1
 Summing over i from 1 to 19, we get
a net estimate of 340 Mbits, ~45 MB
Work out
for our index

calculation
Caveats

This is not the entire space for our index:
– does not account for dictionary storage
– as we get further, we’ll store even more stuff in
the index
Assumes Zipf’s law applies to occurrence
of terms in docs
 All gaps for a term taken to be the same
 Does not talk about query processing

Issues with index we just built
How do we process a query?
 What terms in a doc do we index?

– All words or only “important” ones?

Stopword list: terms that are so
common that they’re ignored for
indexing
– e.g., the, a, an, of, to …
– language-specific
Exercise: Repeat postings size calculation if 100 most
frequent terms are not indexed.
Issues in what to index
Cooper’s concordance of Wordsworth was published in
1911. The applications of full-text retrieval are legion:
they include résumé scanning, litigation support and
searching published journals on-line.
Cooper’s vs. Cooper vs. Coopers
 Full-text vs. full text vs. {full, text} vs.
fulltext
 Accents: résumé vs. resume

Punctuation
Ne’er: use language-specific,
handcrafted “locale” to normalize
 State-of-the-art: break up hyphenated
sequence
 U.S.A. vs. USA - use locale
 a.out

Numbers
3/12/91
 Mar. 12, 1991
 55 B.C.
 B-52
 100.2.86.144

– Generally, don’t index as text
– Creation dates for docs
Case folding

Reduce all letters to lower case
– exception: upper case in mid-sentence
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
Thesauri and soundex

Handle synonyms and homonyms
– Hand-constructed equivalence classes
• e.g., car  automobile
• your  you’re

Index such equivalences, or expand
query?
– More later ...
Spell correction

Look for all words within (say) edit
distance 3 (Insert/Delete/Replace) at
query time
– e.g. Alanis Morisette

Spell correction is expensive and
slows the query (upto a factor of 100)
– Invoke only when index returns zero
matches.
– What if docs contain mis-spellings?
Lemmatization
Reduce inflectional/variant forms to
base form
 E.g.,

– am, are, is  be
– car, cars, car's, cars'  car

the boy's cars are different colors 
the boy car be different color
Stemming

Reduce terms to their “roots” before
indexing
– language dependent
– e.g. automate(s), automatic, automation
all reduced to automat
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compres and
compres are both accept
as equival to compres.
Porter’s algorithm
Commonest algorithm for stemming
English
 Conventions + 5 phases of
reductions

– phases applied sequentially
– each phase consists of a set of
commands
– sample convention: Of the rules in a
compound command, select the one
that applies to the longest suffix
Typical rules in Porter
sses  ss
 ies  i
 ational  ate
 tional  tion

Other stemmers

Other stemmers exist, e.g., Lovins
stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/gener
al/lovins.htm
Single-pass, longest suffix removal
(about 250 rules)
 Motivated by Linguistics as well as
IR
 Full morphological analysis - modest
benefits for retrieval

Beyond term search
What about phrases?
 Proximity: Find Gates NEAR
Microsoft

– Need index to capture position
information in docs

Zones in documents: Find
documents with (author = Ullman)
AND (text contains automata)
Dictionary and postings files:
a fast, compact inverted index
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
did
1
enact
1
hath
2
I
1
i'
1
it
2
julius
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2
Freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Usually in memory
Doc #
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Gap-encoded,
on disk
Freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Inverted index storage

Dictionary storage
– Dictionary in main memory, postings on
disk
• This is common, especially for something
like a search engine where high throughput
is essential, but can also store most of it on
disk with small, in-memory index

Tradeoffs between compression and
query processing speed
– Cascaded family of techniques
How big is the lexicon V?







Grows (but more slowly) with corpus size
Empirically okay model:
Exercise: Can one
derive this from
V = kNb
Zipf’s Law?
where b ≈ 0.5, k ≈ 30–100; N = # tokens
For instance TREC collection (2 Gb;
750,000 newswire articles): ~ 500,000
terms
Number is decreased by case-folding,
stemming
Indexing all numbers could make it
extremely large (so some SE don’t)
Spelling errors contribute a fair bit of size
Dictionary storage - first cut

Array of fixed-width entries
– 500,000 terms; 28 bytes/term = 14 MB
Terms
Freq.
a
999,712
aardvark
71
….
….
zzzz
99
Allows for fast binary 20 bytes
search into dictionary
Postings ptr.
4 bytes each
Exercises
Is binary search really a good idea?
 What are the alternatives?

Fixed-width terms are wasteful

Most of the bytes in the Terms column are wasted
– we allot 20 bytes for 1 letter terms
– and still can’t handle supercalifragilisticexpialidocious

Written English averages ~4.5 characters
– Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?
– Short words dominate token counts


Average word type in English: ~8 characters
Store dictionary as a string of characters:
– Pointer of next word shows end of last
– Hope to save up to 60% of dictionary space
Compressing the term list
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq.
Postings ptr. Term ptr.
33
Total string length =
500KB x 8 = 4MB
29
44
Pointers resolve 4M
positions: log24M =
22bits = 3bytes
126
Binary search
these pointers
Total space for compressed list
4 bytes per term for Freq
 4 bytes per term for pointer to
 Now avg. 11
Postings
 bytes/term,
 3 bytes per term pointer
 not 20.
 Avg. 8 bytes per term in term string
 500 K terms  9.5 MB

Blocking

Store pointers to every kth on term string
 Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq.
Postings ptr. Term ptr.
33
29
44
126
7
 Save 9 bytes
 on 3
 pointers.
Lose 4 bytes on
term lengths.
Exercise

Estimate the space usage (and
savings compared to 9.5 MB) with
blocking, for block sizes of k = 4, 8
and 16
IXE: IndeXing Engine
Design Goals

Specialized tool (indexing and search)
 C++ framework with high-level primitives
– Applications built with few lines of C++
– Specialization by inheritance

High performance
 Scalability
 Simple to maintain
– Hard to deal with autoconf, autoheader,
automake, configure, libtool, …
– Developed my own Make templates
Keep it as simple as possible but not
simpler.
Albert Einstein
Lexicon
Bigram index
Word index
ab: 24930
ac: 24931
ad: 24932
ae: 24933
ate0cent0cute0rial0
Postings
Extreme compression (see MG)

Front-coding:
– Sorted words commonly have long common
prefix – store differences only (for 3 in 4)

Using perfect hashing to store terms
“within” their pointers
– not good for vocabularies that change

Partition dictionary into pages
– use B-tree on first terms of pages
– pay a disk seek to grab each page
– if we’re paying 1 disk seek anyway to get the
postings, “only” another seek/query term
Is it worth it?

Average lexicon search time:
– IXE: 8 msec
– Front coding: 6 msec

Average query time: 300 msec
Number Encoding
Whole chapter in Managing
Gigabytes
 Best solution: local Bernoulli using
Golomb coding

– Roughly: quotient (unary) + remainder
(binary)
– Compression: ~1 bit per posting

Quick and clean solution: eptacode
Eptacode

Use 7 bits in a byte, sign is
continuation
1
2
3
4
5

127
16129
2048383
260144641
33038369407
Golomb
1
2
3
4
5
127
16384
2097152
268435456
34359738368
Golomb drawbacks

Need to store base for quotient
 Postings are non consecutive
 Result: 30% increase in size of index
50%
25
40%
20
30%
15
20%
10
10%
5
0%
0
No
compression
Golomb
Eptacode
Index size
Query time
Fundamental Ideas

Rely on hardware caching and mmap
– Keep data as compact as possible
– Stucture on disk same as used by
algorithms

Rely on good data structures and
algorithms
– STL

Specialize data structures
– For indexing
– For search
Indexing

Posting Lists are created in memory
– Provide as much memory as possible to
indexing machines
When size of lists reaches a
threshold, dump partial index to disk
 Perform final merging of partial
indexes
 Merging operation used also for:

– Incremental indexing
– Distributed indexing
Search

Search mmaps index:
– lexicon completely
– Posting on demand (too big)
Can’t be done while indexing and
viceversa
 However one can:

– add dynamically a collection with new
documents to search
– Mark documents as deleted
Index Structure
Full-text index file
 Postings file

Full-text Index File Structure
FileHeader
Column0 (Lexicon)
…
Columnn (Lexicon)
StopWords
Colors
Colors
Generalization of Google hits
properties (anchor, size,
capitalization)
 Similar to Fulcrum zones
 Used for ranking

– E.g. title words contribute more to rank
of document

and selective queries
text matches author = attardi
Query processing exercises
If the query is friends AND romans
AND (NOT countrymen), how could
we use the freq of countrymen?
 How can we perform the AND of two
postings entries without explicitly
building the 0/1 term-doc incidence
vector?

Boolean queries: Exact match

An algebra of queries using AND, OR and
NOT together with query words
– Uses “set of words” document representation
– Precise: document matches condition or not

Primary commercial retrieval tool for 3
decades
– Researchers had long argued superiority of
ranked IR systems, but not much used in practice
until spread of web search engines
– Professional searchers still like boolean queries:
you know exactly what you’re getting
• Cf. Google’s boolean AND criterion
Query optimization
Consider a query that is an AND of t
terms
 The idea: for each of the t terms, get
its term-doc incidence from the
postings, then AND together
 Process in order of increasing freq:

– start with smallest set, then keep
cutting further
This is why
we kept freq
in dictionary
Small Adaptive Set Intersection

Query compiler
– One cursor on posting lists for each
node
– CursorWord, CursorAnd, CursorOr,
CursorPhrase

QueryCursor.next(Result& min)
– Returns first result r >= min

Single operator for all kind of
queries: e.g. proximity
SASI example
world
wide
web
3
1
2
9
8
4
12
10
6
20
25
21
40
40
30
47
41
35
40
Speeding up postings merges
Insert skip pointers
 Say our current list of candidate
docs for an AND query is 8,13,21

– (having done a bunch of ANDs)
We want to AND with the following
postings entry:
2,4,6,8,10,12,14,16,18,20,22
 Linear scan is slow

Skip pointers or skip lists

At indexing time: augment postings with
skip pointers
2,4,6,8,10,12,14,16,18,20,22,24, ...

At query time: as we walk the current
candidate list, concurrently walk inverted
file entry - can skip ahead
– (e.g., 8,21).

Skip size: recommend about (list length)
General query optimization

e.g. (madding OR crowd) AND
(ignoble OR strife)
– Can put any boolean query into CNF
Get freq’s for all terms
 Estimate the size of each OR by the
sum of its freq’s (conservative)
 Process in increasing order of OR
sizes

Exercise

Recommend a
query processing
order for
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Freq
213312
87009
107913
271658
46653
316812
IXE Architecture
DocInfo
name:
time:
size:
DocInfo
Table<DocInfo>
name:
time:
size:
title:
summary:
type:
mmap
DocStore
DocInfo
DocInfo
name:
time:
size:
name:
time:
size:
title:
summary:
type:
mmap
Lexicon
Crawler
local
cache
Indexer
mmap
Postings
Document Store
Storing Objects in Relational Tables

SQL
create table video (
name
varchar(256),
caption varchar(2048),
format
INT,
PRIMARY KEY(name)
)
Template Metaprogramming
class Video : public DocInfo {
char* name;
char* caption;
int
format;
Attribute (for indexing)
META(Video, (SUPERCLASS(DocInfo),
VARKEY(name, 256),
VARFIELD(caption, 2048),
FIELD(format)));
};
Programming Applications (C++)
Collection<Video> videos(“CNN”);
videos.insert(video1);
Query q(“caption MATCHES Jordan and
format = wav”);
Cursor<Video> cursor(videos, q);
while (cursor.MoveNext())
cout << cursor.Current();
Single cursor operator
Struct QueryResult {
CollectionID cid;
DocID
did;
Position
pos;
}
QueryResult qmin;
Cursor.next(qmin);
Returns next result qr (document or word within document) such that
qr >= qmin



Normal search:
pos = 0
Proximity search:
pos = i
Multiple collections search (increment cid or select cid)
 ‘where’ clauses (e.g. date > 1/1/2002)
 Boolean combinations
Performance
An independent benchmark
250,00
query/sec
200,00
150,00
AltaVista
IXE
doc/sec
100,00
50,00
0,00
Indexing (Intel)
Retrieval (Intel)
Independent evaluations
Major portal, Germany
 Major portal, France
 Major portal, Italy

– Stress test with 300 concurrent queries
– Verity crashed in several cases

Microsoft Redmond
TREC Terabyte 2004

GOV2 collection:
– ~ 25 million documets from .gov domain
– ~ 500 GB of documents

IXE index split into 23 shard
Data Structure
Lexicon
Size
4,2 GB
Posting Lists (including offsets)
62,0 GB
Metadata
26,0 GB
Document cache (optional)
84,0 GB
Total
176,2 GB
TREC Terabyte 2004
Average query time (sec)
2
1,8
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
Monash U. Pisa
MSR
Asia
U.
Am st.
CMU
Sabir
Dublin
C.U.
RMIT
Distributed Search Architecture
MultiThreaded
HTTP Server
query
Broker
Module
Broker
Module
Query
Server
Async
I/O
Shard
index
…
Query
Server
Shard
index
Other Features
Snippets
 Document cache
 Colors
 Multiple collections

– Sorted by page rank
– Authoritativeness
– Popularity

Filter/Group by similarity
Index Compression
Impact on search



Binary search down to 4-term block
Then linear search through terms in block
8 documents: binary tree ave. = 2.6
compares
 Blocks of 4 (binary tree), ave. = 3 compares
3
5
7
2
4
6
8
= (1+2∙2+4∙3+4)/8
1
1
5
2
6
3
7
4
8
=(1+2∙2+2∙3+2∙4+5)/8
Compression: Two alternatives

Lossless compression: all information is
preserved, but we try to encode it
compactly
– What IR people mostly do

Lossy compression: discard some
information
– Using a stoplist can be thought of in this way
– Techniques such as Latent Semantic Indexing
can be viewed as lossy compression
– One could prune from postings entries unlikely
to turn up in the top k list for query on word
• Especially applicable to web search with huge
numbers of documents but short queries
• e.g., Carmel et al. SIGIR 2002
Caching
If 25% of your users are searching
for
Britney Spears
then you probably do need spelling
correction, but you don’t need to
keep on intersecting those two
postings lists
 Web query distribution is extremely
skewed, and you can usefully cache
results for common queries

Query vs. index expansion

Recall:
– thesauri for term equivalents
– soundex for homonyms

How do we use these?
– Can “expand” query to include
equivalences
• Query car tyres  car tyres automobile tires
– Can expand index
• Index docs containing car under
automobile, as well
Query expansion

Usually do query expansion
– No index blowup
– Query processing slowed down
• Docs frequently contain equivalences
– May retrieve more junk
• puma  jaguar
– Carefully controlled wordnets
Wild-card queries: *

mon*: find all docs containing any word
beginning “mon”.
 Easy with binary tree (or B-Tree) lexicon:
retrieve all words in range: mon ≤ w < moo
 *mon: find words ending in “mon”: harder
 Permuterm index: for word hello index
under:
– hello$, ello$h, llo$he, lo$hel, o$hell

Queries:
– X lookup on X$
X* lookup on X*$
– *X lookup on X$*
*X* lookup on X*
– X*Y lookup on Y$X* X*Y*Z ??? Exercise!
Wild-card queries
Permuterm problem: ≈ quadruples lexicon
size
 Another way: index all k-grams occurring
in any word (any sequence of k chars)
 e.g., from text “April is the cruelest
month” we get the 2-grams (bigrams)

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
– $ is a special word boundary symbol
Processing n-gram wild-cards

Query mon* can now be run as
– $m AND mo AND on



Fast, space efficient
But we’d get a match on moon
Must post-filter these results against
query
 Further wild-card refinements
– Cut down on pointers by using blocks
– Wild-card queries tend to have few bigrams
• keep postings on disk
– Exercise: given a trigram index, how do you
process an arbitrary wild-card query?
Phrase search
Search for “to be or not to be”
 No longer suffices to store only
<term:docs> entries
 But could just do this anyway, and then
post-filter [i.e., grep] for phrase matches

– Viable if phrase matches are uncommon

Alternatively, store, for each term, entries
–
–
–
–
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>

Which of these docs
could contain “to be
or not to be”?
Can compress position values/offsets as
we did with docs in the last lecture
 Nevertheless, this expands postings list in
size substantially
Processing a phrase query


Extract inverted index entries for each distinct
term: to, be, or, not
Merge their doc:position lists to enumerate all
positions where “to be or not to be” begins.
to:
2:1,17,74,222,551; 4:8,27,101,429,433;
7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434;
5:14,19,101; ...

Same general method for proximity searches
Example: WestLaw
http://www.westlaw.com/

Largest commercial (paying subscribers)
legal search service (started 1975; ranking
added 1992)
 About 7 terabytes of data; 700,000 users
 Majority of users still use boolean queries
 Example query:
– What is the statute of limitations in cases
involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM

Long, precise queries; proximity operators;
incrementally developed; not like web
search
Index size

Stemming/case folding cut
– number of terms by ~40%
– number of pointers by 10-20%
– total space by ~30%

Stop words
– Rule of 30: ~30 words account for ~30%
of all term occurrences in written text
– Eliminating 150 commonest terms from
indexing will cut almost 25% of space
Positional index size

Need an entry for each occurrence, not
just once per document
 Index size depends on average document
Why?
size
– Average web page has <1000 terms
– SEC filings, books, even some epic poems …
easily 100,000 terms

Consider a term with frequency 0.1%
Document size
1000
100,000
Postings
Positional postings
1
1
1
100
Rules of thumb
Positional index size factor of 2-4
over non-positional index
 Positional index size 35-50% of
volume of original text
 Caveat: all of this holds for “Englishlike” languages

Index construction
Thus far, considered index space
 What about index construction time?
 What strategies can we use with
limited main memory?

Somewhat bigger corpus
Number of docs = n = 40M
 Number of terms = m = 1M
 Use Zipf to estimate number of
postings entries:
 n + n/2 + n/3 + …. + n/m ~ n ln m =
560M entries
Check for
 No positional info yet

yourself
Recall index construction

Documents are parsed to extract
words and these are saved with
the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Key step

After all documents
have been parsed the
inverted file is sorted
by terms
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Index construction

As we build up the index, cannot
exploit compression tricks
– parse docs one at a time, final postings
entry for any term incomplete until the
end
– (actually you can exploit compression,
but this becomes a lot more complex)

At 10-12 bytes per postings entry,
demands several temporary
gigabytes
System parameters for design
Disk seek ~ 1 millisecond
 Block transfer from disk ~ 1
microsecond per byte (following a
seek)
 All other ops ~ 10 microseconds

Bottleneck
Parse and build postings entries one
doc at a time
 To now turn this into a term-wise
view, must sort postings entries by
term (then by doc within each term)
 Doing this with random disk seeks
would be too slow

If every comparison took 1 disk seek, and n items could be
sorted with nlog2n comparisons, how long would this take?
Sorting with fewer disk seeks

12-byte (4+4+4) records (term, doc, freq)
 These are generated as we parse docs
 Must now sort 560M such 12-byte records
by term
 Define a Block = 10M such records
– can “easily” fit a couple into memory

Will sort within blocks first, then merge
the blocks into one long sorted order
Sorting 56 blocks of 10M records

First, read each block and sort within:
– Quicksort takes about 2 x (10M ln 10M) steps

Exercise: estimate total time to read each
block from disk and and quicksort it
 56 times this estimate - gives us 56 sorted
runs of 10M records each
 Need 2 copies of data on disk, throughout
Merging 56 sorted runs

Merge tree of log256 ~ 6 layers
 During each layer, read into memory runs
in blocks of 10M, merge, write back
1
2
1
3
3
4
2
4
Disk
Merging 56 runs

Time estimate for disk transfer:
block
transfer time
6 x 56 x (120M x 10-6) x 2 ~ 22 hours
block size
At each stage the run
size doubles but the
runs divide by half
read+write
Exercise - fill in this table
Step
1 56 initial quicksorts of 10M records
each
2 read 2 sorted blocks for merging, write
back
3 merge 2 sorted blocks
?
4 add (2) + (3) = time to read/merge/write
5 56 times (4) = total merge time
Time
Large memory indexing

Suppose instead that we had 16GB of
memory for the above indexing task.
 Exercise: how much time to index?
 Repeat with a couple of values of n, m.
 In practice, spidering interlaced with
indexing.
– Spidering bottlenecked by WAN speed and
many other factors - more on this later
Improving on merge tree

Compressed temporary files
– compress terms in temporary dictionary runs

Merge more than 2 runs at a time
– maintain heap of candidates from each run
…
1
5
2
…
4
3
6
Indexing speed in practice

From TREC TeraByte 2004:
24-38 GB/hour on a 1GHz Pentium PC
(depending on HTML parser)
Dynamic indexing

Docs come in over time
– postings updates for terms already in
dictionary
– new terms added to dictionary

Docs get deleted
Simplest approach
Maintain “big” main index
 New docs go into “small” auxiliary index
 Search across both, merge results
 Deletions

– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector

Periodically, re-index into one main index
More complex approach
Fully dynamic updates
 Only one index at all times

– No big and small indices

Active management of a pool of
space
Fully dynamic updates

Inserting a (variable-length) record
– e.g., a typical postings entry
Maintain a pool of (say) 64KB chunks
 Chunk header maintains metadata on
records in chunk, and its free space

Header
Free space
Record
Record
Record
Record
Global tracking
In memory, maintain a global record
address table that says, for each
record, the chunk it’s in.
 Define one chunk to be current.
 Insertion

– if current chunk has enough free space
• extend record and update metadata.
– else look in other chunks for enough
space
– else open new chunk
Changes to dictionary

New terms appear over time
– cannot use a static perfect hash for
dictionary

OK to use term character string
w/pointers from postings as in
lecture 2
Index on disk vs. memory

Most retrieval systems keep the dictionary
in memory and the postings on disk
 Web search engines frequently keep both
in memory
– massive memory requirement
– feasible for large web service installations,
less so for standard usage where
• query loads are lighter
• users willing to wait 2 seconds for a response

More on this when discussing deployment
models
Distributed indexing

Suppose we had several machines
available to do the indexing
– how do we exploit the parallelism

Two basic approaches
– stripe by dictionary as index is built up
– stripe by documents
Indexing in the real world

Typically, don’t have all documents sitting
on a local filesystem
– Documents need to be spidered
– Could be dispersed over a WAN with varying
connectivity
– Must schedule distributed spiders/indexers
– Could be (secure content) in
• Databases
• Content management applications
• Email applications

http often not the most efficient way of
fetching these documents - native API
fetching
Indexing in the real world

Documents in a variety of formats
–
–
–
–

word processing formats (e.g., MS Word)
spreadsheets
presentations
publishing formats (e.g., pdf)
Generally handled using format-specific
“filters”
– convert format into text + meta-data

Documents in a variety of languages
– automatically detect language(s) in a
document
– tokenization, stemming, are languagedependent

Indexing and Search - MediaLab

Transcript Indexing and Search - MediaLab

Directory