Transcript Title
INF 2914
Information Retrieval and Web Search
Lecture 3: Parsing/Tokenization/Storage
These slides are adapted from Stanford’s
class CS276 / LING 286
Information Retrieval and Web Mining
1
(Offline) Search Engine Data Flow
Parse & Tokenize
Global Analysis
Index Build
Crawler
web page
- Parse
- Tokenize
- Per page
analysis
in background
tokenized
web pages
- Scan tokenized
web pages,
anchor text, etc
- Generate text
index
- Dup detection
- Static rank comp
- Anchor text
-…
1
2
dup
table
3
rank
table
anchor
text
4
inverted
text index
2
Inverted index
Posting
Brutus
2
4
8
16
Calpurnia
1
2
3
5
Caesar
13
Dictionary
32
8
64
13
128
21
34
16
Postings lists
Sorted by docID (more later on why).
3
Inverted index construction
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends Romans
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
2
4
roman
1
2
countryman
13
4
16
Plan for this lecture
The Dictionary
Storage
Parsing
Tokenization
What terms do we put in the index?
Log structured file systems
XML Introduction
5
Parsing a document
What format is it in?
pdf/word/excel/html?
What language is it in?
What character set is in use?
Each of these is a classification problem.
But these tasks are often done heuristically …
6
Complications: Format/language
Documents being indexed can include docs from
many different languages
Sometimes a document or its components can
contain multiple languages/formats
A single index may have to contain terms of
several languages.
French email with a German pdf attachment.
What is a unit document?
A file?
An email? (Perhaps one of many in an mbox.)
An email with 5 attachments?
A group of files (PPT or LaTeX in HTML)
7
Tokenization
8
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens
Friends
Romans
Countrymen
Each such token is now a candidate for an index
entry, after further processing
Described below
But what are valid tokens to emit?
9
Tokenization
Issues in tokenization:
Finland’s capital
Finland? Finlands? Finland’s?
Hewlett-Packard
Hewlett and
Packard as two tokens?
State-of-the-art: break up hyphenated sequence.
co-education ?
the hold-him-back-and-drag-him-away-maneuver ?
It’s effective to get the user to put in possible hyphens
San Francisco: one token or two? How do you
decide it is one token?
10
Numbers
3/12/91
Mar. 12, 1991
55 B.C.
B-52
My PGP key is 324a3df234cb23e
100.2.86.144
Often, don’t index as text.
But often very useful: think about things like looking up
error codes/stacktraces on the web
(One answer is using n-grams, lectures 6 and 7)
Will often index “meta-data” separately
Creation date, format, etc.
11
Tokenization: Language issues
L'ensemble one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
12
Tokenization: language issues
Chinese and Japanese have no spaces between
words:
莎拉波娃现在居住在美国东南部的佛罗里达。
Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple
alphabets intermingled
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana
Hiragana
Kanji
Romaji
End-user can express query entirely in hiragana!
13
Tokenization: language issues
Arabic (or Hebrew) is basically written right to
left, but with certain items like numbers written
left to right
Words are separated, but letter forms within a
word form complex ligatures
. عاما من االحتالل الفرنسي132 بعد1962 استقلت الجزائر في سنة
← → ←→
← start
‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
With Unicode, the surface presentation is complex,
but the stored form is straightforward
14
Normalization
Need to “normalize” terms in indexed text as well
as query terms into the same form
We want to match U.S.A. and USA
We most commonly implicitly define equivalence
classes of terms
Alternative is to do asymmetric expansion:
e.g., by deleting periods in a term
Enter: window
Enter: windows
Search: window, windows
Search: windows
Potentially more powerful, but less efficient
Execute queries in parallel or do a second pass over the
index
15
Normalization: other languages
Accents: résumé vs. resume.
Most important criterion:
How are your users like to write their queries for
these words?
Even in languages that have accents, users often
may not type them
German: Tuebingen vs. Tübingen
Should be equivalent
16
Normalization: other languages
Need to “normalize” indexed text as well as query
terms into the same form
7月30日 vs. 7/30
Character-level alphabet detection and
conversion
Tokenization not separable from this.
Sometimes ambiguous:
Morgen will ich in MIT …
Is this
German “mit”?
17
Case folding
Reduce all letters to lower case
exception: upper case (in mid-sentence?)
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Often best to lower case everything, since users
will use lowercase regardless of ‘correct’
capitalization…
18
Stop words
With a stop list, you exclude from dictionary entirely the
commonest words. Intuition:
They have little semantic content: the, a, and, to, be
They take a lot of space: ~30% of postings for top 30
But the trend is away from doing this:
Good compression techniques means the space for including
stopwords in a system is very small
Good query optimization techniques mean you pay little at query time
for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
19
Thesauri and soundex
Handle synonyms and homonyms
Hand-constructed equivalence classes
e.g., car = automobile
color = colour
Rewrite to form equivalence classes
Index such equivalences
When the document contains automobile, index it
under car as well (usually, also vice-versa)
Or expand query?
When the query contains automobile, look under
car as well
20
Soundex
Traditional class of heuristics to expand a query
into phonetic equivalents
Language specific – mainly for names
E.g., chebyshev tchebycheff
21
Lemmatization
Reduce inflectional/variant forms to base form
E.g.,
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car
be different color
Lemmatization implies doing “proper” reduction
to dictionary headword form
22
Stemming
Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping
language dependent
e.g., automate(s), automatic, automation all
reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
23
Porter’s algorithm
Commonest algorithm for stemming English
Results suggest at least as good as other
stemming options
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a compound
command, select the one that applies to the
longest suffix.
24
Typical rules in Porter
sses ss
ies i
ational ate
tional tion
25
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250
rules)
Motivated by linguistics as well as IR
Full morphological analysis – at most modest
benefits for retrieval
Do stemming and other normalizations help?
Often very mixed results: really help recall for
some queries but harm precision on others
26
Language-specificity
Many of the above features embody
transformations that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing
process
Both open source and commercial plug-ins
available for handling these
27
Index Build Flow - Overview
Crawled
documents
Per document
analysis
Global
analysis
Indexing
Search
indexes
28
Per document analysis
Multi-format parsing
Handles different files types (HTML, PDF,
PowerPoint, etc)
Multi-language tokenization, stemming,
synonyms, user-defined annotations, etc.
Per document analysis is tipically the bottleneck
of the index build process
50 times slower than I/O
Indexing can be done at I/O speed
29
Incorporating per document analysis
Per document analysis is much slower than
indexing
Store tokenized documents in a scalable
document store
Crawled
documents
Per document
analysis
Document store
30
Storage
31
Document store
Log-structured file system
Only the most recent version of each document is accessible
No in place updates
Documents are grouped into bundles to optimize I/O
Typically built over the file system
3 basic operation modes
Document insertion (during per-document analysis)
Sequential access for index build
Random access during query processing
32
Store design (1/5)
Bundle disk layout
# docs
Fixed bundle size (for
instance 8MB)
All fields are 64-bit
aligned
# fields
All fields are binary
Store does not know how
to interpret fields
Compression
Fields are tokens, anchor
text, URL, shingle,
statistics, etc.
attributes
# docs
hash(URL) timestamp attributes
hash(URL) timestamp attributes
# fields
field ID
field ID
field ID
Header
offset
offset
Doc 2
length
length
length
offset
offset
offset
data
data
# fields
data
# fields
field ID
Doc 1
length
offset
data
33
Store design (2/5)
Document insertion uses a double buffering
algorithm and asynchronous I/O
Try to fit as many documents as it can in a bundle
Schedule write for bundle
Start writing the next bundle
34
Store design (3/5)
Store is sequentially scanned during index build
and global analysis
A double buffering algorithm with asynchronous
I/O is also used here
Return only the newest version of each
document
Store is accessed in reverse order
LFS semantics
currentBuffer
Bundle# 1053
nextBuffer
Bundle# 1052
35
Store design (4/5)
Store cleanup algorithm
“Smarter” algorithms can be used if we are not I/O bound
Avoid seeks
New documents
bundle
Bloom filter
Storei+1
D1’
1
D5’
set
D6
bundle
1
0
D1’
1
0
Storei
1
bundle
*
probe
D1
bundle
D3
D4
*
copy
D5’
bundle
D6
D3
0
D4
1
D2
0
D5
D2
* garbage collected
36
Bloom Filters (1/2)
Compact data structures for a probabilistic
representation of a set
Appropriate to answer membership queries
False positives!
37
Bloom Filters (2/2)
Bit vector v
Element a
1
H1(a) = P1
H2(a) = P2
1
H3(a) = P3
1
H4(a) = P4
1
m bits
Query for b: check the bits at positions H1(b),
H2(b), ..., H4(b).
38
Store design (5/5)
During runtime the summarizer uses the store to
fetch the tokens (random access)
Store provides an API call for retrieving a set of
documents (e.g. 20) given its bundle number and
offset in the file
Internally the store uses a buffer pool for
documents
Asynchronous I/O is used for exploiting
parallelism from the storage subsystem
Summarizer releases the documents after it is
done
39
Storage Issues
Performance
Fault tolerance
Distribution
Redundancy
Field compression
Google File System tries to address these issues
40
XML Introduction
41
Preliminaries: XML
<conference>
<name> PODS </name>
<speaker>
<name> Josifovski </name>
<paper_cnt> 1 </paper_cnt>
</speaker>
root x0
conference
name
x2
x4
<speaker>
<name> Fagin </name>
<paper_cnt> 3 </paper_cnt>
</speaker>
</conference>
x6
speaker
PODS
speaker x3
name
Josifovski
x1
x5
paper_cnt
1
x8
x7
paper_cnt
name
Fagin
3
42
Preliminaries: XPath 1.0
/conference[name = PODS]/speaker[paper_cnt > 1]/name
Query
Document
root x0
root
conference
speaker
conference
name
= PODS
name x2
PODS
speaker x3
x4
paper_cnt > 1
name
name
Josifovski
Result: { x7 }
x5
paper_cnt
1
x1
x6
speaker
x8
x7
name
Fagin
paper_cnt
3
43
XML Indexing
article
//article//section[
//title contains(‘Query Processing’) AND
//figure//caption contains(‘XML’)]
section
title
figure
“Query Processing” caption
“XML”
In an index-based method, 8 tags and text elements need to
be verified to process this query (lessons 6 and 7)
44
Position Encoding
Scheme #1: Begin/End/Level
Begin: preorder position of tag/text
End: preorder position of last descendent
Level: depth
(1,5,1)
B3 (6,7,1)
A1
(2,2,2) B1
(4,4,3) C1
R (0,7,0)
(3,5,2)
B2
C2 (7,7,2)
D1
(5,5,3)
Containment: X contains Y iff
X.begin < Y.begin <= X.end (assuming well-formed)
45
Position Encoding
Scheme #2: Dewey
Position of element E = {position of parent}.n, where E is the
nth child of its parent
R (1)
(1.1)
B3 (1.2)
A1
(1.1.1) B1
(1.1.2)
B2
C2 (1.2.1)
(1.1.2.1) C1
D1
(1.1.2.2)
Containment: X contains Y iff X is a prefix of Y
46
Position Encoding
Begin/End/Level
Typically more compact
Fewer implementation issues
Dewey
Encodes positions of all ancestors
47
Path Index
Path
/R
/R/A
/R/A/B
/R/A/B/C
/R/A/B/D
/R/B
/R/B/C
Path Pattern ->
/R/B
->
//R//C
->
ID
1
2
3
4
5
6
7
R
A1
B1
B2
B3
C2
C1 D1
Set of matching path IDs
{6}
{4, 7}
48
Basic Access Path
Inverted posting lists
Posting: <Token, Location>
Token = <Term/Tag>
Location = <DocumentID, Position>
R
A1
Exercise: Create the posting list
representation for the following XML
document
B1
B2
B3
C2
C1 D1
49
Inverted index
Posting
Brutus
2
4
8
16
Calpurnia
1
2
3
5
Caesar
13
Dictionary
32
8
64
13
128
21
34
16
Postings lists
Sorted by docID (Why on lessons 6/7).
50
Joins in XML
Structural (Containment) Joins
A
||
B
B
||
C
B
||
D
Twig Joins
A
||
B
C
D
A
||
B
||
C
51
Resources for today’s lecture
IIR 2
Porter’s stemmer:
http://www.tartarus.org/~martin/PorterStemmer/
Rosenblum, Mendel and Ousterhout, John K. (February 1992).
"The Design and Implementation of a Log-Structured
Filesystem." ACM Transactions on Computer Systems. 10(1).
26-52
XML Introduction (IIR 10)
52
Trabalho 4 - Proposta
Google File System
http://labs.google.com/papers/gfs.html
Map Reduce
http://labs.google.com/papers/mapreduce.html
53
Trabalho 5 - Proposta
XML Parsing, Tokenization, and Indexing
JuruXML - an XML retrieval system at INEX'02
Optimizing cursor movement in holistic twig joins.
CIKM 2005: 784-791
54