Information Retrieval Document Parsing Basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules friend Modified tokens. Indexer Inverted index. roman countryman.

Transcript Information Retrieval Document Parsing Basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules friend Modified tokens. Indexer Inverted index. roman countryman.

Information Retrieval
Document Parsing
Basic indexing pipeline
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends Romans
Countrymen
Linguistic
modules
friend
Modified tokens.
Indexer
Inverted index.
roman
countryman
Parsing a document

What format is it in?

pdf/word/excel/html?

What language is it in?

What character set is in use?

Plain ASCII, UTF-8, UTF-16,…
Each of these is a classification problem,
with many complications…
Tokenization: Issues

Chinese/Japanese no spaces between words:


Not always guaranteed a unique tokenization
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana
Hiragana
Kanji
What about DNA sequences ?
“Romaji”
ACCCGGTACGCAC...
Definition of Tokens  What you can search !!
Case folding

Reduce all letters to lower case

exception: upper case (in mid-sentence?)


e.g., General Motors
USA vs. usa
Morgen will ich in MIT …
Is this the
German “mit”?
Stemming

Reduce terms to their “roots”



language dependent
e.g., automate(s), automatic, automation all
reduced to automat.
e.g., casa, casalinga, casata, casamatta, casolare,
casamento, casale, rincasare, case reduced to cas
Porter’s algorithm


Commonest algorithm for stemming English
Conventions + 5 phases of reductions



phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a
compound command, select the one that
applies to the longest suffix.
sses  ss, ies  i, ational  ate, tional  tion

Full morphologial analysis  modest benefit !!
Thesauri

Handle synonyms and homonyms

Hand-constructed equivalence classes


e.g., car = automobile
e.g., macchina = automobile = spider

List of words important for a given domain

For each word it specifies a list of correlated words (usually,
synonyms, polysemic or phrases for complex concepts).

Co-occurrence Pattern: BT (broader term), NT (narrower term)

Vehicle (BT)  Car  Fiat 500 (NT)
How to use it in SE ??
Dmoz Directory
Yahoo! Directory
Information Retrieval
Statistical Properties of Documents
Statistical properties of texts

Token are not distributed uniformly

They follow the so called “Zipf Law”

Few tokens are very frequent

A middle sized set has medium frequency

Many are rare

The first 100 tokens sum up to 50% of the text

Many of these tokens are stopwords
The Zipf Law, in detail

K-th most frequent term has frequency
approximately 1/k; or the product of the frequency
(f) of a token and its rank (r) is almost a constant
r * f = c |T|
f = c |T| / r
f = c |T| / ra
a = 1.52.0
General Law
Sum after the k-th element is ≤ fkk/(z-1)
For the initial top-elements is a constant
An example of “Zipf curve”
Zipf’s law log-log plot
Consequences of Zipf Law

Do exist many not frequent tokens that do not
discriminate. These are the so called “stop words”


English: to, from, on, and, the, ...

Italian: a, per, il, in, un,…
Do exist many tokens that occur once in a text and
thus are poor to discriminate (error?).

English: Calpurnia

Italian: Precipitevolissimevolmente (o, paklo)
Words with medium frequency

Words that discriminate
Other statistical properties of texts

The number of distinct tokens grows as
b
 The so called “Heaps Law” (|T| where b<1)


Hence the token length is (log |T|)
Interesting words are the ones with

Medium frequency (Luhn)
Frequency vs. Term significance (Luhn)