Document

Transcript Document

Text-retrieval
Systems
NDBI010 Lecture Slides
KSI MFF UK
http://www.ms.mff.cuni.cz/~kopecky/teaching/ndbi010/
Version 10.05.12.13.30.en
Literature (textbooks)
• Introduction to Information Retrieval
– Christopher D. Manning, Prabhakar Raghavan
and Hinrich Schütze
• Cambridge University Press, 2008
• http://informationretrieval.org/
• Dokumentografické informační systémy
– Pokorný J., Snášel V., Kopecký M.:
• Nakladatelství Karolinum, UK Praha, 2005
– Pokorný J., Snášel V., Húsek D.:
• Nakladatelství Karolinum, UK Praha, 1998
• Textové informační systémy
– Melichar B.:
• Vydavatelství ČVUT, Praha, 1997
NDBI010 - DIS - MFF UK
Further links (books)
• Computer Algorithms - String Pattern
Matching Strategies,
– Jun Ichi Aoe,
• IEEE Computer Society Press 1994
• Concept Decomposition
for Large Sparse Text Data using Clustering
– Inderjit S. Dhillon, Dharmendra S. Modha
• IBM Almaden Research Center, 1999
NDBI010 - DIS - MFF UK
Further links (articles)
• The IGrid Index: Reversing the Dimensionality Curse For
Similarity Indexing in High Dimensional Space
for Large Sparse Text Data using Clustering
– Charu C. Aggrawal, Philip S. Yu
• IBM T. J. Watson Research Center
• The Pyramid Technique: Towards Breaking the Curse of
Dimensionality
– S. Berchtold, C. Böhm, H.-P. Kriegel:
• ACM SIGMOD Conference Proceedings, 1998
NDBI010 - DIS - MFF UK
Further links (articles)
• Affinity Rank: A New Scheme for Efficient Web Search
– Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu, Wei-Ying Ma
• 2004
• Improving Web Search Results Using Affinity Graph
– Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan,
Zheng Chen1, Wei-Ying Ma
• Efficient computation of pagerank
– T.H. Haveliwala
• Technical report, Stanford University, 1999
NDBI010 - DIS - MFF UK
Further links (older)
• Introduction to Modern Information
Retrieval
– Salton G., McGill M. J.:
• McGRAW-Hill 1981
• Výběr informací v textových bázích dat
– Pokorný J.:
• OVC ČVUT Praha 1989
NDBI010 - DIS - MFF UK
Lecture No. 1
Introduction
Overview of the problem
informativeness measurement
Retrieval system origin
• 50th of 20th century
• The gradual automation
of the procedures used in libraries
• Now a separate subsection of IS’s
– Factual IS
• Processing of information having defined internal
structure (usually in the form of tables)
– Bibliographic IS
• Processing of information in form of the text written
in natural language without strict internal structure.
NDBI010 - DIS - MFF UK
Interaction with TRS
1.
2.
3.
4.
Query formulation
Comparison
Hit-list obtaining
Query
tuning/reformulation
5. Document request
6. Document obtaining
1
3 2
4
NDBI010 - DIS - MFF UK
DIS
5
6
TRS Structure
I)
Document disclosure
system
•
1
Returns secondary
information
•
•
•
Author
Title
...
3 2
4
5
II) Document delivery
system
•
I)
Need not to be
supported by the SW
NDBI010 - DIS - MFF UK
6
II)
Query Evaluation
• Direct comparison is
time-consuming
Doc1
Comparison
Query
Doci1
NDBI010 - DIS - MFF UK
Query Evaluation
• Document model is
used to compare
• Lossy process,
usually based on
presence of words in
documents
• Produces structured
data suitable for
effective comparison
Doc1
Indexation
NDBI010 - DIS - MFF UK
X1
Query Evaluation
• Query is processed to
obtain needed form
Comparison
Query
• Processed query
is compared
against the index
Doci1
X1
NDBI010 - DIS - MFF UK
Text preprocessing
• Searching is more effective using created
(structured) model of documents, but it can use
only information stored in the model, not in
documents.
• The goal is to create model, preserving as much
information form the original documents as
possible.
• Problem: lot of ambiguity in text.
• Still exist many not resolved tasks concerning
document understanding.
NDBI010 - DIS - MFF UK
Text understanding
• Writer:
– Text = sequence of words in natural language.
– Each word stands for some idea/imagination of writer.
– Ideas represent real subject, activity, etc.
• Reader: folows (not necessary exactly the same mappings)
from left to right
...
NDBI010 - DIS - MFF UK
Text understanding
• Synonymy of words
– More words can have the same meaning for the writer
• car = automobile
• sick = ill
...
NDBI010 - DIS - MFF UK
Text understanding
• Homonymy of words
– One word can have more than one different meanings
• fluke: fish, anchor, …
• crown: currency, treetop, jewel, …
• class: year of studies, kategory in set theory, …
...
NDBI010 - DIS - MFF UK
Text understanding
• Word meanings need not be exactly the same.
– Hierarchical overlapping
• animal > horse > Stallion
– Associativity among meanings
• calculator ~ computer ~ processor
...
NDBI010 - DIS - MFF UK
Text understanding
• Mapping between subjects, ideas and words can depend on
individual persons – readers and writers.
– Two people can assign partly or completely different meaning to
given term.
– Two people can imagine different thing for the same word.
• mother, room, ...
• In result, by reading the same text two different readers can
obtain different information
– Each from other
– In comparison with author’s intention
NDBI010 - DIS - MFF UK
Text understanding
• Homonymy and ambiguities grows with transition
form words/terms to sentences and bigger parts of
the text.
• Example of English sentence with more grammatically
correct meanings (in this case a human reader probably
eliminates the nonsense meaning)
– See Podivné fungování gramatiky,
http://www.scienceworld.cz/sw.nsf/lingvistika
– In the sentence „Time flies like an arrow“ either flies (fly) or like
can be chosen for the predicate, what produces two significantly
different meanings.
NDBI010 - DIS - MFF UK
Text preprocessing
• Inclusion of linguistic analysis into the text
processing can partially solve the problem
– Disambiguation
• Selection of correct meaning of the term in the
sentence
– According to grammar (Verb versus Noun etc.)
– According to context (more complicated, can distinguish
between two Verbs, two Nouns, etc).
NDBI010 - DIS - MFF UK
Text preprocessing
• Inclusion of linguistic analysis into the text
processing can partially solve the problem
– Lemmatization
• For each term/word in the text – after its proper
meaning is found – assigns
– Type of word, plural vs. singular, present time vs.
preterite, etc.
– Base form (singular for Nouns, infinitive for Verbs, …)
– Information obtained by sentence analysis (subject,
predicate, object, ...)
NDBI010 - DIS - MFF UK
Text preprocessing
• Other options, that can be more or less
solved are
– Identification of collocations
• World war two, ...
– Assigning of Nouns for Pronouns, used in the
text (very complex and hard to solve,
sometimes even for human reader)
NDBI010 - DIS - MFF UK
Precision and Recall
• As a result of ambiguities there exists no optimal
text retrieval system
• After the answer of the query is obtained,
following values can be evaluated
– Number of returned documents in the list: Nv
• The system supposed them to be relevant – useful – according
to their math with the query
– Number of returned relevant documents: Nvr
• The questioner find them to be really relevant as they fulfill its
information needs
– Number of all relevant documents in the system: Nr
• Very hard to guess for large and unknown collections
NDBI010 - DIS - MFF UK
Precision and Recall
• Two TRS’s can (and
do) return two
different result for the
same query, that can
be partly or
completely unique.
How to compare
quality of those
systems?
Documents in the database
NDBI010 - DIS - MFF UK
Returned
by
TRS2
Relevant
documents
Returned
by
TRS1
Precision and Recall
• Two questioners can
suppose another
documents to be
relevant for their
equally formulated
query
How to meet both
subjective
expectations of
questioners?
Documents in the database

NDBI010 - DIS - MFF UK
Relevant
Relevant
Returned
docs.

Precision and Recall
• Quality of result set of documents is usually
evaluated according to numbers Nv, Nr, Nrv
– Precision
• P = Nvr / Nv
• Probability of returned document to be relevant to the user
– Recall
• R = Nvr / Nr
• Probability of relevant document to be returned to the user
NDBI010 - DIS - MFF UK
Precision and Recall
• Both coefficients depend on the feeling of
the questioner
• The same document can fulfill information
needs of first questioner while at the same
time fail to meet them for the second one.
– Each user determines different values of Nr and
Nrv coefficients
– Both measures P and R depend on them
NDBI010 - DIS - MFF UK
Precision and Recall
• In optimal case
Optimal
answer
1
– P=R=1
– There are all and only
relevant documents in
the response of the
system
• Usually
– The answer in the first
iteration is neither
precise nor complete
Typical initial answer
0
0
NDBI010 - DIS - MFF UK
1
Precision and Recall
• Query tuning
R
– Iterative modification
of the query targeted to
increase the quality of
the response
• Theoretically it is
possible to reach the
optimum sooner or
later …
1
0
Optimum
0
NDBI010 - DIS - MFF UK
1
P
Přesnost a úplnost
• … due to (not only)
R
ambiguities both measures 1
depend indirectly each on
the other,
ie. P*R  const. < 1
Optimum
– In order to increase P the
absolute number of relevant
documents in the response
is decreased.
– In order to increase R the
number of irrelevant
documents rapidly grows.
• The probability to reach
quality above the limit is
low.
0
0
NDBI010 - DIS - MFF UK
1
P
Prediction Criterion
• In time of query formulation the questioner
has to guess correct term (words) the author
used for expression of given idea
– Problems are caused e.g. by
• Synonyms (author could use different synonym not
remembered by the user)
• Overlapping meanings of terms
• Colorful poetical hyperboles
•…
NDBI010 - DIS - MFF UK
Prediction Criterion
• The problem can be partly suppressed by
inclusion of thesaurus, containing
– Hierarchies of terms and their meanings
– Sets of synonyms
– Definitions of associations between terms
• Questioner can use it during query formulation
• System can use it during query evaluation
NDBI010 - DIS - MFF UK
Prediction Criterion
• The user often tends to tune its own query in
conservative way
– He/she tends to fix terms used in the first iteration (they
must be the best because I remembered them
immediately) and vary only additional terms at the end
of the query
• It is useful to support the user to
(semi)automatically eliminate wrong terms and
replace them with useful ones, that describe really
relevant documents
NDBI010 - DIS - MFF UK
Maximal Criterion
• The questioner is usually not able or willing to go
through exhaustive number of hits in the response
to find out the relevant one
• Usually max. 20-50 documents according to their
length
Need to not only sort out documents not matching
the query but order the answer list according to
supposed relevancy in descendant order – the
supposedly best documents at the begining
NDBI010 - DIS - MFF UK
Maximal Criterion
• Due to maximal criterion, the user usually tries to
increase the Precision of the answer
– Small amount of resulting
documents in the answer,
containing as high ratio
of relevant documents
as possible
„better“
„worse“
Vr.
Rel.
Rel.
Vr.
• Some problematic domains requires both high
precision and recall
– Lawyers, especially in territories having case law based
on precedents (need to find out as much similar cases as
possible)
NDBI010 - DIS - MFF UK
Exact pattern matching
Why to Search for Patterns in Text
• Due to index documents or queries
– To involve only given set of terms (lemmas)
– To omit given set of meaningless terms
(lemmas) as conjunctions, numerals, pronouns,
…
• To highlight given terms in documents, presented
to users
• …
NDBI010 - DIS - MFF UK
Algorithms classification
by preprocessing
Categories by preprocessing
Text
NO
YES
Pattern
NO
YES
I.
II.
III.
IV.
I - Brute-force algorithm
II - Others (suitable for TRS)
Further divided according to
• Number of simultaneously
matched patterns
– 1, N, 
• Direction of comparison
– Left to right
– Right to left
NDBI010 - DIS - MFF UK
Class II Algorithms
Subcategories of class II
Number
of
patterns
1
N
inf.
Direction
Left to right Right to left
KMP
BM
AC
CW
KA
2WJFA
NDBI010 - DIS - MFF UK
Exact Pattern Matching
Searching of One Pattern
Within Text
Brute-force Algorithm
• Let m denotes length of text t,
let n denotes length of pattern p.
• If i-th position in text doesn’t match j-th position in pattern
– Shift of pattern one position to the right,
restart comparison at first (leftmost) position in the pattern
• Average time complexity: o(m*n),
e.g. in search of „an-1b“ in „am-1b“
• For natural language text/pattern m*const ops, i.e. o(m)
a b c c b a b c a bb c a abc cbabcbbba b c c
Text
const is small number
Bef. shift a b c c b a b c b b b
(<10), dependent
on the language
Aft. shift a b c c b a b c b b b
NDBI010 - DIS - MFF UK
Lecture no. 2
Knuth-Morris-Pratt Algorithm
• Left to right searching for one pattern
• In comparison with brute-force algorithm
KMP eliminates repeated comparison of
already successfully compared characters of
text
• Pattern is shifted as less as possible to align
own prefix of examined part of pattern
below equal fragment of text
NDBI010 - DIS - MFF UK
KMP Algorithm
Brute-force algorithm
a b c c b a b c a bb c a abc cbabcbbba b c c
Text
Bef. shift a b c c b a b c b b b
Aft. shift a b c c b a b c b b b
KMP
Text
a b c c b a b c a bb c a abc cbabcbbba b c c
Bef. shift a b c c b a b c b b b
abccbabcbbb
Aft. shift
NDBI010 - DIS - MFF UK
KMP Algorithm
• In front of mismatch position is left own prefix
already examined part of pattern
• It has to be equal to the postfix of already
examined part of pattern
• The longest such a prefix determines the smallest
shift
a b c c b a b c a bb c a abc cbabcbbba b c c
Text
Před pos. a b c c b a b c b b b
abccbabcbbb
Po posunu
NDBI010 - DIS - MFF UK
KMP algoritmus
Text
Text
a b c c b a b c a b b c a a b c c b a b c bbb a b c c
Před
a b c c b a b c b b b
Bef.posunem
shift
Po
posunu
Aft.
shift
a b c c b a b c b b b
a b c c b a b c b b b
a b c c b a b c b b b
a b c c b a b c b b b
a b c c b a b c b b b
a b c c b a b c bbb
NDBI010 - DIS - MFF UK
KMP algoritmus
• If
– j-th position of pattern p
doesn’t match to i-th position of text t
– The longest own prefix of already examined part of pattern
equal to the postfix of already examined part of pattern is of
length k
• then
– After the shift k characters remain before the mismatch
position
– Comparison restarts from k+1st position of the pattern
• Restart positions are pre-computed and stored in
auxiliary array A
• In this case A[j] = k+1
NDBI010 - DIS - MFF UK
KMP algoritmus
begin {KMP}
m := length(t); n := length(p);
i := 1; j := 1;
while (i <= m) and (j <= n) do begin
while (j > 0) and (p[j] <> t[i]) do
j := A[j];
inc(i); inc(j);
end; {while}
if (j > n)
then {pattern found at position i-j+1}
else {not found}
end; {KMP}
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
• A[1] = 0
• If all values are known for positions
1 .. j-1, it is easy to compute correct
value for j-th position
– Let A[j-1] contains corrections for j-1st
position.
I.e., A[j-1]-1 chars at the beginning of
pattern are the same as equivalent number
of chars before j-1st positon
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
1 2 3 4 5 6 7 8 9 1011
Pattern
Vzorek
a b c c b a b c b b b
PA
0 1 1 1 1 1 2 ?
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
• If j-1st position of pattern
match to A[j-1] th position,
the prefix can be prolonged
and so correct value of A[j] is by one
higher, than the previous value.
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
1 2 3 4 5 6 7 8 9 1011
Pattern
Vzorek
a b c c b a b c b b b
PA
0 1 1 1 1 1 2 3
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
• If j-1st and A[j-1] th positions in pattern
doesn’t match, the correction A[j-1]+1
would cause mismatch at the previous
position in text
• The correction for such a mismatch is
already known
(numbers A[1] .. A[j-1] are already
computed)
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
• It is necessary to follow correction
starting by j-1 st position until j-1 st
position in pattern match to the found
position in the target position, or the
correction reaches 0 (out of pattern)
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
1 2 3 4 5 6 7 8 9 1011
Pattern
Vzorek
a b c c b a b c b b b
PA
0 1 1 1 1 1 2 34 ?
NDBI010 - DIS - MFF UK
Obtaining of array A
for KMP search
1 2 3 4 5 6 7 8 9 1011
Vzorek
Pattern
a b c c b a b c b b b
Posun
pro P[j]=P[j-1]+1
Shift A[j]=A[j-1]+1
a b c c b a b c b b b
th pos.
Shift topro
correct
error
Posun
chybu
naat4.4pozici
st pos.
Shift topro
correct
error
Posun
chybu
naat1.1pozici
a b c c b a b c b b b
a b c c b a b c b b b
NDBI010 - DIS - MFF UK
0 1 1 1 1 1 2 34 ?
AP
Obtaining of array A
for KMP search - algorithm
begin
A[1] := 0;
n := length(p); j := 2;
while (j <= n) do begin
k := j-1; l := k;
repeat
l := A[l];
until (l = 0) or (p[l] = p[k]);
A[j] := l+1;
inc(j);
end;
end;
NDBI010 - DIS - MFF UK
KMP algorithm
• Time complexity of KMP is o(m+n).
• Already successfully compared positions in
text are never checked again
• After each shift of pattern the given mismatch
position can be checked again, but there are
at most o(m) shifts of pattern.
• Similarly time complexity of preprocessing is
o(n).
NDBI010 - DIS - MFF UK
KMP Optimization
• It is possible to further optimize auxiliary array A
• If the character p[j] equals to p[A[j]],
there would be the same character as the one that
caused the mismatch aligned to mismatch position.
• In this case the optimization can be computed in
advance in another auxiliary array A’
where A’[j] =def A’[A[j]]
• Else A’[j] =def A[j]
p abccbabcbbb
• Array A’ [j] can be used
A 01111123411
during the search phase
A’ 0 1 1 1 1 1 1 1 4 1 1
NDBI010 - DIS - MFF UK
Boyer-Moore Algorithm
• Right to left search of one pattern using
pattern preprocessing
– Pattern is shifted left to right
– Characters of pattern are compared from
right to left
NDBI010 - DIS - MFF UK
Boyer-Moore Algorithm
• If the mismatch of n-j th position of pattern
against i-j th position of text, where T[i-j]=x
occures, where:
–
–
–
–
n denotes length of pattern,
i denotes position of the end of pattern in text,
j=0..n-1
xX, X is the alphabet
• Pattern is moved by SHIFT[n-j,x] characters
to the rights
• The comparison restarts at the end of pattern,
i.e. for j=0
NDBI010 - DIS - MFF UK
Boyer-Moore Algorithm
• There exists more different definitions
of SHIFT[n-j,x]
• Variant 1: Auxiliary array SHIFT[0..n-1,X] is
for each position in the pattern and for each
character of the alphabet X defined as
follows:
– The smallest possible shift, aligning the character
x in the text at the mismatch position with the
same character in the pattern.
– If there exists no such character x in the pattern
left to the mismatch position, shift the pattern to
start immediately after the mismatch position.
NDBI010 - DIS - MFF UK
Boyer-Moore Algorithm (1)
• Average time complexity is o(m*n),
e.g. for searching „ban-1“ in „am-nban-1“
• For huge alphabets
and patterns with small number of different
characters (especially for words searched in
texts in natural languages)
the average time complexity is o(m/n)
– i.e. the longer the pattern,
the more efficient search
NDBI010 - DIS - MFF UK
Boyer-Moore Algorithm (1)
• Example:
T RO P I CK Ý M OV OC EM J E
I A
A NA NA S
A NA NA S
A NA NA S
A NA NA
A NA
A
NDBI010 - DIS - MFF UK
NA NA S .
S
NA S
NA NA S
Boyer-Moore Algorithm (1)
• Representation of SHIFT array for pattern
’ANANAS’
– Full arrows depicts successful comparison of one character
– Other arrows stands for shift of target character to position of
starting character
– Not present arrows means shift after the mismatch position
A
N
A
A
N
A
N
A
N
N
A
NDBI010 - DIS - MFF UK
N
A
A
N
S
S
A
Boyer-Moore Algorithm (1)
• Another representation. To save the space complexity
x{‘A’,’N’,’S’,’X’}
– ‘X‘ stands for any character not apearing in the pattern
• Values beginning with „+“ represents the length of
shift
• Values without „+“ represents new value of j
j
0
1
2
3
4
5
6
‘A‘
+1
2
+1
4
+1
6
pattern
‘N‘
+2
+1
3
+1
5
+1
was
‘S‘
1
+5
+4
+3
+2
+1
successfully
NDBI010 - DIS - MFF UK
‘X‘
+6
+5
+4
+3
+2
+1
found
Benchmark on Artificial Text
Text
('a'rnd(200)'b ')1000
Size
100 KB
#patterns
1 000
#unique patterns
200
NDBI010 - DIS - MFF UK
Benchmark on Artificial Text
Text
('a'rnd(200)'b ')1000
Size
100 KB
#patterns
1 000
#unique patterns
200
#compar. - Br.-f.
24 128 586
NDBI010 - DIS - MFF UK
Benchmark on Artificial Text
Text
('a'rnd(200)'b ')1000
Size
100 KB
#patterns
1 000
#unique patterns
200
#compar. - Br.-f.
24 128 586
#compar. - KMP
885 747 3,7%
NDBI010 - DIS - MFF UK
Benchmark on English Text
Text
Size
#patterns
#unique patterns
Words
English
130 KB
18 075
1 570
Note:
Unique pattern
 found at its original position
NDBI010 - DIS - MFF UK
Benchmark on English Text
Text
Size
#patterns
#unique patterns
#compar. - Br.-f.
Words
English
130 KB
18 075
1 570
256 799 832
NDBI010 - DIS - MFF UK
Benchmark on English Text
Text
Size
#patterns
#unique patterns
#compar. - Br.-f.
#compar. - KMP
Words
English
130 KB
18 075
1 570
256 799 832
255 942 030 99,7%
NDBI010 - DIS - MFF UK
Benchmark on English Text
Text
Size
#patterns
#unique patterns
#compar. - Br.-f.
#compar. - KMP
#compar. - BM
Words
English
130 KB
18 075
1 570
256 799 832
255 942 030 99,7%
50 114 658 19,5%
NDBI010 - DIS - MFF UK
Benchmark on English Text
Text
Size
#patterns
#unique patterns
#compar. - Br.-f.
#compar. - KMP
#compar. - BM
Words
English
130
18 075
1 570
256 799 832
255 942 030
50 114 658
Bi-words
English
KB
130 KB
9 038
4 395
433 721 058
99,7% 430 220 025 99,2%
19,5% 52 046 084 12,0%
NDBI010 - DIS - MFF UK
Review of Algorithms
No
preporocessing
Pattern preprocessing
Left to right
Right to left
timemax
o(m*n)
o(m+n)
o(m*n)
timeavg
o(m*n)
o(m+n)
o(m*n)
timeavg
(nat. language)
o(m)
o(m+n)
o(m/n)
NDBI010 - DIS - MFF UK
Exact pattern matching
Searching for finite set of patterns
Aho-Corrasick Algorithm
• Left to right searching of more patterns
simultaneously
• Extension of KMP algorithm
– Preprocessing of patterns
– Linear reading of text
• Average time complexity o(m+ni), where
m denotes length of text
ni denotes length of i-th pattern
NDBI010 - DIS - MFF UK
A-C Algorithm
• Text T
• Set of patterns
P={P1, P2, …, Pk}
• Search engine
S = (Q, X, q0, g, f, F)
•
•
•
•
Q finite set of states
X alphabet
q0Q initial state
g: Q x X  Q (go)
forward function
• f: Q  Q (fail)
backward function
• F  Q set of final
states
NDBI010 - DIS - MFF UK
A-C Algorithm
• States in the set Q
correspond to all
prefixes of all patterns
• State q0 reprezents
empty prefix 
• g(q,x) = qx,
iff qxQ
• Else g(q0,x)=q0
• Else g(q,x) undefined
• f(q) for q<>q0 is equal
to longest own postfix
q in the set Q
 |f(q)|<|q|
• Final states correspond
to all complete
patterns, i.e. F=P
NDBI010 - DIS - MFF UK
A-C Algorithm
• Search based on total (fully defined)
transition function
(q,x): QxXX
– (q,x) = g(q,x), iff g(q,x) is defined
– (q,x) = (f(q),x)
• Correct definition, because
|f(q)| - distance of f(q) from initial state – is
less than |q| and g(q0,x) is completely
defined.
NDBI010 - DIS - MFF UK
A-C Algorithm
• f is constructed in order of
increasing |q|, i.e.
according to distance of
state from the beginning
• It is not necessary to
define f(q0)
• If |q|=1 the longest own
postfix is empty, i.e.
f(q)=q0
• f(qx)=f(g(q,x))
= (f(q),x)
• To determine value of fail
function for state qx,
accessible from state q
using character x, it is
necessary to start in q,
follow fail function to f(q)
and then go forward using
the character x
NDBI010 - DIS - MFF UK
A-C Algorithm
• Example: P={”he”,”her”,”she”}, function g
X\{h,s}
h
"h"
e
"he"
r
"her"
"s"
h
"sh"
e
"she"
""
s
NDBI010 - DIS - MFF UK
A-C Algorithm
• Example: P={”he”,”her”,”she”}, function f
X\{h,s}
h
"h"
e
"he"
r
"her"
"s"
h
"sh"
e
"she"
""
s
NDBI010 - DIS - MFF UK
A-C Algorithm
• Detection of all occurrences of patterns,
even of patterns hidden inside another ones:
– Either collect all patterns detected in given state
by going through all states accessible from it
using fail function,
i.e. final states in {f i(q), i>=0}
– Or - after transition to state q – go through all
states linked together by fail function and report
all final states
NDBI010 - DIS - MFF UK
A-C Algorithm – delta function
function delta(q:states; x: alphabet):states;
begin {delta}
while g[q,x] = fail do q := f[q];
delta := g[q,x];
end; {delta}
begin {A-C}
q := 0;
for i := 1 to length(t) do begin
q := delta(q,t[i]); report(q);
{report all found patterns, ending by t[i]}
end; {for}
end; {A-C}
NDBI010 - DIS - MFF UK
KMP vs. A-C for 1 pattern
•
•
•
•
Equal algorithms, different formulations
j (~ compared position)
• qj-1 (~ # compared positions)
P[1]=0
• g(q0,*)=q0
P[j]=k
• f(qj-1)=qk-1
P a b c c b a b c b b b
A 0 1 1 1 1 1 2 3 4 1 1
NDBI010 - DIS - MFF UK
Commentz-Walter Algorithm
• Right to left search for more patterns
simultaneously
• Combination of B-M and A-C algorithms
• Average time complexity
(for natural languages)
o(m/min(ni)), where
m denotes length of texts
ni denotes length of i-th pattern
NDBI010 - DIS - MFF UK
C-W Algorithm
• Text T
• Set of patterns
P={P1, P2, …, Pk}
• Search engine
S = (Q, X, q0, g, f, F)
–
–
–
–
Q finite set of states
X alphabet
q0Q initial state
g: Q x X  Q (go)
forward function
– f: Q  Q (fail)
backward function
– F  Q set of final
states
NDBI010 - DIS - MFF UK
C-W Algorithm
• States in set Q
represents all postfixes
of all patterns
• State q0 represents
empty postfix 
• g(q,x) = xq,
iff xqQ
• f(q) where q<>q0 is
equal to longest own
prefix q in the set Q
 |f(q)|<|q|
• Final states correspond
to all complete
patterns, i.e. F=V
NDBI010 - DIS - MFF UK
C-W Algorithm
• Forward function
s
“she“
h
“he“
e
“e“
r
h
“her“
e
“er“
NDBI010 - DIS - MFF UK
“r“
C-W Algorithm
• Backward function (arrows going to q0 are not shown)
s
“she“
h
“he“
e
“e“
r
h
“her“
e
“er“
NDBI010 - DIS - MFF UK
“r“
C-W Algorithm
• LMIN = min(ni)
length of the shortest pattern
• h(q) = |q|
distance of state q from the
initial state
• char(x)
minimal distance of state,
reachable via character x
• pred(q)
predecessor of state q,
i.e. the state, representing one
character shorter postfix
• If g(q,x) is not defined, patterns
(search engine) is shifted by
shift(q,x) positions to the right
and the again search restarts by
state q0 again
• shift(q,x)=min(
max(
shift1(q,x),
shift2(q)
),
shift3(q)
)
NDBI010 - DIS - MFF UK
C-W Algorithm
• shift1(q,x) = char(x)-h(q)-1, pokud > 0
• shift2(q)
= min({LMIN}{h(q’)-h(q), f(q’)=q})
• shift3(q0) = LMIN
• shift3(q)
= min({shift3(pred(q))} 
 {h(q’)-h(q), k:fk(q’)=q  q’F})
NDBI010 - DIS - MFF UK
C-W Algorithm
• shift1(q,x) – aligning of “collision” character
char(’y’)-h(’kolo’)-1=8-4-1=3
. . . my k o l o g . . .
k
S
k o
g y m
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
. . . my k o l o g . . .
o
z
v
t
m
+3
NDBI010 - DIS - MFF UK
k
S
k o
g y m
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
o
z
v
t
m
C-W Algorithm
• shift2(q) – aligning of checked part of text
states, where fail function goes to q
must be taken into account
. . . my k o l o g . . .
k
S
k o
g y m
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
. . . my k o l o g . . .
o
z
v
t
m
+1
k
S
k o
g y m
NDBI010 - DIS - MFF UK
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
o
z
v
t
m
C-W Algorithm
• shift3(q) – aligning of (any) postfix of
checked text, collision character need not be used
again to find a match
. . . my k o l o g . . .
k
S
k o
g y m
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
. . . my k o l o g . . .
o
z
v
t
m
+2
NDBI010 - DIS - MFF UK
k
S
k o
g y m
o
o
l
n
o
l
k
o
á
l
k
o
o
v
z
o
o
m
l
r
i
g
l
a
o
a
u
o
z
v
t
m
Lecture No. 3
Exact Pattern Matching
Searching for (Regular) Infinite Set of
Patterns in Text
Regular expressions and languages
• Regular expression R
• Atomic expressions
– 
– 
– a, a  X
• Value of expression h(R)
–  empty language
– {} empty word only
– a, a  X
• Operations
–
–
–
–
–
U.V – concatenation
U+V – union
Vk = V.V…V
V* = V0+V1+V2+…
V+ = V1+V2+V3+…
– {u.v|uh(U) vh(V)}
– h(U)h(V)
NDBI010 - DIS - MFF UK
Regular Expression Feature
•
•
•
•
•
•
•
•
•
•
1) U+(V+W) = (U+V)+W
2) U.(V.W) = (U.V).W
3) U+V = V+U
4) (U+V).W = (U.W)+(V.W)
5) U.(V+W) = (U.V)+(U.W)
6) U+U = U
7) .U = U
8) .U = 
9) U+ = U
10) U* = +U*.U = (+U)*
NDBI010 - DIS - MFF UK
(Deterministic) Finite Automaton
• K = ( Q, X, q0, , F )
– Q is a finite set of states
– X is an alphabet
– q0  Q is an initial state
– : Q x X  Q is totally defined transition
function
– F  Q is a set of final states
NDBI010 - DIS - MFF UK
(Deterministic) Finite Automaton
• Configuration of FA
– (q,w)  Q x X*
• Transition of FA
– relation
– (q,aw)
 (Q x X*) x (Q x X*)
(q’,w)  (q,a) = q’
• Automaton accepts word w
(q0, w) * (q,), qF
NDBI010 - DIS - MFF UK
Non-deterministic
Finite Automaton
• a) default def.
b) extended def.
K = ( Q, X, q0, , F )
K = ( Q, X, S, , F )
– Q is a finite set of internal states
– X is an alphabet
– q0  Q is an initial state
S  Q is (alternatively) set of initial states
– : Q x X  P(Q) is a transition function
– F  Q is a set of final states
NDBI010 - DIS - MFF UK
Non-deterministic
Finite Automaton
• NKA for P={”he”, ”her”, ”she”}
– S={1,4,8}
– F={3,7,11}
*
1 h 2 e 3
*
4 h 5 e 6 r 7
*
8 s 9 h 10 e 11
– S={1}
– F={3,4,7}
*
1 h 2 e 3 r 4
s
NDBI010 - DIS - MFF UK
5 h 6 e 7
NDFADFA Conversion
• K=(Q, X, S, , F)
• K’=(Q’, X, q’0, ‘, F‘)
•
•
•
•
Q’ = P(Q)
X
q’0 = S
‘( q’, x)
= (q, x), qq’
• F‘ = {q’Q’q’F}
NDBI010 - DIS - MFF UK
NDFADFA Conversion
Set of Initial States Allowed
• By table, only reachable states
1 h 2 e 3
*
transitions
to state 1
4 h 5 e 6 r 7
*
not shown
*
state
{1,4,8}
{1,2,4,5,8}
{1,4,8,9}
{1,3,4,6,8}
{1,2,4,5,8,10}
{1,4,7,8}
{1,3,4,6,8,11}
8 s 9 h 10 e 11
lbl.
1
2
3
4
5
6
7
e
h
{1,4,8}
{1,2,4,5,8}
{1,3,4,6,8}
{1,2,4,5,8}
{1,4,8}
{1,2,4,5,8,10}
{1,4,8}
{1,2,4,5,8}
{1,3,4,6,8,11} {1,2,4,5,8}
{1,4,8}
{1,2,4,5,8}
{1,4,8}
{1,2,4,5,8}
r
{1,4,8}
{1,4,8}
{1,4,8}
{1,4,7,8}
{1,4,8}
{1,4,8}
{1,4,7,8}
NDBI010 - DIS - MFF UK
h
h 2
1
e
h
s
s
s
s
h
6
4
rs
r
hh
h
e
s 5 s 7
3
s
{1,4,8,9}
{1,4,8,9}
{1,4,8,9}
{1,4,8,9}
{1,4,8,9}
{1,4,8,9}
{1,4,8,9}
x
{1,4,8}
{1,4,8}
{1,4,8}
{1,4,8}
{1,4,8}
{1,4,8}
{1,4,8}
NDFADFA Conversion
Only One Initial State Allowed
• By table, only reachable states
1 h 2 e 3 r 4
transitions
*
to state 1
s 5 h 6 e 7
not shown
state
{1}
{1,2}
{1,5}
{1,3}
{1,2,6}
{1,4}
{1,3,7}
lbl.
1
2
3
4
5
6
7
e
{1}
{1,3}
{1}
{1}
{1,3,7}
{1}
{1}
h
{1,2}
{1,2}
{1,2,6}
{1,2}
{1,2}
{1,2}
{1,2}
1
e
h
NDBI010 - DIS - MFF UK
s
h
6
4
rs
s
s
s
r
hh
h
e
s 5 s 7
3
r
{1}
{1}
{1}
{1,4}
{1}
{1}
{1,4}
h
h 2
s
{1,5}
{1,5}
{1,5}
{1,5}
{1,5}
{1,5}
{1,5}
x
{1}
{1}
{1}
{1}
{1}
{1}
{1}
Derivation of regular expression
• If,
hV   v
• I.e., if
then
,
then
 a 
shell


,
hV   

stop


 plot 
 dV 
h
  
 da 
 dV  hell
h


 ds   top 
 dV 
h

 dt 
 dV 
h
  v xv  hV 
 dx 
NDBI010 - DIS - MFF UK
Derivation of regular expression
•
•
•
•
d
 , a  X
da
d
 , a  X
da
da
 , a  X
da
db
 , b  a
da
dU  V 
dV

• da  dU
da da
dU .V  dU
• da  da .V ,  U
• dU.V   dU .V  dV ,  U
da
da
da
• dV  dV .V
*
*
da
•
dV
d  d  dV  

...
  , x  a1... an1 an


dx an  an1  a1  
NDBI010 - DIS - MFF UK
da
Construction of DFA
Using Derivations of RE
• Derivation of regular expressions allows directly and
algorithmically build DFA for any regular expression
• Let V is given regular expression in alphabet X
• Each state of DFA defines a set of words,
that move the DFA from this state to any of final
states.
So, every state can be associated with regular
expression, defining this set of words
– q0 = V dq
– (q,x) =
dx
– F = {qQ | h(q)}
NDBI010 - DIS - MFF UK
Construction of DFA
Using Derivations of RE
• V= (0+1)*.01 in alphabet X={0,1}
• q0 = (0+1)*.01
•
•
d 0  1* .01  d 0  1*
d 01 d 0  1

.01

.0  1* .01 1 
d0
d0
d0
d0
 d 0 d1 


. 0  1* .01 1  0  1* .01 1
.0  1* .01 1     
 d0 d0 
d 0  1* .01  d 0  1*
d 01 d 0  1

.01

.0  1* .01  
d1
d1
d1
d1
 d 0 d1 

 .0  1* .01     
. 0  1* .01  0  1* .01
 d1 d1 
NDBI010 - DIS - MFF UK
Construction of DFA
Using Derivations of RE
• V= (0+1)*.01 in alphabet X={0,1}
state
*
(0+1) .01
*
(0+1) .01+1
*
(0+1) .01+
lbl.
A
B
C
• q0 = (0+1)*.01
• F = {(0+1)*.01+}
0
1
*
(0+1) .01
*
(0+1)*.01+
*
(0+1) .01
(0+1) .01+1
(0+1) .01+1
(0+1) .01+1
*
*
1
A
B
0
0
C
1
0
NDBI010 - DIS - MFF UK
1
Document Models
• Different variants of models
– Takes (non)existence of terms in documents
into account or not
– Takes frequencies of terms in documents into
account or not
– Takes positions of terms in documents into
account or not
–…
NDBI010 - DIS - MFF UK
Document Models in TRS’s
Boolean Model
Boolean Model of TRS
• Mid of 20. century
• Adoption of procedures,
used in librarianship
and their gradual implementation
NDBI010 - DIS - MFF UK
Boolean Model of TRS
• Database (collection) D containing n documenta
– D={d1, d2, … dn}
• Documents described using m terms
– T ={t1, t2, … tm}
– term tj = descriptor, usually word or collocation
• Each document is represented as a subset of
available terms
– Contained in the document
– Better describing content of the document
– d1  T
NDBI010 - DIS - MFF UK
Boolean Model of TRS
• Assigning of a set of terms to document can be achieved by
different approaches
– Subdivision according to author
• Manual
– Done by a human indexer, that understands the content of document
– Non-consistent. More indexers need not produce the same set of terms.
One indexer might later produce different set of terms as before.
• Automatic
– Done algorithmically
– Consistent, but without text understanding
– Subdivision according to free will in selecting descriptors
• Controlled
– Set of terms is defined in advance and indexer cannot change it. It only can
select those describing given document as best as possible.
• Non-controlled
– The set of terms can be extended whenever new document is inserted into
collection.
NDBI010 - DIS - MFF UK
Indexation
• Thesaurus
– Internally structured set of terms
•
•
•
•
Synonyms with defined preferred term
Hierarchies of semantically narrower/broader terms
Similar terms
...
• Stop-list
– Set of non-significant terms that are meaningless for
indexation
• Pronouns, interjections, …
NDBI010 - DIS - MFF UK
Indexation
• Common words are
not suitable for
document
identification
• Too specific words as
well. Lot of different
terms appears in very
small number of docs
• Its elimination
decreases significantly
size of the index, and
slightly its quality
vhodné termy
Suitable
terms
0
0
0,1
NDBI010 - DIS - MFF UK
0,5
0,9
1
Boolean Model of TRS
• Query is represented by Boolean expression
– ta AND tb
– ta OR tb
– NOT t
document has to contain/to be
described by both terms
document has to contain/to be
described by at least one term
document has not contain/to be
described by given term
NDBI010 - DIS - MFF UK
Boolean Model of TRS
• Query examples:
– ‘searching’ AND ‘information’
– ‘encoding’ OR ‘decoding’
– ‘processing’ AND (‘document’ OR ‘text’)
– ‘computer’ AND NOT ‘personal’
NDBI010 - DIS - MFF UK
Boolean Model of TRS
– Extensions
• Collocations in queries
– ‘searching for information’
– ‘data encoding’ OR ‘data decoding’
– ‘text processing’
– ‘computer’ AND NOT ‘personal computer’
NDBI010 - DIS - MFF UK
Boolean Model of TRS
– Extensions
• Using of factual meta-data
(attribute values)
– ‘database’
AND (author = ‘Salton’)
– ‘text processing’
AND (year_of_publishing >= 1990)
NDBI010 - DIS - MFF UK
Boolean Model of TRS
– Extensions
• Wildcards in terms
– ‘datab*’ AND ‘system*’
• stands for terms
‘database’, ‘databases’,
‘system’, ‘systems’, etc.
– ‘portabl?’ AND ‘computer*’
• stands for terms
‘portable’,
‘computer’, ‘computers’, ‘computerized’ etc.
NDBI010 - DIS - MFF UK
Boolean Index Structure
• Inverted file
– It holds a list of identified documents for each
term
(instead of a set of terms for each document)
• t1 = d1,1,
• t2 = d2,1,
• tm = dm,1,
d1,2,
d2,2,
dm,2,
...,
...,
...,
d1,k1
d2,k2
dm,km
NDBI010 - DIS - MFF UK
Boolean Index Structure
• One-by-one processing of inserted documents
produces a sequence of couples
<doc_id,term_id>
sorted by first component, i.e. by doc_id
• Next the sequence is reordered lexicographically
by term_id, doc_id and duplicates are removed
• The result can be further optimized by adding
directory pointing to sections, corresponding to
individual terms, and removing term_id’s from the
sequence
NDBI010 - DIS - MFF UK
Lemmatization and Disambiguation
of Czech Language (ÚFAL)
• Odpovědným zástupcem
nemůže být každý.
• Zákon by měl zajistit
individualizaci
odpovědnosti a zajištění
odbornosti. …
Paragraph Nr.
Sentence Nr.
Word in document
Lemma including meaning
Type of word (Adverb), …
• <p n=1>
<s id="docID:001-p1s1">
<f cap>Odpovědným
<MDl>odpovědný_^(kdo_za_něc
o_odpovídá)
<MDt>AAIS7----1A---<f>zástupcem<MDl>zástupce<
MDt>NNMS7-----A---<f>nemůže<MDl>moci_^(mít_
možnost_[něco_dělat])<MDt>VB
-S---3P-NA--<f>být<MDl>být<MDt>Vf-------A---<f>každý<MDl>každý<MDt>A
AIS1----1A---• <p n=2>
…
NDBI010 - DIS - MFF UK
Proximity Constraints
• t1 (m,n) t2
– most general form
– term t2 can appear at most m words after t1,
or term t1 can appear at most n words after t2.
• t1 sentence t2
– terms have to appear in the same sentence
• t1 paragraph t2
– terms have to appear in the same paragraph
NDBI010 - DIS - MFF UK
Proximity Constraints – Evaluation
• Using the same index structure
– Operators replaced by conjunctions
– Query evaluation to find candidates
– Check for co-occurrences in primary texts
• Small index
• Longer time needed for evaluation
• Necessity of storing primary documents
• Extension of index by positions of term
occurrences in documents
– Large index
NDBI010 - DIS - MFF UK
Extended Index Structure
• During indexation is built a sequence
of 5-tuples
<dok_id,term_id,para_nr,sent_nr,word_nr>
ordered by dok_id, para_nr,sent_nr,word_nr
• Sequence is reordered by
<term_id,dok_id,para_nr,sent_nr,word_nr>
• No duplicities are removed
NDBI010 - DIS - MFF UK
Thesaurus Utilization
•
•
•
•
•
•
BT(x) - Broader Term to term x
NT(x) - Narrower Terms
PT(x) - Preferred Term
SYN(x) - SYNonyms to term x
RT(x) - Related Terms
TT(x) - Top Term
NDBI010 - DIS - MFF UK
Disadvantages of Boolean Model
• Salton:
–
–
–
–
Query formulation is more an art than science.
Hits can not be rate by its quality.
All terms in the query are taken as equally important.
Output size can not be controlled. System frequently
produces empty or very large answers.
– Some results doesn’t correspond to intuitive
understanding.
• Documents in answer to disjunctive query can contain
only one of mentioned term as well as all of them.
• Documents eliminated from answer to conjunctive query
can miss one of mentioned term as well as all of them.
NDBI010 - DIS - MFF UK
Partial Answer Ordering
Q = (t1 OR t2) AND (t2 OR t3) AND t4
– conversion to equivalent DNF
Q’ =
OR
OR
OR
OR
(t1 AND t2 AND t3 AND t4)
(t1 AND t2 AND NOT t3 AND t4)
(t1 AND NOT t2 AND t3 AND t4)
(NOT t1 AND t2 AND t3 AND t4)
(NOT t1 AND t2 AND NOT t3 AND t4)
NDBI010 - DIS - MFF UK
Partial Answer Ordering
• Each elementary conjunction (further EC) contain
all terms used in original query and is rated by
number of terms used in positive way (without
NOT)
• All EC’s differs each from another in at least one
term (one contains tj, second contains NOT tj)
Every document correspond to at most one EC
Document is then rated by number, assigned to
given EC.
NDBI010 - DIS - MFF UK
Partial Answer Ordering
• There exist 2k EC’s
in case of query using k terms
• There exist only k different ratings
• More EC’s can have the same rating
• (ta OR tb) =
= (ta AND tb)
… rating 2
OR (ta AND NOT tb) … rating 1
OR (NOT ta AND tb) … rating 1
NDBI010 - DIS - MFF UK
Lecture No. 4
Vector Space Model of TRS
• 70th of 20. century
– cca 20 years younger than Boolean model of
TRS
• Tries to minimize and/or eliminate
disadvantages of Boolean model
NDBI010 - DIS - MFF UK
Vector Space Model of TRS
• Database D containing n documents
– D={d1, d2, … dn}
• Documents are described by m terms
– T ={t1, t2, … tm}
– term tj = word or collocation
• Document representation using
m-dimensional vector of term weights
– d i  wi,1, wi,2,...,wi,m
NDBI010 - DIS - MFF UK
Vector Space Model of TRS
• Document model
– d i  wi,1, wi,2,..., wi,m 0,1m
– wi,j … level of importance of j-th term
to identify/describe i-th document
• Query
– q  q1, q2,..., qm 0,1m
– qj … level of importance of j-th term for the
user
NDBI010 - DIS - MFF UK
Vector Space Model Index
 d 1  w1,1
  
d 2  w2,1

D

  
  
d n  wn,1
w2,2




wn,2

w1,2
NDBI010 - DIS - MFF UK
w1,m 

w2,m
nm
 0,1
 

wn,m
Vector Space Model of TRS
• Similarity between
vectors representing
document and query is
in general defined by
Similarity function
1
 
Sim q ,d i  R
0
0
1
NDBI010 - DIS - MFF UK
Similarity Functions
 
• Sim q ,d i 
m
 q jwi, j  q  d i 
j 1
qd
i
cos 
• Factor q j  wi, j is proportional both to
level of importance in document and for the
user
• Orthogonal vectors have zero similarity
– Base vectors in the vector space (individual
terms) are orthogonal each to other and so have
zero similarity
NDBI010 - DIS - MFF UK
Vector Space Model of TRS
•
 
Sim q ,d i  q d
i
cos 
• Not only the angle, but also sizes of vectors
influence the similarity
• Longer vectors, that tends to be assigned to longer
texts have an advantage on shorter ones
• Its desirable to normalize all vectors to have
unitary size
NDBI010 - DIS - MFF UK
Vector normalization
• Vector length influence elimination
NDBI010 - DIS - MFF UK
Vector normalization
• In time of indexation
– No overhead in time of searching
– Sometimes it is necessary to re-compute all
vectors – in case that vectors reflects also
aspects dependent on complete collection
• In time of search
– Part of similarity function definition
– Slows down the response of the system
NDBI010 - DIS - MFF UK
Output Size Control
• Documents in the output list are ordered by
descending similarity to the given query
– Most similar documents at the beginning of the list
– The list size can be easily restricted with respect to
maximal criterion
• The maximal number of documents in the list can
be restricted
• Only documents reaching threshold similarity can
be shown in the result
NDBI010 - DIS - MFF UK
Negation in Vector Space Model
•
 
Sim q ,d i 
m
 q j wi, j
j 1
• It is possible to extend query space
•
q  q1, q2,..., qm  1,1m
Then the contribution q j  wi, j of j-th dimension
can be negative
• Documents that contain j-th term are suppressed in
comparison with others
NDBI010 - DIS - MFF UK
Scalar product
 
Sim q ,d i 
m
 q j wi, j
j 1
NDBI010 - DIS - MFF UK
Cosine Measure
(Salton)
m
 
Sim q ,d i 
 q jwi, j
j 1
m
2
m
 q j   wi, j
j 1
j 1
NDBI010 - DIS - MFF UK
2
Jaccard Measure
m
 
Sim q ,d i 
 q jwi, j
m
j 1
m
m
j 1
j 1
j 1
 q j   wi, j   q jwi, j
NDBI010 - DIS - MFF UK
Dice Measure
m
 
Sim q ,d i 
2  q jwi, j
j 1
m
m
j 1
j 1
 q j   wi, j
NDBI010 - DIS - MFF UK
Overlap Measure
m
 
Sim q ,d i 
 q jwi, j
j 1
m

2
 min q j , wi, j
j 1
2
NDBI010 - DIS - MFF UK

Asymmetric Measure

m
 
Sim q ,d i 
 min q j , wi, j
j 1

m
 wi, j
j 1
NDBI010 - DIS - MFF UK
Pseudo-Cosine
Measure
m
 
Sim q ,d i 
 q jwi, j
j 1
m  m

  q j   wi, j 
 j 1   j 1 
NDBI010 - DIS - MFF UK
Vector Space Model Indexation
• Based on number of occurrences of given
term in given document
– The more given word occurs in given
document, the more important for its
identification
• Term Frequency
TFi,j = #term_occurs / #all_occurs
NDBI010 - DIS - MFF UK
Vector Space Model Indexation
• Without stop-list the
result contains almost
only meaningless
words at the beginning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
term
the
of
a
to
and
is
for
be
if
in
use
are
it
should
class
NDBI010 - DIS - MFF UK
#
239
96
84
78
70
65
60
53
52
49
49
44
44
38
33
TF
0,0582
0,0234
0,0205
0,0190
0,0171
0,0158
0,0146
0,0129
0,0127
0,0119
0,0119
0,0107
0,0107
0,0093
0,0080
Vector Space Model Indexation
• Term frequencies are
very small even for
most frequent terms
• Normalized term
frequency
NTFi, j  0 iff TF i, j  
else
1 1
TF i , j



NTF i, j 2 2 max
TF i,k
k


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
term
use
class
owl
c
line
example
comments
file
bi
functions
files
code
int
data
public
NDBI010 - DIS - MFF UK
#
49
33
31
26
26
25
23
23
22
20
18
17
17
16
15
TF
0,0119
0,0080
0,0076
0,0063
0,0063
0,0061
0,0056
0,0056
0,0054
0,0049
0,0044
0,0041
0,0041
0,0039
0,0037
NTF
1,0000
0,8367
0,8163
0,7653
0,7653
0,7551
0,7347
0,7347
0,7245
0,7041
0,6837
0,6735
0,6735
0,6633
0,6531
Vector Space Model Indexation
Histogram of (norm alized) term frequency
1,0000
0,9000
0,8000
Frequency
0,7000
0,6000
TF
0,5000
0,4000
0,3000
0,2000
Differentiation of
important terms
from non-important ones
0,1000
0,0000
Term s ordered by increasing frequency
NDBI010 - DIS - MFF UK
NTF
Vector Space Model Indexation
def
w
•
i, j
 TF i , j
• IDF (Inverted Document
Frequency) reflects
importance of given term
in the index for complete
collection
def
•
w
i, j

def
•
w
i, j

NTF
i, j
NTF  IDF
i, j
j
2,5000
2,0000
1,5000
0,5000
Entropy of probability that the term
occurs in randomly chosen document
NDBI010 - DIS - MFF UK
1,00
0,97
0,94
0,91
0,88
0,85
0,82
0,79
0,76
0,73
0,70
0,67
0,64
0,61
0,58
0,55
0,52
0,49
0,46
0,43
0,40
0,37
0,34
0,31
0,28
0,25
0,22
0,19
0,16
0,13
0,10
0,0000
0,07
j
0,04
IDF
1,0000
0,01
 # docs containingterm 

  log
# all docs


def
Vector Space Model Indexation
• (Optional) document vector
normalization to unite size
def
v

NTF
v

w
v
i, j
def
i, j
 IDF j
i, j
i, j
2
i ,k
k
NDBI010 - DIS - MFF UK
Querying in Vector Space Model
• Equal representation of documents and queries
brings many advantages over Boolean Model
• Query can be defined
– Directly by its hand-made definition
– By reference to known indexed document q  d i
– By reference to non-indexed document
– indexer creates ad-hoc vector from its primary text
– By text fragment (using copy-paste etc.)
– By combination of some above mentioned ways
NDBI010 - DIS - MFF UK
Feedback
• Query building/tuning based on user
feedback to previous answers
– Adding terms identifying relevant documents
– Elimination of terms unimportant for relevant
document identification and important for
irrelevant ones
• Prediction criterion improvement
NDBI010 - DIS - MFF UK
Feedback
• Answer to previous
query q
is classified by the
user, who can mark
relevant and/or
irrelevant documents
NDBI010 - DIS - MFF UK
Positive Feedback
• Relevant document
“attract” the query
towards them
NDBI010 - DIS - MFF UK
Negative Feedback
• Irrelevant documents
push query away from
them
– Less effective than
positive feedback
– Less used
NDBI010 - DIS - MFF UK
Feedback
• The query iteratively
migrates towards the
center of relevant
documents
NDBI010 - DIS - MFF UK
Feedback
k
'

• General form q   0  q    jd i
j 1
j
• One of used special form
Centroid (centre of gravity)
k
di j
'

j 1
q    q  1 
k
NDBI010 - DIS - MFF UK
(1-) / 
Feedback
k
'

• General form q   0  q    jd i
j 1
• Other used
(weighted) form
j
(1-) / 
k
 v j* d i j
'

j 1
q    q  1  k
v j
j 1
NDBI010 - DIS - MFF UK
Weighted centroid
(centre of gravity)
Term Equivalence in VS Model
• Individual terms (dimensions of the space)
are supposedly, but not really, mutually
independent
T  t , t , t , t , t , t
 1
1 1 1

,
0
,
0
,
, ,
d 2
2 2 2

1 3
q  0, 4 , 4 , 0, 0, 0
 
Simq , d   0
1
2
3
4
5
6
,
where t1  t 2 , t 3  t 4
Problem with prediction
– inappropriately chosen
synonyms
NDBI010 - DIS - MFF UK
Term Equivalence in VS Model
• Equivalency matrix E

d 

q
1
1 1 1
, 0, 0, , ,
2
2 2 2
1 3
0, , , 0, 0, 0 
4 4
1 1 0 0 0 0 
1 1 0 0 0 0 
 0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 0 1 0
 0 0 0 0 0 1

q E
 4 1

Sim q  E , d   
8 2
1 1 3 3
 , , , , 0, 0
4 4 4 4
NDBI010 - DIS - MFF UK
Term Similarity in VS Model
1.0 0.8 0 0 0.2 0 
• Generalised equivalence
0.8 1.0 0 0 0 0 
• Similarity matrix
 0 0 1.0 1.0 0 0 

S  0 0 1.0 1.0 0 0 
•
  38 19 0.2 0 0 0 1.0 0 


Sim q  S , d   
 0 0 0 0 0 1.0
80 40
• All computation used in VS model can be evaluated also
on transposed index.
Here mutual similarity of term can be evaluated
  (vector dimension n, not m)
Sim t j1 , t j 2


+ Really similar terms co-occurs often together
– Common terms co-occurs often together as well
NDBI010 - DIS - MFF UK
Term Hierarchies in VS Model
• Similarly to
Boolean Model
Publication
Print
Papers
NDBI010 - DIS - MFF UK
Book
Magazine
Term Hierarchies in VS Model
• Similarly to
Boolean Model
• Edges can have
assigned weights 0.3
• User weights
Papers
then can
be easily propagated
0.8
Publication
0.4
0.32
Print
0.6
Book
0.7
0.096
NDBI010 - DIS - MFF UK
Magazine
0.224
0.48
Citations and VS Model
• Scientific publications cites their sources
• Assumption:
– Cited documents are semantically similar
– Citing documents are semantically similar
NDBI010 - DIS - MFF UK
Citations and VS Model
• Direct reference between documents “A” a
“B”
– Document “A” cites document “B”
– Denoted AB
• Indirect reference between “A” a “B”
– Ex. C1, …Ck so, that AC1…CkB
• Link between documents “A” a “B”
– AB or BA
NDBI010 - DIS - MFF UK
Citations and VS Model
• A and B are
bibliographically
paired,
if and only if they cite
the same source C
AC  BC
• A and B are co-cited,
if and only if they are
both cited in some
document C
CA  CB
NDBI010 - DIS - MFF UK
Citations and VS Model
• Acyclic oriented
citation graph
• Flowchart matrix of
the citation graph
• C=[cij]{0,1}<nxn>
cij=1, iff ij
cij=0 else
NDBI010 - DIS - MFF UK
Citations and VS Model
• BP matrix of bibliographic pairing
• bpij = number of documents cited in both
documents i and j.
– Follows bpii = number of documents cited in i
  

• bpij c i  c j 
n
 cik c jk
k 1
NDBI010 - DIS - MFF UK
Citations and VS Model
• CP matrix of co-citation pairing
• cpij = number of documents
citing both i and j
– Follows cpii = number of documents citing i
  

• kpij c  i c  j
n
 ckickj
k 1
NDBI010 - DIS - MFF UK
Citations and VS Model
• DL matrix of document links
• dlij = 1  (cij = 1  cji = 1)
• It is possible to modify resulting similarities
between documents and given query using some
of matrices BP, CP, DL
• Modification of index matrix D
– D’= BP.D, resp. D’=CP.D , resp. D’=DL.D
– D’=BP.CP.DL.D
NDBI010 - DIS - MFF UK
Using mutual document
similarities in VS Model
• DS matrix of mutual document similarities
• dsij = Sim d i,d j


• The same idea as in case of BP, CP, DL
• Modification of index matrix D
– D’=DS.D
NDBI010 - DIS - MFF UK
Lecture No. 5
Term Discrimination Values
• Discrimination value defines the importance of
the term in the vector space to distinguish
individual documents stored in the collection
• By removal of the term from index, i.e. by
reduction of index dimensionality it can happen:
– Overall distance between documents decreases
(average similarity of document pairs increases)
– Overall distance between documents increases
(average similarity of document pairs decreases)
• In this case the presence of the dimension in the space
is not needed (is contra-productive)
NDBI010 - DIS - MFF UK
45,0
35,3
45,0
NDBI010 - DIS - MFF UK
0,0
Term Discrimination Values
• Computation based on
average document
similarity
 
 Simd , d 
Q
n
n
i
i , j 1
2
j
• More efficient variant
using “central
document” (centroid)

 d
c
n
 
 Simd , c 
Q
n
n
i 1
n
i 1
NDBI010 - DIS - MFF UK
i
i
Term Discrimination Values
• The same value is computed for the space
reduced by k-th dimension

x
(k )

x , x , , x , x

  d i
Q
c
n
1
n
(k )
i 1
2
k 1
k 1
,  , xm
(k )
 (k ) 

Sim d , c

 i
i 1
n
(k )

NDBI010 - DIS - MFF UK
n
(k )



Term Discrimination Values
• Discrimination value
is defined as a
difference of both
average values
DV
(k )
k
 Q Q
• Can be used
instead of IDFk
0
Important term
discriminating documents
DVk defines the measure
of importance
0
Unimportant term
NDBI010 - DIS - MFF UK
Sem přetáhněte stránková pole.
Term Discrimination Values
Celkem
(value DV of terms depending on number of documents where the term is present)
Průměr z DVk
0,00001
90/7777
180/7777
1200/7777
-0,00004
-0,00009
Results for collection of 7777 articles published in
papers „Lidové noviny“ in 1994, described by 13495 lemmas
-0,00014
Positive DVk in case of 12324 lemmas having 478849 occurrences.
Negative DVk in case of 1170 lemmas having 466992 occurrences.
-0,00019
-0,00024
Number of documents, where the term is present. Collection contains 7777 documents
NDBI010 - DISVýskyty
- MFF UK
Se
Document clustering
Kohonen maps
C3M algorithm
K-mean algorithm
Document Clustering
• Response time of VS based TRS
is directly proportional to number of
documents in the collection,
that must be compared with the query
• Clustering allows to skip major part of
index during the search and compare only
closest documents
NDBI010 - DIS - MFF UK
Document Clustering
• Without clusters, it is
necessary to compare
all documents, even if
the minimal needed
similarity is defined
NDBI010 - DIS - MFF UK
Document Clustering
• Each cluster represent
m-dimensional sphere,
defined by its center
and radius
• If not, it is possible to
approximate it this
way during
computations
NDBI010 - DIS - MFF UK
Document Clustering
• Having clusters, the
query evaluation need
not to compare
documents in clusters
outside the area of
user interest
NDBI010 - DIS - MFF UK
Cluster types
• Clusters having
the same volume
+ Easy to create
– Some clusters can
be almost empty,
while others can
contain huge
amount of
documents
NDBI010 - DIS - MFF UK
Cluster types
• Clusters having
(approximately)
the same number
of documents
– Hard to create
+ More effective in
case of nonuniformly
distributed docs.
NDBI010 - DIS - MFF UK
Cluster types
• Non-disjunctive
clusters
• One document can
belong to more than
one cluster
• Sometimes weighted
belonging in fuzzy
clusters.
NDBI010 - DIS - MFF UK
Cluster types
• Disjunctive
clusters
• Document can
belong to exactly
one cluster
NDBI010 - DIS - MFF UK
Cluster types
• It is not possible to
completely and
disjointly cover
space using spheres
• It is possible to use
convex polyhedra,
where each
document belongs
to closest center
NDBI010 - DIS - MFF UK
Cluster types
• Then clusters can
be approximated by
non-disjoint set of
spheres, defined by
the center and the
most distant
belonging
document
NDBI010 - DIS - MFF UK
Query Evaluation With Clusters (I)
• Let are given query q and minimal required
similarity s
– Note.: Similarity computed by scalar product, vectors are normalized
• Index is split to k clusters
(c1,r1), …, (ck,rk)
r=
– Note. Radii are angular
• Query radius
r =  = arccos(s)
s = cos()
NDBI010 - DIS - MFF UK
1  1
Query Evaluation With Clusters (I)
• Emptiness of cluster intersection with the
query area is found out from the value
arccos(Sim(q,ci))-r-ri
• If this value  0,
documents in the
cluster are compared
• If this values > 0,
documents can not be
in the result NDBI010 - DIS - MFF UK
Query Evaluation With Clusters (II)
• Let are given query q and maximal number
of required documents x.
• Again, index is split to k clusters
(c1,r1), …, (ck,rk)
• No radius of the query is available
NDBI010 - DIS - MFF UK
Query Evaluation With Clusters (II)
• Clusters are sorted in ascended
order by increasing distance of
their center from the query, i.e.
according to arccos(Sim(q,ci))
• Better sorted by increasing
distance of cluster boundary
from the query, i.e.
according to arccos(Sim(q,ci))-ri
NDBI010 - DIS - MFF UK
1.
2.
2.
1.
Query Evaluation With Clusters (II)
• Clusters are sorted in
ascended order by
arccos(Sim(q,ci))-ri
i.e. by increasing
distance of cluster
boundary from the
query q
NDBI010 - DIS - MFF UK
x=7
4.
5.
2.
1.
3.
Vyhodnocení dotazu se shluky II
• Closest cluster is
evaluated
NDBI010 - DIS - MFF UK
x=7
Vyhodnocení dotazu se shluky II
• While there is not
enough documents in
the result, the next
closest cluster is
evaluated
• If there is enough of
documents, the x-th
best document defines
the working radius
NDBI010 - DIS - MFF UK
x=7
Vyhodnocení dotazu se shluky II
• Once there is enough
documents, next
cluster is evaluated
only if it intersects the
sphere given by the
query and x-th best hit
• If some documents
were replaced by
better ones from new
cluster, working radius
is reduced
NDBI010 - DIS - MFF UK
x=7
Víceúrovňové shlukování
• If there is still lot of
clusters, it is
possible to cluster
them further to
obtain second-level
clusters etc.
NDBI010 - DIS - MFF UK
Clustering Methods
• Kohonen self-organizing maps
– Used to classification of multi-dimensional
input patterns
– Unsupervised artificial neural network
– Self-organizing structure
– Conforms to density of patterns (documents) in
given area
– Tends to create clusters having approximately
the same number of members
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• Regular k-dimensional net of m-dimensional
points (centers)
– Usually k << m
• Each center has assigned its position in mdimensional space and up to 2k predefined
neighbors (two in each of k dimensions, center at
the boundary have less neighbors)
– Ex.: 2- a 1-dimensional
maps in 2-dimensional
space
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• At the start, centers have random
 positions
• When inserting document
d

– The closest center c x is found
– and moved closer
 to the document



    d    d  1 
 c x
cx
cx cx
– Its defined neighbors in the map
are moved as
 well 
 
 


c  c   d c  d  1  c
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• Parameters 0      1 denotes measure
of system flexibility
• It is advisable to decrease those coefficients
in time to zero
– Later the map center positions represent more
useful information that should not be suddenly
forgotten
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• The map before adaptation to new document
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• The map after adaptation to new document
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• Clusters are defined by map centers
• Each cluster contains documents, which are
closer to given center
than to any other one
• Proximate points
in the map are
proximate in the
original space
(but not vice versa)
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
NDBI010 - DIS - MFF UK
Kohonen self-organizing maps
• It is possible to use them
to cluster terms/lemmas
– Index matrix is transposed
– The n-dimensional space is
mapped instead of original
m-dimensional one
• Example:
• Created lemma clusters
(translated from Czech)
C2,4
stock-exchange, stocking, coupon, stock,
investor, volume, investment, fund, value,
business
C2,5
wave, privatization, national
– The map having 15*15
centers
– 7777 documents
(Lidové noviny, 1994)
– 13495 lemmas
– 100000 iterations of
learning using randomly
chosen lemma vector
C3,6
literary, writer, literature, publisher,
origination, reader, history, text, book, write
C3,13
Havel, Václav, president
C4,13
Klaus, prime, minister
C6,14
stage, comedy, film, script, audience,
festival, shot, story, film, role
NDBI010 - DIS - MFF UK
ekonomika
economy
literatura
literature
C
1
1
2
3
patřit
vést
firma
2 znamenat
platit
obchodní
Kč
banka
činit
trh
milión
koruna
cena
3 oblast
současný
deset
sto
tisíc
částka
zaplatit
pět
30
peníz
4 poptávka
nabídka
značný
měsíc
rychlý
změnit
ztráta
vývoj
cíl
šest
hranice
Kohonen maps
sport
2D map of lemmas
(unfold 2D map from previous example)
politika
politics
4
celkový
procento
5
podnik
zisk
burza
akcie
kupónový
cenný
papír
investor
objem
investiční
fond
hodnota
obchod
majitel
50
prodej
výše
držet
dovolit
dostávat
stav
jednička
tenista
Open
postoupit
semifinále
porazit
turnaj
finále
postup
6
7
čas
cesta
řada
přijít
čtyři
vlna
privatizace
národní
dílo
autor
umění
vydat
1994
ruka
najít
žebříček
literární
spisovatel
literatura
naklada-telství
počátek
čtenář
dějiny
text
kniha
napsat
Američan
úspěšný
americký
USA
třetí
skupina
století
život
lidský
pracovat
působit
práce
mladý
8
stát (verb.)
zůstat
vyjít
starý
pravda
muž
9
10
výstavba
stavba
11
Lille-hammer
sportovní
zlatý
město
městský
metr
pohled
zdát
znát
poskytnout
manželka
pocit
cítit
myslit
mluvit
udělat
vidět
rád
životní
prostředí
mezi-národní
NDBI010 - DIS - MFF UK
12
medaile
olympijský
vítěz
závod
podtitulek
13
14
MS
mistrovství
start
15
světový
sledovat
šampionát
ME
kvalifikace
válka
komise
předseda
Havel
Václav
prezident
bývalý
úřad
část
jednání
jednat
Klaus
premiér
ministr
vztah
názor
období
činnost
mistr
titul
Kohonen self-organizing maps
Obtained Cluster Sizes
• Size of clusters is here not so equal …
Počet z t
y
x
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
9
26
81
193
204
572
329
129
362
Celkový součet 1920
2
3
4
1
2
2
1
8
11
2
7
4
1
7
4
2
15
10
4
15
12
7
12
12
12
25
11
47
23
19
132
85
33
346 349 190
481 320 459
122 3223 709
1062 291 383
163 533 186
2383 4915 2045
5
6
7
8
2
3
2
1
3
5
2
1
1 10
3
9
6
4
4
12 11
8
5
2 12
6
2
3 15
4 22
34 10
4 14
30 18
2
4
21 19
6
7
84 50 26 12
272 166 30
6
302 129 10
7
103 38 98 17
121 94 40 17
999 586 245 119
9
1
3
7
3
2
9
15
3
2
2
3
6
15
4
75
10
2
1
1
2
3
2
2
6
1
2
3
3
4
5
3
40
NDBI010 - DIS - MFF UK
11
3
1
1
1
2
2
3
2
3
1
2
2
1
3
27
12
5
2
2
3
1
2
1
1
1
1
2
1
2
2
26
13
2
1
3
3
2
1
1
1
2
1
1
2
5
2
2
29
14
3
3
2
2
4
10
6
4
1
1
1
1
1
1
2
42
15 Celkový součet
1
29
1
40
1
42
2
57
3
85
11
87
3
109
4
170
234
1
507
6
1276
2
2321
3
4853
5
2152
1
1533
44
13495
Kohonen self-organizing maps
Cluster C15,1
• C15,1 (Football player names and terminology)
37, 41, 45, 46, 53, 54, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, Adamec, Babka, Baček, Balcárek, Baník, Baránek, Barát, Barbarič, Barbořík,
Barcuch, Bečka, Bejbl, Beránek, Berger, Bielik, Bílek, Blažek, boční, Bohuněk, Borovec, Brabec, brána, Branca,
brankový, Breda, brejk, Brückner, břevno, Březík, Budka, centr, Cieslar, Culek, Cupák, Čaloun, čára, Časka, čermák,
Červenka, Čihák, čížek, Diepold, Dobš, dohrávat, Dostál, Dostálek, drnovický, Drulák, Duhan, Džubara, faul,
fauloval, Frýdek, Fujdiar, Gabriel, Galásek, gólman, gólový, Gunda, Guzik, Harazim, Hašek, Havlíček, Heřman,
hlavička, Hodúl, Hoftych, Hogen, Holec, Holeňák, Hollý, holomek, Holota, holub, Horňák, Horváth, hostující,
hradecký, Hrbek, Hromádko, Hrotek, Hruška, Hřebík, hřídel, Hýravý, Hyský, chebský, chovanec, inkasovat,
jablonecký, Janáček, Jančula, Janeček, Jánoš, Janota, Janoušek, Jarabinský, Jarolím, Jihočech, Jindra, Jindráček,
jinoch, Jirka, Jirousek, Jukl, Kafka, Kamas, Kerbr, Kirschbaum, Klejch, Klhůfek, Klimeš, klokan, Klusáček,
Knoflíček, Kobylka, Kocman, kočí, Koller, kolouch, koncovka, kop, kopačka, Kopřiva, Kordule, kostelník, Kotrba,
Kotůlek, Kouba, Koubek, kovář, kozel, Kožlej, Kr(krypton), krejčí, Krejčík, Krištofík, Krondl, Křivánek, Kubánek,
kuchař, Kukleta, Lasota, Lerch, Lexa, Lička, Litoš, Lokvenc, Ložek, Macháček, Machala, Maier, Majoroš, Maléř,
Marek, Maroš, Mašek, Mašlej, Maurer, mela, míč, Mičega, Mičinec, Michálek, Mika, mina, mířit, Mojžíš, Mucha,
nápor, nastřelit, Navrátil, Nečas, Nedvěd, Nesvačil, Nešický, Neumann, Novák, Novotný, Obajdin, olomoucký,
Onderka, Ondráček, Ondrůšek, Palinek, Pařízek, Pavelka, Pavlík, Pěnička, Petrouš, Petržela, Petřík, pilný, plzeňský,
Poborský, pokutový, poločas, poslat, Poštulka, Povišer, prázdný, Pražan, proměněný, protiútok, Průcha, předehrávka,
přesný, převaha, Přibyl, přidat, přihrávka, ptáček, Puček, půle, Purkart, rada, Radolský, Rehák, roh, rohový, Rusnák,
Řepka, samec, Sedláček, Schindler, Siegl, Sigma, síť, Skála, skórovat, slabý, Slezák, Slončík, Sokol, sólo, srazit,
standardní, Stejskal, střídající, střílet, Studeník, Suchopárek, Svědík, Svoboda, šatna, Šebesta, šedivý, šestnáctka,
šilhavý, Šimurka, šindelář, Šlachta, Šmarda, Šmejkal, Špak, Švach, Tejml, tesařík, Tibenský, tlak, Tobiáš, trefit, Trval,
Tuma, tyč, Tymich, Uličný, Ulich, Ulrich, uniknout, Urban, Urbánek, útočný, úvod, Vacek, Vaďura, Vágner, Vácha,
Valachovič, valnoha, Váňa, Vaněček, Vaniak, vápno, Vávra, vejprava, veselý, Vidumský, Víger, Viktoria, vlček, volej,
Vonášek, Vosyka, Votava, vrabec, vyloučený, vyložený, Výravský, vyrazit, vyrovnání, Vyskočil, vystrašit, Wagner,
Weber, Wohlgemuth, zachránit, zákostelský, zákrok, Západočech, zlikvidovat, zlínský, Zúbek, žižkovský, ŽK (žlutá
karta)
NDBI010 - DIS - MFF UK
Přednáška č. 6
C3M Clustering
Cover Coefficient-based
Clustering Methodology
3
CM
Clustering
• First, inverse values are computed for all
sum total for each row and column in
matrix
 w1,1

w2,1

D
 

 wn,1
w1,2
w2,2

wn,2




NDBI010 - DIS - MFF UK
w1,m 

w2,m
 

wn,m
3
CM
Clustering
• Inverse row total
i 
1
m
 wi,k
k 1
• Inverse column total
• Assuming that:
 j
– Each document is indexed
by at least one term
– Each term describes
at least one document
NDBI010 - DIS - MFF UK
1
n
 wk , j
k 1
3
CM
Clustering
• Product wi, j * i expresses occurrence of j-th
term in i-th document
– If one points randomly inside i-th document,
what is the probability he/she find j-th term?
• Product wi, j *  j expresses occurrence of i-th
document for j-th term
– If one points randomly to some occurrence of j-th term
in the collection, what is the probability
he/she find it in i-th document?
NDBI010 - DIS - MFF UK
3
CM
Clustering
• Second, the matrix C of cover-coefficients is
computed
ci, j   w j,k * i wi,k *  k    i   k wi,k w j,k
m
m
k 1
k 1
If I pick random term occurrence from
the i-th document and then I try to select random
occurrence of the same term in the collection,
what is the probability I pick up occurrence form
the j-th document?
NDBI010 - DIS - MFF UK
3
CM
Clustering
• Second, the matrix C of cover-coefficients is
computed
ci, j   w j,k * i wi,k *  k    i   k wi,k w j,k
m
m
k 1
k 1
If the i-th document contains exclusive set of
terms
• cii =1,
• cij =0 iff i<>j
NDBI010 - DIS - MFF UK
3
CM
Clustering
n
1)
n
 ci, j  1
j 1
n
m
j 1
m
k 1
 ci, j    i   k wi, k w j, k
j 1
 i
 i
n
m
n
   k wi, k w j, k   i    k wi, k w j, k
j  1k  1
m
k 1 j 1
m
n
 wi, k  k  w j, k   i  wi, k 1 1
k 1
j 1
k 1
=1
NDBI010 - DIS - MFF UK
=1
3
CM
2)
3)
4)
5)
6)
Clustering
ci, j 0 - obvious
ci, j 1 - follows from 1) and 2)
ci, j 0  c j, i 0
 k wi,k w j,k 0 
ci, j 0  c j, i 0 - follows from
 2)and 4)
ci, i  ci, j  c j, j  c j, i  d i  d j
NDBI010 - DIS - MFF UK
3
CM
Clustering
• Cover coefficient cij says, how much terms
occurring in one document cover terms in other
documents
• If given document covers poorly other documents
(lot of exclusive terms),
the value cii is close to 1
• If given document covers other documents well
(lot of very common terms),
• Pokud dokument naopak dobře pokrývá ostatní, je
the value cii is close to 0
NDBI010 - DIS - MFF UK
3
CM
Clustering
•  i  ci , i
Decoupling coefficient
•  1
Coupling
coefficient
c
i
,
i
n
i
•
Number
of
needed
clusters

nc 
•
i
i 1
m
pi   i i  wi, j
j 1
„Power“ of given document become a
centre of the cluster
NDBI010 - DIS - MFF UK
3
CM
Clustering
m
•
pi   i i  wi, j   j  j
j 1
normalized computation, where
•
n
ci, j   i  k wk ,i wk , j
k 1
 i  ci, i
 i 1 ci, i
NDBI010 - DIS - MFF UK
3
CM
Clustering
• First nc documents having biggest pi value become
cluster centers, with exceptions
– Too dissimilar documents are put to special “trash”
cluster, which is compared to any query,
and the nc can be decreased accordingly
– Only one representative from the group of mutually
similar documents (with similar values of cii, cij, cjj and
cji) is taken and others (with similar pi) are skipped
• Other documents are assigned to closest centre
NDBI010 - DIS - MFF UK
x
y
z

d01
0,9800
0,1950
0,0397
0,8233
d02
0,9400
0,3410
0,0109
0,7740
d03
0,9600
0,2800
0,0000
0,8065
d04
0,9700
0,2430
0,0071
0,8196
d05
0,9500
0,3120
0,0125
0,7846
d06
0,1900
0,9800
0,0592
0,8136
d07
0,3200
0,9400
0,1183
0,7255
d08
0,2200
0,9600
0,1732
0,7390
d09
0,2300
0,9700
0,0787
0,7820
d10
0,0100
0,0200
0,9997
0,9711

0,1733
0,1908
0,6669
NDBI010 - DIS - MFF UK
3
CM
Clustering
• Cover coefficients
c
1
2
3
4
5
6
7
8
9
10
1 0,1439 0,1421 0,1428 0,1432 0,1427 0,0579 0,0761 0,0639 0,0636 0,0238
2 0,1336 0,1358 0,1352 0,1346 0,1356 0,0736 0,0884 0,0771 0,0783 0,0079
3 0,1399 0,1408 0,1409 0,1406 0,1409 0,0677 0,0834 0,0709 0,0727 0,0022
4 0,1426 0,1425 0,1429 0,1429 0,1428 0,0636 0,0803 0,0675 0,0689 0,0060
5 0,1360 0,1374 0,1371 0,1367 0,1374 0,0707 0,0860 0,0744 0,0755 0,0088
6 0,0572 0,0774 0,0683 0,0632 0,0733 0,1561 0,1554 0,1575 0,1563 0,0354
7 0,0671 0,0828 0,0751 0,0711 0,0795 0,1386 0,1420 0,1437 0,1400 0,0602
8 0,0574 0,0736 0,0650 0,0608 0,0701 0,1431 0,1464 0,1509 0,1445 0,0883
9 0,0604 0,0791 0,0705 0,0657 0,0753 0,1502 0,1509 0,1529 0,1508 0,0443
10 0,0281 0,0099 0,0027 0,0072 0,0108 0,0423 0,0806 0,1161 0,0550 0,6474
NDBI010 - DIS - MFF UK
3
CM
Clustering
• Coefficients , , a p
1
2
3
4
5
6
7
8
9
10
 0,1439 0,1358 0,1409 0,1429 0,1374 0,1561 0,1420 0,1509 0,1508 0,6474
1
2
3
4
5
6
7
8
9
10
 0,8561 0,8642 0,8591 0,8571 0,8626 0,8439 0,8580 0,8491 0,8492 0,3526
1
2
3
4
5
6
7
8
9
10
p 0,1496 0,1516 0,1501 0,1494 0,1510 0,1619 0,1679 0,1734 0,1638 0,2351
NDBI010 - DIS - MFF UK
n
2
3
CM
Clustering
• After 8. document is chosen as the centre,
c
1
2
3
4
5
6
7
8
9
10
1 0,1439 0,1421 0,1428 0,1432 0,1427 0,0579 0,0761 0,0639 0,0636 0,0238
2 0,1336 0,1358 0,1352 0,1346 0,1356 0,0736 0,0884 0,0771 0,0783 0,0079
3 0,1399 0,1408 0,1409 0,1406 0,1409 0,0677 0,0834 0,0709 0,0727 0,0022
4 0,1426 0,1425 0,1429 0,1429 0,1428 0,0636 0,0803 0,0675 0,0689 0,0060
5 0,1360 0,1374 0,1371 0,1367 0,1374 0,0707 0,0860 0,0744 0,0755 0,0088
6 0,0572 0,0774 0,0683 0,0632 0,0733 0,1561 0,1554 0,1575 0,1563 0,0354
7 0,0671 0,0828 0,0751 0,0711 0,0795 0,1386 0,1420 0,1437 0,1400 0,0602
8 0,0574 0,0736 0,0650 0,0608 0,0701 0,1431 0,1464 0,1509 0,1445 0,0883
9 0,0604 0,0791 0,0705 0,0657 0,0753 0,1502 0,1509 0,1529 0,1508 0,0443
10 0,0281 0,0099 0,0027 0,0072 0,0108 0,0423 0,0806 0,1161 0,0550 0,6474
6th,7th and 9th document should not be taken
NDBI010 - DIS - MFF UK
x
y
z

d01
0,9800
0,1950
0,0397
0,8233
d02
0,9400
0,3410
0,0109
0,7740
d03
0,9600
0,2800
0,0000
0,8065
d04
0,9700
0,2430
0,0071
0,8196
d05
0,9500
0,3120
0,0125
0,7846
d06
0,1900
0,9800
0,0592
0,8136
d07
0,3200
0,9400
0,1183
0,7255
d08
0,2200
0,9600
0,1732
0,7390
d09
0,2300
0,9700
0,0787
0,7820
d10
0,0100
0,0200
0,9997
0,9711

0,1733
0,1908
0,6669
NDBI010 - DIS - MFF UK
2
C ICM
Clustering
• Cover Coefficient-based
Incremental Clustering Methodology
– INSERT: assigning to closest cluster, respectively to the
trash cluster
– DELETE: if the centre of cluster is deleted, cluster is
marked as ivnalid
– REORGANIZE:
• Centers of clusters are chosen from scratch
• Current clusters, whose centre was not chosen, are marked as
invalid
• Documents from invalid clusters are reassigned to new clusters
– The fact, that some documents from valid clusters should be
reassigned, because some new centre can be closer, is not taken
into account
NDBI010 - DIS - MFF UK
Spherical K-means Clustering
nm
• Vector index D  0,1
split to k
disjoint sets of documents
Dhillon, I., S.; Modha, D., S.
D  
• For each set j are defined

k den.
j j 1  S k

1

– Centroid
m j   d i
nj

m

j
– Centroid having unite size
cj 
mj

d i j
NDBI010 - DIS - MFF UK
d1
m
c
d2
Spherical K-means Clustering
• Value

c
 d i j
d i j
can be considered as cluster quality measure
(the higher value, the better)
• Value

S k      d i c j
k
j  1 d i j
represents quality of classification/clustering
Note: in spite of Euclidean K-means, that minimizes
  2
S k      d i  c j
k
NDBI010 - DIS - MFF UK
j  1 d i j
Spherical K-means Clustering
• The goal is to find classification
having maximal value of assessing function
• Obviously

S k     d i c j  n
k
j  1 d i j

d i c j 1
due to (vectors are normalized)
• In general NP-complete problem
– Iterative algorithm that converge to (local) maximum
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
• Initialization (0th iteration)
– Documents are assigned
randomly to k clusters
( 0)
Sk
(here k=3)
– Positions of centroids
(gravity centers) are
computed
 ( 0)
cj
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
• Iteration step tt+1
– Documents are
assigned to closest
centroid from previous
iteration

(
t 1)
j 
  | x   (t )    (t ) 
 d c j d c x 
d i
i
 i


– New centroid positions
are computed  (t 1)
cj
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
• Cycle iterates until the
growth of assessing
function is below
predefined threshold
• I.e. while

  
 S (kt 1)   S (kt )  
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
• Result for k=5
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
• Assessing function is nondecreasing

(t 1)
 Sk
  
(t )
 Sk
• Cauchy Schwartz inequality

 
  
 x, x 1    d i x    d i c j 
 d 



di j
 i j

i.e.: centroid of the set has the highest
average similarity to all items in the set
NDBI010 - DIS - MFF UK
k

  

?
k
  (t )
(t )
(t 1)
 d ic j   S k   S k 

j  1 d i   (jt )


j  1 d i   (jt  1)
  (t 1)
d ic j
• We want to show, that above inequality holds,
i.e. that iterations converge (nondecreasing, from
above limited function)
?

NDBI010 - DIS - MFF UK
  
 S (kt )
k
  (t ) k k
 d ic j   

j  1d i   (jt )


(t )
(t  1)
l

1
j 1
d i  j   l
  (t )
d ic j
• Similarities are summed first in intersection of areas
from previous and current iteration
• Similarities are taken with respect to original
centroids, so the sum doesn’t change
NDBI010 - DIS - MFF UK
k
k
 


j  1l  1 d i   (jt )   (t  1)
l
 (t ) k k
d ic j   


j  1l  1 d i   (jt )   (t  1)
l
 (t )
d ic l
• We take similarities to closest centroid instead of
the original one
• Because new center is not farther than older one,
similarity doesn’t decrease
NDBI010 - DIS - MFF UK
k



l  1 d i   (t  1)
l
  (t ) k
d ic l  


l  1 d i   (t  1)
l
  (t 1)
d ic l
• Cauch.-Schwartz. inequality:
– Sum of similarities of group of vectors with their
centroid is not smaller than sum of similarities with any
other unite vector
NDBI010 - DIS - MFF UK
Proof: inequality sequence
k
k k


 S (kt )     d i c (jt )   
(t )
 
k
k
  
Doc. Assignment in t+1st iteration:
  (t )   (t )  
t 1)  

 l  d i | x d i c j  d i c x 
(


l  1 d i   (l t  1)
 

  (t )
d icl

(t )
(t  1)
l

1
j 1
d i  j   l
k
k
 
Cauch.-Schwartz. inequality:


  (t )
d icl

j  1l  1 d i   (jt )   (l t  1)
j  1d i   j
k

  (t )
d ic j

l  1 j  1 d i   (jt )   (l t  1)
  (t ) k
d icl  


l  1 d i   (l t  1)
  S tk 1
NDBI010 - DIS - MFF UK
 (t 1)
d icl
Spherical K-means Algorithm
– Example
• Combination of
documents from three
collections
– MEDLIN:
1033 abstracts,
medical magazines
– CISI:
1460 abstracts,
information retrieval
– CRANFIELD:
1400 abstracts,
aviation
• Flowchart matrix for
three computed
clusters



MEDLIN
1004
11
18
CISI
5 1440
15
CRANFIELD
4
16 1380
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
– Example
• Distr. of similar. of docs – same cluster
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
– Example
• Distr. of similar. of docs – different clusters
Cluster are mutually (almost) orthogonal
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
– Cluster Labeling
• Mutual orthogonality of clusters obtained using
spherical K-means algorithm shows,
that terms most important for given centroid
characterization are (almost) unimportant for
characterization of other centroids
• Individual centroids can be considered prototypes
of content inside cluster – a concept
• Important terms can “label” the content of given
cluster
NDBI010 - DIS - MFF UK
Spherical K-means Algorithm
– Cluster Labeling
• For each classification of documents S k into k
clusters we can define k clusters of terms
k

W k  j1
• The ith cluster contains terms, having the highest
weight in the ith document cluster centroid
• Terms can be ordered primarily by the term cluster
number, secondly by its weight in the ith centroid
• Each cluster is labeled by most important terms in
the cluster (i.e. by terms having highest weights)
NDBI010 - DIS - MFF UK
NDBI010 - DIS - MFF UK
Hierarchical Clustering
• Either repeating of flat clustering,
or incremental building of clusters until stop
condition is met – typically reaching of wanted
number of clusters
• Agglomerative methods
– Gradual joining of most similar documents and/or
smaller clusters
• Divisive methods
– Gradual splitting of largest clusters
NDBI010 - DIS - MFF UK
Hierarchical Clustering
• Different definitions of cluster similarity produce
different results
– Single linkage clustering
• Similarity of two clusters =
= similarity of closest couple of documents
– Complete linkage clustering
• Similarity of two clusters =
= similarity of farthest couple of documents
– Average group linkage
• Similarity of two clusters =
= average similarity of all couples
NDBI010 - DIS - MFF UK
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
arithmetic
0.0
0.0
0.541
0.0
0.55
0.0
0.0
0.0
0.0
0.0
basketball
0.0
0.0
0.0
0.0
0.0
0.0
0.556
0.563
0.0
0.0
C
0.0
0.0
0.541
0.0
0.55
0.0
0.0
0.0
0.0
0.0
error
0.0
0.0
0.0
0.556
0.0
0.0
0.0
0.0
0.517
0.0
cycle
0.0
0.0
0.0
0.556
0.55
0.55
0.0
0.0
0.0
0.0
inheritance
0.0
0.0
0.541
0.556
0.55
0.0
0.0
0.0
0.0
0.0
hardware
0.563
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
player
0.0
0.583
0.0
0.0
0.0
0.0
0.0
0.531
0.517
0.556
java
0.0
0.0
0.541
0.556
0.0
0.0
0.0
0.0
0.0
0.0
language
0.0
0.0
0.541
0.556
0.55
0.0
0.0
0.0
0.0
0.0
trash
0.0
0.0
0.0
0.0
0.0
0.0
0.556
0.531
0.0
0.0
ball
0.0
0.0
0.0
0.0
0.0
0.0
0.556
0.531
0.333
0.0
pivot
0.0
0.0
0.0
0.0
0.0
0.0
0.556
0.531
0.0
0.0
platform
0.0
0.0
0.0
0.556
0.55
0.0
0.0
0.0
0.0
0.0
computer
0.563
0.0
0.541
0.556
0.55
0.55
0.0
0.0
0.0
0.0
procedure
0.0
0.583
0.0
0.0
0.55
0.55
0.0
0.0
0.0
0.0
speed
0.563
0.0
0.0
0.0
0.0
0.55
0.0
0.0
0.0
0.556
server
0.563
0.0
0.0
0.0
0.0
0.55
0.0
0.0
0.0
0.0
software
0.563
0.0
0.541
0.0
0.0
0.0
0.0
0.0
0.0
0.0
sport
0.0
0.0
0.0
0.0
0.0
0.0
0.556
0.563
0.55
0.556
net
0.563
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.517
0.0
try
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.531
0.517
0.0
performance
0.0
0.0
0.0
0.556
0.0
0.55
0.0
0.0
0.0
0.0
NDBI010 - DIS - MFF UK
Hierarchical Clustering
– dendrogram
• Result obtained by
agglomerative
hierarchical clustering
using average group
linkage
NDBI010 - DIS - MFF UK
Hierarchical Clustering
• Obtained hierarchy is binary (in general k-ary)
• More natural and more suitable is hierarchy with
open arity, that better reflects similarities between
clusters
• Optimal number of descendants for root cluster is
found, then the process id recursively applied to
them
– The quality of clustering is measured for different
number of descendants
• Cut is done in time of highest growth of error
• Cut is done in time of highest ratio of error differences
• Cut is done in time of highest second derivation
NDBI010 - DIS - MFF UK
Hierarchical Clustering
• Cutting in time of highest error growth
produces tree
d1, d2, d3, d4, d5, d6, d7, d8,
d9, d10
d1, d2, d3, d4, d5,
d6
d1
d2
d3
d4
d5
d7, d8, d9,
d10
d6
d7, d8
d7
NDBI010 - DIS - MFF UK
d9
d8
d10
General Cluster Labeling
• Simplifies user navigation through clusters
• Mark clusters by:
– Term set
– Collocation set
• Terms used as labels should be
– Descriptive (describe content of clusters well)
– Discriminative (Discriminate content from other clusters)
NDBI010 - DIS - MFF UK
General Cluster Labeling
• Modified Information Gain of term t in cluster X
IGm t , X   Pt , X log
Pt  
PX  
  , where
  
Pt , X 
P t, X
 P t , X log
Pt  * PX 
P t *P X
 
# docs, containing term t
# all docs in the collection
# docs inside cluster X
# all docs in the collection
Pt , X   PX * PX | t  

Pt 
 
PX 
# docs, NOT containing term t
# all docs in the collection
# docs OUTSIDE cluster X
# all docs in the collection
# docs in cluster X containing t
# all docs in the collection
NDBI010 - DIS - MFF UK
General Cluster Labeling
• Select terms having highest IGm
• Join clusters on the same level having the
same labeling
NDBI010 - DIS - MFF UK
TRS models, based on the VS Model
Inductive Model
Semantic Net
Inductive TRS
• Modification of VS model
• Similar to two-layered neural network
– Bottom (input) layer contains m nodes
representing terms t1, …, tm
– Upper (output) layer contains n nodes
representing documents d1, …, dn
– Terms tj are interconnected with documents
di by oriented edges, rated using weights wi,j
NDBI010 - DIS - MFF UK
Inductive TRS
• Up to this equal to VS model using different
terminology
d2
d1
W1,1
d3
d4
d5
dn
W1,3
t1
t2
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Plus reverted edges rated by weights xi,j
• Usually xi,j = wi,j, can differ in general
d2
d1
W1,1
d3
W1,3
t1
d4
d5
dn
x2,5
t2
x4,5
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Query q defines initial values
of input nodes
• Initialization t j  q j
m
• Forward step
d i   wi, j  t j
• Backward step
t j   xi, j  d i
j 1
n
i 1
NDBI010 - DIS - MFF UK
Inductive TRS
• Weights in query initialize bottom layer
of the net – term nodes
t1
t2
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Forward step computes similarities
of documents with given query
d2
d1
t1
d3
d4
t2
d5
dn
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Backward step activates further terms,
not mentioned in original query, but are
important for documents, similar to the query
d2
d1
t1
d3
d4
t2
d5
dn
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Forward step activates more documents ...
d2
d1
t1
d3
d4
t2
d5
dn
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• The global sum of values (the energy) in
the layer grows with iterations
• Forward step:
– Column sum in the index matrix D is greater
than 1, if there is enough documents in the
collection
Each value in the bottom layer contributes by
more than its own value to the top layer energy
NDBI010 - DIS - MFF UK
Inductive TRS
• The global sum of values (the energy) in the layer
grows with iterations
• Backward step:
– Row sum in the index matrix D is always greater than
1, if document vectors are normalized
 m
 

2   m


  wi, j  1    wi, j  1
 j 1
  j 1


 

Each value in the top layer contributes by more than its
own value to the bottom layer energy
NDBI010 - DIS - MFF UK
Inductive TRS
• Solved by so called lateral inhibition
• Documents interconnected each to other using
horizontal edges
• Each of them weighted by a value li,j, determining
how much jth document inhibits value of ith
document
• Lateral inhibition is executed before the backward
step
n
di  di   l d j
j 1
i, j
for ij
NDBI010 - DIS - MFF UK
Inductive TRS
• Either n2 independent coefficients (space consuming)
• Or one coefficient for each document
• Or (usually) one coefficient at all
d2
d1
t1
d3
d4
t2
d5
dn
t3
NDBI010 - DIS - MFF UK
t4
tm
Inductive TRS
• Forward step
n
d i   wi, j  t j  wi  t
 Sim
j 1
wi,t 
• Backward step (no lat. inhibition, x  w)
t


k 1


k 

  Sim d i ,t *d
i


i 1
n
• Corresponds to (automatical) feedback using

 i  Sim d i,t
NDBI010 - DIS - MFF UK

k 
Semantic Net and Spreading
• Semantic net
– Thesaurus generalization
• By associations between documents
• By associations between documents and terms
• General oriented graph with weighted edges
– Nodes correspond to terms and documents
– Weighted oriented edges correspond to
associations
NDBI010 - DIS - MFF UK
Semantic Net and Spreading
• TermTerm associations
–
–
–
–
Synonym
Broader-narrower terms
Related terms
...
• TermDocument associations
– Importance of terms to identify documents
– ...
• DocumentDocument associations
– Citations
– ...
NDBI010 - DIS - MFF UK
Semantic Net
term(s)
Is component of
hardw
are
hardware
Assoc.
informační
systém
inform. system
ISA
D1
K
P
V
počítač
computer
ISA
domácí
počítač
home
computer
S
ISA
výběr
informace
inform.
retrieval
P
osobní
počítač
pers.
computer
Synonyms
Term to
document assoc.
K
data
data
ISA
informatika
informatics
V
P
Broader/narr.
K
bibliografická
inf ormatika
bibliogr. informatics
P
D2
C
P
D3
Citations
document(s)
NDBI010 - DIS - MFF UK
O
P
D4
Similar
documents
Spreading
• Query q  q1, q2 ,..., qm 0,1m
• Initialization t j  q j
m
 q
j
j 1
• Increment of node value uj caused by node
ui during the iteration u j  uiwi, j  wi, k
• Overall

k


u j  u j    uiwi, j  k wi, k 


i
NDBI010 - DIS - MFF UK
Models Based on
Boolean Model
Fuzzy Model
MMM Model
Paice Model
P-norm Model
Boolean Model Extensions
• Opposite to classical Boolean Model
– Allow weighted queries
• Information(0,7) AND System(0,3)
• Breeding(0,9) AND (Dogs(0,6) OR Cats(0,4))
– Allow use internally vector space index
• Allow order output according to
presumed relevancy
NDBI010 - DIS - MFF UK
Fuzzy Logic

• Document d i  wi,1, wi,2, ..., wi,n
• Query
t a qa  AND tbqb
Similarity
min qawi,a, qbwi,b
ta qa OR tbqb
max qawi,a , qbwi,b
NOT t a qa 
1  qa  wi,a
NDBI010 - DIS - MFF UK
Fuzzy Logic
• Documents with the 1
same similarity to
(unweighted)
conjunction are
denoted by blue lines
• Documents with the
0
same similarity to
0
(unweighted)
disjunction are
denoted by green lines
NDBI010 - DIS - MFF UK
1
Fuzzy Logic
• Example:
1 11

d1  2 , 4 ,8
1 11

d 2  2 , 6,8
 1 1 1 1
 1 1 1 1
a AND bAND c min  , ,   min  , ,  
 2 4 8 8
 2 6 8 8
 1 1 1 1
 1 1 1 1
max  , ,   max  , ,  
a OR b OR c
 2 4 8 2
 2 6 8 2
1 5
1 3
1 
1 
NOT b
6 6
4 4
NDBI010 - DIS - MFF UK
MMM (Min-Max Model)
• Linear combination of minimal an maximal values
t a qa  AND tbqb
Sim  k MinAnd min qa wi,a, qb wi,b k MaxAnd max qa wi,a, qb wi,b
ta qa OR tbqb
Sim  k MinOr min qa wi,a, qb wi,b k MaxOr max qa wi,a, qb wi,b
• kMinAnd>kMaxAnd, kMinOr<kMaxOr
• Usually 2 coef.: kMinAnd+kMaxAnd =kMinOr+kMaxOr=1
Or 1 coef.: k=kMinAnd=1-kMaxAnd=kMaxOr=1-kMinOr
NDBI010 - DIS - MFF UK
MMM (Min-Max Model)
• Documents with the 1
same similarity to
(unweighted)
conjunction are
denoted by blue lines
• Documents with the
0
same similarity to
0
(unweighted)
disjunction are
denoted by green lines
NDBI010 - DIS - MFF UK
k=0,75
1
MMM (Min-Max Model)
1 11

d1  2 , 4 ,8
1 11

d 2  2 , 6,8
a AND bAND c
7
3 1 1 1
   
4 8 4 2 32
7
3 1 1 1
   
4 8 4 2 32
a OR b OR c
1 1 3 1 13
   
4 8 4 2 32
1 1 3 1 13
   
4 8 4 2 32
•
3
Ex.: k  4
NDBI010 - DIS - MFF UK
Paice Model
• All values are taken into account
their importance decreases geometrically
Sim   r q j wi, jk , where 0  r 1
k
k
k
• In case of conjunction are values q jk wi, jk
ordered in ascending order
• In case of disjunction in descending order
NDBI010 - DIS - MFF UK
Paice Model
• Ex.:
1
r
2
1 11

d1  2 , 4 ,8
1 11

d 2  2 , 6,8
a AND bAND c
36 1 1 1 1 1 1 32
1 1 1 1 1 1
.  .  . 
.  .  . 
2 8 4 4 8 2 192 2 8 4 6 8 2 192
a OR b OR c
84 1 1 1 1 1 1 59
1 1 1 1 1 1
.  .  . 
.  .  . 
2 2 4 4 8 8 192 2 2 4 6 8 8 192
NDBI010 - DIS - MFF UK
Extended Boolean Logic
(P-norm Model)
• Similarity is derived from the distance of
document (measured by p-norm)
from zero (false) document dF=<0, 0, …, 0>
for disjunction, resp. from unitary (true)
document dT=<1, 1, …, 1>
for conjunctions
NDBI010 - DIS - MFF UK
Extended Boolean Logic
(P-norm Model)
• Non-weighted query variant , 2 terms
Sima OR b 
p
p
p wi,a  wi,b
2
 d F d i
1wi,a 1wi,b
p
Sima AND b 1
p
p
p
2
NDBI010 - DIS - MFF UK
1 d T d i
p
Extended Boolean Logic
(P-norm Model)
• Non-weighted query, k terms
SimOR 
k
p
 wi, j
p j 1
Sim AND 1
k


k
p
 1 wi, j
p j 1
k
NDBI010 - DIS - MFF UK
Extended Boolean Logic
(P-norm Model)
• Weighted queries, k terms
SimOR 
k
p
p
 q wi, j
j
j 1
p
k
 q
j
j 1


k
p
p
 q  1 wi, j
j
j 1
Sim AND 1 p
k
 q
j
j 1
NDBI010 - DIS - MFF UK
Extended Boolean Logic
(P-norm Model)
• If p, model turns toward classical
Boolean model
• If p=1, disjunctions correspond to vector
space model
• If p=2, reported results are better than in
case of vector space model
NDBI010 - DIS - MFF UK
Extended Boolean Logic
• Documents with the
same similarity to
(unweighted)
conjunction are
denoted by blue arcs
• Documents with the
same similarity to
(unweighted)
disjunction are
denoted by green arcs
p=2
1
0
0
NDBI010 - DIS - MFF UK
1
Handling Index-term
Dependencies (Concepts)
Concept Net for Boolean Model
Concept Net for Boolean Model
• Requires Boolean TRS
with thesaurus available
• Instead of individual terms (that can be
mutually dependent) it works with so called
concepts, that are mutually semantically
independent
NDBI010 - DIS - MFF UK
Concept Net for Boolean Model
• Synonyms
– All equivalent terms (set of synonyms) form
one semantic concept (theme)
• I.e.: “Home computer”  “Personal computer”
– Documents, using any of those terms are
supposed to tell something about the same
concept
– In above case exist only 21=2
different document classes
instead of four in classic Boolean model
NDBI010 - DIS - MFF UK
xy
Concept Net for Boolean Model
• Related terms (in some semantic association)
– Couples of related terms define three semantically
independent concepts
• I.e. “Information system”  “Informatics”
– Documents can independently tell about
• about “Information system” (but not about “Informatics”)
• about “Informatics” (but not about “Information system”)
• about theme, defined by intersection of their semantics
– There exist 23=8
different document classes
instead of four in classic Boolean model
NDBI010 - DIS - MFF UK
x
y
Concept Net for Boolean Model
• Broader term – narrower term
– Couples of related terms define two semantically
independent concepts
• I.e. “Computer” > “Personal computer”
– Documents can independently tell about
• about “Computer” (but not about “Personal computer”),
let say about mainframe
• about “Personal computer” (and so about “Computer” as well)
– There exist 22=4
different document classes,
not equivalent to those
in classic Boolean model
y
NDBI010 - DIS - MFF UK
x
Concept Net
Tezaurus: Synonyms/Vztahy/ISA hierarchies
hardware
hardw
are
inform. system
informační
systém
U
isa
Uisa
A
V
computer
počítač
isaU
home computer
domácí
počítač
data
data
AV
Uisa
inform.informace
retrieval
výběr
isa
U
SS
informatics
informatika
pers.
computer
osobní
počítač
NDBI010 - DIS - MFF UK
U
isa
A
V
bibliogr. informatics
bibliografická
inf ormatika
Concept Net
• Corresponding concept net of atomic concepts
informatics
bibliographical
informatics
hardware
computer
home computer
pers. computer
information
system
X
data
inform. retrieval
complementary concept,
represents “anything else”
NDBI010 - DIS - MFF UK
common sense of “biblio.
inf.” and “inform. retr.”
Concept Net
• There is 9 terms, it is 29 = 512 different document classes
• In fact 12 different, semantically independent atomic concepts
(the last one represents any other topic, different form those,
described by terms in thesaurus), i.e. 212 = 4096 different
document classes (212 = 2048 without the twelfth)
• Each atomic concept is represent by conjunction of all terms in
positive or negative notion according to the position of
concept inside/outside corresponding set
– For example, X represents:
“informatics” and “bibliographical informatics”
and “information retrieval” and not “hardware”
and not “computer” and not “home computer”
and not “information system” and not “data” and not “anything else”
NDBI010 - DIS - MFF UK
home computer
hardware
information system
informatics
personal computer
computer
information retrieval
1
2
3
4
5
6
7
8
data
• One concept is
assigned to each
set of synonyms
bibliographical informatics
Concept Net
Construction
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
NDBI010 - DIS - MFF UK
home computer
hardware
information system
informatics
personal computer
computer
information retrieval
1
2
3
4
5
6
7
8
X9
10
11
data
• Concepts,
corresponding to
couples of related
terms are added
bibliographical informatics
Concept Net
Construction
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
1
1
0
0
NDBI010 - DIS - MFF UK
home computer
hardware
information system
informatics
personal computer
computer
information retrieval
1
2
3
4
5
6
7
8
X9
10
11
data
• Going through
thesaurus in
bottom to top
direction, the ones
from narrower
terms are copied
to broader term
columns
bibliographical informatics
Concept Net
Construction
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
1
1
1
1
0
0
0
0
1
0
0
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
1
1
0
0
NDBI010 - DIS - MFF UK
home computer
hardware
information system
informatics
personal computer
computer
information retrieval
1
2
3
4
5
6
7
8
X 9
10
11
12
data
• Last,
complementary
concept is
(optionally)
added
bibliographical informatics
Concept Net
Construction
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
1
0
0
0
0
1
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
NDBI010 - DIS - MFF UK
Concept Net
• Query
– Non-weighted disjunction of terms
‘informatics’
OR ‘information retrieval’
OR ‘data’
– Weighted disjunction of terms
(‘informatics’ ; 0.5)
OR (‘information retrieval’ ; 1.0)
OR (‘data’ ; 0.4)
NDBI010 - DIS - MFF UK
NDBI010 - DIS - MFF UK
informatika
výběr informace
• Unweighted disjunction is translated
to disjunction of corresponding
columns – to column vector of concepts
• Documents can be translated as well
‘informatics’
OR ‘information retrieval’
OR ‘data’
 (1,1,0,0,0,1,0,1,1,1,1,0)
data
Concept Net
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
1
1
0
0
0
Concept Net
• Both vectors corresponding to query and
document are compared using dot product
similarity computation
• Query „informatics“ OR „information retrieval“
OR „data“
(1,1,0,0,0,1,0,1,1,1,1,0)
• Document „Information system“
(0,1,0,0,1,0,0,1,1,1,1,0)
• Similarity = 5, while term-based similarity is
nought
NDBI010 - DIS - MFF UK
Concept Net
 w1, 
 
• Weighted term

 w2, 
q  q1, q 2, ..., q m  
disjunction can be
...,
 
translated to column
 wk 
 
vector of concept
weights using fuzzy logic
 t1, j , 


• Concept weight
  t 2, j , 
qj
qj

,
where


t
j
 ..., 
wi
max k
max 
j  1,...,m 


t r , j j  1,...,m t j 1
r 1
 t k, j 


NDBI010 - DIS - MFF UK
0.4 0.5 1.0
NDBI010 - DIS - MFF UK
inform. retrieval
"informatics" 1  4
"information retrieval" 1  2
"data" 1  2
informatics
• Query
(‘informatics’ ; 0.5)
OR (‘information retrieval’ ; 1.0)
OR (‘data’ ; 0.4)
data
Concept Net
1 1 1
/ / /
2 4 2
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
1
1
0
0
0
 0.5*1/4
 0.4*1/2
0
0
0
 0.5*1/4
0
 1.0*1/2
 1.0*1/2
 0.4*1/2
 0.5*1/4
0
Handling Index-term
Dependencies (Concepts)
Singular Value Decomposition – SVD
for Vector Space Model
Latent Semantic Indexing - LSI
• Similarly to concept net in Boolean model, the
LSI tries to find out mutually independent
concepts – themes, that can be used for indexation
instead of terms that can be (and usually are
mutually dependent).
• Doesn’t use thesaurus.
• It derives so called latent semantic dependencies
directly from the Vector space index.
NDBI010 - DIS - MFF UK
Latent Semantic Indexing - LSI
• 3 documents in 3D
space
 2/2

D   3 /3
 0

2/2
3 /3
0
0 

3 / 3
1 
• Matrix rank = 2 < dim.
NDBI010 - DIS - MFF UK
Singular Value Decomposition
• Each matrix A having mxn values and rank r,
(e.g. matrix A=DT, i.e. rows  terms),
can be decomposed to product
DT=USVT, where
– URmxr has r orthonormal columns
• Forms a base in m-dimensional term space.
• Its dimension corresponds to rank of original matrix.
– SRrxr is a diagonal regular matrix
– VRnxr has r orthonormal rows
• Forms a base in n-dimensional document space.
• Its dimension corresponds to rank of original matrix.
NDBI010 - DIS - MFF UK
Singular Value Decomposition
• Left singular vectors u1, u2, ..., ur
– Eigen vectors of matrix A.AT=DT.D
• Singulární hodnoty 1  2  ...  r > 0
– Square roots of abs. values of eigen values of matrix A.AT , resp. AT.A
• Right singular vectors v1, v2, ..., vr
– Eigen vectors of matrix AT.A=D.DT
U
VT
S
1
u1
v1
ur
r
NDBI010 - DIS - MFF UK
vr
Singular Value Decomposition
• Geometrical meaning
– Matrix
projects
unitary m-dimensional
sphere to r-dimensional
ellipsoid with axis id
directions stored in columns
of matrix U
– Half-axes lengths correspond
to values 1, 2, …, r
– Rights singular vectors are
projected to vectors parallel
to space axes
DT=USVT
V*,3
NDBI010 - DIS - MFF UK
V*,1
V*,2
2 U*, 2
1 U*, 1
3 U*, 3
Latent Semantic Indexing (LSI)
• LSI takes into account mutual dependencies of
terms using SVD of index matrix
+ Co-occurring (equivalent) terms are projected to
common dimension
+ Allows further reduction of matrix dimensionality
+ The space required by the index can be smaller
+ Documents, containing similar terms can be
considered as similar, even if they contain distinct
terms
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• Represents documents in the space with
dimensionality equivalent to rank of
original indexation matrix
• Dimensions correspond to left singular
vectors of SVD decomposition
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• It is possible to approximate the index
matrix D by a matrix with defined lower
rank k<r
– To achieve rank k<r, matrix USVT
is approximated by UkSkVkT, where
• Uk corresponds to first k columns of matrix U
• Sk correspond to upper left corner of matrix S
with size k x k
• Vk corresponds to first k columns of matrix V
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• When the k is decreased by one, the ellipsoid is flatten
along its shortest axis
• UkSkVkT represents the best approximation of
USVT matrix that has rank k
(according to Frobenius norm of difference of both
matrices)
• I.e.:
U S V T U k S kV Tk
F
U
S VT X
F
for all X with rank k
where the Frobenius norm
M
F

NDBI010 - DIS - MFF UK
m
n
  xi2, j
i 1 j 1
Latent Semantic Indexing (LSI)
• Example
– 6 documents containing
5 different terms
d1
d2
d3
d4
d5
d6
Cosmonaut 1,0000
1,0000
Astronaut
1,0000
Moon
1,0000 1,0000
Vehicle
1,0000
1,0000 1,0000
Car
1,0000
1,0000
NDBI010 - DIS - MFF UK
u1
U=
Cosmonaut
Astronaut
Moon
Vehicle
Car
u2
u3
u4
u5
-0,4403 0,2962 -0,5695 -0,5774 0,2464
-0,1293 0,3315 0,5870 0,0000 0,7272
-0,4755 0,5111 0,3677 0,0000 -0,6144
-0,7030 -0,3506 -0,1549 0,5774 0,1598
-0,2627 -0,6467 0,4146 -0,5774 -0,0866
2,1625
1,5944
1,2753
S=
1,0000
0,3939
V=
D1
D2
D3
D4
D5
D6
v1
v2
v3
v4
v5
-0,7486
-0,2797
-0,2036
-0,4466
-0,3251
-0,1215
0,2865
0,5285
0,1858
-0,6255
-0,2199
-0,4056
-0,2797
0,7486
-0,4466
0,2036
-0,1215
0,3251
0,0000
0,0000
-0,5774
0,0000
0,5774
-0,5774
-0,5285
0,2865
0,6255
0,1858
0,4056
-0,2199
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• UkSkVkT for k=5=r
1.000
-0.000
1.000
1.000
0.000
0.000
1.000
1.000
-0.000
0.000
1.000
-0.000
-0.000
-0.000
0.000
-0.000
-0.000
0.000
1.000
1.000
NDBI010 - DIS - MFF UK
-0.000
-0.000
-0.000
1.000
-0.000
0.000
-0.000
0.000
-0.000
1.000
Latent Semantic Indexing (LSI)
• UkSkVkT for k=4
1.051
0.151
0.872
1.033
-0.018
-0.028
0.918
1.069
-0.018
0.010
0.939
-0.179
0.151
-0.040
0.021
-0.018
-0.053
0.045
0.988
1.006
NDBI010 - DIS - MFF UK
-0.039
-0.116
0.098
0.975
0.014
0.021
0.063
-0.053
0.014
0.993
Latent Semantic Indexing (LSI)
• UkSkVkT for k=3
1.051
0.151
0.872
1.033
-0.018
-0.028
0.918
1.069
-0.018
0.010
0.606
-0.179
0.151
0.294
-0.312
-0.018
-0.053
0.045
0.988
1.006
NDBI010 - DIS - MFF UK
0.294
-0.116
0.098
0.641
0.347
-0.312
0.063
-0.053
0.347
0.659
Latent Semantic Indexing (LSI)
• UkSkVkT for k=2
0.848
0.361
1.003
0.978
0.130
0.516
0.358
0.718
0.130
-0.386
0.282
0.155
0.361
0.206
-0.076
0.130
-0.206
-0.050
1.029
0.899
NDBI010 - DIS - MFF UK
0.206
-0.025
0.155
0.617
0.411
-0.076
-0.180
-0.206
0.411
0.487
DIS based on LSI
• Instead of matrix DT
we can use its SVD decomposition USVT,
resp. its approximation UkSkVkT.
• Similarity of document couples DDT
=(UkSkVkT)(VkSkTUkT)
=(UkSk)(SkTUkT) {Vk is orthonormal, i.e. VkTVk=I}
=UkSk2UkT
{Sk is diagonal, i.e. Sk=SkT}
=(UkSk)(UkSk)T
NDBI010 - DIS - MFF UK
DIS based on LSI
• Instead of matrix DT
we can use its SVD decomposition USVT,
resp. its approximation UkSkVkT.
• Similarity of term couples DTD
=(VkSkTUkT)(UkSkVkT)
=(VkSkT)(SkVkT) {Uk is orthonormal, i.e. UkTUk=I}
=VkSk2VkT
{Sk is diagonal, i.e. SkT=Sk}
=(VkSk)(VkSk)T
NDBI010 - DIS - MFF UK
DIS based on LSI
• From DT=UkSkVkT
{multiply by UkT on left side}
follows UkTDT=SkVkT
{UkTUk=I}
and futher
{multiply by S-1 on left side}
follows Sk-1UkTDT=VkT
{Sk-1Sk=I}
• By transposition we got: Vk=DUkSk-1
• I.e.:
– We obtain new k-dimensional vector from original one by
multiplication of the vector with matrix UkSk-1
– The query can be transformed the same way, by
multiplication of the query vector with matrix UkSk-1
NDBI010 - DIS - MFF UK
DIS based on LSI
• Similarity between document and the query
– SimLSI(q,Di)=Sim(qUkSk-1,DiUkSk-1)
• Disadvantages of the method
– Static method
• Decomposition is done using given specific set of
vectors (documents)
– Further documents can be added using UkSk-1
transformation, but this approach doesn’t reflect
latent semantic features of terms
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• Evaluation of query [moon,vehicle],
i.e. <0, 0, 1, 1, 0>
• Without LSI we obtain similarities
<2.000; 1.000; 0.000; 1.000; 1.000; 0.000>
d1
d2
d3
d4
d5
d6
Cosmonaut 1,0000
1,0000
Astronaut
1,0000
Moon
1,0000 1,0000
Vehicle
1,0000
1,0000 1,0000
Car
1,0000
1,0000
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• Using LSI both documents and the query can be
projected to a 2-dimensional space using matrix
U2S2-1 (k=2), where the space keeps two most
important latent semantic concepts
• After the transformation the query is evaluated as
usual
• We obtain similarities
<0.983; 0.621; 0.849; 0.424; 0.713; 0.108>
Comp.<2.000; 1.000; 0.000; 1.000; 1.000; 0.000>
NDBI010 - DIS - MFF UK
Latent Semantic Indexing (LSI)
• Existing benchmark results:
– On small collections having 103 documents
up to 30% increase of precision in comparison
with VS model
– On collections having 104 documents still
better than VS model
– On collections having 105 documents
result are behind VS model
NDBI010 - DIS - MFF UK
Signatures
Signatures
• Suitable for conjunctive queries over
Boolean IS
• Excludes large amount of irrelevant
documents
• Requires low time and space complexity
NDBI010 - DIS - MFF UK
Signatures
• Signature = k-bit
string
– k is a predefined
constant
• Document signature si
is assigned to each
document di
• Query signature s is
assigned to
conjunctive query q
Query
NDBI010 - DIS - MFF UK
00101
01001
10101
00111
10100
Comparison
Doc1
Signature Query Evaluation
• Document signature si matches query signature s if
and only if
si  s (bit by bit)
i.e. iff
s AND NOT si = 0 (binary)
• If document signature doesn’t match, given
document cannot contain all queried terms
• If document signature matches, document can, but
need not, contain all required terms
NDBI010 - DIS - MFF UK
Signature Query Evaluation
• Effectively computable
evaluation using native
machine code instructions
of the CPU
• Non-exact, irrelevant
document can still match
in signature comparison
(false hit)
• Necessary to be followed
by further – more exact –
comparison
Query
NDBI010 - DIS - MFF UK
00101
01001
10101
00111
10100
Doc1
Comparison
10101
00111
Comparison II
Doc2
Signature Assigning
• Word signature
– Hash function h : X*  0..k-1
– Signature sig(w) of word w
has set 1-bit at position h(w)
all other positions contain 0-bits
– More suitable than hash function X*  0..2k-1,
that directly generates signatures with many
(k/2 in avg.) 1-bits, matching to large amount of
queries
NDBI010 - DIS - MFF UK
Signature Assigning
• Document signature
– Layering of signatures assigned to individual words
using binary disjunction.
– Collection of documents with fixed internal structure
(author, title, abstract, body, publisher) can use
signature built by concatenation of (layered) signatures
corresponding to individual document sections.
Each of concatenated signatures can have its own
predefined length.
NDBI010 - DIS - MFF UK
Concatenated Signatures
• Concatenated signatures allow independent
querying over document section(s)
– Books written by Alan Poe:
• sig(„Alan“)=00100000, sig(„Poe“)=00001000
– Query q=00101000 | 00000000 | 0000000000000000
• sig(„Saul“)=00010000, sig(„Bellow“)=10000000
– Dokument 10010000 | xxxxxxxx | xxxxxxxxxxxxxxxx
doesn’t match to query signature
• sig(„Toni“)=00001000, sig(„Morrison“)=00100000
– Dokument 00101000 | xxxxxxxx | xxxxxxxxxxxxxxxx
matches to query signature, doesn’t match to query
NDBI010 - DIS - MFF UK
Layered Signatures
• By layering word signatures the document/query signature
gains more and more 1-bits and so the ability to
discriminate irrelevant documents vanishes,
because document signature containing all bits set to one
matches to any query.
• Solution:
– More signatures are assigned to the document, each of them for
different block of text.
– By suitable sectioning, for example on chapter or paragraph
boundaries, the information loss is not problematic.
Unrelated co-occurrence of words is not much important.
NDBI010 - DIS - MFF UK
Layered Signatures
• Block can be created by two methods
– FSB
(fixed size block)
Each block contains approximately the same number of
words
– FWB
(fixed weight block)
Each block produces a signature with approximately the
same number of 1-bits
Optimally k/2
NDBI010 - DIS - MFF UK
Monotonous Signatures
• Signature is monotonous, if for each two words,
resp. their fragments u, v holds
sig(u.v)  sig(u)
I.e.: signature of on right side extended word is
not less than signature of original word
• Monotonous signatures allow querying of radixes
using right-side wildcards „*“
• For example q=„datab*” AND „system*”
NDBI010 - DIS - MFF UK
Monotonous Signatures
• Monotonous signatures can be created using
following approaches (all of them use in general
more than one 1-bit in the signature)
– Layering of signatures corresponding to all
word prefixes
• sig(„system“)=sig’(„s“)+sig’(„sy“)+...+sig’(„system“)
where sig’(w) is arbitrary signature
– Layering of signatures corresponding to
all n-grams within given word
n-gram = sequence of n adjacent characters
• sig(„system“)=sig’(„sys“)+sig’(„yst“)+...+sig’(„tem“)
for trigrams
NDBI010 - DIS - MFF UK
Monotonous Signatures
• Signatures built upon n-grams
– Allow wildcards at the beginning or inside of
words
– Allow uniform utilization of all k positions in
the signature
• There exists fixed number of n-grams
( ~ 26n in English)
• It is possible to estimate probabilities of individual
n-grams from the language model and assign bits to
them uniformly
NDBI010 - DIS - MFF UK
Signature Storing
• Inverted file
• Non-inverted file
• Signature tree
00
000
01001
10101
001
– By ordering signatures with the
same prefix of length k1<k
appear in one continuous block
– Prefix can be stored only once
– It is possible to use more levels
for k1, k2, …
NDBI010 - DIS - MFF UK
00111
10100
01
10
11
Distributed TRS
Distributed TRS
• Data as well as functionality is spread among
more computers
– Transparency
• User should not be aware of the distribution
– Scalability
• Performance can be boost using more computers
– Robustness
• Failure of one computer doesn’t affect functionality of other
computers
• Overall functionality of the system can remain the same
(redundancy) or only slightly decreased (unavailable is only
small part of data)
NDBI010 - DIS - MFF UK
Distributed TRS
• Data stored in TRS
– Primary data (original documents)
– Secondary data (author, title, year of publication, ...)
– Index
• Computer nodes can be distinguished according to
the services they provides and what data they store
and maintain
• More processes can run on one computer node
NDBI010 - DIS - MFF UK
Distributed TRS
• Processes of TRS (involved in query answering)
– Clients (C)
• User interface
– Document server (D)
• Document delivery system containing primary data
• For example independent WEB server
– Index server (S)
• Document disclosure system containing the index and
secondary data
– Integration node (I)
• Specific process ensuring coordination of cooperating nodes
and processes
NDBI010 - DIS - MFF UK
Distributed TRS
• Integration node
– Takes queries from client processes (users)
– Defines strategy of query evaluation according
to its knowledge of system topology
– Distributes partial queries to individual index
servers
– Collects final result from answers to partial
queries
NDBI010 - DIS - MFF UK
Distributed TRS
• DIS
• DDIS
D
D
D
C
S
C
D
I
S
NDBI010 - DIS - MFF UK
C
I
S
I
S
Distributed TRS
• Availability of more
Clients and Integration
nodes increases the
throughput of the
system and its
robustness
• Necessity to replicate
metadata about system
topology on all
integration nodes
C
I
S
NDBI010 - DIS - MFF UK
C
I
S
I
S
Distributed Boolean TRS
• The index matrix is split to more (usually distinct,
redundancy is achieved due to mirroring) parts
• Description here is based on relational algebra
notation and semantics
–
–
–
–
–
Relation R(A1, A2, …, An)
Boolean condition q
Projection R[Ai1, Ai2, …, Aik]
Selection R(q)
Natural join of relations R*S
NDBI010 - DIS - MFF UK
Distributed Boolean TRS
• Index is presented as (m+1)-ary relation
D(d, t1, t2, …, tm), where
tj  {0,1}, d  N (document identification)
• Document index instance is matrix

1
D  2:

n

w
w
:
w
1.1
2 ,1
n ,1
w
w
:
w
1, 2
2, 2
n,2
.. w 
.. w 
 , where w {0,1}
: 
.. w 
1, m
2,m
i, j
n,m
NDBI010 - DIS - MFF UK
Distributed Boolean TRS
• Answer to query q is a list of identifiers of
matching documents, i.e. relation
D(q)[d]
NDBI010 - DIS - MFF UK
Horizontal Fragmentation
• Splitting of the index to k fragments
D1, D2, …, Dk based on
k-tuple of queries q1, q2, …, qk, where
Dx = D(qx)
• q1  q2  …  qk = true
i.e. D1  D2  …  Dk = D
• qx  qy = false if x  y
i.e. Dx  Dy = 
NDBI010 - DIS - MFF UK
Horizontal Fragmentation
• D(q)[d]
= (D1  …  Dk)(q)[d]
= (D(q1)  …  D(qk))(q)[d]
= (D(q1q)[d]  …  D(qkq)[d])
• If qxq = false, the Dx(q)[d] = 
and the xth index server need not to take part
on query evaluation
NDBI010 - DIS - MFF UK
Horizontal Fragmentation
• How to choose queries qi
– To obtain fragments of approx. the same size
– To obtain fragments where typical queries can
be evaluated on as small number of nodes as
possible
• Need to have statistics about queries evaluated in the
past
NDBI010 - DIS - MFF UK
Vertical Fragmentation
• Splitting of the index to k fragments
D1, D2, …, Dk based on k-tuple of sets
{d}  T1, T2, …, Tk  {d, t1, t2, …, tm}, where
Dx = D[Tx]
• T1  T2  …  Tk = {d, t1, t2, …, tm}
i.e. D1 * D2 * … * Dk = D
• Tx  Ty = {d} if x  y
NDBI010 - DIS - MFF UK
Vertical Fragmentation
• D(q)[d]
= (D1 * D2 * … * Dk)(q)[d]
• Let Tq is the set of terms used in the query.
Let ={d}Tq
D(q)[d]
= (D[T1] * … * D[Tk])(q)[d]
• Smaller relations are joined
• Fragments, where Tx={d} can be omitted
NDBI010 - DIS - MFF UK
Vertical Fragmentation
• Based on following rules
D(q1q2)[d]=D(q1)[d]D(q2)[d]
D(q1q2)[d]=D(q1)[d]D(q2)[d]
queries can be rewritten to intersections and
unions of partial results that can be
evaluated on available index servers
NDBI010 - DIS - MFF UK
Vertical Fragmentation
• How to choose sets Ti
– To obtain fragments of approx. the same size
• Sets of the same size (and having similar probability
of terms in the text)
– Fragments, where queries can be evaluated on
as small number of fragments
• Terms co-occurring in typical queries are hold
together. Necessity to store history of queries
NDBI010 - DIS - MFF UK
Combined Fragmentation
• Regular (grid)
D11=D(q1)[T1]
D21=D(q2)[T1]
D12=D(q1)[T2]
D22=D(q2)[T2]
D13=D(q1)[T3]
D11=D(q2)[T3]
D31=D(q3)[T1]
D32=D(q3)[T2]
D11=D(q3)[T3]
D12=D(q1)[T2]
D13=D(q1)[T3]
• Irregular
D11=D(q1q2)[T1]
D2=D(q2)[T2T3]
D3=D(q3)
NDBI010 - DIS - MFF UK
Example
• Irregular combined fragmentation
D1=D(q1)
D21=D(q2)[T1] D22=D(q2)[T2]
D3=D(q3)
• Where
– T1={d, t1, t2, t3}, T2={d, t4, t5, t6}
– q1=(t1t4), q2=(t1t4)(t1t4), q3=(t1t4)
NDBI010 - DIS - MFF UK
Example
• D=D1(D21*D22)D3
q=t1(t2t4t5)
• D(q)[d]=(D1(D21*D22)D3)(q)[d]
=D1(q)[d](D21*D22)(q)[d]D3(q)[d]
=D1(q)[d](D21*D22)(q)[d]
qq3=false
• (D21*D22)(q)[d]
=(D21*D22)(t1(t2t4t5))[d]
=(D21*D22)(t1)[d](D21*D22)(t2t4t5)[d]
NDBI010 - DIS - MFF UK
Example
• (D21*D22)(t1)[d]=D21(t1)[d]
• (D21*D22)(t2t4t5)[d]
=(D21*D22)(t2)[d](D21*D22)(t4t5)[d]
=D21(t2)[d]D22(t4t5)[d]
NDBI010 - DIS - MFF UK

D1(q)[d]
Example


D21(t1)[d]
D21(t2)[d]
S1
S21
D22(t4t5)[d]
S22
NDBI010 - DIS - MFF UK
S3
Distributed Vector Space TRS
• Uses clustering
– Analogous to horiyontal fragmentation in
Boolean TRS
• Integration servers need information about
cluster topology, i.e. about centers and radii
of clusters
NDBI010 - DIS - MFF UK
Integrated TRS
Integrated TRS
• Integration of more independent TRS into
one meta-system
• Problems
– Different methods of indexing
• One document can have mode different
representation
– Different sets of terms
– Different similarity computations
NDBI010 - DIS - MFF UK
Optimal searching
• One of methods of TRS’s integration, plus
– Decreasing of problems with prediction criterion
– Suitable for systems, containing multiple modules for
• Document indexation (more independent
algorithms)
– High space complexity
• Query intexation algorithms
• More similarity computations
– System combines results together and (based on the
user interaction) chooses optimal combination of
available methods
NDBI010 - DIS - MFF UK
Optimal searching
• Optimal query answering method
– k different methods
– ith method returned
ri relevant documents
in set of ni returned documents
• How to determine the best of available TRS’s?
– By ri2 / ni criterion
– It is not necessary to find out the overall number of all
relevant documents in collections
NDBI010 - DIS - MFF UK
Optimal searching
• Let suppose our knowledge of all relevant
documents and
 its count r
• Let take X  x1, x2,..., xn
where xj is equal to 1, if and only if jth
document is relevant for the user, else xj=0

• Let take Y i  yi,1, yi,2,..., yi,n 
where yi,j is equal to 1, if and only if ith
system returned jth document, else yi,j=0
NDBI010 - DIS - MFF UK
Optimal searching
Number of relevant documents,
 
returned by the i system

• Optimally Y i X
• The quality of the ith system
 
  ,    .  
Sim Y i X  Y i X Y i . X .cos 
 
 
Y
i. X
cos    , where Y i . X   yi, jxi  ri



Yi X
Y i  ni X  r
th
r
i ri
r
i
cos 


  Pi  Ri
nir
ni r
ni  r
2
ri
NDBI010 - DIS - MFF UK
Optimal searching
• The quality measure corresponds to PiRi
2
• Due to P  R  r  r  r
r
n
n
r
where r is constant,
and square root is ascending on <0;1>,
to order those expressions
i
i
i
i
i
i
i
2
r
ordering of
is sufficient
n
i
i
NDBI010 - DIS - MFF UK
Optimal searching
• Query evaluation algorithm
– The query is evaluated by all available methods
and results are merged together to one ordered
list (see below)
– User marks relevant documents in the answer
– Individual methods are rated by ri2 / ni criterion
– The best method is preferred in future answers
NDBI010 - DIS - MFF UK
Optimal searching
• Merging of output document lists
– Different methods can return documents rated
by numbers from different intervals
– It is necessary to normalize all values from
local interval <l1,l2> of ith method
to global interval <g1,g2>
g2
linearly:
y=(x- l1)*((g2–g1)/(l2–l1))+g1 y
– <g1,g2> is usually <0,1>, thus g1
y=(x- l1)/(l2–l1)
l1 x
l2
NDBI010 - DIS - MFF UK
Optimal searching
• Merging of output document lists
– If some given document is found and rated by more
than one methods, the overall rate of the document has
to be computed
– Individual document rates are considered to be
estimations of the probability of document relevancy
given by the method(s)
– If one document is returned multiple times, the
probability of its relevancy grows
– If si  <0;1>, then s = 1-(1-si)
• Computed for all methods, that returned given document
NDBI010 - DIS - MFF UK
Optimal searching
• If there exists some method (or methods), that
should be preferred over another methods, rates
provided by the method are normalized to
individual intervals
<g1,g1+i(g2-g1)>, where i<0,1> denotes the
credibility (quality) of the method
• For example:
– i=1 for the best method, i=<1 for all others
– i=(ri2 / ni) / (rmax2 / nmax)
NDBI010 - DIS - MFF UK
HTML Searching
• Web can be considered as the special case of TRS
– Unknown and huge number of stored documents
• Surface web – anonymously accessible documents
• Hidden (deep) web
– Dynamic web pages
– Unlinked pages
– Documents accessible after authorization
– Volume of deep web is (est.) hundred times larger
– Quality of deep web is thousand times higher
NDBI010 - DIS - MFF UK
HTML Searching
• Web can be considered as the special case
of TRS
– Redundancy
• Estimations says the web redundancy is approx.
30%
– Volatility
• ¼ of pages changes every day
• Estimated half-life of pages is approx. 10 days
• I.e. information in the index ages considerably
NDBI010 - DIS - MFF UK
HTML Searching
• Web can be considered as the special case of TRS
– Number of documents in Google
• 4.285.199.774 documents (July 2004)
• 8.058.044.651 documents (May 2005)
• At least 25 270 000 000 documents,
more probably over 35 070 000 000 docs (April 2006)
• Query „the“ used in Google returns
– 3 200 000 000 hits (May 2005)
– 24 210 000 000 hits (April 2006)
• Query „-the“used in Google returns
– 14 800 000 000 hits (April 2006)
NDBI010 - DIS - MFF UK
HTML Searching
• Two methods of information search
– Query engines
• www.google.com, www.yahoo.com,
www.altavista.com, morfeo.centrum.cz, …
– Catalog browsing
• seznam.cz, centrum.cz, …
• Usually manually managed pages
NDBI010 - DIS - MFF UK
Querying the Web
• Typically different implementations of some
extended Boolean search engines
– Binary logical operators
– Support for explicit or implicit proximity operators
– Usually without query weighting
• Different additional techniques
– The location of terms in document is important
• Titles and headings more important than plain text,
…
– Mutual references between pages are taken into account
NDBI010 - DIS - MFF UK
Querying the Web
• Catalogs
– Thematically organized lists of references
– Navigation through hierarchies of search terms
– Suitable in situations, where the user exactly
knows what he/she searches for, as well as
when he/she cannot express the query using
keywords
NDBI010 - DIS - MFF UK
Web Macrostructure
• Taken from:
Graph structure
in the web
Andrei Broder,
Ravi Kumar2 et al.
http://www9.org/
w9cdrom/160/160.html
NDBI010 - DIS - MFF UK
Hypertext References Utilization
• Web can be considered as oriented graph G(N,E)
– N is set of nodes (pages)
– E is set of edges, where (p,q)E means,
that page q is referred from page p
• Output page degree o(p)
– Number of references in the page p
• Input page degree i(p)
– Number of pages referencing to page p
NDBI010 - DIS - MFF UK
Hypertext References Utilization
• Edges inside one domain are denoted as
inner edges
• Edges crossing domain boundaries are
denoted as traversal edges
p11
dom1
dom2
p12
p22
p21
NDBI010 - DIS - MFF UK
i(p11) = 0, o(p11) = 2
i(p12) = 1, o(p12) = 0
i(p21) = 2, o(p21) = 0
i(p22) = 1, o(p22) = 2
Web Search Engine Structure
URL
Robot
HTML
Queries
Index
Indexer
• Robot
(crowler, spider)
– Uses internal URL
store and visits pages
with given frequency
and in given order
– Stores data to list of
harvested HTML pages
NDBI010 - DIS - MFF UK
Web Search Engine Structure
• Indexer
URL
Robot
HTML
Queries
Index
Indexer
– Indexes harvested
HTML pages
– Generates index data
• Textual
• Structural
– Adds newly found
references to URL
store
NDBI010 - DIS - MFF UK
Web Search Engine Structure
• Query processing
URL
Robot
HTML
Queries
Index
Indexer
– Uses index data to
evaluate document
similarities according
to queries
– If the query is
formulated using URL
to unknown page, it
can be loaded and adhoc indexed at the time
of the query evaluation
NDBI010 - DIS - MFF UK
Page Harvesting
• Usually breadth-first search traversal from
starting set of pages, stored in the URL store
– Not all stored URL has to be the starting pages
• Priorities can be defined according to
– Page theme
• For example based on page similarity to some predefined
vector and/or query
– Page popularity
– Page location
• According to the domain
– …
NDBI010 - DIS - MFF UK
Page Harvesting
• Robots are not able to download all freely
available pages
– Web doesn’t form connected graph
• Indexed can be only connected components
reachable from starting pages
– Web volume grows and content on the web
changes rapidly than the robots are able to
harvest
NDBI010 - DIS - MFF UK
Indexation
• Indexer decides, what pages harvested by robots
will be really indexed
– Tendency to ommit duplicates
• Obtained index data is stored (in case of extended
Boolean model) using inverted files, usually
including positions of term occurrences in pages
• In addition, another metadata, needed for query
evaluation is stored
– Graph of references
– Page sizes
– …
NDBI010 - DIS - MFF UK
Document to Query Comparison
• Documents are rated from more points of view,
with respect to
– Given query
• Similarity of document content and the query
– Given document itself
• Page popularity, derived for example from the number of
(traversal) references
– Given user
• Newest part of rating, that tries to create and hold user profile,
obtained from previous interaction with him/her
• Prefer user’s subjective feeling about document quality
NDBI010 - DIS - MFF UK
Web Page Popularity
• Is derived from the references graph
analysis and from the similarity between
source and target pages
– PageRank
– HITS algorithm
NDBI010 - DIS - MFF UK
PageRank
• Supposes that the reference to foreign page
represents a recommendation of the target
page given by the author of the source page
• Problems
– Algorithm can be confused by generation of
large amount of pages referring to target page
– The popularity of source page and/or the
content of the source page can be taken into
account
NDBI010 - DIS - MFF UK
PageRank
• Rating – rank – r(q) of page q depends on
ranks of referring pages and on the number
of those pages
– Simple PageRank:
r(q) = (p,q)E((1/o(p))*r(p))
• Multiple references are counted only once
– Matrix notation:
r = Xr, where xp,q = 1/o(p) if (p,q)E, else 0
NDBI010 - DIS - MFF UK
PageRank
• Iterative PageRank evaluation
– Problems
• Group of pages can link each other, but not outside of the
group (rank sink)
– Rating accumulation
– No contribution to other pages
p
p1
p2
p
0.1
0.1
0.1
0.1
0.1
0.1
NDBI010 - DIS - MFF UK
p1
0.0
0.1
0.1
0.2
0.2
0.3
p2
0.0
0.0
0.1
0.1
0.2
0.2
PageRank
• Rank sink problem can be according to authors
(Lawrence Page, Sergey Brin) reduced using
Random Surfer Model
– PageRank :
r(q) = (1-d) + d*(p,q)E((1/o(p))*r(p)), d<0,1>
• d represents damping factor, usually d = 0.85
– Matrix notation:
r = (1-d)e + dXr,
where e is a vector containing ones
NDBI010 - DIS - MFF UK
PageRank
• Random Surfer Model
r(q) = (1-d) + d*(p,q)E((1/o(p))*r(p)), d<0,1>
– User browses the web randomly
– The probability of visiting given page is defined by the
PageRank
– User clicks to some hyperlink in the page with the
probability d
• The selection of any of o(p) links
is random with uniform distribution
– User doesn’t follow any link on the page and writes new
URL into address field or choose some of favorites or …
with probability (1-d)
NDBI010 - DIS - MFF UK
PageRank
• Other – more exact – variant of PageRank (Lawrence
Page, Sergey Brin)
– PageRank :
r(q) = (1-d)/|V| + d*(p,q)E((1/o(p))*r(p))
– If o(p)=0, i.e. page refers to nothing,
it is considered referring to all pages of the Web.
I.e.
• o(p)=|V|,
• (p,q)E q
– The result suites better to probabilities of visiting web pages
NDBI010 - DIS - MFF UK
PageRank Example
•
•
•
r(x) = 0.5 + 0.5*r(z)
r(y) = 0.5 + 0.5*r(x)/2
r(z) = 0.5 + 0.5*(r(x)/2+ r(y))
Exact solution of equations:
r(x) = 14/13 = 1.07692308
r(y) = 10/13 = 0.76923077
r(z) = 15/13 = 1.15384615
Iterative computation
r(x)
r(y)
0 1.0
1.0
1 1.0
0.75
2 1.0625
0.765625
3 1.07421875
0.76855469
4 1.07641602
0.76910400
… …
…
10 1.07692305
0.76923076
11
1.07692307
0.76923077
12 1.07692308
0.76923078
y
x
r(z)
1.0
1.125
1.1484375
1.15283203
1.15365601
…
1.15384615
1.15384615
1.15384615
NDBI010 - DIS - MFF UK
z
Kleinberg’s HITS Algorithm
• Hypertext-Induced Topic Search
– Rates documents,
returned by the given query
– Supposes, that the set contains similar
documents retrieved using the user query
• Documents are often mutually linked by references
NDBI010 - DIS - MFF UK
Kleinberg’s HITS Algorithm
• Two classes of pages are distinguished
– Authorities
• Pages having high input degree
• i.e. referenced by many pages
included in the query answer
– Hubs
• Pages having high output degree
• i.e. referencing many pages
included in the query answer
h
a
h
a
h
h
NDBI010 - DIS - MFF UK
a
HITS
• Algorithm
– The input set of pages for HITS is chosen
• Small enough collection
• Containing documents similar to given query q
• Containing large number of authorities
– Rating of selected pages
NDBI010 - DIS - MFF UK
HITS
•
–
–
–
1.
2.
3.
4.
Selection of page set Sq according to query q
In(p) … set of pages referring to page p
Out(p) … set of pages referenced by page p
d … chosen small integer number
Rq := first 200 of pages from the answer to query q
Sq := Rq;
for each p in Rq do begin
Sq := Sq  Out(p);
if i(p)d then Sq := Sq  In(p)
else Sq := Sq  S; {SIn(p), |S|=d, S chosen randomly}
end;
Remove inner links from graph induced by Sq
NDBI010 - DIS - MFF UK
HITS
•
Page rating
–
–
1.
2.
ak(p) … authority rating of the page p in the kth iteration
hk(p) … hub rating of the page p in the kth iteration
for each p in Sq do begin a0(p) := 1; h0(p) := 1; end;
for k := 1 to n do for each p do begin
ak(p) := (q,p)Ehk-1(q);
hk(p) := (p,q)Eak-1(q);
normalize ratings so, that pSq(hk(p))2=pSq(ak(p))2=1
end;
NDBI010 - DIS - MFF UK
PageRank Computation Speed-Up
• One of possibilities is approximate the PageRank
computation
r(q)  rs(q) * rp(q)
– rs(q) … rank of the site (domain)
– rp(q) … rank of the page within the site (domain)
• The number of domains is much smaller than
number of pages
– The eigen-vector of the matrix is much simpler
• Page number within the site is also much smaller
– Computation for different sites can be easily done in
parallel
NDBI010 - DIS - MFF UK
Extending of PageRank
• PageRank itself is independent on page content
– Empty page with many referring pages will obtain high
PageRank
– Can be easily fooled
• Page has different importance depending on the
theme the user is searching for
• The PageRank computation can be change to
reflect particular themes
NDBI010 - DIS - MFF UK
Extending of PageRank
• Original proposition (Haveliwala) computes
independent PageRank values for top-terms
taken from the ODP (Open Directory
Project) thesaurus
– Dependent on the language
– Pages written in different languages should be
rated individually
NDBI010 - DIS - MFF UK
Themes Based
Extending of PageRank
• Basic equations for PageRank computation
r(q) = d*(p,q)E(r(p)/o(p)) + (1-d)/n
are modified, so that during random walk the new
page is with the probability (1-d) chosen only if the
page theme match to the searched theme
• Having given theme t the set of equations is
following
– If the page q matches theme t
rt(q) = d*(p,q)E(r(p)/o(p)) + (1-d)/nt
– If the page q doesn’t match theme t
rt(q) = d*(p,q)E(r(p)/o(p)) + 0
NDBI010 - DIS - MFF UK
Themes Based
Extending of PageRank
• For the query q coefficient c(q,t) of matching
the query to individual themes are computed
• PageRank of the page p is then evaluated as
the linear combination of PageRank values for
all themes
– r(p,q) = rt(p)*c(q,t)
NDBI010 - DIS - MFF UK
Personalized PageRank
• Also modifies the random walk algorithm
• It remembers set of favorite pages for known
users
• During random traversal using address field of
the browser favorite pages are preferred
NDBI010 - DIS - MFF UK
Further TRS Quality Metrics
• To evaluate the TRS quality, other metrics
among standard P (precision) and R (recall)
can be used
• Additional metrics try to take into account
the maximal criterion
• Having large collections, the quality of the
system should be measured for the
beginning part of the answer
NDBI010 - DIS - MFF UK
Further TRS Quality Metrics
• Simplest of those metrics is precision within
the first k returned documents, denoted as Pk
– The system with values P10=0.9; P=0.3
can be considered as better,
than other system where P10=P=0.6
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• If the query is ambiguous,
more independent groups of documents can match the
query
• For example: „Binding“
–
–
–
–
–
Foot binding
Ski binding, a device for connecting a foot to a ski
Snowboard binding, a device for connecting a foot to a snowboard
Book binding, the protective cover of a book
Binding (computer science), a tie to certain names in programming
languages
– Binding (molecular), a chemical interaction between molecules
– Neural binding, synchronous activity of neurons
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• In case of ambiguous query is desirable to have
documents representing all available topics as best
as possible within the first page of the answer or at
the beginning of the answer
• Which topic the user requested is possible to tell
from the further interaction with him/her, so the
system can provide restricted set of documents in
the next iteration
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• To express the number of individual topics
appearing in the answer the new metrics has to be
defined, different from both precision and recall
– Diversity … number of groups (clusters), of mutually
simillar documents in the answer
• Grows with number of topics presented in the answer or at the
beginning of the answer
– Information Richness … quality of documents with
regards to their respective topics
• Grows with the quality of chosen documents for individual
topics
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• Computation similar to PageRank
evaluation, with exceptions
– It is computed for the answer (similarly to
HITS algorithm)
– Doesn’t use the graph based on mutual
references, but on their mutual similarities
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• Let is given a collection of documents
D={di, i  1, …, n}
• Diversity (of the set)
Div(D)
– Number of topics,
covered by documents within the set D
• Information Richness (of the document)
IRD(di)  <0;1>
– How much the document di represents its own topic.
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• If Div(D)=k,
each of documents within the set is assigned
to one of k topics
– The number of documents assigned to topic l is denoted nl
– ith document assigned to the topic l is denoted as dil
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• The similarities Sim(di, dj) of document couples are
computed for all di, dj  D,
where Sim(di, dj) = (di*dj)/(|di|*|dj|)
• The rated graph G=(D,E) is built
– Graph nodes correspond to documents in D
• The rating of edge eij=(di, dj) is defined by the
similarity, i.e. h(eij) = Sim(di, dj)
• To spare the space and time edges corresponding to
dissimilar documents are not used, i.e. the edge eij 
E, if and only if
Sim(di, dj)  St, where St is chosen treshold
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• The adjacency matrix M of graph G is built
• The Information richness measure is derived
from two aspects
– The more similar documents (neighbors) the
graph node has, the higher the IR is
– The more similar the neighbor document is, the
higher the IR is
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• The eigen vector  of matrix
c*M’T + (1-c)*U
is computed where
– M’ is matrix M, where rows are normalized to unite size
(Manhattan metrics, sum of all values is equal to one)
– U contains values 1/n
– c = 0.85 (similar as in case of Page Rank)
• Vector  contains required values of IR(di)
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• Average value of Information Richness
in the collection can be defined as
1 Div  D  1 nl
 l
IRD  




IR
D di
DivD  l 1 nl i 1


• Values of IR allow to choose best
representatives, but can choose more very
similar documents that have similar IR, but
represent the same topic
NDBI010 - DIS - MFF UK
Diversity, Information Richness
• Greedy algorithm
– A :=  ; B := D;
sort B in descending order by IRD value;
while B <>  do begin
move the best available document di from B to A;
decrease values of left documents by Mij*IRD(di);
re-order left documents in B
end;
NDBI010 - DIS - MFF UK
Artificial Neural Networks
in TRS’s
COSIMIR
Neural Networks in TRS’s
• Neural networks are increasingly used in
TRS’s since nineties of 20th century
– Usually targets some of following aspects
•
•
•
•
Clustering (SOM)
Reduction of index dimensionality
Concepts finding
Document relevancy estimation
NDBI010 - DIS - MFF UK
Neural Networks in TRS’s
• Advantages of using neural networks
– Generalization of relevancy rules
based on learning from specific examples
– Finding abstract dependencies, even if the user is
not able to formulate them exactly
– Robustness, fault tolerance
– Pattern classification (documents and/or terms)
using self-organizing maps
NDBI010 - DIS - MFF UK
Neural Networks in TRS’s
• Disadvantages of using neural networks
– Unsure results
• If sample data are not chosen correctly,
or the amount of them is too small/large,
derived generalizations need not reflects the reality
– Difficult interpretation of learned rules
NDBI010 - DIS - MFF UK
Neural Networks in TRS’s
• Neuron (perceptron)
– Basic unit of the neural network
– Models biological neuron
axon
• Inaccurate, reflects only our idea of its
functionality
• Perceptron structure
– n inputs, corresponding to dendrites of
biological neuron
– 1 output, corresponding to the axon
NDBI010 - DIS - MFF UK
dendrites
Neural Networks in TRS’s
•
Perceptron structure
– n inputs denote as xi,
each of them has assigned
weight wi
– 1 output y
– Threshold t
•
if the neuron excitation is sufficient,
i.e. exceeding the threshold,
the neuron triggers its output
NDBI010 - DIS - MFF UK
y
t
x1 x2 x3 … xn
Neuronové sítě a DIS
•
Perceptron functionality
1.
2.
Enumeration of its inner activation
(of the weighted sum)
a = i (wixi)-t
Calculation of the output value using
transition function g
y = g(a)
–
–
Usually g(a) = 1/(1+e-y)
so called sigmoid function
with steepness >0
Sometimes other transition functions
–
–
–
Linear g(a) = a
Signum g(a) = sgn(a)
…
NDBI010 - DIS - MFF UK
y
t
x1 x2 x3 … xn
Neuronové sítě a DIS
• Geometrical interpretation
– Equation i (wixi)-t=0 determines separating
hyperplane in the input n-dimensional space
– Perceptron separates
input patterns
i (wixi)-t > 0
i (wixi)-t < 0
belonging to
individual halfspaces
NDBI010 - DIS - MFF UK
Neuronové sítě a DIS
• Nets with more neurons
– Are built by connection of neuron output to inputs of
other neurons
• Recurrent networks
allows cyclic connections
– Usually unsupervised networks
– Allow pattern classification to groups
• Acyclic layered neural networks connect
outputs of one group (layer) of neurons
to inputs of all neurons in following layer
– Usually supervised, learned by backpropagation,
based on couples [pattern, required response]
NDBI010 - DIS - MFF UK
Layered Neural Network Learning
• Backpropagation algorithm
– Based on set o vector couples
[pattern, required response]
– Minimizes error E of network
over the learning set
– E = j(yj-oj)2, where
yj is the output of jth neuron in the output layer
oj is its response required by the supervisor
NDBI010 - DIS - MFF UK
Layered Neural Network Learning
• Backpropagation algorithm
– Numerical iterative calculation
• E is function of network weights,
derivable everywhere with respect to all weights
(in case of sigmoid transition function, need not to
hold in general)
• The derivation dE/dw is calculated for each weight
• The weights is changed slightly in direction of
decreasing error
NDBI010 - DIS - MFF UK
Layered Neural Network Learning
• Backpropagation algorithm
– Calculates weight changes layer-by-layer from top
to bottom
• Output – vth – layer
– iv = λ yi (1 – yi) (oi – yi)
– wij  wij + η δjv yi
• Any other – kth – layer
– ik = λyi(1 – yi)i(jk+1yij)
– wij  wij + η δjk yi
NDBI010 - DIS - MFF UK
Document clustering
• Kohonen self-organizing maps
– High-dimensional space projection
into low-dimensional (often two-dimensional) space
– Partial topology preservation
– Clustering
NDBI010 - DIS - MFF UK
COSIMIR
• COSIMIR Model - Thomas Mandl, 1999:
COgnitive SIMilarity learning in Information Retrieval
• Cognitive calculation of document to query
similarity, based on the layered artificial
neural network and the backpropagation
algorithm
NDBI010 - DIS - MFF UK
COSIMIR
• Input layer has 2m neurons,
where m is number of terms
• Hidden layer contains
k neurons
(“symbolic concepts”)
• Output layer
has 1 neuron
representing similarity
indexation
similarity
wi1 wi2 wi3 ... wim
D
NDBI010 - DIS - MFF UK
q1 q2 q3 ... qm
q
Dimensionality curse
Pyramid technique
IGrid indexes
What is Dimensionality Curse?
• Most methods invented for nearest neighbor
search to given point in m-dimensional space
as R-trees, m-trees and others, work well
in low-dimensional spaces, but quickly lose their
effectiveness with increasing m.
• With m ~ 500 and more is more efficient to go
through whole space (cluster) sequentially.
• TRS’s spaces consists of m ~ 10000 and (much)
more dimensions
NDBI010 - DIS - MFF UK
Pyramid Technique
• Reduces the problem of neighborhood
searching of given predefined size
from m-dimensional to 1-dimensional,
where B-trees and similar structures can be
easily used
• Searching through m-dimensional block in
neighborhood of point x is converted to
scanning certain sections of the line
NDBI010 - DIS - MFF UK
Pyramid Technique
• m-dimensional cube <0;1>m is split to 2m
pyramids
– Each pyramid is understaffed by one
of 2m m-1-dimensional cube walls
– The top is in the cube center
2
3
m=2, 4 pyramids
1
0
m=3, 6 pyramids
NDBI010 - DIS - MFF UK
Pyramid Technique
• m-dimensional cube <0;1>m is split to 2m
pyramids
v2
2
– Each m-dimensional point within the cube is
projected onto point
0,5*PyramidNr+DistanceFromBase
v1
3
1
0,5+v1
0
0,0
0,5
1,0
NDBI010 - DIS - MFF UK
1,5+v2
1,5
2,0
Pyramid Technique
• During search within neighborhood of given
point are searched only required parts (layers)
of pyramids parallel with bases.
There exists 2m maximum
2
v1
3
1
0,5+v1
0
0,0
0,5
1,0
NDBI010 - DIS - MFF UK
1,5
2,0
Pyramid Technique
• Advantages
• Disadvantages
– Easy search for all points
within block in
neighborhood of given
centre with given size
– Responses to TRS queries
with predefined maximal
distance (minimal
similarity) of document
with respect
to given query
– Not optimal for searching
for k closest (most similar)
points to given point x
• It is not easy to estimate
the needed size of the
block containing at least k
poinst
2
3
– Not optimal, if some
dimensions of the block are
unbound
v1
1
0
NDBI010 - DIS - MFF UK
• Resulting block has large
intersection with many
pyramids
IGrid Index
• Solves problem of increasing dimensionality by
different distance (similarity) definition
in m-dimensional space
– Less “intuitive” definition
– Inadequate for low-dimensional spaces
– Increasing effectiveness with increasing space
dimensionality
• The more the space dimensionality,
the less percentage of the space must be searched
to find the closest point(s)
NDBI010 - DIS - MFF UK
IGrid Index
• The idea of the method is splitting of
m-dimensional space containing n points to
discrete m-dimensional sub-intervals
– ith dimension is split to km sections in such a way,
that in each section defined by interval <li;ui> exist
approx. m/km points, i.e. all sections contain approx. the
same number of points
• If the distribution of points is uneven in given
dimension, densely occupied areas are split to
smaller sections
NDBI010 - DIS - MFF UK
IGrid Index
• Typically km=m
NDBI010 - DIS - MFF UK
IGrid Index
• The similarity is defined as follows
• If X=[x1,x2,…xm], Y=[y1,y2,…ym] then
– If xi and yi belong to the same section <li;ui>,
the points are considered as similar in ith dimension
and the similarity increase is
1-[(|xi-yi|)/(ui-li)]p. Else the increase is equal to 0.
• The similarity is in fact defined only over dimensions,
where both points are similar enough
Sim(X,Y)= p(1-[(|xi-yi|)/(ui-li)]p)
Sim(X,Y)<0;pm>
NDBI010 - DIS - MFF UK
IGrid Index
• Why 1-[(|xi-yi|)/(ui-li)]p ?
– Derived from P-norm
– The more is the distance in given dimension closer
to the size of the section, the closer the fraction is
to 1and thus the similarity increase is closer to 0
|ui-li|
NDBI010 - DIS - MFF UK
|xi-yi|
X
Y
IGrid Index
Zero similarity
Points are not similar in any dimension
Maximal reachable similarity p1=1
Points are similar in one dimension only
• The probability of the
similarity of two
points in ith dimension
is 1/km
• Two points are in
average similar in
m/km dimensions
Maximal reachable similarity p2
Points are similar in both dimensions
NDBI010 - DIS - MFF UK
IGrid Index
• Index structure
• IGrid Index
– For each of m dimensions
and each of km sections
exists list (with length
n/km) of items, having the
value of given dimension
within given section
– Each item contains value
of corresponding
dimension and reference
to original complete
vector
NDBI010 - DIS - MFF UK
IGrid Index
• Index structure
• IGrid Index
– Index size
m*km*(n/km)=m*n
– Each vector is
referenced from m
lists, one for each
dimension
NDBI010 - DIS - MFF UK
IGrid Index
• It is shown that the number of
sections km in each dimension
should be linearly dependent on the
number of dimensions m, for
example km=m
• Reduction of discretization impact
– Each dimension is split to l*km
sections (l is an odd number, for
example l=3)
– So it exists l times more l times
shorter lists within the index
– In each dimension is searched l
lists, the corresponding one plus (l1)/2 adjacent lists in each direction
NDBI010 - DIS - MFF UK
IGrid Index
• Direct search in vector
index
– Index volume
m*n
– During query evaluation
necessary to read
m*n
• Vyhledání pomocí
IGrid indexu
– Index volume
(m*km)*(n/km)=m*n
– During query evaluation
necessary to read
m*(n/km)=n , if km=m
• The volume of data doesn’t
depend on the space
dimensionality m i.e. the
effectiveness increases
NDBI010 - DIS - MFF UK
Approximate search
• Error (typo) detection in text
• Typo correction
• Words (with max. length n)
in the alphabet X
correspond to points in the space (X{})n
where  pads short words to uniform length
• Not each point in the space correspond to
some word
NDBI010 - DIS - MFF UK
Approximate search
• Metrics in the space (X{})n
– Hamming metrics H(u,v)
• Minimal number of REPLACE (of one character)
operations needed to convert one word to another
• Omissions, respectively addition of a character
usually produces large distance of words
• H(’success’,’syccess’)=1
• H(’success’,’sucess’)=3
NDBI010 - DIS - MFF UK
Approximate search
• Metrics in the space (X{})n
– Levenshtein metrics L(u,v)
• Minimal number of REPLACE, INSERT, DELETE
(of one character) operations needed to convert one
word to another
• L(’success’,’syccess’)=1
• L(’success’,’sucess’)=1
NDBI010 - DIS - MFF UK
Hamming Metrics
• Detection using non-deterministic FA
*
s
u
c
c
e
s
s
-s
-u
u
-c
c
-c
c
-e
e
-s
s
-s
s
-u
u
-c
c
-c
c
-e
e
-s
s
-s
s
NDBI010 - DIS - MFF UK
0 errors
1 error
2 errors
Hamming Metrics
•
•
•
•
Q = {qi,j | 0ik, ijn} is finite set of states
X is given alphabet
S = {q0,0}  Q is set of initial states
F = {qi,n}  Q is set of final states,
where state qi,n detects word w with i errors.
a) qi,j  (qi,j-1, xj) represents an acceptance of
character xi without error.
b) qi+1,j  (qi,j-1 , x), for x X\{xj} represenstan
acceptance of character xi with error (REPLACE
operation).
NDBI010 - DIS - MFF UK
Levenshtein Metrics
• Detection using non-deterministic FA
*
s
u
c
c
u
c
c
-s -s -u -u -c -c -c
s
u
c
u
c
c
-s -s -u -u -c -c -c
s
u
c
e
e
-c -e -e
c
e
e
-c -e -e
c
e
s
s
s
s
-s -s -s -s
s
s
s
s
-s -s -s -s
s
s
NDBI010 - DIS - MFF UK
0 errors
1 error
2 errors
Levenshtein Metrics
•
•
•
•
a)
b)
c)
d)
Q = {qi,j | 0ik, 0jn} is finite set of states
X is given alphabet
S = {q0,0}  Q is set of initial states
F = {qi,n}  Q is set of final states,
where state qi,n detects word w with i errors.
qi,j  (qi,j-1, xj) represents an acceptance of
character xi without error.
qi+1,j  (qi,j-1, x), for xX\{xj}  REPLACE.
qi+1,j-1  (qi,j-1, x), for xX\{xj}  INSERT.
qi+1,j+1  (qi,j-1, xj+l)  DELETE.
NDBI010 - DIS - MFF UK
Vocabulary Construction
• Frequency dictionary
– List of words ordered
by number of
occurrences in
descending order
• At the end
Doc. 1
Freq. dict.
Stoplist
Doc. 2
– Rarely used words
– Typos
Terms
Freq. dict.
Stoplist
• At the beginning
Terms
…
– Often used words
– Stop words
NDBI010 - DIS - MFF UK
Typos
Typos
Non-interactive Spell Checking
• Simultaneous comparison of two
alphabetically ordered dictionaries
– Alphabetical list of terms in document
– Alphabetical list of correct terms
• Requires one pass through both dictionaries
NDBI010 - DIS - MFF UK
Interactive Spell Checking
• Each term has to be checked immediately against
the dictionary
• Saves memory and time using a hierarchical
dictionary
• Uses so called
Empirical Zipf’s Law:
order of the term in the frequency dictionary
multiplied by its frequency
is approximately constant
NDBI010 - DIS - MFF UK
Empirical Zipf’s Law
• First 10 words of English frequency dictionary
(containing approx.1.000.000 words)
pořadí
1
2
3
4
5
6
7
8
9
10
slovo
the
of
and
to
a
in
that
is
was
he
NDBI010 - DIS - MFF UK
frekvence pořadí*frekvence
0.069971
0.069971
0.036411
0.072822
0.028852
0.086556
0.026149
0.104596
0.023237
0.116185
0.021341
0.128046
0.010595
0.074165
0.010099
0.080792
0.009816
0.088344
0.009543
0.095430
Empirical Zipf’s Law
• Cumulative term frequency of first k terms
k
CTF   frequency
k
i
i 1
• First
10 words

25%
words
in text
order
1
2
3
4
5
6
7
8
9
10
word
the
of
and
to
a
in
that
is
was
he
frequency order*frequency
0,069971
0,069971
0,036411
0,072822
0,028852
0,086556
0,026149
0,104596
0,023237
0,116185
0,021341
0,128046
0,010595
0,074165
0,010099
0,080792
0,009816
0,088344
0,009543
0,095430
NDBI010 - DIS - MFF UK
CTF
0,069971
0,106382
0,135234
0,161383
0,184620
0,205961
0,216556
0,226655
0,236471
0,246014
Empirical Zipf’s Law
• Cumulative term frequency graph
% word
occurrences
in text
%
všech
slov v textu
100
70
0
0
20
NDBI010 - DIS - MFF UK
% different words in text
100 % rùzných slov v textu
Hierarchical dictionaries
• Complete dictionary in external memory
 10.000 and more different words
• Dictionary of words found in document
 2.000 different words
• Dictionary of most frequent words inn
memory
 200 different words
 50% word occurrences in the document
NDBI010 - DIS - MFF UK
Compression
Term lists (terms, stoplist)
Index
Primary documents
Compression
• Compression in TRS’s
– Term lists (Terms, Stopterms)
– Index
– Primary documents
NDBI010 - DIS - MFF UK
Compression of Term Lists
• POM - Prefix
Omitting Method
– Each term represented
as a couple
• Length of prefix
common with previous
term in list
• Rest of the term
(postfix)
a
abeceda
absence
absolutní
absolvent
abstinent
abstraktní
aby
ačkoli
administrace
administrativní
NDBI010 - DIS - MFF UK
0:a
1:beceda
2:sence
3:olutní
5:vent
3:tinent
4:raktní
2:y
1:čkoli
1:dministrace
10:tivní
Index Vector Representation
• Individual vectors are sparse (approx. 90%
of zeroes)

• Storage of all weights d  [w1, w2 , w3,, wm]
including zeroes is ineffective
• More effective is storage of only couples
– Non-zero element index
– Value of the element

d  [ j1:w j1, j2:w j2 , j3:w j3,, jk :w jk , ]
NDBI010 - DIS - MFF UK
Encoding and Compression
• Code K=(A,C,f), where
– A={a1, a2, ..., an} is source alphabet, |A|=n
– C={c1, c2, ..., cm} is target (code) alphabet,
|C|=m, usually C={0,1}
– f: AC+ je injective mapping of characters
of alphabet A onto words in alphabet C.
• Extension for word encoding from A*
f(ai1ai2...aik) = f(ai1)f(ai2)...f(aik)
NDBI010 - DIS - MFF UK
Encoding and Compression
• Code K is uniquely decodable,
if and only if for each string YC+ exists at
most one string XA+ so, that f(X)=Y.
• Code K=({0,1,2,3},{0,1},f),
where f(0)=0, f(1)=1, f(2)=10, f(3)=11
is not uniquely decodable
f(21)=101=f(101)
NDBI010 - DIS - MFF UK
Encoding and Compression
• Code K is a prefix code, if and only if no
code word f(ai) is a prefix
of another code word f(aj).
• Code K=({0,1,2,3},{0,1},f),
where f(0)=0, f(1)=10, f(2)=110, f(3)=111
is a prefix code.
• Code K is a block code (of length k), if and
only if all code words have the length k.
NDBI010 - DIS - MFF UK
Encoding and Compression
• Each block code is also a prefix code.
• Each prefix code is uniquely decodable.
• Each code K is left to right (char by char)
decodable, if it is possible to determine the
end of the code word f(ai) and the
corresponding character ai just after the last
bit of the code word is read.
NDBI010 - DIS - MFF UK
Encoding and Compression
• Code K=({a,b,c,d},{0,1},f),
where f(a)=0, f(b)=01, f(c)= 011, f(d)=111
is not left to right decodable,
but is still uniquely decodable.
• Example: f(X)=011111111...
NDBI010 - DIS - MFF UK
Entropy and Redundancy
• Let A={a1, a2, ..., an} is source alphabet.
• Let the occurrence probability of character
ai in text is equal to pi.
• P(A) = (p1, p2, ..., pn) is denoted as
probability distribution of A.
NDBI010 - DIS - MFF UK
Entropy and Redundancy
• Entropy (measure of amount of
information) of character ai in the text is
equal to value E(ai) = -log(pi) bits.
• Average entropy of one character
n
AE A    pi log pi 
i 1
• Entropy of text
E  ai ai ...ai   
k
 1 2
NDBI010 - DIS - MFF UK



 log  pi j 


j 1
k
Entropy and Redundancy
• Let the code K(A,C,f) assigns code words
with lengths |f(ai)| = di to characters aiA.
• The lengths of encoded message is equal to
k
f  ai ai ...ai   l  ai ai ...ai    d i

1
2
k

1
2
k
j 1
j
• Holds l  ai1ai2 ...aik   E  ai1ai2 ...aik 
• Redundancy R  l  ai ai ...ai   E  ai ai ...ai 
k
k
 1 2
 1 2
NDBI010 - DIS - MFF UK
Number Encoding
• Binary encoding for
(potentially) infinite set
• Left to right decodable
• As most effective as possible
NDBI010 - DIS - MFF UK
Fibonacci Encoding
• Instead of powers of 2 it uses Fibonacci numbers
for individual orders
• So instead of n=bi2i, where bi{0,1}
it uses notation n=biFi,
where bi{0,1}, F0=1, F1=2, Fk+1=Fk-1+Fk
• Highest orders are on the right side
• Problem: ambiguity
1710
=1+3+5+8 =F0+F2+F3+F4 =10111Fib
=1+3+13 =F0+F2+F5
=101001Fib
NDBI010 - DIS - MFF UK
Fibonacci Encoding
• Exists only one notation, that doesn’t use two
consecutive members of the Fibonacci sequence
• If there exist two consecutive members, the
highest of all occurrences can be replaced by their
sum (by the following member)
1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …
• 10010
=111101101Fib
=111100011Fib
=1111000001Fib
=1100100001Fib
=0010100001Fib
NDBI010 - DIS - MFF UK
Fibonacci Encoding
• In normalized Fibonacci encoding
– Don’t exist two consecutive one-bits
– Last (rightmost) used position is one-bit
• At the end of notation is added extra one-bit
that allows determining of the end
• 10010 =00101000011F
• Fibonacci sequence grows exponentially
 notation has logarithmic length
NDBI010 - DIS - MFF UK
Elias Codes
• Set of number
encodings with
different features
• Alpha code (unary)

•  n   00

01
n 1
+ Decodable
– Long codes
number
code
1
1
2
01
3
001
4
0001
5
00001
6
000001
7
0000001
8 00000001
9 000000001
|(230-1)|= 230-1
NDBI010 - DIS - MFF UK
Elias Codes
• Beta code (binary)
• Standard encoding
 1  1
 2n    n .0
 2n  1   n .1
+ Short code words
– Undecodable
number
1
2
3
4
5
6
7
8
9
code
1
10
11
100
101
110
111
1000
1001
|(230-1)|= 30
NDBI010 - DIS - MFF UK
Elias Codes
• Modified
beta code
• Without leading one-bit
•  1  
 2n    n .0
 2n  1   n .1
+ Short code words
– Undecodable
number
1
2
3
4
5
6
7
8
9
code
0
1
00
01
10
11
000
001
|’(230-1)|= 30-1=29
NDBI010 - DIS - MFF UK
Elias Codes
• Theta code
•  n    n .#
+ Short code words
+ Decodable
– Ternary encoding
number
1
2
3
4
5
6
7
8
9
code
1#
10#
11#
100#
101#
110#
111#
1000#
1001#
|(230-1)|= 30+1=31
NDBI010 - DIS - MFF UK
Elias Codes
• Gamma kód
• Combination of two codes
– Modified beta code
encodes the number
– Alpha code ensures
decodability
•  7  01011
+ Short codes
+ Decodable
’(n)
(|(n)|)
number
1
2
3
4
5
6
7
8
9
code
1
001
011
00001
00011
01001
01011
0000001
0000011
|(230-1)|= 30+29=59
NDBI010 - DIS - MFF UK
Elias Codes
• Modified
gamma code
•  n     n . n
• More human readable
• Non-regular
number
1
2
3
4
5
6
7
8
9
code
1
010
011
00100
00101
00110
00111
0001000
0001001
|’(230-1)|= 30+29=59
NDBI010 - DIS - MFF UK
Elias Codes
• Delta code
• Uses more efficient
gamma code for
encoding of the length
of binary code
•  n     n . n
number
1
2
3
4
5
6
7
8
9
code
1
0010
0011
01100
01101
01110
01111
00001000
00001001
|(230-1)|= 6+5+29=40
NDBI010 - DIS - MFF UK
Elias Codes
• Omega code
• Used for very long
numbers encoding
 n   B0 B1 B k 0
B   n
B   log B 
   B 1
B 2
k
i 1
2
i
0
i
number
1
2
3
4
5
6
7
8
9
code
0
100
110
101000
101010
101100
101110
1110000
1110010
|(230-1)|= 2+3+5+30+1=41
NDBI010 - DIS - MFF UK
Vector Index Structure
• The weights can be stored using integers
instead of floats
• The precision of a few positions is usually
sufficient
• Smaller numbers are stored in less number
of bits, it is possible to store differences of
indexes instead of their values

d  [ j1:w j1, j2  j1:w j2 ,, jk  jk  1:w jk , ]
NDBI010 - DIS - MFF UK
Text Compression
• Huffman encoding.
• Prefix code for alphabet A with minimal
reachable redundancy
NDBI010 - DIS - MFF UK
Text Compression
• Huffman code construction
– The alphabet A={a1, a2, ..., an} with distribution
P(A)={p1, p2, ..., pn}, suppose p1  p2  ...  pn
– If n=2, then f(a1)=0, f(a2)=1
– Else the modified (reduced) alphabet is built
A’={a1a2, a3, ..., an}, P(A’)={p1+p2, p3, ..., pn}
And the modified code f’ recursively
– f(a1)=f’(a1a2).0, f(a1a2)=f’(a2).1, f(ai)=f’(ai)
NDBI010 - DIS - MFF UK
Huffman Encoding Example
– A={u, v, w, x, y, z} with distribution
 1 2 3 4 5 17 
P(A)=  32 , 32 , 32 , 32 , 32 , 32
•
•
•
•
•
•
f(u)
f(v)
f(w)
f(x)
f(y)
f(z)
= 0000
= 0001
= 001
= 010
= 011
=1
uvw xyz
0
32/32
1
uvw xy
0
z
15/32
uvw
0
u
1/32
NDBI010 - DIS - MFF UK
17/32
xy
6/32 1
uv
0
1
w
3/32 1
v
2/32
0
x
3/32
9/32 1
y
4/32
5/32
Text Compression
• Data model
– Both compression and decompression are
controlled by set of data, that parameterizes
given method
• For Huffman encoding the probability distribution
– The equality of both models must be ensured
model
model
input
text
Vstupní text
compre
ssion
komprese
model
model
compresed
Komprimovaná
data
data
decomp
ression
dekomprese
NDBI010 - DIS - MFF UK
output
text
Vstupní text
Text Compression
• Static compression
– Static model for all documents in the collection
• Can be computed and stored only once
• Compression is not optimal
• Semi-adaptive compression
– Each document has its own model
• Model must be stored together with compressed data
• Dynamic (adaptive) compression
– Both algorithm forms model dynamically according to
already processed data
NDBI010 - DIS - MFF UK
Adaptive
Huffman Compression
• FGK (Faller, Gallager a Knuth) algorithm
• It uses so called sibling property
– Nodes in tree can
be ordered so, that
• Sibling are consecutive
in the ordering
• Weights (probabilities,
frequencies)
don’t decrease
uvw xyz
0
32/32
1
uvw xy
0
z
15/32
uvw
0
u
1/32
NDBI010 - DIS - MFF UK
17/32
xy
6/32 1
uv
0
1
w
3/32 1
v
2/32
0
x
3/32
9/32 1
y
4/32
5/32
Adaptive
Huffman Compression
• There exist encoding tree for each character
of the text
• Following character is encoded/decoded
according to existing tree
• The tree is modified to reflect the increased
frequency of the last encoded/decoded
character
• Both algorithms have to start form the same
tree
NDBI010 - DIS - MFF UK
Adaptive
Huffman Compression
• Huffman tree modification
Node := Processed_Node;
while Node <> Root do begin
Swap the Node including its
subtree with the last node with
the same frequency;
Increase the Node frequency by
one
{the ordering is not corrupt}
Node := Predecessor(Node)
end;
NDBI010 - DIS - MFF UK
Adaptive
Huffman Compression
• The following tree
is given
• The character z
is encoded as 1011
• The node z is the only
one with frequency 3
• The frequency can be
increased to 4
0
32
1
x
z
11
NDBI010 - DIS - MFF UK
0
0
10
21
1
0
w
5
1
u
0
y
2
5
1
z
3
5
11
1
y v
6
Adaptive
Huffman Compression
• Node with frequency 5
is not the last
in the ordering
• Must be swapped
with the last one
• Then the frequency
can be increased
0
32
1
x
z
11
NDBI010 - DIS - MFF UK
0
0
10
21
1
0
w
5
1
u
0
y
2
5
1
z
4
5
11
1
y v
6
Adaptive
Huffman Compression
• Node with frequency 11
is not the last
in the ordering
• Must be swapped
with the last one
• Then the frequency
can be increased
0
32
1
x
z
11
NDBI010 - DIS - MFF UK
0
0
w
5
10
21
1
1
0
11
u
1
y v
5 0
y
2
6
1
z
4
6
Adaptive
Huffman Compression
• Node with frequency 32
is not the last
in the ordering
• Need not
to be swapped
• The frequency
can be increased
0
32
1
z
0
0
y
2
NDBI010 - DIS - MFF UK
6
12 1
y v
1
z
4
0
21
1
x
6 0
w
5
10
1
u
5
11
Adaptive
Huffman Compression
• According to
the modified tree
the next character z
would be encoded as
001 instead of 1011
0
33
1
z
0
0
y
2
NDBI010 - DIS - MFF UK
6
12 1
y v
1
z
4
0
21
1
x
6 0
w
5
10
1
u
5
11
Adaptive
Huffman Compression
• The starting tree can
be either the one that
suppose frequencies
equal to one
0
1
4
0
a
1
NDBI010 - DIS - MFF UK
2
1
0
b
1
c
1
2
1
d
1
Adaptive
Huffman Compression
•
•
•
More effective is to start with only
one-node tree that represents all
unknown characters with the
frequency equal to one
New, still unknown character is
encoded using this special node and
the representation of the character
is then stored in the compressed
data as well
The node is split to new node
representing unknown characters
and the one representing last
character – both with frequencies
equal to one
• The first character of the message
is encoded
? abacbda
1
• The empty string is sent
Followed by the definition of a
a

bacbda
0
?
1
NDBI010 - DIS - MFF UK
2
1
a
1
Adaptive
Huffman Compression
•
•
•
More effective is to start with only
one-node tree that represents all
unknown characters with the
frequency equal to one
New, still unknown character is
encoded using this special node and
the representation of the character
is then stored in the compressed
data as well
The node is split to new node
representing unknown characters
and the one representing last
character – both with frequencies
equal to one
•
The next character is encoded
a
bacbda
0
2
1
?
1
•
a
1
The code 0 is sent
followed by b definition

a0b
acbda
0
1
3
a
1
0
?
1
NDBI010 - DIS - MFF UK
2
1
b
1
Adaptive
Huffman Compression
•
•
•
More effective is to start with only
one-node tree that represents all
unknown characters with the
frequency equal to one
New, still unknown character is
encoded using this special node and
the representation of the character
is then stored in the compressed
data as well
The node is split to new node
representing unknown characters
and the one representing last
character – both with frequencies
equal to one
• The third character is encoded
a0b
acbda
0
1
3
a
1
0

b
1
cbda
0
1
4
a
2
0
?
1
NDBI010 - DIS - MFF UK
1
?
1
• The string 0 is sent
a0b0
2
2
1
b
1
HuffWord Algorithm
• Encoding of the text word by word using
Huffman algorithm
• Semi-adaptive method
• More effective than char by char encoding
– More different symbols
with much different frequencies
• Simple representation of the tree
• More time consuming compression
NDBI010 - DIS - MFF UK
HuffWord Algorithm
• Uses canonical form of code words
• Each code word is in form p.c
– p is a prefix of all zeroes
– c is code number
• Codes are ordered by the length, for the same
length by their values
• The prefix length of the longer code words is
equel to the complete length of shorter code words
NDBI010 - DIS - MFF UK
HuffWord Algorithm Example
• Words A..H with code lengths 4,4,5,5,2,2,4,2
• Code words with length 4 have prefix 00 of length
2
 Codes with the length 2 are 01,10,11
• Codes with the length 5 have prefix 0000
with length 4
 Codes with the length 4 are 00.01, 00.10, 00.11
 Codes with the length 5 are 0000.0 and 0000.1
NDBI010 - DIS - MFF UK
HuffWord Algorithm Example
• Assigning of code
words to words
A
B
C
D
E
F
G
H
length
4
4
5
5
2
2
4
2
code
0001
0010
00000
00001
01
10
0011
11
• Codes in proper order
E
F
H
A
B
G
C
D
lenght
2
2
2
4
4
4
5
5
NDBI010 - DIS - MFF UK
code
01
10
11
0001
0010
0011
00000
00001
HuffWord Algorithm
• Length of codes finding
• Given words wi,
their frequencies ni and probabilities pi
• for each i
b=round(-log(pi))
if (b==0) then b++
x[b]+=ni
• Index b for the highest accumulated value x[b]
determines the length of shortest code words
• Index belonging to second highest value
defines the length increase of second shortest code words
• ...
NDBI010 - DIS - MFF UK
HuffWord Algorithm Example
• Code word lengths
• Words A..E
word
A
B
C
D
E
ni
3
7
20
30
40
pi -log(pi ) round(-log(pi ))
0,03
5,06
5
0,07
3,84
4
0,20
2,32
2
0,30
1,74
2
0,40
1,32
1
• Array x
I x[I]
1 40
2 50
3 0
4 7
5 3
–
–
–
–
2
2+1
2+1+4
2+3+4+5
• Codes
–
–
–
–
–
E=01
D=10
C=11
B=000
A=001
NDBI010 - DIS - MFF UK
HuffWord Algorithm Example
• Decompression
• For each length b in bits the following information should be available
– The value of lowest valid code word
having the given length (first[b])
– Index of first valid code word in the table of code
words (base[b])
• For E=01, D=10, C=11, B=000, A=001
–
–
–
–
first[0] =+
first[1] = +
first[2] = 1
first[3] = 0
base[2] = 1
base[3] = 4
NDBI010 - DIS - MFF UK
HuffWord Algorithm Example
• For E=01, D=10, C=11, B=000, A=001
• first[0]=+,
first[1]=+,
first[2]=1,
first[3]=0
• base[2]=1,
base[3]=4
• Input 10|11|01|01|000|10|001
• c = 0; d = 0
while (c<prvni[d])
c=2*c+next_bit()
d++
word_index = base[d] + c - first[d]
NDBI010 - DIS - MFF UK
Markov Automata
• The probability distribution of characters can be
very different according to context (previous
characters)
– p(”o”)
=0.058
probability of occurrence of character ”o”
in the text
– p(”o”|”e”)
=0.004
probability of occurrence of character ”o”
in the text supposing that
the previous character was ”e”
– p(”o”|”c”)
=0.237
NDBI010 - DIS - MFF UK
Markov Automata
• It is possible to build a finite automaton whose
states correspond to strings of given length
• Q=Xn, where n is the automaton order
• (x1x2…xn, x) = x2…xnx
• For each state the individual compression model
can be constructed that correspond to conditional
probabilities
• Better probability estimation results in better –
more effective – compression
NDBI010 - DIS - MFF UK