The Mathematics of Information Retrieval

Download Report

Transcript The Mathematics of Information Retrieval

The Mathematics of
Information Retrieval
11/21/2005
Presented by Jeremy Chapman,
Grant Gelven and Ben Lakin
Acknowledgments
This presentation is based on the following
paper:
“Matrices, Vector Spaces, and Information
Retrieval.” by Michael W. Barry, Zlatko
Drmat, and Elizabeth R.Jessup.
Indexing of Scientific Works
Indexing primarily done by using the title,
author list, abstract, key word list, and
subject classification
These are created in large part to allow
them to be found in a search of scientific
documents
The use of automated information retrieval
(IR) has improved consistency and speed
Vector Space Model for IR
The basic mechanism for this model is the
encoding of a document as a vector
All documents’ vectors are stored in a
single matrix
Latent Semantic Indexing (LSI) replaces
the original matrix by a matrix of a smaller
rank while maintaining similar information
by use of Rank Reduction
Creating the Database Matrix
Each document is defined in a column of
the matrix (d is the number of documents)
Each term is defined as a row (t is the
number of terms)
This gives us a t x d matrix
The document vectors span the content
Simple Example
Let the six terms as follows:
T1: bak(e, ing)
T2: recipes
T3: bread
T4: cake
T5: pastr(y, ies)
T6: pie
The following are the d=5 documents
D1: How to Bake Bread Without Recipes
D2: The Classical Art of Viennese Pastry
D3: Numerical Recipes: The Art of Scientific Computing
D4: Breads, Pastries, Pies, and Cakes: Quantity Baking
Recipes
D5:Pastry: A Book of Best French Recipes
Thus the document matrix becomes:
A=
1

1
1

0
0

0
0
0
1
0
1
1
0
0
1
0
0
1
1
0
1
0
0
1
0

1
0

0
0

0
The matrix A after Normalization
Thus after the normalization of the columns of A we get the following:
 .5774

 .5774
 .5774
A
0
0

0
0
0
.4082
0
0
1
.4082
.7071
0
0
.4082
0
0
0
.4082
0
1
0
.4082
.7071
0
0
.4082
0









Making a Query
Next we will use the document matrix to
ease our search for related documents.
Referring to our example we will make the
following query: Baking Bread
We will now format a query using our
terms definitions given before:
q= (1 0
1
0
0
0)T
Matching the Document to the
Query
Matching the documents to a given query is
typically done by using the cosine of the angle
between the query and document vectors
The cosine is given as follows:
T
aj q
cos( j ) 
|| a j ||2 || q ||2
A Query
By using the cosine formula we would get:
Cos(1 )=.8165, Cos(2 )=0, Cos(3 )=0, Cos(4 )=.5774, and Cos(5 )=0
We will set our lower limit on our cosine at .5.
• Thus by conducting a query “baking bread” we get the
following two articles:
D1: How to Bake Bread Without Recipes
D4: Breads, Pastries, Pies, and Cakes: Quantity
Baking Recipes
Singular Value Decomposition
The Singular Value Decomposition (SVD) is used to
reduce the rank of the matrix, while also giving a good
approximation of the information stored in it
The decomposition is written in the following manner:
A=U  V
T
Where U spans the column space of A,  is the matrix with
singular values of A along the main diagonal, and V
spans the row space of A. U and V are also orthogonal.
SVD continued
• Unlike the QR Factorization, SVD provides us with a lower rank
representation of the column and row spaces
• We know Ak is the best rank-k approximation to A by Eckert and Young’s
Theorem that states:
|| A  Ak ||
|| A  Ak ||
min || A  x ||
 x || ( x )  k
min || Arank
rank ( x )k
• Thus the rank-k approximation of A is given as follows:
Ak= Uk kVkT
• Where Uk=the first k columns of U
 k=a k x k matrix whose diagonal is a set of decreasing values,
call them:
1 ,...,  k
VkT=is the k x d matrix whose rows are the first k rows of V
SVD Factorization
 .2670

 .7479
 .2670
U 
 .1182
 .5198

 .1182


-.5249 .0816
0
0 
.5308 -.2847 .7071
0 

.2774 .6397
0
-.7071  V

.0838 -.1158
0
0

.2774 .6394
0
.7071
-.2567 .5308 -.2847 -.7071
-.3981
-.2567
-.0127
.8423
-.0127

0
 .4366

 .3067
  .4412

 .4909
 .5288

0 

.7549 .0998 -.2760 -.5000 
-.3568 -.6247 .1945 -.5000 

-.0346 .5711 .6571
0 

.2815 -.3712 -.0577 .7071 
-.4717 .3688 -.6715
0
0
0
0
1.6950


0
1.1158
0
0
0


0
0
0.8403 0
0


0
0
0
0.4195
0


0
0
0
0
0


0
0
0
0
0
Interpretation
From the matrix given on the slide before we
notice that if we take the rank-4 matrix has only
four non-zero singular values
Also the two non-zero columns in  tell us that
the first four columns of U give us the basis for
the column space of A
Analysis of the Rank-k
Approximations
Using the following formula we can calculate the
relative error from the original matrix to its rank-k
approximation:
2
2


...


||A-Ak||F= 1
k 1
Thus only a 19% relative error is needed to change
from a rank-4 to a rank-3 matrix, however a 42%
relative error is necessary to move to a rank-2
approximation from a rank-4 approximation
• As expected these values are less than the rankk approximations for the QR factorization
Using the SVD for Query Matching
•
Using the following formula we can calculate
the cosine of the angles between the query
and the columns of our rank-k approximation
of A.
T
T
Cos( j )=
•
[s j (U k q)]
(||s j ||2 ||U k T q||2 )
Using the rank-3 approximation we return the
first and fourth books again using the cutoff of
.5
Term-Term Comparison
It is possible to modify the vector space model for
comparing queries with documents in order to
compare terms with terms.
When this is added to a search engine it can act as
a tool to refine the result
First we run our search as before and retrieve a
certain number of documents in the following
example we will have five documents retrieved.
We will then create another document matrix with
the remaining information, call it G.
Another Example
Terms
Documents
T1:Run(ning) D1:Complete Triathlon Endurance Training Manual:Swim,
Bike, Run
D2:Lake, River, and Sea-Run Fishes of Canada
T2:Bike
D3:Middle Distance Running, Training and Competition
T3:Endurance D4:Music Law: How to Run your Band’s Business
D5:Running: Learning, Training Competing
T4:Training
 .5000 .7071 .7071 .5774 .7071 
T5:Band


.5000
0
0
0
0


T6:Music
 .5000 0

0
0
0

T7:Fishes G   .5000 0
.7071 0
.7071 
0

0
0

0
0
.5774 0
0
0
.5774 0
.7071
0
.5774 0





Analysis of the Term-Term
Comparison
For this we use the following formula:
T
Cos(ij )=
T
[(ei G)(G e j )]
T
T
(||G ei ||2 ||G e j ||2 )
Clustering
• Clustering is the process by
which terms are grouped if
they are related such as bike,
endurance and training
• First the terms are split into
groups which are related
• The terms in each group are
placed such that their vectors
are almost parallel
Clusters
In this example the first cluster is running
The second cluster is bike, endurance and
training
The third is band and music
And the fourth is fishes
Analyzing the term-term
Comparison
We will again use the SVD rank-k approximation
Thus the cosine of the angles becomes:
[(ei U k 
T
Cos(ij )=
)( k U k e j )]
T
k
(|| k U k T ei ||2 || k U k T e j ||2 )
Conclusion
Through the use of
this model many
libraries and smaller
collections can index
their documents
However, as the next
presentation will show
a different approach
is used in large
collections such as
the internet