Transcript PowerPoint

CS 430 / INFO 430
Information Retrieval
Lecture 9
Latent Semantic Indexing
1
Course Administration
2
Latent Semantic Indexing
Objective
Replace indexes that use sets of index terms by indexes
that use concepts.
Approach
Map the term vector space into a lower dimensional
space, using singular value decomposition.
Each dimension in the new space corresponds to a latent
concept in the original data.
3
Deficiencies with Conventional
Automatic Indexing
Synonymy: Various words and phrases refer to the same concept
(lowers recall).
Polysemy: Individual words have more than one meaning
(lowers precision)
Independence: No significance is given to two terms that
frequently appear together
Latent semantic indexing addresses the first of these (synonymy),
and the third (dependence)
4
Example
Query: "IDF in computer-based information look-up"
Index terms for a document: access, document, retrieval,
indexing
How can we recognize that information look-up is related to
retrieval and indexing?
Conversely, if information has many different contexts in the
set of documents, how can we discover that it is an
unhelpful term for retrieval?
5
Technical Memo Example: Titles
c1
Human machine interface for Lab ABC computer
applications
c2
c3
c4
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
c5
Relation of user-perceived response time to error
measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
6
Technical Memo Example: Terms and
Documents
Terms
7
c1
human
1
interface 1
computer 1
user
0
system
0
response 0
time
0
EPS
0
survey
0
trees
0
graph
0
minors
0
c2
0
0
1
1
1
1
1
0
1
0
0
0
c3
0
1
0
1
1
0
0
1
0
0
0
0
c4
1
0
0
0
2
0
0
1
0
0
0
0
Documents
c5 m1
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
0
0
0
m2
0
0
0
0
0
0
0
0
0
1
1
0
m3
0
0
0
0
0
0
0
0
0
1
1
1
m4
0
0
0
0
0
0
0
0
1
0
1
1
Technical Memo Example: Query
Query:
Find documents relevant to "human computer interaction"
Simple Term Matching:
Matches c1, c2, and c4
Misses c3 and c5
8
Models of Semantic Similarity
Proximity models: Put similar items together in some space or
structure
9
•
Clustering (hierarchical, partition, overlapping). Documents
are considered close to the extent that they contain the same
terms. Most then arrange the documents into a hierarchy based
on distances between documents. [Covered later in course.]
•
Factor analysis based on matrix of similarities between
documents (single mode).
•
Two-mode proximity methods. Start with rectangular matrix
and construct explicit representations of both row and column
objects.
Selection of Two-mode Factor Analysis
Additional criterion:
Computationally efficient O(N2k3)
N is number of terms plus documents
k is number of dimensions
10
The term vector space
The space has
as many
dimensions as
there are terms
in the word
list.
t3
d1
d2

t2
t1
11
Figure 1
Latent concept
vector space
• term
document
query
--- cosine > 0.9
12
Mathematical concepts
Define X as the term-document matrix, with t rows (number of
index terms) and d columns (number of documents).
Singular Value Decomposition
For any matrix X, with t rows and d columns, there exist matrices
T0, S0 and D0', such that:
X = T0S0D0'
T0 and D0 are the matrices of left and right singular vectors
T0 and D0 have orthonormal columns
S0 is the diagonal matrix of singular values
13
Dimensions of matrices
txd
X
txm
=
T0
m is the rank of X < min(t, d)
14
mxm
mxd
S0
D0'
Reduced Rank
S0 can be chosen so that the diagonal elements are positive and
decreasing in magnitude. Keep the first k and set the others to
zero.
Delete the zero rows and columns of S0 and the corresponding
rows and columns of T0 and D0. This gives:
^ = TSD'
~X
X~
Interpretation
^ retains the
If value of k is selected well, expectation is that X
semantic information from X, but eliminates noise from synonymy
and recognizes dependence.
15
Selection of singular values
txd
txk
kxk
S
^
X
=
kxd
D'
T
k is the number of singular values chosen to
represent the concepts in the set of documents.
Usually, k « m.
16
Comparing a Term and a Document
An individual cell of ^
X is the number of occurrences of
term i in document j.
^ = TSD'
X
- = TS(DS)'
where S is a diagonal matrix whose values are the square
root of the corresponding elements of S.
17
Calculation Similarities in the Concept
Space
Objective:
Calculate similarities between terms, documents, and
queries, using the matrices T, S, and D.
18
Mathematical Revision
A is a p x q matrix
B is a r x q matrix
ai is the vector represented by row i of A
bj is the vector represented by row j of B
The inner product ai.bj is element i, j of AB'
q
ith
r
row of A
q
p
B'
jth row of B
A
19
Comparing Two Terms
^ reflects the
The dot product of two rows of X
extent to which two terms have a similar pattern
of occurrences.
^ ^ = TSD'(TSD')'
XX'
= TSD'DS'T'
= TSS'T'
Since D is orthonormal
= TS(TS)'
To calculate the i, j cell, take the dot product between
the i and j rows of TS
20
Since S is diagonal, TS differs from T only by
stretching the coordinate system
Comparing Two Documents
^ reflects the
The dot product of two columns of X
extent to which two columns have a similar
pattern of occurrences.
^ ^ = (TSD')'TSD'
X'X
= DS(DS)'
To calculate the i, j cell, take the dot product
between the i and j columns of DS.
Since S is diagonal DS differs from D only by
stretching the coordinate system
21
Comparing a Query and a Document
A query can be expressed as a vector in the termdocument vector space xq.
xqi = 1 if term i is in the query and 0 otherwise.
(Ignore query terms that are not in the term vector
space.)
Let pqj be the inner product of the query xq with
document dj in the term-document vector space.
^
p is the jth element in the product of x 'X.
qj
22
q
Comparing a Query and a Document
[pq1 ... pqj ... pqt] = [xq1 xq2 ... xqt]
inner product of
query q with
document dj
^
X
query
^
pq' = xq'X
= xq'TSD'
= xq'T(DS)'
similarity(q, dj) =
23
pqj
|xq| |dj|
document dj
is column j
^
of X
cosine of angle is
inner product
divided by lengths
of vectors
Comparing a Query and a Document
In the reading, the authors treat the query as a pseudodocument in the concept space dq:
dq = xq'TS-1
[Note that S-1 stretches the vector]
To compare a query against document j, they extend the
method used to compare document i with document j.
Take the jth element of the product of:
dqS and (DS)'
This is the jth element of product of:
xq'T (DS)' which is the same expression as before.
Note that with their notation dq is a row vector.
24
Technical Memo Example: Query
Terms
25
Query
xq
human
1
interface
0
computer 0
user
0
system
1
response
0
time
0
EPS
0
survey
0
trees
1
graph
0
minors
0
Query:
"human system interactions on trees"
In term-document space, a query is
represented by xq, a column vector with
t elements.
In concept space, a query is
represented by dq, a row vector with k
elements.
Experimental Results
Deerwester, et al. tried latent semantic indexing on two test
collections, MED and CISI, where queries and relevant
judgments were available.
Documents were full text of title and abstract.
Stop list of 439 words (SMART); no stemming, etc.
Comparison with:
(a) simple term matching, (b) SMART, (c) Voorhees method.
26
Experimental Results: 100 Factors
27
Experimental Results: Number of Factors
28