Supporting the Emergence of Ideas in Spatial Hypertext
Download
Report
Transcript Supporting the Emergence of Ideas in Spatial Hypertext
IR Models:
Latent Semantic Analysis
IR Model Taxonomy
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
U
s
e
r
Retrieval:
Adhoc
Filtering
boolean
vector
probabilistic
Structured Models
T
a
s
k
Non-Overlapping Lists
Proximal Nodes
Browsing
Browsing
Flat
Structure Guided
Hypertext
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
Vocabulary Problem
The “vocabulary problem” causes classic IR
to potentially experience poor retrieval:
– Polysemy - same term means many things so
unrelated documents might be included in the
answer set
• Leads to poor precision
– Synonymy - different terms mean the same
thing so relevant documents that do not
contain any index term are not retrieved.
• Leads to poor recall
Latent Semantic Indexing
Retrieval based on index terms is vague
and noisy
The user information need is more related
to concepts and ideas than to index terms
A document that shares concepts with
another document known to be relevant
might be of interest
Latent Semantic Indexing
The key idea
– Map documents and queries into a lower
dimensional space
– Lower dimensional space represents higher
level concepts which are fewer in number
than the index terms
Retrieval in this reduced concept space
might be superior to retrieval in the space
of index terms
Latent Semantic Indexing
Definitions
– Let t be the total number of index terms
– Let N be the number of documents
– Let (Mij) be a term-document matrix with t
rows and N columns
– Each element of this matrix is assigned a
weight wij associated with the pair [ki,dj]
– The weight wij can be based on a tf-idf
weighting scheme
Singular Value Decomposition
The matrix (Mij) can be decomposed into 3 matrices:
–
–
–
–
(Mij) = (K) (S) (D)t
(K) is the matrix of eigenvectors derived from (M)(M)t
(D)t is the matrix of eigenvectors derived from (M)t(M)
(S) is an r x r diagonal matrix of singular values where
• r = min(t,N) that is, the rank of (Mij)
Latent Semantic Indexing
In the matrix (S), select only the s largest
singular values
Keep the corresponding columns in (K) and (D)t
The resultant matrix is called (M)s and is given by
– (M)s = (K)s (S)s (D)t
– where s, s < r, is the dimensionality of the concept
space
The parameter s should be
– large enough to allow fitting the characteristics of the
data
– small enough to filter out the non-relevant details
LSI Ranking
The user query can be modelled as a pseudodocument in the original (M) matrix
Assume the query is modelled as the document
numbered 0 in the (M) matrix
The matrix
(M)t(M)s
quantifies the relantionship between any two
documents in the reduced concept space
The first row of this matrix provides the rank of all the
documents with regard to the user query
(represented as the document numbered 0)
Latent Semantic Analysis as Model
of Human Language Learning
Psycho-linguistic model:
– Acts like children who acquire word
meanings not through explicit definitions
but by observing how they are used.
– LSA is a pale reflection of how humans
learn language, but it is a reflection.
– LSA offers an explanation of how people
can agree enough to share meaning.
LSA Applications
In addition for typical query systems, LSA
has been used for:
– Cross-language search
– Reviewer assignment at conferences
– Finding experts in an organization
– Identifying reading level of documents
Concept-based IR Beyond LSA
LSA/LSI uses principle component analysis
Principle components are not necessarily
good for discrimination in classification.
Linear Discriminant Analysis (LDA) identifies
linear transformations
– maximizing between-class variance while
– minimizing within class variance
LDA requires training data
Linear Discriminant Analysis
Projecting a 2D space
to 1 PC
B
2.0
1.5
1.0
0.5
.
.
.
.
.
.
.
.
.
.
w
A
0.5
(from slides by Shaoqun Wu)
1.0
1.5
2.0
PCA
Linear Discriminant Analysis
LDA: discovers a discriminating
projection
2.0 B
2.0 B
.
.
.
.
.
.
.
.. .
1.5
1.0
0.5
1.5
1.0
0.5
w
A
0.5
1.0
1.5
... .. ...
.
2.0
0.5
w
....
1.0
1.5
A
2.0
LDA results
LDA reduces number of dimensions (concepts)
required for classification tasks
Conclusions
Latent semantic indexing provides an
intermediate representation of concept to
aid IR, minimizing the vocabulary problem.
It generates a representation of the
document collection which might be
explored by the user.
Alternative methods for identifying clusters
(e.g. LDA) may improve results.