Transcript Slide 1

Latent Semantic
Analysis:
A Model of Inductive
Knowledge Acquisition
Paul Fillmore & Stefanie Wong
Overview

The question of
interest



The Problem
The Proposed Solution:
LSA
Latent Semantic
Analysis



What is it?
What can it do?
How does it do it?



Evaluation of the
model
Additional
Considerations
Demonstrations of
LSA
The Problem of Induction

Plato’s problem: the poverty of the stimulus


How do people acquire as much knowledge as
they do based on the little information they get?
Example: Language Acquisition


Chomsky (1991) – Observing adult language is
insufficient for children’s development of grammar
or a typical lexicon
Pinker (1994) – Language learning must be innate
– a “language instinct”
Problem of induction in cognitive
terms...

Problem of categorization


What is the mechanism in which concepts
(cheetah, tigers) come to be treated as the same
for some purpose (predators that will eat me)
Problem of similarity

How does experience combine disparate things
into a feature identity (“wing” different for a bird,
insect, bat)
Latent Semantic Analysis: What is it?



“Latent Semantic Analysis (LSA) is a
mathematical/statistical technique for extracting
and representing the similarity of meaning of
words and passages by analysis of large bodies
of text.”
More simply, it is a computer model of human
associative learning through experience
Does not embody human knowledge beyond its
general learning mechanism
What can LSA do?




Performance on standard vocabulary and subject matter
tests comparable to humans
Demonstrates similar mechanism for word sorting and
category judgments
Processes word-word and passage-word lexical priming
data
It can accurately estimate:




Passage coherence
Learnability of passages by individual students
The quality and quantity of knowledge contained in essays
Can perform humanlike generalizations based on
learning that isn’t dependent upon primitive perceptual
relations/representations
How does LSA work?

Definitions




Semantic space
Singular value decomposition (SVD)
Dimensionality
Procedure




1) Matrix Input
2) Cell Transformation
3) Singular Value Decomposition
4) Dimension Reduction
Semantic Space



A semantic space is a mathematical representation
of a large body of text (e.g. Encyclopedias,
Psychology Texts)
Each term or combination of terms has its own highdimensional vector representation within the
semantic space
Similarity between vectors for words and context is
measured by cosine of their combined angle


Note: Terms can only be compared within a semantic space, not
directly between semantic spaces
If vectors were projected onto a sphere surrounding the semantic
space, points close together would have closer semantic relations
Example of similarities within
Semantic Space

Submitting a term/short text and receiving
list of terms that are nearest to it in
semantic space

Matrix comparison of multiple terms
Singular Value Decomposition



A mathematical matrix decomposition
technique (general case of factor analysis),
condenses large matrix of word-by-content
data into smaller matrix
Smaller matrix typically has a 100-500
dimensional representation
The right number of dimensions critical for
optimal simulation
Dimensionality


Knowing appropriate
dimensionality improves
estimates
Example:


Three separate house,
ABC are arranged as
follows: A is 5 units from
both B and C, and B and C
are separated by 8 units
Oh, also, all on the same
straight, flat road
A
B
B
C
A
C
Procedure: Matrix Input



Rows = individual
word types
Columns =
meaning-bearing
passages (i.e.
sentences or
paragraphs)
Cells = frequency
with which a word
occurs in a
passage
Procedure: Cell Transformation

Transformation 1: Approximates standard
empirical growth functions of simple learning


Taking a word’s appearance frequency
(f)
Transformation 2: makes primary association
better represent the informative relation between
the entities rather than co-occurrence
 Entropy ( 
 p log p ) for a word
f
Transformation 1
log( f )
Transformation 2
log( f )
  p log p
Procedure: SVD &
Dimension Reduction
SVD:
[ij] = [ik] [kk] [jk]'
in which [ik] and [jk] have orthonormal
columns, [kk] is a diagonal matrix of
singular values, and k <= max (i,j).
 Dimension reduction: all but the d largest
singular values are set to zero, where d =
number of dimensions to be used
Word (w) x Context (c)
Matrix (X)

m columns of W and
m rows of C’ are
linearly independent
Diagonal Matrix
Orthonormal Matrices
LSA Example
c1: Human machine interface for ABC
computer applications
c2: A survey of user opinion of computer
system response time
c3:The EPS user interface management
system
c4: System and human system
engineering testing of EPS
c5: Relation of user perceived response
time to error measurement
m1: The generation of random, binary,
ordered trees
m2: The intersection graph pf paths in
trees
m3: Graph minors IV: Widths of trees and
well-quasi ordering
m4: Graph minors: A survey
r(human user) = 0.94
Evaluating the Model
Four Questions to keep in mind:
1.
Can a simple linear model acquire knowledge of
humanlike word meaning similarities given
sufficient input?
2.
If successful, is it dependent upon
dimensionality of representation?
3.
Is the rate of acquisition comparable to a
human?
4.
What degree of this knowledge is from indirect
inferences from combinations of information
across samples?
Is It Acquiring Knowledge

Model’s knowledge tested with standard
multiple-choice synonym test


After training on approx. 2,000 pages of English
text, LSA scored as well as average test-takers
on the synonym portion of TOEFL
Acquired knowledge attributed to indirect
inference as opposed to direct
co-occurrence relations
Two explanations…
1) A substantial portion of the information needed to
answer common vocabulary questions could be
inferred from the contextual statistics of usage
alone
2) Model employs a means of induction-dimension
matching that amplifies its learning ability,
resulting in correct inference of similarity relations
only implicit in temporal correlations of experience
Is dimensionality a factor?



Varied number of
dimensions retained
Note: What happens when
there is no dimensionality
reduction at all
Choosing optimal
dimensionality
approximately triples the
number of words learned
Comparable rate?

Learning comparable
to the rate at which
school aged children
improve their
performance on
similar tests as a
result of reading

Rate of acquisition for
late elementary and
high school years
estimated at 3,000 5,400 words per year
(10-15 per day)
Calculating Comparable Rate:
Direct & Indirect Effects

LSA simulations consider




Average number of contexts in
which test word appeared (the
parameter)
And the total number of other
contexts, those that contained no
words from the synonym test
items
Varied by randomly replacing
test words with nonsense
words and choosing random
subsamples of total text
Joint effects of direct and
indirect textual experience
LSA simulation of total vocabulary gain

Came up with a model to fit data: z = a(log b T)(log c S)




r = .89
For every word estimates were made for






T: total number of text samples analyzed
S: number of text samples containing stem word
Probability that a word of its frequency appears in the next sample
Number of times individual would have encountered the word previously
Expected increase in z with the addition of a passage containing the word
Expected increase in z with the addition of a passage that doesn’t contain it
Converted z to probability correct x corresponding frequencies
Cumulated gains in number correct / all individuals words in the
language to get the total vocabulary gains from reading single text
sample
Conclusions from Vocabulary
Simulations



LSA learns meanings similarities of words from
text, amount equivalent to test scores of
moderately competent English readers
Three-fourths of LSA’s knowledge is a product of
indirect induction (the exposure of text not
containing the word)
Expression of hypothesis that word meanings
grow continuously and that correct performance is
a stochastic event governed by individual
differences in experience

i.e. word meanings are constantly in flux
Other Considerations

Neurocognitive & Psychological Plausibility





Neural net models
Similarity to biological models
Parallels with memory
Meaning – Independent of word order?
Contextual Disambiguation – In LSA, words
have only one vector representation, thus
only one meaning
Mathematical Machine

Analogy: a three-layered neural net
LAYER 1: WORD TYPE
LAYER 2: CONCEPTUAL
REPRESENTATIONS
LAYER 3: TEXT WINDOW
Neural Net Analogy


Network is symmetrical – can run in either
direction
Different computations made to assess
similarity between two episodes, event
types, or an episode and an event type
Similarity to Biological Models

Interneuronal communication



Vector multiplication between axons, dendrites
and cell bodies
Excitation is proportional to dot product of output
and sensitivities of surrounding neurons
Single-cell recordings

Population effects described as vector averages
of individual direction representations
Word-versus-context difference:
Analogy to Episodic & Semantic Memories



Word representations are semantic, meanings
abstracted and averaged from many experiences
Context representations are episodic, unique
combinations that occurred only once ever
Both words and episodes represented by same
defining dimensions, and relation to one another
is still retained
Word-versus-context difference:
Analogy to Explicit & Implicit Memories


Retrieving a context vector brings past
happening to mind - explicit memory
Retrieving a word vector instantiates
abstraction of many happenings brought
together - implicit memory
Meaning: independent of word order?

Text segments treated as “bags of words”


LSA makes no use of word order, syntax or
grammar
Despite assertions that “scrambled sentences
would be worthless context for vocabulary
instruction” (Durkin,1983), LSA acquires 100% of
its knowledge via “scrambled sentences” and still
performs relatively well at deciphering meaning
Expertise

LSA account of knowledge brings new
perspective for expertise


Simulated expert learns four times more about an
item per exposure than the simulated novice
LSA suggests that great masses of knowledge
contribute to superior performance by



Direct application of stored knowledge to a problem
Greater ability to add new knowledge to long term
memory
To infer indirect relations among bits of knowledge and to
generalize from instances and experience
Contextual Disambiguation

Frequency-weighted average of predicted usages



Acceptable for words that generate only one or a few
closely related meanings (majority of words)
Balanced homographs such as bear result in an LSA
vector that doesn’t resemble any of their major meanings
While LSA’s single-vector representation can’t
account for multiple word-meaning phenomena at
this stage, it is not a fatal flaw (local context will
aid in disambiguation)
Text Comprehension:
An LSA Interpretation of ConstructionIntegration Theory


Research in which individual word senses
aren’t represented, but overall meaning of
phrases/sentences/paragraphs is
constructed from linear combination of their
words
Vector average reflects overall topic or
meaning or passage
Criticisms/ Further Issues



Remember: SVD is just one possible,
simple case for a model
Assumption: All necessary semantic
information is gleaned from a word’s
context (ex. – “love”)
Linguistic structures (i.e. syntax) which
show obvious importance for derivation of
meaning should be incorporated
Educational Applications of LSA




Performance on college exams
Scoring the content of an essay
Selecting most appropriate text for learners
with different levels of background
knowledge
Assisting students to summarize material
Performance on College Exams
Essay Grading
Demonstrations:
Write to Learn

Promotes writing skills and reading
comprehension
Demonstrations:
Intelligent Essay Assessor (IEA)


Assesses and critiques electronically
submitted essays
Provides assessment and feedback
Demonstration:
Summary Street


Web-based reading comprehension and
writing instruction tool
Compares student summaries to each
section of text and provides feedback
Demonstration:
Super Manual

Program that allows one to identify,
develop, and test better ways to organize
and present information customized to
individual maintainers' level of expertise
Educational Text Selection

Predicts how much readers will learn from
texts based on estimated conceptual
knowledge of topic and information present
in the text they read
Demonstration:
State the Essence!


LSA provides evaluations to student
summaries of text
Guides students toward content that had
been noted by experts to consider most
significant


A way to measure reading comprehension
Summary writing requires construction of mental
representations that joins elements of text
information with each other and elements of prior
knowledge
Summary






People appear to know significantly more than they could have learned
from temporally local experiences
Proposed induction method dependant on reconstruction of system of
multiple similarity relations in high dimensional space
Implemented dimensionality-optimizing induction though SVD matrix
decomposition
Model scored as well as the mean scores of foreign students on TOEFL
exams
Model learned at a rate similar to school-children and through induction
from data about other words
Because LSA didn’t have access to word-similarity information based on
spoken language, morphology, syntax, logic or perceptual word knowledge,
concluded that induction method is sufficient to account for Plato’s paradox,
at least in domain of knowledge measured by synonym tests