Inductive inference in perception and cognition

Download Report

Transcript Inductive inference in perception and cognition

A Bayesian view of language
evolution by iterated learning
Tom Griffiths
Mike Kalish
Brown University
University of Louisiana
Linguistic universals
• Human languages possess universal properties
– e.g. compositionality
(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
Possible answers
• Traditional answer:
– linguistic universals reflect innate constraints
specific to a system for acquiring language
(e.g., Chomsky, 1965)
• Alternative answer:
– linguistic universals emerge as the result of the fact
that language is learned anew by each generation
(using general-purpose learning mechanisms)
(e.g., Briscoe, 1998; Kirby, 2001)
The iterated learning model
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Each learner sees data, forms a hypothesis,
produces the data given to the next learner
• c.f. the playground game “telephone”
The “information bottleneck”
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
size indicates
compressibility
“survival of the
most compressible”
Analyzing iterated learning
What are the consequences of iterated learning?
Complex
algorithms
Analytic results
?
Kirby (2001)
Simulations
Smith, Kirby, &
Brighton (2003)
Simple
algorithms
Komarova, Niyogi,
& Nowak (2002)
Brighton (2002)
Bayesian inference
• Rational procedure
for updating beliefs
• Foundation of many
learning algorithms
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
(e.g., Mackay, 2003)
• Widely used for
language learning
(e.g., Charniak, 1993)
Reverend Thomas Bayes
Bayes’ theorem
Posterior
probability
Likelihood
Prior
probability
p(d | h) p(h)
p(h | d ) 
 p(d | h) p(h)
hH
h: hypothesis
d: data
Sum over space
of hypotheses
Iterated Bayesian learning
p(h|d)
p(h|d)
p(d|h)
p(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Learners are Bayesian agents
d0
p(h|d)
h1
p(d|h)
d1
p(h|d)
h2
p(d|h)
d2
p(h|d)
h3
Markov chains
x
x
x
x
x
x
x
transition matrix T
Tij = p(xn = i | xn-1 = j)
• Variable xn independent of history given xn-1
• Converges to a stationary distribution under
easily checked conditions for ergodicity
x
Analyzing iterated learning
d0
p(h|d)
h1
p(d|h)
d1
p(h|d)
h2
p(d|h)
d2
p(h|d)
h3
A Markov chain on hypotheses
h1
d p(d|h)p(h|d)
h2
d p(d|h)p(h|d)
h3
A Markov chain on data
d0
h p(h|d) p(d|h)
d1
h p(h|d) p(d|h)
d2
h p(h|d) p(
A Markov chain on hypothesis-data pairs
h1,d1
p(h|d) p(d|h)
h2 ,d2
p(h|d) p(d|h)
h3 ,d3
Stationary distributions
• Markov chain on h converges to the prior, p(h)
• Markov chain on d converges to the “prior
predictive distribution”
p(d)   p(d | h) p(h)
h
• Markov chain on (h,d) is a Gibbs sampler for

p(d,h)  p(d | h) p(h)
Implications
• The probability that the nth learner entertains
the hypothesis h approaches p(h) as n  
• Convergence to the prior occurs regardless of:
– the amount or structure of the data transmitted
– the properties of the hypotheses themselves
• The consequences of iterated learning are
determined entirely by the biases of the learners
A simple language model
“agents”
0
“actions”
1
“nouns”
compositional
0
1
0
0
1
1
events
language
utterances
“verbs”
A simple language model
0
1
compositional
p(h)
0
1
0
0
1
1
0
1
holistic
0

4
1
0
 0
1
1
• Data: m event-utterance pairs
• Hypotheses: languages, with error
 
(1  )
256
Analysis technique
1. Compute transition matrix on languages
Tij  p(hn  i | hn1  j)   p(hn  i | d) p(d | hn1  j)
d
2. Sample Markov chains
3. Compare language frequencies with prior
Convergence to priors
Compositionality
emerges only when
favored by the prior
 = 0.50,  = 0.05, m = 3
 = 0.01,  = 0.05, m = 3
Iteration
Chain
Prior
The information bottleneck
No effect of bottleneck
 = 0.50,  = 0.05, m = 1
 = 0.50,  = 0.05, m = 3
 = 0.50,  = 0.05, m = 10
Iteration
Chain
Prior
The information bottleneck
Stability ratio=
P(hn  i | hn1  i)
P(hn  i | hn1  i)
iC
i H

Bottleneck affects relative stability
of languages favored by prior
Explaining linguistic universals
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
• Our analysis gives different answers:
– existence explained through iterated learning
– universal properties depend on the prior
• Focuses inquiry on the priors of the learners
– languages reflect the biases of human learners
Extensions and future directions
• Results extend to:
– unbounded populations
– continuous time population dynamics
• Iterated learning applies to other knowledge
– religious concepts, social norms, legends…
• Provides a method for evaluating priors
– experiments in iterated learning with humans
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
Function learning in the lab
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
(Kalish, 2004)
Markov chain Monte Carlo
• A strategy for sampling from complex
probability distributions
• Key idea: construct a Markov chain which
converges to a particular distribution
– e.g. Metropolis algorithm
– e.g. Gibbs sampling
Gibbs sampling
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
Converges to P(x1, x2, …, xn)
(Geman & Geman, 1984)
(a.k.a. the heat bath algorithm in statistical physics)
Gibbs sampling
(MacKay, 2003)
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
x n / x2  0 / 02
n 
2
2
1/ x  1/ 0
1
 
2
2
1/ x  1/ 0
2
n
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
• p(xn|x0) is Gaussian(0+cnx0, (x2 + 02)(1 - c2n))
i.e. geometric convergence to prior c  1
1
 x2
 02
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
An example: Linear regression
• Assume
– data, d, are pairs of real numbers (x, y)
– hypotheses, h, are functions
• An example: linear regression
– hypotheses have slope  and pass through origin
– p() is Gaussian(0,02)
y
}
x=1
y
0 = 1, 02 = 0.1, y0 = -1
}
x=1