Inductive inference in perception and cognition

Download Report

Transcript Inductive inference in perception and cognition

Analyzing iterated learning
Tom Griffiths
Mike Kalish
Brown University
University of Louisiana
Cultural transmission
• Most knowledge is based on secondhand data
• Some things can only be learned from others
– cultural objects transmitted across generations
• Studying the cognitive aspects of cultural
transmission provides unique insights…
Iterated learning
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Each learner sees data, forms a hypothesis,
produces the data given to the next learner
• c.f. the playground game “telephone”
Objects of iterated learning
• It’s not just about languages…
• In the wild:
–
–
–
–
religious concepts
social norms
myths and legends
causal theories
• In the lab:
– functions and categories
Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
Discrete generations of single learners
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
PP(d|h)
PL(h|d): probability of inferring hypothesis h from data d
PP(d|h): probability of generating data d from hypothesis h
Markov chains
x
x
x
x
x
x
x
x
Transition matrix
T = P(x(t+1)|x(t))
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under
easily checked conditions for ergodicity
Stationary distributions
• Stationary distribution:
 i  P(x(t 1)  i | x(t )  j) j  Tij j
j
j
• In matrix form

  T
•  is the first eigenvector of the matrix T
• Second eigenvalue
sets rate of convergence

Analyzing iterated learning
d0
PL(h|d)
h1
PP(d|h)
d1
PL(h|d)
h2
PP(d|h)
d2
PL(h|d)
h3
A Markov chain on hypotheses
h1
d PP(d|h)PL(h|d)
h2
d PP(d|h)PL(h|d)
h3
A Markov chain on data
d0
h PL(h|d) PP(d|h)
d1
h PL(h|d) PP(d|h)
d2
h PL(h|d) PP
A Markov chain on hypothesis-data pairs
h1,d1
PL(h|d) PP(d|h)
h2 ,d2
PL(h|d) PP(d|h)
h3 ,d3
A Markov chain on hypotheses
• Transition probabilities sum out data
Qij  P(hn 1  i | hn  j)   P(hn 1  i | d) P(d | hn  j)
d
• Stationary distribution and convergence rate
from eigenvectors and eigenvalues of Q
– can be computed numerically for matrices of
reasonable size, and analytically in some cases
Infinite populations in continuous time
• “Language dynamical equation”
dxi
  Qij f j (x) x j   (x)x i
dt
j
(Nowak, Komarova, & Niyogi, 2001)
• “Neutral model” (fj(x) constant)

dxi
 Qij x j  x i
dt
j
dx
 (Q I)x
dt
(Komarova & Nowak, 2003)
• Stable equilibrium at first eigenvector of Q


Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
Bayesian inference
• Rational procedure
for updating beliefs
• Foundation of many
learning algorithms
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
(e.g., Mackay, 2003)
• Widely used for
language learning
(e.g., Charniak, 1993)
Reverend Thomas Bayes

Bayes’ theorem
Posterior
probability
Likelihood
Prior
probability
P(d | h)P(h)
P(h | d) 
 P(d | h)P(h)
h  H
h: hypothesis
d: data
Sum over space
of hypotheses
Iterated Bayesian learning
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Learners are Bayesian agents
PP (d | h)P(h)
PL (h | d) 
 PP (d | h)P(h)
h  H
Markov chains on h and d
• Markov chain on h has stationary distribution
 i  P(h  i)
the prior
• Markov chain on d has stationary distribution

 i   PP (d  i | h)P(h)
h
the prior
predictive
distribution
Markov chain Monte Carlo
• A strategy for sampling from complex
probability distributions
• Key idea: construct a Markov chain which
converges to a particular distribution
– e.g. Metropolis algorithm
– e.g. Gibbs sampling
Gibbs sampling
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
Converges to P(x1, x2, …, xn)
(Geman & Geman, 1984)
(a.k.a. the heat bath algorithm in statistical physics)
Gibbs sampling
(MacKay, 2003)
Iterated learning is a Gibbs sampler
• Iterated Bayesian learning is a sampler for
p(d,h)  p(d | h) p(h)
• Implies:

– (h,d) converges to this distribution
– converence rates are known
(Liu, Wong, & Kong, 1995)
Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
x n / x2  0 / 02
n 
2
2
1/ x  1/ 0
1
 
2
2
1/ x  1/ 0
2
n
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
• p(xn|x0) is Gaussian(0+cnx0, (x2 + 02)(1 - c2n))
i.e. geometric convergence to prior c  1
1
 x2
 02
An example: Gaussians
• p(xn+1|x0) is Gaussian(0+cnx0,(x2 + 02)(1-c2n))
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
An example: Linear regression
• Assume
– data, d, are pairs of real numbers (x, y)
– hypotheses, h, are functions
• An example: linear regression
– hypotheses have slope  and pass through origin
– p() is Gaussian(0,02)
y
}
x=1
y
0 = 1, 02 = 0.1, y0 = -1
}
x=1
An example: compositionality
“agents”
0
“actions”
1
“nouns”
compositional
0
1
0
0
1
1
events
x
language
function
utterances
y
“verbs”
An example: compositionality
0
1
compositional
P(h)
0
1
0
0
1
1
0
1
holistic
0

4
1
0
 0
1
1
• Data: m event-utterance pairs
• Hypotheses: languages, with error
 
(1  )
256
Analysis technique
1. Compute transition matrix on languages
P(hn  i | hn1  j)   P(hn  i | d)P(d | hn1  j)
d
2. Sample Markov chains
3. Compare language frequencies with prior
(can also compute eigenvalues etc.)
Convergence to priors
Effect of Prior
 = 0.50,  = 0.05, m = 3
 = 0.01,  = 0.05, m = 3
Iteration
Chain
Prior
The information bottleneck
No effect of bottleneck
 = 0.50,  = 0.05, m = 1
 = 0.01,  = 0.05, m = 3
 = 0.50,  = 0.05, m = 10
Iteration
Chain
Prior
The information bottleneck
Stability ratio=
P(hn  i | hn1  i)
P(hn  i | hn1  i)
iC
i H

Bottleneck affects relative stability
of languages favored by prior
Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
A method for discovering priors
Iterated learning converges to the prior…
…evaluate prior by producing iterated learning
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
Function learning in the lab
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
(Kalish, 2004)
Outline
1. Analyzing iterated learning
2. Iterated Bayesian learning
3. Examples
4. Iterated learning with humans
5. Conclusions and open questions
Conclusions and open questions
• Iterated Bayesian learning converges to prior
– properties of languages are properties of learners
– information bottleneck doesn’t affect equilibrium
• What about other learning algorithms?
• What determines rates of convergence?
– amount and structure of input data
• What happens with people?