Inductive inference in perception and cognition

Download Report

Transcript Inductive inference in perception and cognition

Priors and predictions in
everyday cognition
Tom Griffiths
Cognitive and Linguistic Sciences
data
Qui ckTi me™ and a
TIFF (Uncompressed) decompr essor
are needed to see this pictur e.
behavior
What computational problem is the brain solving?
Does human behavior correspond to an
optimal solution to that problem?
Inductive problems
• Inferring structure from data
• Perception
– e.g. structure of 3D world from 2D visual data
data
hypotheses
cube
shaded hexagon
Inductive problems
• Inferring structure from data
• Perception
– e.g. structure of 3D world from 2D data
• Cognition
– e.g. relationship between variables from samples
data
hypotheses
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
Bayes’ theorem
Posterior
probability
Likelihood
Prior
probability
p(d | h) p(h)
p(h | d ) 
 p(d | h) p(h)
hH
h: hypothesis
d: data
Sum over space
of hypotheses

Bayes’ theorem
p(h | d)  p(d | h) p(h)
h: hypothesis
d: data
Perception is optimal
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Qui ckTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this pi cture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Körding & Wolpert (2004)
Cognition is not
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Do people use priors?
p(h | d)  p(d | h) p(h)
Standard answer: no

(Tversky & Kahneman, 1974)
This talk: yes
What are people’s priors?
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
Prior knowledge matters when…
• …using a single datapoint
– predicting the future
– joint work
• …using secondhand data
– effects of priors on cultural transmission
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Predicting the future
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
How often is Google News updated?
t = time since last update
ttotal = time between updates
What should we guess for ttotal given t?
Making predictions
• You encounter a phenomenon that has
existed for t units of time. How long will it
continue into the future? (i.e. what’s ttotal?)
• We could replace “time” with any other
variable that ranges from 0 to some
unknown upper limit
Everyday prediction problems
• You read about a movie that has made $60 million
to date. How much money will it make in total?
• You see that something has been baking in the
oven for 34 minutes. How long until it’s ready?
• You meet someone who is 78 years old. How long
will they live?
• Your friend quotes to you from line 17 of his
favorite poem. How long is the poem?
• You see taxicab #107 pull up to the curb in front of
the train station. How many cabs in this city?
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
p(ttotal|t)  1/ttotal p(ttotal)
assume
random
sample
(0 < t < ttotal)
Bayesian inference
p(ttotal|t)  p(t|ttotal) p(ttotal)
posterior
probability
likelihood
prior
p(ttotal|t)  1/ttotal 1/ttotal
assume “uninformative”
random
prior
sample
(0 < t < ttotal)
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
How about maximal value of p(ttotal|t)?
p(ttotal|t)
ttotal = t
ttotal
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
Instead, compute t* such that p(ttotal > t*|t) = 0.5:
p(ttotal|t)
ttotal
Bayesian inference
p(ttotal|t)  1/ttotal 1/ttotal
posterior
probability
random
sampling
“uninformative”
prior
What is the best guess for ttotal?
Instead, compute t* such that p(ttotal > t*|t) = 0.5.
Yields Gott’s Rule: P(ttotal > t*|t) = 0.5 when t* = 2t
i.e., best guess for ttotal = 2t
Applying Gott’s rule
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
t  4000 years, t*  8000 years
Applying Gott’s rule
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
t  130,000 years, t*  260,000 years
Predicting everyday events
• You meet someone who is 35 years old. How long
will they live?
– “70 years” seems reasonable
• Not so simple:
– You meet someone who is 78 years old. How long will
they live?
– You meet someone who is 6 years old. How long will they
live?
The effects of priors
• Different kinds of priors p(ttotal) are
appropriate in different domains.
Uninformative: p(ttotal) 1/ttotal
The effects of priors
• Different kinds of priors p(ttotal) are
appropriate in different domains.
e.g. wealth
e.g. height
The effects of priors
Evaluating human predictions
• Different domains with different priors:
–
–
–
–
–
a movie has made $60 million
your friend quotes from line 17 of a poem
you meet a 78 year old man
a movie has been running for 55 minutes
a U.S. congressman has served for 11 years
[power-law]
[power-law]
• Prior distributions derived from actual data
• Use 5 values of t for each
• People predict ttotal
[Gaussian]
[Gaussian]
[Erlang]
people
Gott’s rule
empirical prior
parametric prior
Nonparametric priors
You arrive at a friend’s
house, and see that a cake
has been in the oven for 34
minutes. How long will it
be in the oven?
No direct experience
You learn that in ancient
Egypt, there was a great
flood in the 11th year of a
pharaoh’s reign. How
long did he reign?
No direct experience
You learn that in ancient
Egypt, there was a great
flood in the 11th year of a
pharaoh’s reign. How
long did he reign?
How long did the typical
pharaoh reign in ancient
Egypt?
…using a single datapoint
• People produce accurate predictions for the
duration and extent of everyday events
• Strong prior knowledge
– form of the prior (power-law or exponential)
– distribution given that form (parameters)
– non-parametric distribution when necessary
• Reveals a surprising correspondence between
probabilities in the mind and in the world
Outline
• …using a single datapoint
– predicting the future
– joint work with Josh Tenenbaum (MIT)
• …using secondhand data
– effects of priors on cultural transmission
– joint work with Mike Kalish (Louisiana)
• Conclusions
Cultural transmission
• Most knowledge is based on secondhand data
• Some things can only be learned from others
– cultural objects transmitted across generations
• Cultural transmission provides an opportunity
for priors to influence cultural objects
Iterated learning
(Briscoe, 1998; Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Each learner sees data, forms a hypothesis,
produces the data given to the next learner
• c.f. the playground game “telephone”
Objects of iterated learning
• Languages
• Religious concepts
• Social norms
• Myths and legends
• Causal theories
Explaining linguistic universals
• Human languages are a subset of all logically
possible communication schemes
– universal properties common to all languages
(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
Explaining linguistic universals
• Traditional answer:
– linguistic universals reflect innate constraints
specific to a system for acquiring language
• Alternative answer:
– iterated learning imposes “information bottleneck”
– universal properties survive this bottleneck
(Briscoe, 1998; Kirby, 2001)
Analyzing iterated learning
What are the consequences of iterated learning?
Complex
algorithms
Analytic results
?
Kirby (2001)
Simulations
Smith, Kirby, &
Brighton (2003)
Simple
algorithms
Komarova, Niyogi,
& Nowak (2002)
Brighton (2002)
Iterated Bayesian learning
p(h|d)
p(h|d)
p(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
p(d|h)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Learners are rational Bayesian agents
(covers a wide range of learning algorithms)
Markov chains
x
x
x
x
x
x
x
x
Transition matrix
P(x(t+1)|x(t))
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under
easily checked conditions
Markov chain Monte Carlo
• A strategy for sampling from complex
probability distributions
• Key idea: construct a Markov chain which
converges to target distribution
– e.g. Metropolis algorithm
– e.g. Gibbs sampling
Gibbs sampling
For variables x = x1, x2, …, xn
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
(Geman & Geman, 1984)
(a.k.a. the heat bath algorithm in statistical physics)
Gibbs sampling
(MacKay, 2002)
Iterated Bayesian learning
• Defines a Markov chain on (h,d)
Iterated Bayesian learning
• Defines a Markov chain on (h,d)
• This Markov chain is a Gibbs sampler for
p(d,h)  p(d | h) p(h)

Iterated Bayesian learning
• Defines a Markov chain on (h,d)
• This Markov chain is a Gibbs sampler for
p(d,h)  p(d | h) p(h)
• Rate of convergence is geometric
– Gibbs sampler converges geometrically

(Liu, Wong, & Kong, 1995)
Analytic results
• Iterated Bayesian learning converges to
p(d,h)  p(d | h) p(h)
• Corollaries:

– distribution over hypotheses converges to p(h)
– distribution over data converges to p(d)
– the proportion of a population of iterated
learners with hypothesis h converges to p(h)
Implications for linguistic universals
• Two questions:
– why do linguistic universals exist?
– why are particular properties universal?
• Different answers:
– existence explained through iterated learning
– universal properties depend on the prior
• Focuses inquiry on the priors of the learners
– cultural objects reflect the human mind
A method for discovering priors
Iterated learning converges to the prior…
…evaluate prior by producing iterated learning
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
Function learning in the lab
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
…using secondhand data
• Iterated Bayesian learning converges to the prior
• Constrains explanations of linguistic universals
• Open questions in Bayesian language evolution
– variation in priors
– other selective pressures
• Provides a method for evaluating priors
– concepts, causal relationships, languages, …
Outline
• …using a single datapoint
– predicting the future
• …using secondhand data
– effects of priors on cultural transmission
• Conclusions
Bayes’ theorem
p(h | d)  p(d | h) p(h)

A unifying principle for explaining
inductive inferences
Bayes’ theorem
behavior = f(data,knowledge)
data
Qui ckTi me™ and a
TIFF (Uncompressed) decompr essor
are needed to see this pictur e.
behavior
Bayes’ theorem
behavior = f(data,knowledge)
knowledge
data
Qui ckTi me™ and a
TIFF (Uncompressed) decompr essor
are needed to see this pictur e.
behavior
Explaining inductive leaps
• How do people
–
–
–
–
–
infer causal relationships
identify the work of chance
predict the future
assess similarity and make generalizations
learn functions, languages, and concepts
. . . from such limited data?
• What knowledge guides human inferences?
HHTHT
HHTHT
HHHHT
What’s the computational problem?
p(HHTHT|random)
p(random|HHTHT)
An inference about the structure of the world
An example: Gaussians
• If we assume…
– data, d, is a single real number, x
– hypotheses, h, are means of a Gaussian, 
– prior, p(), is Gaussian(0,02)
• …then p(xn+1|xn) is Gaussian(n, x2 + n2)
x n / x2  0 / 02
n 
2
2
1/ x  1/ 0
1
 
2
2
1/ x  1/ 0
2
n
0 = 0, 02 = 1, x0 = 20
Iterated learning results in rapid convergence to prior
An example: Linear regression
• Assume
– data, d, are pairs of real numbers (x, y)
– hypotheses, h, are functions
• An example: linear regression
– hypotheses have slope  and pass through origin
– p() is Gaussian(0,02)
y
}
x=1
y
0 = 1, 02 = 0.1, y0 = -1
}
x=1