Transcript Inductive inference in perception and cognition
Markov chain Monte Carlo with people
Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky, and Adam Sanborn
Inductive problems Learning languages from utterances blicket toma dax wug blicket wug S X Y X {blicket,dax} Y {toma, wug} Learning categories from instances of their members Learning functions from (
x
,
y
) pairs
Computational cognitive science Identify the underlying computational problem Find the optimal solution to that problem Compare human cognition to that solution For inductive problems, solutions come from statistics
Statistics and inductive problems
Cognitive science
Categorization Causal learning Function learning Language …
Statistics
Density estimation Graphical models Regression Probabilistic grammars …
Statistics and human cognition • How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor process models to natural language – exchangeable distributions on infinite binary matrices via the Indian buffet process (priors on causal structure) – nonparametric Bayesian models for relational data
Statistics and human cognition • How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor process models to natural language – exchangeable distributions on infinite binary matrices via the Indian buffet process (priors on causal structure) – nonparametric Bayesian models for relational data
Statistics and human cognition • How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor process models to natural language – exchangeable distributions on infinite binary matrices via the Indian buffet process – nonparametric Bayesian models for relational data
Are people Bayesian?
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Reverend Thomas Bayes
Bayes’ theorem Posterior probability Likelihood Prior probability
P
(
h
|
d
)
P
(
d P
(
d h
H
| |
h
)
P
(
h
)
h
)
P
(
h
)
h
: hypothesis
d
: data Sum over space of hypotheses
People are stupid QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Predicting the future QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
How often is Google News updated?
t =
time since last update
t total
= time between updates What should we guess for
t total
given
t
?
The effects of priors
Evaluating human predictions • Different domains with different priors: – a movie has made $60 million – your friend quotes from line 17 of a poem – you meet a 78 year old man – a movie has been running for 55 minutes – a U.S. congressman has served for 11 years • Prior distributions derived from actual data • Use 5 values of
t
for each • People predict
t total
[power-law] [power-law] [Gaussian] [Gaussian] [Erlang]
people Gott’s rule empirical prior parametric prior
A different approach… Instead of asking whether people are rational, use assumption of rationality to investigate cognition If we can predict people’s responses, we can design experiments that measure psychological variables
Two deep questions • What are the biases that guide human learning?
– prior probability distribution
P
(
h
) • What do mental representations look like?
– category distribution
P
(
x
|
c
) lim
t
P
(
x
(
t
)
i
|
x
(0) )
i
Two deep questions • What are the biases that guide human learning?
– prior probability distribution on hypotheses,
P
(
h
) • What do mental representations look like?
– distribution over objects
x
in category
c
,
P
(
x
|
c
) Develop ways to sample from these distributions
Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions
Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions
x x x
Markov chains
x x x x x
Transition matrix T = P (
x
(
t
+1) |
x
(
t
) ) • Variables
x
(
t
+1) independent of history given
x
(
t
) • Converges to a
stationary distribution
under easily checked conditions (i.e., if it is ergodic)
Markov chain Monte Carlo • Sample from a target distribution
P
(
x
) by constructing Markov chain for which
P
(
x
) is the stationary distribution • Two main schemes: – Gibbs sampling – Metropolis-Hastings algorithm
Gibbs sampling For variables
x
=
x
1 ,
x
2 , …,
x n
and target
P
(
x
) Draw
x i
(
t
+1)
x
-i
from
P
(
x i |x -i
)
= x 1
(
t
+1)
, x 2
(
t
+1)
,…, x i-1
(
t
+1)
, x i+1
(
t
)
, …, x n
(
t
)
Gibbs sampling (MacKay, 2002)
Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) Step 1: propose a state (we assume symmetrically)
Q
(
x
(
t
+1) |
x
(
t
) ) =
Q
(
x
(
t
) )|
x
(
t
+1) ) Step 2: decide whether to accept, with probability Metropolis acceptance function Barker acceptance function
Metropolis-Hastings algorithm
p
(
x
)
Metropolis-Hastings algorithm
p
(
x
)
Metropolis-Hastings algorithm
p
(
x
)
Metropolis-Hastings algorithm
p
(
x
) A(
x
(
t
) ,
x
(
t
+1) ) = 0.5
Metropolis-Hastings algorithm
p
(
x
)
Metropolis-Hastings algorithm
p
(
x
) A(
x
(
t
) ,
x
(
t
+1) ) = 1
Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions
Iterated learning (Kirby, 2001) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
What are the consequences of learners learning from other learners?
Analyzing iterated learning
P L
(
h
|
d
)
P L
(
h
|
d
)
P P
(
d|h
) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
P P
(
d|h
)
P L
(
h
|
d
): probability of inferring hypothesis
h
from data
d P P
(
d|h
): probability of generating data
d
from hypothesis
h
Iterated Bayesian learning
P L
(
h
|
d
)
P L
(
h
|
d
) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
P P
(
d|h
) TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
P P
(
d|h
) Assume learners
sample
from their posterior distribution:
P L
(
h
|
d
)
P P h
H
(
d P P
(
d
| |
h
)
P
(
h
)
h
)
P
(
h
)
Analyzing iterated learning
d
0
P L
(
h
|
d
)
h
1
P P
(
d
|
h
)
d
1
P L
(
h
|
d
)
h
2
P P
(
d
|
h
)
d
2
P L
(
h
|
d
)
h
3 A Markov chain on hypotheses
h
1
d P P
(
d
|
h
)
P L
(
h
|
d
)
h
2
d P P
(
d
|
h
)
P L
(
h
|
d
)
h
3 A Markov chain on data
d
0
h P L
(
h
|
d
)
P P
(
d
|
h
)
d
1
h P L
(
h
|
d
)
P P
(
d
|
h
)
d
2
h P L
(
h
|
d
)
P P
(
d
Stationary distributions • Markov chain on
h
converges to the prior,
P
(
h
) • Markov chain on
d
converges to the “prior predictive distribution”
P
(
d
)
P
(
d h
|
h
)
P
(
h
) (Griffiths & Kalish, 2005)
Explaining convergence to the prior
P L
(
h
|
d
)
P L
(
h
|
d
) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
P P
(
d|h
) TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
P P
(
d|h
) • Intuitively: data acts once, prior many times • Formally: iterated learning with Bayesian agents is a
Gibbs sampler
on
P
(
d
,
h
) (Griffiths & Kalish, in press)
Revealing inductive biases • Many problems in cognitive science can be formulated as problems of induction – learning languages, concepts, and causal relations • Such problems are not solvable without bias (e.g., Goodman, 1955; Kearns & Vazirani, 1994; Vapnik, 1995) • What biases guide human inductive inferences?
If iterated learning converges to the prior, then it may provide a method for investigating biases
Serial reproduction (Bartlett, 1932) • Participants see stimuli, then reproduce them from memory • Reproductions of one participant are stimuli for the next • Stimuli were interesting, rather than controlled – e.g., “War of the Ghosts”
General strategy • Use well-studied and simple stimuli for which people’s inductive biases are known – function learning – concept learning – color words • Examine dynamics of iterated learning – convergence to state reflecting biases – predictable path to convergence
Iterated function learning
data hypotheses
• Each learner sees a set of (
x
,
y
) pairs • Makes predictions of
y
for new
x
values • Predictions are data for the next learner (Kalish, Griffiths, & Lewandowsky, in press)
Function learning experiments Stimulus Feedback Response Slider Examine iterated learning with different initial data
Initial data Iteration 1 2 3 4 5 6 7 8 9
Identifying inductive biases • Formal analysis suggests that iterated learning provides a way to determine inductive biases • Experiments with human learners support this idea – when stimuli for which biases are well understood are used, those biases are revealed by iterated learning • What do inductive biases look like in other cases?
– continuous categories – causal structure – word learning – language learning
Statistics and cultural evolution • Iterated learning for MAP learners reduces to a form of the stochastic EM algorithm – Monte Carlo EM with a single sample • Provides connections between cultural evolution and classic models used in population genetics – MAP learning of multinomials = Wright-Fisher • More generally, an account of how products of cultural evolution relate to the biases of learners
Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions
Categories are central to cognition
Sampling from categories Frog distribution
P
(
x
|
c
)
A task Ask subjects which of two alternatives comes from a target category Which animal is a frog?
A Bayesian analysis of the task Assume:
Response probabilities If people probability match to the posterior, response probability is equivalent to the Barker acceptance function for target distribution
p
(
x
|
c
)
Collecting the samples Which is the frog?
Which is the frog?
Which is the frog?
Trial 1 Trial 2 Trial 3
Verifying the method
Training Subjects were shown schematic fish of different sizes and trained on whether they came from the ocean (uniform) or a fish farm (Gaussian)
Between-subject conditions
Choice task Subjects judged which of the two fish came from the fish farm (Gaussian) distribution
Examples of subject MCMC chains
Estimates from all subjects • Estimated means and standard deviations are significantly different across groups • Estimated means are accurate, but standard deviation estimates are high – result could be due to perceptual noise or response gain
Sampling from natural categories Examined distributions for four natural categories: giraffes, horses, cats, and dogs Presented stimuli with nine-parameter stick figures (Olman & Kersten, 2004)
Choice task
Samples from Subject 3 (projected onto plane from LDA)
giraffe horse cat dog Mean animals by subject S1 S2 S3 S4 S5 S6 S7 S8
Marginal densities (aggregated across subjects) Giraffes are distinguished by neck length, body height and body tilt Horses are like giraffes, but with shorter bodies and nearly uniform necks Cats have longer tails than dogs
Relative volume of categories Convex Hull Minimum Enclosing Hypercube Convex hull content divided by enclosing hypercube content Giraffe 0.00004
Horse 0.00006
Cat 0.00003
Dog 0.00002
Discrimination method (Olman & Kersten, 2004)
Parameter space for discrimination Restricted so that most random draws were animal-like
MCMC and discrimination means
Conclusion • Markov chain Monte Carlo provides a way to sample from subjective probability distributions • Many interesting questions can be framed in terms of subjective probability distributions – inductive biases (priors) – mental representations (category distributions) • Other MCMC methods may provide further empirical methods… – Gibbs for categories, adaptive MCMC, …
A different approach… Instead of asking whether people are rational, use assumption of rationality to investigate cognition If we can predict people’s responses, we can design experiments that measure psychological variables Randomized algorithms Psychological experiments
From sampling to maximizing
P L
(
h
|
d
)
P h
H P P P
(
d
(
d
| |
h
)
P
(
h
)
h
)
P
(
h
)
r
r
= 1
r
= 2
r
=
From sampling to maximizing • General analytic results are hard to obtain – (
r =
is Monte Carlo EM with a single sample) • For certain classes of languages, it is possible to show that the stationary distribution gives each hypothesis
h
probability proportional to
P
(
h
)
r
– the ordering identified by the prior is preserved, but not the corresponding probabilities (Kirby, Dowman, & Griffiths, in press)
Implications for linguistic universals • When learners sample from
P
(
h
|
d
), the distribution over languages converges to the prior – identifies a one-to-one correspondence between inductive biases and linguistic universals • As learners move towards maximizing, the influence of the prior is exaggerated – weak biases can produce strong universals – cultural evolution is a viable alternative to traditional explanations for linguistic universals
Iterated concept learning
data hypotheses
• Each learner sees examples from a species • Identifies species of four amoebae • Iterated learning is run within-subjects (Griffiths, Christian, & Kalish, in press)
Two positive examples data (
d
) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
hypotheses (
h
)
Bayesian model (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)
P
(
h
|
d
)
P
(
d P
(
d h
H
| |
h
)
P
(
h
)
h
)
P
(
h
)
d
: 2 amoebae
h:
set of 4 amoebae
P
(
d
|
h
) 1/
h
0
m P
(
h
|
d
)
h
'|
d P
(
h
)
P
(
h
)
h
'
d
h
otherwise
m:
# of amoebae in the set
d
(
=
2)
|h|
: # of amoebae in the set
h
(= 4) Posterior is renormalized prior What is the prior?
Classes of concepts (Shepard, Hovland, & Jenkins, 1958) Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 color size shape
Experiment design (for each subject) 6 iterated learning chains 6 independent learning “chains” Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Estimating the prior data (
d
)
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Estimating the prior Human subjects Bayesian model Prior 0.861
0.087
0.009
0.002
0.013
0.028
r
= 0.952
Two positive examples (
n
= 20) Human learners Bayesian model Iteration Iteration
Two positive examples (
n
= 20) Human learners Bayesian model
Three positive examples data (
d
) hypotheses (
h
)
Three positive examples (
n
= 20) Human learners Bayesian model Iteration Iteration
Three positive examples (
n
= 20) Human learners Bayesian model
Classification objects
Parameter space for discrimination Restricted so that most random draws were animal-like
MCMC and discrimination means
Problems with classification objects Category 1 Category 2 Category 1 Category 2
Problems with classification objects Convex Hull Minimum Enclosing Hypercube Convex hull content divided by enclosing hypercube content Giraffe 0.00004
Horse 0.00006
Cat 0.00003
Dog 0.00002
Allowing a Wider Range of Behavior An exponentiated choice rule results in a Markov chain with stationary distribution corresponding to an exponentiated version of the category distribution, proportional to
p
(
x
|
c
)
Category drift • For fragile categories, the MCMC procedure could influence the category representation • Interleaved training and test blocks in the training experiments