Revealing inductive biases through iterated learning

Download Report

Transcript Revealing inductive biases through iterated learning

Revealing inductive biases
with Bayesian models
Tom Griffiths
UC Berkeley
with Mike Kalish, Brian Christian, and Steve Lewandowsky
Inductive problems
Learning languages from utterances
blicket toma
dax wug
blicket wug
SXY
X  {blicket,dax}
Y  {toma, wug}
Learning categories from instances of their members
Learning functions from (x,y) pairs
Generalization requires induction
Generalization: predicting the properties of
an entity from observed properties of others
y
x
What makes a good inductive learner?
• Hypothesis 1: more representational power
– more hypotheses, more complexity
– spirit of many accounts of learning and development
Some hypothesis spaces
Linear functions
g(x)  p1 x  p0
Quadratic functions
g(x)  p2 x  p1x  p0
2
8thdegree polynomials
8

g(x)   p j x
j 0
j
Minimizing squared error
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Measuring prediction error
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
What makes a good inductive learner?
• Hypothesis 1: more representational power
– more hypotheses, more complexity
– spirit of many accounts of learning and development
• Hypothesis 2: good inductive biases
– constraints on hypotheses that match the environment
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
A simple schema for induction
• Data D are n pairs (x,y)
generated from function f
• Hypothesis space of
functions, y = g(x)
y
• Error is E = (y - g(x))2
• Pick function g that
minimizes error on D
• Measure prediction error,
averaging over x and y
x
Bias and variance
• A good learner makes (f(x) - g(x))2 small
• g is chosen on the basis of the data D
• Evaluate learners by the average of (f(x) - g(x))2
over data D generated from f
E p(D) ( f (x)  g(x)) 
2
( f (x)  E p(D )[g(x)])  E p(D ) g(x)  E p(D )[g(x)]
2
2
bias
variance
(Geman, Bienenstock, & Doursat, 1992)
Making things more intuitive…
• The next few slides were generated by:
– choosing a true function f(x)
– generating a number of datasets D from p(x,y) defined
by uniform p(x), p(y|x) = f(x) plus noise
– finding the function g(x) in the hypothesis space that
minimized the error on D
• Comparing average of g(x) to f(x) reveals bias
• Spread of g(x) around average is the variance
Linear functions (n = 10)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Linear functions (n = 10)
y pink is g(x) for each dataset
red is average g(x)
black is f(x)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
}bias
}
variance
x
Quadratic functions (n = 10)
y pink is g(x) for each dataset
red is average g(x)
black is f(x)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
x
8-th degree polynomials (n = 10)
y pink is g(x) for each dataset
red is average g(x)
black is f(x)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
x
Bias and variance
(for our (quadratic) f(x), with n = 10)
Linear functions
high bias, medium variance
Quadratic functions
low bias, low variance
8-th order polynomials
low bias, super-high variance
In general…
• Larger hypothesis spaces result in higher
variance, but low bias across several f(x)
• The bias-variance tradeoff:
– if we want a learner that has low bias on a range of
problems, we pay a price in variance
• This is mainly an issue when n is small
– the regime of much of human learning
Quadratic functions (n = 100)
y pink is g(x) for each dataset
red is average g(x)
black is f(x)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
x
8-th degree polynomials (n = 100)
y pink is g(x) for each dataset
red is average g(x)
black is f(x)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
x
The moral
• General-purpose learning mechanisms do not work
well with small amounts of data
– more representational power isn’t always better
• To make good predictions from small amounts of
data, you need a bias that matches the problem
– these biases are the key to successful induction, and
characterize the nature of an inductive learner
• So… how can we identify human inductive biases?
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Bayesian inference
• Rational procedure for
updating beliefs
• Foundation of many
learning algorithms
• Lets us make the
inductive biases of
learners precise
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes

Bayes’ theorem
Posterior
probability
Likelihood
Prior
probability
P(d | h)P(h)
P(h | d) 
 P(d | h)P(h)
h  H
h: hypothesis
d: data
Sum over space
of hypotheses
Priors and biases
• Priors indicate the kind of world a learner
expects to encounter, guiding their conclusions
• In our function learning example…
– likelihood gives probability to data that decrease
with sum squared errors (i.e. a Gaussian)
– priors are uniform over all functions in hypothesis
spaces of different kinds of polynomials
– having more functions corresponds to a belief in a
more complex world…
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Two ways of using Bayesian models
• Specify models that make different
assumptions about priors, and compare
their fit to human data
(Anderson & Schooler, 1991;
Oaksford & Chater, 1994;
Griffiths & Tenenbaum, 2006)
• Design experiments explicitly intended
to reveal the priors of Bayesian learners
Quic kTime™ a nd a
TIFF (Un co mp res se d) d ec ompre ss or
ar e n eed ed to see th is p ictu re.
Iterated learning
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
What are the consequences of learners
learning from other learners?
Objects of iterated learning
• Knowledge communicated across generations
through provision of data by learners
• Examples:
–
–
–
–
–
religious concepts
social norms
myths and legends
causal theories
language
Analyzing iterated learning
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
PP(d|h)
PL(h|d): probability of inferring hypothesis h from data d
PP(d|h): probability of generating data d from hypothesis h
Markov chains
x
x
x
x
x
x
x
x
Transition matrix
T = P(x(t+1)|x(t))
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under
easily checked conditions (i.e., if it is ergodic)
Analyzing iterated learning
d0
PL(h|d)
h1
PP(d|h)
d1
PL(h|d)
h2
PP(d|h)
d2
PL(h|d)
h3
A Markov chain on hypotheses
h1
d PP(d|h)PL(h|d)
h2
d PP(d|h)PL(h|d)
h3
A Markov chain on data
d0
h PL(h|d) PP(d|h)
d1
h PL(h|d) PP(d|h)
d2
h PL(h|d) PP
Iterated Bayesian learning
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
PP(d|h)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Assume learners sample from their posterior distribution:
PP (d | h)P(h)
PL (h | d) 
 PP (d | h)P(h)
h  H
Stationary distributions
• Markov chain on h converges to the prior, P(h)
• Markov chain on d converges to the “prior
predictive distribution”
P(d)   P(d | h)P(h)
h
(Griffiths & Kalish, 2005)
Explaining convergence to the prior
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
PP(d|h)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Intuitively: data acts once, prior many times
• Formally: iterated learning with Bayesian
agents is a Gibbs sampler on P(d,h)
(Griffiths & Kalish, in press)
Revealing inductive biases
• If iterated learning converges to the prior, it
might provide a tool for determining the
inductive biases of human learners
• We can test this by reproducing iterated
learning in the lab, with stimuli for which
human biases are well understood
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
(Kalish, Griffiths, & Lewandowsky, in press)
Function learning experiments
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
Identifying inductive biases
• Formal analysis suggests that iterated learning
provides a way to determine inductive biases
• Experiments with human learners support this idea
– when stimuli for which biases are well understood are
used, those biases are revealed by iterated learning
• What do inductive biases look like in other cases?
– continuous categories
– causal structure
– word learning
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Conclusions
• Solving inductive problems and forming good
generalizations requires good inductive biases
• Bayesian inference provides a way to make
assumptions about the biases of learners explicit
• Two ways to identify human inductive biases:
– compare Bayesian models assuming different priors
– design tasks to extract biases from Bayesian learners
• Iterated learning provides a lens for magnifying
the inductive biases of learners
– small effects for individuals are big effects for groups
Iterated concept learning
data
hypotheses
• Each learner sees examples from a species
• Identifies species of four amoebae
• Iterated learning is run within-subjects
(Griffiths, Christian, & Kalish, in press)
Two positive examples
data (d)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
hypotheses (h)
Bayesian model
(Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)
P(d | h)P(h)
P(h | d) 
 P(d | h)P(h)
d: 2 amoebae
h: set of 4 amoebae
h  H
m: # of amoebae in
1/ h m
dh
the set d (= 2)
P(d | h)  
|h|: # of amoebae in
otherwise
 0
the set h (= 4)
P(h)
Posterior is renormalized prior
P(h | d) 
 P(h)
What is the prior?
h'|d h'
Classes of concepts
(Shepard, Hovland, & Jenkins, 1961)
color
size
shape
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Experiment design (for each subject)
6
iterated
learning
chains
6
independent
learning
“chains”
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Estimating the prior
hypotheses (h)
data (d)
Estimating the prior
Human subjects
Bayesian model
Prior
Class 1
Class 2
0.861
0.087
Class 3
0.009
Class 4
0.002
Class 5
0.013
Class 6
0.028
r = 0.952
Two positive examples
(n = 20)
Human learners
Probability
Probability
Bayesian model
Iteration
Iteration
Two positive examples
(n = 20)
Probability
Human learners
Bayesian model
Three positive examples
data (d)
hypotheses (h)
Three positive examples
(n = 20)
Human learners
Probability
Probability
Bayesian model
Iteration
Iteration
Three positive examples
(n = 20)
Human learners
Bayesian model
Serial reproduction
(Bartlett, 1932)
• Participants see stimuli,
then reproduce them
from memory
• Reproductions of one
participant are stimuli
for the next
• Stimuli were interesting,
rather than controlled
– e.g., “War of the Ghosts”
Discovering the biases of models
Generic neural network:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
EXAM (Delosh, Busemeyer, & McDaniel, 1997):
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
POLE (Kalish, Lewandowsky, & Kruschke, 2004):
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.