Document 7496826

Download Report

Transcript Document 7496826

Bayesian models of human
inductive learning
Josh Tenenbaum
MIT
Everyday inductive leaps
How can people learn so much about the
world from such limited evidence?
–
–
–
–
–
Kinds of objects and their properties
The meanings of words, phrases, and sentences
Cause-effect relations
The beliefs, goals and plans of other people
Social structures, conventions, and rules
“tufa”
Modeling Goals
• Explain how and why human learning and reasoning
works, in terms of (approximations to) optimal statistical
inference in natural environments.
• Computational-level theories that provide insights into
algorithmic- or processing-level questions.
• Principled quantitative models of human behavior, with
broad coverage and a minimum of free parameters and ad
hoc assumptions.
• A framework for studying people’s implicit knowledge
about the world: how it is structured, used, and acquired.
• A two-way bridge to state-of-the-art AI, machine learning.
Explaining inductive learning
1. How does background knowledge guide learning
from sparsely observed data? Bayesian inference, with
priors based on background knowledge.
2. What form does background knowledge take, across
different domains and tasks? Probabilities defined over
structured representations: graphs, grammars, rules, logic,
relational schemas, theories.
3. How is background knowledge itself learned?
Hierarchical Bayesian models, with inference at multiple levels of
abstraction.
4. How can background knowledge constrain learning
yet maintain flexibility, balancing assimilation and
accommodation? Nonparametric models, growing in
complexity as the data require.
Two case studies
• The number game
• Property induction
The number game
• Program input: number between 1 and 100
• Program output: “yes” or “no”
The number game
• Learning task:
– Observe one or more positive (“yes”) examples.
– Judge whether other numbers are “yes” or “no”.
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
16
Diffuse similarity
16 8 2 64
Rule:
“powers of 2”
16 23 19 20
Focused similarity:
numbers near 20
The number game
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
Main phenomena to explain:
– Generalization can appear either similaritybased (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
Divisions into “rule” and
“similarity” subsystems?
• Category learning
– Nosofsky, Palmeri et al.: RULEX
– Erickson & Kruschke: ATRIUM
• Language processing
– Pinker, Marcus et al.: Past tense morphology
• Reasoning
– Sloman
– Rips
– Nisbett, Smith et al.
Bayesian model
• H: Hypothesis space of possible concepts:
–
–
–
–
–
h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
...
Representational interpretations for H:
– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Three hypothesis subspaces for
number concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between
1 and 100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Bayesian model
• H: Hypothesis space of possible concepts:
– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.
• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
– p(h) [prior]: domain knowledge, pre-existing biases
– p(X|h) [likelihood]: statistical information in examples.
– p(h|X) [posterior]: degree of belief that h is the true extension of C.
Generalizing to new objects
Given p(h|X), how do we compute p( y  C | X ) ,  p( y 
the probability that C applies to some new
hH
stimulus y?
Background
knowledge
h
X=
x1
x2
p( y  C | X )
x3
x4
y C ?
Generalizing to new objects
Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):
p( y  C | X ) 
y  C | h) p ( h | X )
 p(

hH


 1 if yh

 0 if yh
h { y, X }
p(h | X )
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
 1 
p ( X | h)  
if x1 ,  , xn  h

 size( h) 
 0 if any xi  h
• Follows from assumption of randomly sampled examples
+ law of “conservation of belief”:

p(D  d | M )  1
all d D
• Captures the intuition of a “representative” sample.
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data slightly more of a coincidence under h1
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data much more of a coincidence under h1
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
e.g., X = {60 80 10 30}:
4
1
p ( X | multiples of 10)     0.0001
10 
4
1 
p( X | multiples of 10 except 50, 70)     0.00024
8 
The “ugly duckling” theorem
Hypotheses
Objects
1
2
3
4
How would we generalize
without any inductive bias –
without constraints on the
hypothesis space, informative
priors or likelihoods?
The “ugly duckling” theorem
Hypotheses
Objects
1
2
3
4
p ( X  {3} | h)  1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
The “ugly duckling” theorem
Hypotheses
Objects
1
2
3
4
p ( X  {3} | h)  1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
p(4  C | X  {3})  1 + 1 + 1 + 1
= 4/8
p(2  C | X  {3})  1 1
= 4/8
+
p (1  C | X  {3})  1 1 + 1 1
11
= 4/8
The “ugly duckling” theorem
Hypotheses
Objects
1
2
3
4
p ( X  {3, 1} | h)  1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
p(4  C | X  {3, 1})  1 + 1
= 2/4
p(2  C | X  {3, 1})  1 1
= 2/4
+
Without any inductive bias – constraints on hypotheses,
informative priors or likelihoods – no meaningful generalization!
Posterior:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
• X = {60, 80, 10, 30}
• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 20”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative weights of alternative theories:
H: Total hypothesis space
p(H1) = 1/5
p(H2) = 3/5
p(H3) = 1/5
H1: Math properties (24)
H2: Raw magnitude (5050)
H3: Approx. magnitude (10)
• even numbers
• powers of two
• multiples of three
…. p(h) = p(H1) / 24
• 10-15
• 20-32
• 37-54
…. p(h) = p(H2) / 5050
• 10-20
• 20-30
• 30-40
…. p(h) = p(H3) / 10
+ Examples
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
Human generalization
Bayesian Model
Examples:
16
Examples:
16
8
2
64
Examples:
16
23
19
20
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
broad p(h|X): similarity gradient
narrow p(h|X): all-or-none rule
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
broad p(h|X): Many h of similar size, or
very few examples (i.e. 1)
narrow p(h|X): One h much smaller
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
Time?
Alternative models
• Neural networks
even
60
80
10
30
multiple
of 10
multiple
of 3
power
of 2
Alternative models
• Neural networks
• Hypothesis ranking and elimination
Hypothesis
ranking:
60
80
10
30
1
2
3
4
….
even
multiple
of 10
multiple
of 3
power
of 2
….
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
1
sim ( y, x j )
– Average similarity: p( y  C | X ) 

| X | x X
j
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.80)
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
– Flexible similarity? Bayes.
The universal
law of
generalization
Probability of
generalization
Distance in psychological space
Explaining the universal law
(Tenenbaum & Griffiths, 2001)
Bayesian generalization when the hypotheses correspond to
convex regions in a low-dimensional metric space (e.g.,
intervals in one dimension), with an isotropic prior.
p ( y  C | x)
x
y
y1
x
y2
y3
Asymmetric generalization
horse
• Assume a gradient of typicality
• Examples sampled in
proportion to their typicality:
• Size of hypotheses now
camel
Asymmetric generalization
Symmetry may depend heavily on the context:
– Healthy levels of hormone (left) versus healthy levels
of toxin (right)
– Predicting the durations or magnitudes of events.
Modeling word learning
Bayesian inference
over treestructured
hypothesis
space:
(Xu & Tenenbaum;
Schmidt &
Tenenbaum)
“tufa”
“tufa”
“tufa”
Taking stock
• A model of high-level, knowledge-driven
inductive reasoning that makes strong quantitative
predictions with minimal free parameters.
(r2 > 0.9 for mean judgments on 180 generalization
stimuli, with 3 free numerical parameters)
• Explains qualitatively different patterns of
generalization (rules, similarity) as the output of a
single general-purpose rational inference engine.
• Differently structured hypothesis spaces account
for different kinds of generalization behavior seen
in different domains and contexts.
What’s missing:
How do we choose a good prior?
• Can we describe formally how these priors are generated by
abstract knowledge or theories?
• Can we move from ‘weak rational analysis’ to ‘strong
rational analysis’ in inductive learning?
– “Weak”: behavior consistent with some reasonable prior.
– “Strong”: behavior consistent with the “correct” prior given the
structure of the world (c.f., ideal observer analyses in vision).
• Can we explain how people learn these rich priors?
• Can we work with more flexible priors, not just restricted to
a small subset of all logically possible concepts?
– Would like to be able to learn any concept, even complex and
unnatural ones, given enough data (a non-dogmatic prior).
Property induction
• How likely is the conclusion, given the
premises?
Gorillas have T9 hormones.
Seals have T9 hormones.
Squirrels have T9 hormones.
Gorillas have T9 hormones.
Seals have T9 hormones.
Squirrels have T9 hormones.
Flies have T9 hormones.
Horses have T9 hormones.
“Similarity”, “Typicality”,
“Diversity”
Gorillas have T9 hormones.
Chimps have T9 hormones.
Monkeys have T9 hormones.
Baboons have T9 hormones.
Horses have T9 hormones.
The computational problem
“Transfer Learning”,
“Semi-Supervised Learning”
?
Horse
Cow
Chimp
Gorilla
Mouse
Squirrel
Dolphin
Seal
Rhino
Elephant
?
?
?
?
?
?
?
?
Features
New property
85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’,
‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’,
‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
Horses have T9 hormones
Rhinos have T9 hormones
Cows have T9 hormones
}X
P(Y | X ) 
Y


h consistentwith X ,Y
h consistentwith X
Hypotheses h
Horse
Cow
Chimp
Gorilla
Mouse
Squirrel
Dolphin
Seal
Rhino
Elephant
?
?
?
?
?
?
?
...
P ( h)
...
?
Prior P(h)
P ( h)
Hierarchical Bayesian Framework
(Kemp & Tenenbaum)
P(form)
F: form
P(structure | form)
Tree with species at leaf nodes
mouse
squirrel
S: structure
chimp
Has T9
hormones
gorilla
F1
F2
F3
F4
P(data | structure)
D: data
mouse
squirrel
chimp
gorilla
…
?
?
?
P(D|S): How the structure constrains the
data of experience
• Define a stochastic process over structure S that
generates candidate property extensions h.
– Intuition: properties should vary smoothly over structure.
Smooth: P(h) high
Not smooth: P(h) low
P(D|S): How the structure constrains the
data of experience
S
Gaussian Process
(~ random walk,
diffusion)
[Zhu, Lafferty &
Ghahramani 2003]
y
Threshold
h
P(D|S): How the structure constrains the
data of experience
S
Gaussian Process
(~ random walk,
diffusion)
[Zhu, Lafferty &
Ghahramani 2003]
y
Threshold
h
Structure S
Data D
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’,
‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’,
‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
[c.f., Lawrence,
2004; Smola &
Kondor 2003]
Structure S
Data D
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
New property
85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’,
‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’,
‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
Probability of
generalization:
Cows have property P.
Elephants have property P.
2D
Tree
Horses have property P.
Gorillas have property P.
Mice have property P.
Seals have property P.
All mammals have property P.
Testing different priors
Inductive bias
Correct
bias
Wrong
bias
Too weak
bias
Too strong
bias
Learning about spatial properties
Tree
2D
Geographic inference task: “Given that a certain kind of
native American artifact has been found in sites near city
X, how likely is the same artifact to be found near city Y?”
Summary so far
• A framework for modeling human inductive
reasoning as rational statistical inference over
structured knowledge representations
– Qualitatively different priors are appropriate for different
domains of property induction.
– In each domain, a prior that matches the world’s structure
fits people’s judgments well, and better than alternative
priors.
– A language for representing different theories: graph
structure defined over objects + probabilistic model for the
distribution of properties over that graph.
• Remaining question: How can we learn
appropriate structures for different domains?
Hierarchical Bayesian Framework
F: form
Linear
Tree
Clusters
mouse
chimp
mouse
squirrel
gorilla
squirrel
squirrel
chimp
chimp
mouse
gorilla
gorilla
F1
F2
F3
F4
S: structure
D: data
mouse
squirrel
chimp
gorilla
F: form
Linear
Favors simplicity
Tree
Clusters
mouse
chimp
mouse
squirrel
S: structure
Favors smoothness
gorilla
squirrel
squirrel
chimp
chimp
mouse
gorilla
gorilla
F1
F2
F3
F4
[Zhu et al., 2003]
D: data
mouse
squirrel
chimp
gorilla
Hypothesis space of structural forms
Partition
Hierarchy
Tree
Order
Chain Ring
Grid
Cylinder
Development of structural forms
as more data are observed
The “blessing of abstraction”
• Often quicker to learn at higher levels of abstraction.
– Quicker to learn that you have a biased coin than to learn its precise
bias, or to learn that you have a second-order polynomial than to learn
the precise coefficients.
– Quicker to learn that shape matters most for labeling object categories
than to learn the labels for most categories.
– Quicker to learn that a domain is tree-structured than to learn the
precise tree that best characterizes it.
• Explanation in hierarchical Bayesian models:
– At higher levels, hypothesis space gets smaller and simpler, and draw
support (albeit indirectly) from a broader sample of data.
– Total hypothesis space gets bigger when we add levels of abstraction,
but the effective number of degrees of freedom only decreases, because
higher levels specify constraints on lower levels.
– Hence the overall learning problem becomes easier.
Beyond “Nativism” versus “Empiricism”
• “Nativism”: Explicit knowledge of structural
forms for core domains is innate.
– Atran (1998): The tendency to group living kinds into
hierarchies reflects an “innately determined cognitive
structure”.
– Chomsky (1980): “The belief that various systems of mind are
organized along quite different principles leads to the natural
conclusion that these systems are intrinsically determined, not
simply the result of common mechanisms of learning or
growth.”
• “Empiricism”: General-purpose learning systems
without explicit knowledge of structural form.
– Connectionist networks (e.g., Rogers and McClelland, 2004).
– Traditional structure learning in probabilistic graphical models.
Conclusions
• Computational tools for studying core questions of human
learning (and building more human-like machine learning)
–
–
–
–
What is the structure of knowledge, at multiple levels of abstraction?
How does abstract domain knowledge guide new learning?
How can abstract domain knowledge itself be learned?
How can inductive biases provide strong constraints yet be flexible?
• A different way to think about the development of cognition.
– Powerful abstractions can be learned “from the top down”, together
with or prior to learning more concrete knowledge.
• Go beyond the traditional “either-or” dichotomies:
– How can probabilistic inference over symbolic hypotheses span the
range of “rule-based” to “similarity-based” generalization?
– How can domain-general learning mechanisms acquire domainspecific representations?
– How can structured symbolic representations be acquired by
statistical learning?