cog2005 5352
Download
Report
Transcript cog2005 5352
Graphical Models for
Psychological Categorization
David Danks
Carnegie Mellon University; and
Institute for Human & Machine Cognition
A Puzzle
Concepts & causation are intertwined
Concepts and categorization depend (in part) on
causal beliefs and inferences
Causal learning and reasoning depend (in part) on
the particular concepts we have
But the most prevalent theories in the two
fields use quite different formalisms
Q: Can categorization and causal inference
be represented in a common “language”?
Central Theoretical Claim
Many psychological theories of categorization
are equivalent to
(special cases of) Bayesian categorization of
probabilistic graphical models
(and so the answer to the previous question is “Yes”
– they can share the language of graphical models)
Overview
Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Bayesian Categorization
Set of exclusive, exhaustive models M
For each model m, a prior probability and a
P(X | m) distribution (perhaps a generative model)
Given X, update the model probabilities:
P X | m Pm
Pm | X
P X | a Pa
aM
(and use the updated probabilities for choices)
Probabilistic Graphical Models
PGMs were developed to provide compact
representations of probability distributions
All PGMs are defined for a set of variables V,
and composed of:
Graph over (nodes corresponding to) V
Probability distribution/density over V
Probabilistic Graphical Models
Markov assumption: Graph entails certain
(conditional and unconditional) independence
constraints on the probability distribution
Markov assumptions imply a decomposition of the
probability distribution into a product of simpler
terms (i.e., fewer parameters)
Different PGM-types have different graphtypes and/or Markov assumptions
Probabilistic Graphical Models
Also assume Faithfulness/Stability:
The only probabilistic independencies are those
implied by Markov
If we do not assume this, then every probability
distribution can be represented by every PGM-type
Faithfulness is assumed explicitly or implicitly by all
PGM learning algorithms
Def’n: A graph is a perfect map iff it is Markov
& Faithful to the probability distribution
Probabilistic Graphical Models
For a particular PGM-type, the set of
probability distributions with a perfect map in
that PGM-type form a natural group
This set will almost always be non-exhaustive
Shorthand: “Probability distribution for PGM” will
mean “Probability distribution for which there is a
perfect map in the PGM-type”
Bayesian Networks
Directed acyclic graph
Markov: V is independent of its nondescendants conditional on its parents
Example:
F1
F2
F3
F4
P(F1, F2, F3, F4) =
P(F1) P(F2)
P(F4 | F1, F2) P(F3 | F4)
Markov Random Fields
Undirected graph (i.e., no arrowheads)
Markov: V is independent of its nonneighbors conditional on its neighbors
Example:
F1
F2
F3
F4
P(F1, F2, F3, F4) =
P(F1, F4) P(F2, F4)
P(F3, F4)
Bayesian Categorization of PGMs
Use the standard updating equation:
P X | m Pm
Pm | X
P X | a Pa
aM
And require the P(X | a) distributions to be
distributions for that PGM-type
I.e., the PGM-type supplies the generative model
Simple Example
Suppose we have two equiprobable models
F1
F2
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2
F1
F2
P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Simple Example
Suppose we have two equiprobable models
F1
F2
F1
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2
F2
P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Observe 11, and conclude Right
P(Left | 11) = 0.03
<<
P(Right | 11) = 0.97
Simple Example
Suppose we have two equiprobable models
F1
F2
F1
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2
P(Left | 11) = 0.03
<<
P(Right | 11) = 0.97
Observe 00, and conclude Left
P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Observe 11, and conclude Right
F2
P(Left | 00) = 0.90
and so on…
>>
P(Right | 00) = 0.10
Overview
Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Psychological Theories
All assume a fixed set of input features
Usually binary-, sometimes continuous-valued
For the purposes of this talk, I will focus on
static theories of categorization
That is, focus on the categories that are learned,
as opposed to the learning process itself
The learning processes can also be captured/
explained in this framework
Shared Theoretical Structure
For many psychological theories,
categorization of a novel instance involves:
For each category under consideration, determine
the similarity (according to a specific metric)
between the category and the novel instance
Then use the category similarities to generate a
response probability for each category
Alternately, use a deterministic choice rule but assume
noise in the perceptual system (e.g., Ashby)
Shared Theoretical Structure
In this high-level picture,
We get different categorization theories by having
(i) different classes of similarity metrics, and/or
(ii) different response rules
Within a particular theory, different particular
categories result from different actual similarity
metrics (i.e., different parameter values)
Unconsidered Theories
Not every categorization theory has this
particular high-level structure
For practical reasons, I will focus on models
with analytically defined similarity metrics
In particular, arbitrary neural network models don’t
Excludes models such as RULEX & SUSTAIN
that can only be investigated with simulations
Finally, I won’t explore obvious connections
with Anderson’s rational analysis model
Returning to the High-Level Picture…
Step 2: “Use the category similarities to
generate a response probability”
Most common second stage rule is the
Weighted Luce-Shepard rule:
bm Simm, X
Prespond " m" | X
ba Sima, X
aM
Luce-Shepard & Bayesian Updating
bm Simm, X
Prespond " m" | X
ba Sima, X
aM
L-S is equivalent to Bayesian updating if, for
each a, Sim(a, X) is a probability distribution
Sim(a, X) represents P(X | m)
(normalized) ba weights represent base rates
Note: Unweighted L-S equal base rates for the categories
Similarities as Probabilities
When do similarities represent probabilities?
The answer turns out to be “Always”
Similarity metrics are defined for arbitrary
combinations of category features
So from the point-of-view of response
probabilities, we can renormalize any similarity
metric to produce a probability distribution
(see also Myung, 1994; Ashby & Alfonso-Reese, 1995; and
Rosseel, 2002)
Categorization as Bayesian Updating
All psychological theories of categorization
with this high-level structure are special
cases of Bayesian updating
“Special cases” because they restrict the possible
similarities (and so probability distributions)
Note: I focused on weighted L-S, but similar
conclusions can be drawn for other response
probability rules
Common thread: treat similarities as probabilities
(perhaps because of noise in the perceptual system)
Psychological Categorization & PGMs
Claim: For each psychological theory,
[Class of similarity metrics] is equivalent to
[Probability distributions for (sub-classes of) a
PGM-type]
Three examples:
Causal Model Theory
Exemplar-based models (specifically, GCM)
Prototype-based models (first- and second-order)
Causal Model Theory
Causal Model Theory:
Categories are defined by causal structures,
represented as arbitrary causal Bayes nets
Similarity of an instance to a category is explicitly:
Sim(m, X) = P(X | m)
(where m is a Bayesian network)
Causal Model Theory
CMT categorization (with weighted L-S) is
equivalent to Bayesian updating with arbitrary
Bayes nets as the generating PGMs
Varying weights in the L-S rule correspond to
different category base rates
Exemplar-Based Models
Generalized Context Model
Categories defined by a set of exemplars Ej
Exemplars are actually observed category instances
Exemplar-Based Models
Generalized Context Model
Categories defined by a set of exemplars Ej
Exemplars are actually observed category instances
Similarity is the (weighted) average (exponential
of) distance between the instance and exemplars
Multiple distance metrics used (e.g., weighted city-block)
Simm, X W j exp c i X i E j i
jExemp
iFeat
Exemplar-Based Models
There is an equivalence between:
GCM-similarity functions; and
Probability distributions for Bayes nets with graph:
E
[unobserved]
F1
F2
…
Fn
and a regularity constraint on the distribution terms
Exemplar-Based Models
GCM categorization (with weighted L-S) is
equivalent to Bayesian updating with fixedstructure Bayes nets (+constraint) as the
generating PGMs
Prototype-Based Models
First-order Multiplicative Prototype Model:
Categories defined by a prototypical instance Q
Prototype need not be actually observed
Prototype-Based Models
First-order Multiplicative Prototype Model:
Categories defined by a prototypical instance Q
Prototype need not be actually observed
Similarity is the (weighted exponential of the)
distance between the instance and the prototype
Again, different distance metrics can be used
Simm, X exp c i X i Qi
iFeat
Prototype-Based Models
There is an equivalence between:
FOMPM-similarity functions; and
Probability distributions for empty-graph Markov
random fields (and a regularity constraint on the
distribution terms)
Note: The “no-edge Markov random field” probability
distributions are identical with the “no-edge Bayes net”
probability distributions
Prototype-Based Models
First-order models fail to capture the intuition
of “prototype as summary of observations”
Inter-feature correlations cannot be captured
Second-order models with interaction terms
Define features Fij whose value depends on the
state of Fi and Fj
Assume the similarity function is still factorizable
into feature-based terms
Non-trivial assumption, but not particularly restrictive
Prototype-Based Models
There is an equivalence between:
SOMPM-similarity functions; and
Probability distributions for arbitrary-graph Markov
random fields (and a regularity constraint on the
distribution terms)
Constraint details highly dependant on the exact
second-order feature definition and the similarity metric
Prototype-Based Models
First-order prototype-based categorization
(with weighted L-S) is equivalent to Bayesian
updating with no-edge Markov random fields
(+constraint) as the generating PGMs
And second-order prototypes are equivalent to
Bayesian updating with arbitrary-graph Markov
random fields
Summary of Theoretical Results
Many psychological theories of categorization
are equivalent to Bayesian updating,
assuming a particular generative model-type
Significant instances:
CMT Arbitrary-graph Bayes nets
GCM Fixed-graph Bayes net (+constraint)
Prototype Empty- or Arbitrary-graph Markov
random field (+constraint)
Overview
Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Common Representational Language
Common representational language for:
Many psychological theories of concepts and
categorization; and
Psychological theories of causal inference and
belief based on Bayes nets
This shared language arguably facilitates the
development of a unified theory of the
psychological domains
Unfortunately, just a promissory note right now
Multiple Categorization Systems
Several recent papers have argued (roughly):
Each psychological theory is empirically superior
for some problems in some domains
There must be multiple categorization systems
(corresponding to the different theories)
Multiple Categorization Systems
Bayes nets and Markov random fields are
special cases of chain graphs – PGMs with
directed and undirected edges
We can model each categorization theory
as a special case of Bayesian updating on a
chain graph
Multiple Categorization Systems
If all categorization is Bayesian updating on
chain graphs, then we have one cognitive
system with many different possible
“parameters” (i.e., generative models)
Note: This possibility does not show that the
“multiple systems” view is wrong, but does blunt
the inference from multiple confirmed theories
Concepts as Chain Graphs
How can we test “concepts as chain graphs”?
Concepts as Chain Graphs
How can we test “concepts as chain graphs”?
Use a probability distribution for chain graphs with
no Bayes net or Markov random field perfect map
Example: F1
F4
F2
F3
Concepts as Chain Graphs
How can we test “concepts as chain graphs”?
Use a probability distribution for chain graphs with
no Bayes net or Markov random field perfect map
Example: F1
F4
F2
F3
Experimental question: How accurately can
people learn categories based on this graph?
Expanded Equivalence Results
These results extend known equivalencies to
include (i) Causal model theory; and
(ii) Second-order prototype models
These various theoretical equivalencies can
guide experimental design
Use them to determine whether a particular
category structure can be equally well-modeled by
multiple psychological theories
Expanded Equivalence Results
Bayes nets and Markov random fields
represent overlapping sets of distributions
Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
Expanded Equivalence Results
Bayes nets and Markov random fields
represent overlapping sets of distributions
Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
F1
F2
F3
F4
Equal CMT &
SOMPM model fits
for this concept
Expanded Equivalence Results
Bayes nets and Markov random fields
represent overlapping sets of distributions
Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
F1
F1
F2
F3
F4
Equal CMT &
SOMPM model fits
for this concept
F2
F3
F4
Different CMT &
SOMPM model fits
for this concept
Novel Suggested Theories
Recall that the PGMs for both the GCM and
SOMPM have additional constraints
These constraints have a relatively natural
computational motivation
Idea: Investigate generalized versions of the
psychological theories
E.g., do we get significantly better model fits? how
accurately do people learn concepts that violate
the regularity constraints? and so on…
Conclusion
Many psychological theories of categorization
are equivalent to
(special cases of) Bayesian categorization of
probabilistic graphical models
(and those equivalencies have implications for
both (a) theory development & testing, and
(b) experimental design & practice)
Appendix: GCM & Bayes Nets
Example of the regularity constraint:
City-block distance metric, continuous features:
For each Fi, each P(Fi | E = j) is a Laplace (double
exponential) distribution with the same scale
parameter, and possibly distinct means
E (in the Bayes net) has as many values as
there are exemplars (in the category)
P(E = j) is the exemplar weight
In the limit of infinite exemplars, we can represent
arbitrary probability distributions