cog2005 5352

Download Report

Transcript cog2005 5352

Graphical Models for
Psychological Categorization
David Danks
Carnegie Mellon University; and
Institute for Human & Machine Cognition
A Puzzle

Concepts & causation are intertwined




Concepts and categorization depend (in part) on
causal beliefs and inferences
Causal learning and reasoning depend (in part) on
the particular concepts we have
But the most prevalent theories in the two
fields use quite different formalisms
Q: Can categorization and causal inference
be represented in a common “language”?
Central Theoretical Claim
Many psychological theories of categorization
are equivalent to
(special cases of) Bayesian categorization of
probabilistic graphical models
(and so the answer to the previous question is “Yes”
– they can share the language of graphical models)
Overview



Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Bayesian Categorization

Set of exclusive, exhaustive models M


For each model m, a prior probability and a
P(X | m) distribution (perhaps a generative model)
Given X, update the model probabilities:
P X | m Pm 
Pm | X  
 P X | a Pa 
aM
(and use the updated probabilities for choices)
Probabilistic Graphical Models

PGMs were developed to provide compact
representations of probability distributions

All PGMs are defined for a set of variables V,
and composed of:


Graph over (nodes corresponding to) V
Probability distribution/density over V
Probabilistic Graphical Models

Markov assumption: Graph entails certain
(conditional and unconditional) independence
constraints on the probability distribution


Markov assumptions imply a decomposition of the
probability distribution into a product of simpler
terms (i.e., fewer parameters)
Different PGM-types have different graphtypes and/or Markov assumptions
Probabilistic Graphical Models

Also assume Faithfulness/Stability:

The only probabilistic independencies are those
implied by Markov



If we do not assume this, then every probability
distribution can be represented by every PGM-type
Faithfulness is assumed explicitly or implicitly by all
PGM learning algorithms
Def’n: A graph is a perfect map iff it is Markov
& Faithful to the probability distribution
Probabilistic Graphical Models

For a particular PGM-type, the set of
probability distributions with a perfect map in
that PGM-type form a natural group


This set will almost always be non-exhaustive
Shorthand: “Probability distribution for PGM” will
mean “Probability distribution for which there is a
perfect map in the PGM-type”
Bayesian Networks



Directed acyclic graph
Markov: V is independent of its nondescendants conditional on its parents
Example:
F1
F2
F3
F4
P(F1, F2, F3, F4) =
P(F1)  P(F2) 
P(F4 | F1, F2)  P(F3 | F4)
Markov Random Fields



Undirected graph (i.e., no arrowheads)
Markov: V is independent of its nonneighbors conditional on its neighbors
Example:
F1
F2
F3
F4
P(F1, F2, F3, F4) =
P(F1, F4)  P(F2, F4) 
P(F3, F4)
Bayesian Categorization of PGMs

Use the standard updating equation:
P X | m Pm 
Pm | X  
 P X | a Pa 
aM

And require the P(X | a) distributions to be
distributions for that PGM-type

I.e., the PGM-type supplies the generative model
Simple Example

Suppose we have two equiprobable models
F1
F2
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2
F1
F2
P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Simple Example

Suppose we have two equiprobable models
F1
F2
F1
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2

F2
P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Observe 11, and conclude Right

P(Left | 11) = 0.03
<<
P(Right | 11) = 0.97
Simple Example

Suppose we have two equiprobable models
F1
F2
F1
P(F1 = 1) = 0.1
P(F2 = 1) = 0.2

P(Left | 11) = 0.03
<<
P(Right | 11) = 0.97
Observe 00, and conclude Left


P(F1 = 1) = 0.8
P(F2 = 1 | F1 = 1) = 0.8
P(F2 = 1 | F1 = 0) = 0.6
Observe 11, and conclude Right


F2
P(Left | 00) = 0.90
and so on…
>>
P(Right | 00) = 0.10
Overview



Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Psychological Theories

All assume a fixed set of input features


Usually binary-, sometimes continuous-valued
For the purposes of this talk, I will focus on
static theories of categorization


That is, focus on the categories that are learned,
as opposed to the learning process itself
The learning processes can also be captured/
explained in this framework
Shared Theoretical Structure

For many psychological theories,
categorization of a novel instance involves:


For each category under consideration, determine
the similarity (according to a specific metric)
between the category and the novel instance
Then use the category similarities to generate a
response probability for each category

Alternately, use a deterministic choice rule but assume
noise in the perceptual system (e.g., Ashby)
Shared Theoretical Structure

In this high-level picture,


We get different categorization theories by having
(i) different classes of similarity metrics, and/or
(ii) different response rules
Within a particular theory, different particular
categories result from different actual similarity
metrics (i.e., different parameter values)
Unconsidered Theories

Not every categorization theory has this
particular high-level structure


For practical reasons, I will focus on models
with analytically defined similarity metrics


In particular, arbitrary neural network models don’t
Excludes models such as RULEX & SUSTAIN
that can only be investigated with simulations
Finally, I won’t explore obvious connections
with Anderson’s rational analysis model
Returning to the High-Level Picture…


Step 2: “Use the category similarities to
generate a response probability”
Most common second stage rule is the
Weighted Luce-Shepard rule:
bm Simm, X 
Prespond " m" | X  
 ba Sima, X 
aM
Luce-Shepard & Bayesian Updating
bm Simm, X 
Prespond " m" | X  
 ba Sima, X 
aM

L-S is equivalent to Bayesian updating if, for
each a, Sim(a, X) is a probability distribution



Sim(a, X) represents P(X | m)
(normalized) ba weights represent base rates
Note: Unweighted L-S  equal base rates for the categories
Similarities as Probabilities


When do similarities represent probabilities?
The answer turns out to be “Always”


Similarity metrics are defined for arbitrary
combinations of category features
So from the point-of-view of response
probabilities, we can renormalize any similarity
metric to produce a probability distribution
(see also Myung, 1994; Ashby & Alfonso-Reese, 1995; and
Rosseel, 2002)
Categorization as Bayesian Updating

All psychological theories of categorization
with this high-level structure are special
cases of Bayesian updating


“Special cases” because they restrict the possible
similarities (and so probability distributions)
Note: I focused on weighted L-S, but similar
conclusions can be drawn for other response
probability rules

Common thread: treat similarities as probabilities
(perhaps because of noise in the perceptual system)
Psychological Categorization & PGMs

Claim: For each psychological theory,



[Class of similarity metrics] is equivalent to
[Probability distributions for (sub-classes of) a
PGM-type]
Three examples:



Causal Model Theory
Exemplar-based models (specifically, GCM)
Prototype-based models (first- and second-order)
Causal Model Theory

Causal Model Theory:


Categories are defined by causal structures,
represented as arbitrary causal Bayes nets
Similarity of an instance to a category is explicitly:
Sim(m, X) = P(X | m)
(where m is a Bayesian network)
Causal Model Theory

CMT categorization (with weighted L-S) is
equivalent to Bayesian updating with arbitrary
Bayes nets as the generating PGMs

Varying weights in the L-S rule correspond to
different category base rates
Exemplar-Based Models

Generalized Context Model

Categories defined by a set of exemplars Ej

Exemplars are actually observed category instances
Exemplar-Based Models

Generalized Context Model

Categories defined by a set of exemplars Ej


Exemplars are actually observed category instances
Similarity is the (weighted) average (exponential
of) distance between the instance and exemplars

Multiple distance metrics used (e.g., weighted city-block)


Simm, X    W j exp  c   i X i   E j i  
jExemp
 iFeat

Exemplar-Based Models

There is an equivalence between:


GCM-similarity functions; and
Probability distributions for Bayes nets with graph:
E
[unobserved]
F1
F2
…
Fn
and a regularity constraint on the distribution terms
Exemplar-Based Models

GCM categorization (with weighted L-S) is
equivalent to Bayesian updating with fixedstructure Bayes nets (+constraint) as the
generating PGMs
Prototype-Based Models

First-order Multiplicative Prototype Model:

Categories defined by a prototypical instance Q

Prototype need not be actually observed
Prototype-Based Models

First-order Multiplicative Prototype Model:

Categories defined by a prototypical instance Q


Prototype need not be actually observed
Similarity is the (weighted exponential of the)
distance between the instance and the prototype

Again, different distance metrics can be used


Simm, X   exp  c   i X i   Qi  
 iFeat

Prototype-Based Models

There is an equivalence between:


FOMPM-similarity functions; and
Probability distributions for empty-graph Markov
random fields (and a regularity constraint on the
distribution terms)
 Note: The “no-edge Markov random field” probability
distributions are identical with the “no-edge Bayes net”
probability distributions
Prototype-Based Models

First-order models fail to capture the intuition
of “prototype as summary of observations”


Inter-feature correlations cannot be captured
Second-order models with interaction terms


Define features Fij whose value depends on the
state of Fi and Fj
Assume the similarity function is still factorizable
into feature-based terms

Non-trivial assumption, but not particularly restrictive
Prototype-Based Models

There is an equivalence between:


SOMPM-similarity functions; and
Probability distributions for arbitrary-graph Markov
random fields (and a regularity constraint on the
distribution terms)
 Constraint details highly dependant on the exact
second-order feature definition and the similarity metric
Prototype-Based Models

First-order prototype-based categorization
(with weighted L-S) is equivalent to Bayesian
updating with no-edge Markov random fields
(+constraint) as the generating PGMs

And second-order prototypes are equivalent to
Bayesian updating with arbitrary-graph Markov
random fields
Summary of Theoretical Results


Many psychological theories of categorization
are equivalent to Bayesian updating,
assuming a particular generative model-type
Significant instances:



CMT Arbitrary-graph Bayes nets
GCM  Fixed-graph Bayes net (+constraint)
Prototype  Empty- or Arbitrary-graph Markov
random field (+constraint)
Overview



Bayesian Categorization of
Probabilistic Graphical Models (PGMs)
Psychological Theories of Categorization
Theoretical & Experimental Implications
Common Representational Language

Common representational language for:



Many psychological theories of concepts and
categorization; and
Psychological theories of causal inference and
belief based on Bayes nets
This shared language arguably facilitates the
development of a unified theory of the
psychological domains

Unfortunately, just a promissory note right now
Multiple Categorization Systems

Several recent papers have argued (roughly):


Each psychological theory is empirically superior
for some problems in some domains
 There must be multiple categorization systems
(corresponding to the different theories)
Multiple Categorization Systems


Bayes nets and Markov random fields are
special cases of chain graphs – PGMs with
directed and undirected edges
 We can model each categorization theory
as a special case of Bayesian updating on a
chain graph
Multiple Categorization Systems

If all categorization is Bayesian updating on
chain graphs, then we have one cognitive
system with many different possible
“parameters” (i.e., generative models)

Note: This possibility does not show that the
“multiple systems” view is wrong, but does blunt
the inference from multiple confirmed theories
Concepts as Chain Graphs

How can we test “concepts as chain graphs”?
Concepts as Chain Graphs

How can we test “concepts as chain graphs”?


Use a probability distribution for chain graphs with
no Bayes net or Markov random field perfect map
Example: F1
F4
F2
F3
Concepts as Chain Graphs

How can we test “concepts as chain graphs”?


Use a probability distribution for chain graphs with
no Bayes net or Markov random field perfect map
Example: F1
F4
F2

F3
Experimental question: How accurately can
people learn categories based on this graph?
Expanded Equivalence Results


These results extend known equivalencies to
include (i) Causal model theory; and
(ii) Second-order prototype models
These various theoretical equivalencies can
guide experimental design

Use them to determine whether a particular
category structure can be equally well-modeled by
multiple psychological theories
Expanded Equivalence Results

Bayes nets and Markov random fields
represent overlapping sets of distributions

Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
Expanded Equivalence Results

Bayes nets and Markov random fields
represent overlapping sets of distributions

Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
F1
F2
F3
F4
Equal CMT &
SOMPM model fits
for this concept
Expanded Equivalence Results

Bayes nets and Markov random fields
represent overlapping sets of distributions

Specifically, Bayes nets with no colliders are
equivalent to Markov random fields with no cycles
F1
F1
F2
F3
F4
Equal CMT &
SOMPM model fits
for this concept
F2
F3
F4
Different CMT &
SOMPM model fits
for this concept
Novel Suggested Theories

Recall that the PGMs for both the GCM and
SOMPM have additional constraints


These constraints have a relatively natural
computational motivation
Idea: Investigate generalized versions of the
psychological theories

E.g., do we get significantly better model fits? how
accurately do people learn concepts that violate
the regularity constraints? and so on…
Conclusion
Many psychological theories of categorization
are equivalent to
(special cases of) Bayesian categorization of
probabilistic graphical models
(and those equivalencies have implications for
both (a) theory development & testing, and
(b) experimental design & practice)
Appendix: GCM & Bayes Nets

Example of the regularity constraint:


City-block distance metric, continuous features:
For each Fi, each P(Fi | E = j) is a Laplace (double
exponential) distribution with the same scale
parameter, and possibly distinct means
E (in the Bayes net) has as many values as
there are exemplars (in the category)


P(E = j) is the exemplar weight
In the limit of infinite exemplars, we can represent
arbitrary probability distributions