Revealing inductive biases through iterated learning

Download Report

Transcript Revealing inductive biases through iterated learning

The dynamics of iterated learning

Tom Griffiths

UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Learning

data

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Iterated learning

(Kirby, 2001) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

What are the consequences of learners learning from other learners?

Objects of iterated learning

• Knowledge communicated across generations through provision of data by learners • Examples: – religious concepts – social norms – myths and legends – causal theories –

language

Language

• The languages spoken by humans are typically viewed as the result of two factors – individual learning – innate constraints (biological evolution) • This limits the possible explanations for different kinds of linguistic phenomena

Linguistic universals

• Human languages possess universal properties – e.g. compositionality (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) • Traditional explanation: – linguistic universals reflect innate constraints specific to a system for acquiring language (e.g., Chomsky, 1965)

Cultural evolution

• Languages are also subject to change via cultural evolution (through iterated learning) • Alternative explanation: – linguistic universals emerge as the result of the fact that language is learned anew by each generation (using general-purpose learning mechanisms) (e.g., Briscoe, 1998; Kirby, 2001)

Analyzing iterated learning

P L

(

)

P L

(

)

P P

(

d|h

) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

P P

(

d|h

)

P L

(

): probability of inferring hypothesis

from data

d P P

(

d|h

): probability of generating data

from hypothesis

x x x

Markov chains

x x x x x

Transition matrix T = P (

(

+1) |

(

) ) • Variables

(

+1) independent of history given

(

) • Converges to a

stationary distribution

under easily checked conditions (i.e., if it is ergodic)

Stationary distributions

• Stationary distribution: 

 

(

 1)



(

) 

) 

 

T ij



j j

• • In matrix form    

 is the first eigenvector of the matrix

– can sometimes be found analytically

Analyzing iterated learning

P L

(

)

P P

(

)

P L

(

)

P P

(

)

P L

(

)

3 A Markov chain on hypotheses

1 

d P P

(

)

P L

(

)

2 

d P P

(

)

P L

(

)

3 A Markov chain on data

0 

h P L

(

)

P P

(

)

1 

h P L

(

)

P P

(

)

2 

h P L

(

)

P P

(

Bayesian inference

• Rational procedure for updating beliefs • Foundation of many learning algorithms • Lets us make the

inductive biases

of learners precise QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Reverend Thomas Bayes

Bayes’ theorem

Posterior probability Likelihood Prior probability

(

|

)

 

(

d P

(

d h

 

| |

)

(

)



)

(



)



: hypothesis

: data Sum over space of hypotheses



Iterated Bayesian learning

P L

(

)

P L

(

) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) Assume learners

sample

from their posterior distribution:

P L

(

|

)

 

P P h

 

(

d P P

(

| |

)

(

)



)

(



)

Stationary distributions

• Markov chain on

converges to the prior,

(

) • Markov chain on

converges to the “prior predictive distribution”

(

)  

(

d h

)

(

) (Griffiths & Kalish, 2005) 

Explaining convergence to the prior

P L

(

)

P L

(

) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) • Intuitively: data acts once, prior many times • Formally: iterated learning with Bayesian agents is a

Gibbs sampler

for

(

) (Griffiths & Kalish, submitted)

Gibbs sampling

For variables

1 ,

2 , …,

x n

Draw

x i

(

+1)

-i

from

(

x i |x -i

)

= x 1

(

+1)

, x 2

(

+1)

,…, x i-1

(

+1)

, x i+1

(

)

, …, x n

(

) Converges to

P(x

1 ,

2 , …,

x n )

(Geman & Geman, 1984) (a.k.a. the heat bath algorithm in statistical physics)

Gibbs sampling

(MacKay, 2003)

Explaining convergence to the prior

P L

(

)

P L

(

) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) When the target distribution is

(

), the conditional distributions are

P L

(

) and

P P

(

)

Implications for linguistic universals

• When learners sample from

(

), the distribution over languages converges to the prior – identifies a one-to-one correspondence between inductive biases and linguistic universals



Iterated Bayesian learning

P L

(

)

P L

(

) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

P P

(

d|h

) Assume learners

sample

from their posterior distribution:

P L

(

|

)

 

P P h

 

(

d P P

(

| |

)

(

)



)

(



)

From sampling to maximizing

P L

(

|

)

     

P h

 

H P P P

(

| |

)

(

)



)

(



)

   



= 1

= 2

= 

From sampling to maximizing

• General analytic results are hard to obtain – (

r =

 is a stochastic version of the EM algorithm) • For certain classes of languages, it is possible to show that the stationary distribution gives each hypothesis

probability proportional to

(

)

– the ordering identified by the prior is preserved, but not the corresponding probabilities (Kirby, Dowman, & Griffiths, submitted)

Implications for linguistic universals

• When learners sample from

(

), the distribution over languages converges to the prior – identifies a one-to-one correspondence between inductive biases and linguistic universals • As learners move towards maximizing, the influence of the prior is exaggerated – weak biases can produce strong universals – cultural evolution is a viable alternative to traditional explanations for linguistic universals

Analyzing iterated learning

• The outcome of iterated learning is determined by the inductive biases of the learners – hypotheses with high prior probability ultimately appear with high probability in the population • Clarifies the connection between constraints on language learning and linguistic universals… • …and provides formal justification for the idea that culture reflects the structure of the mind

Conclusions

• Iterated learning provides a lens for magnifying the inductive biases of learners – small effects for individuals are big effects for groups • When cognition affects culture, studying groups can give us better insight into individuals QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

data QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Revealing inductive biases through iterated learning

Transcript Revealing inductive biases through iterated learning

The dynamics of iterated learning

Tom Griffiths

Learning

Iterated learning

Objects of iterated learning

Language

Linguistic universals

Cultural evolution

Analyzing iterated learning

Markov chains

Stationary distributions

Analyzing iterated learning

Bayesian inference

Bayes’ theorem

(

|

)

(

(

| |

)

(

)

)

(

)

Iterated Bayesian learning

(

|

)

(

(

| |

)

(

)

)

(

)

Stationary distributions

Explaining convergence to the prior

Gibbs sampling

Gibbs sampling

Explaining convergence to the prior

Implications for linguistic universals

Iterated Bayesian learning

(

|

)

(

(

| |

)

(

)

)

(

)

From sampling to maximizing

(

|

)

(

(

| |

)

(

)

)

(

)

From sampling to maximizing

Implications for linguistic universals

Analyzing iterated learning

Conclusions

Directory