LING 696B: Categorical perception, perceptual magnets and

Download Report

Transcript LING 696B: Categorical perception, perceptual magnets and

LING 696B: Categorical
perception, perceptual
magnets and neural nets
1
From last time: quick overview
of phonological acquisition

Prenatal to newborn


4-day-old French babies prefer listening to
French over Russian (Mehler et al, 1988)
Prepared to learn any phonetic contrast
in human languages (Eimas, 71)
2
A quick overview of
phonological acquisition


6 months: Effect of experience begin to
surface (Kuhl, 92)
6 - 12 months: from a universal
perceiver to a language-specific one
(Werker & Tees, 84)
3
A quick overview of
phonological acquisition

6 - 9 months: knows the sequential
patterns of their language

English v.s. Dutch (Jusczyk, 93)


Weird English v.s. better English (Jusczyk,
94)


“avoid” v.s. “waardig”
“mim” v.s. “thob”
Impossible Dutch v.s. Dutch (Friederici, 93)

“bref” v.s. “febr”
4
A quick overview of
phonological acquisition

6 months: segment words out of
speech stream based on transition
probabilities (Saffran et al, 96)


pabikutibudogolatupabikudaropi…
8 months: remembering words heard in
stories read to them by graduate
students (Jusczyk and Hohne, 97)
5
A quick overview of
phonological acquisition

From 12-18 months and above:
vocabulary growth from 50 words to a
“burst”


Lexical representation?
Toddlers: learning morpho-phonology

Example: U-shaped curve
went --> goed --> wented/went
6
Next few weeks: learning
phonological categories

The innovation of phonological structure
requires us to think of the system of
vocalizations as combinatorial, in that the
concatenation of inherently meaningless
phonological units leads to an intrinsically
unlimited class of words ... It is a way of
systematizing existing vocabulary items and
being about to create new ones.
-- Ray Jackendoff (also citing
C. Hockett in “Foundations of Language”)
7
Categorical perception

Categorization

Discrimination
8
Eimas et al (1971)


Enduring results: tiny (1-4 month)
babies can hear almost anything
Stimuli: 3 pairs along the /p/-/b/
synthetic continuum, differing in 20ms
9
Eimas et al (1971)

Babies get



Very excited hearing a change cross the
category boundary
Somewhat excited when the change stay
within the boundary
Bored when there is no change
10
Eimas et al (1971)



Their conclusion: [voice] feature
detection is innate
But similar things were shown for
Chichillas (Kuhl and Miller, 78) and
monkeys (Ramus et al)
So maybe this is a Mammalian aditory
system thing…
11
Sensitivity to distributions


Percetual Magnet effects (Kuhl, 91):
tokens near prototypes are harder to
discriminate
Perhaps a fancier term is “perceptual
space warping”
12
Categorical perception v.s.
perceptual magnets

Gradient effects: Kuhl (91-95) tried hard
to demonstrate these are not
experimental errors, but systematic
Refererce
1
2
3
13
4
Categorical perception v.s.
perceptual magnets

PM is based on a notion of perceptual
space


Higher dimensions
Non-category boundaries
14
Kuhl et al (1992)



American English has /i/, but no /y/
Swedish has /y/, but no /i/
Test whether infants can detect differences
between prototype and neighbors
Percent
of treating
as the same
15
Kuhl et al (92)’s interpretation



Lingustic experience changes
perception of native and non-native
sounds (also see Werker & Tees, 84)
Influence of input starts before any
phonemic contrast is learned
Information stored as prototypes
16
Modeling attempt: Guenther
and Gjaja (96)


Claim: CP/PM is a matter of auditory
cortex, and does not have to involve
higher-level cognition
Methodology: build a connectionist
model inspired by neurons
17
Self-organizing map in
Guenther and Gjaja



Input coding:
So for each (F1,F2) pair, use 4 input nodes
This seems to be an artifact of the
mechanism (with regard to scaling), but a
standard practice in the field
18
Self-organizing map in
Guenther and Gjaja
mj : activation
Connection weights
19
Self-organizing map: learning
stage
mj : activation
Connection weights:
initialized to be random
Learning rule:
update
Learning rate
mismatch
activation
20
Self-organizing map: learning
stage

Crucially, only update connections to
the most active N cells



The “inhibitory connection”
N goes down to 1 over time
Intuition: each cell selectively encodes
certain kind of input. Otherwise the
weights are pulled in different directions
21
Self-organizing map: testing
stage



Population vector: take a weighted sum of
individual votes
Inspired by neural firing patterns in monkey
reaching
How to compute Fvote?


X+, X- must point to the same direction as Z+, ZAlgebra exercise: derive a formula to calculate
Fvote (homework Problem 1)
22
The distributin of Fvote after
learning
Trainining stimuli
Distribution of Fvote
Recall: activation
23
Results shown in the paper
Kuhl
Their model
24
Results shown in the paper
With some choice of
parameters to fit the
“percent generalization”
25
Local summary

Captures a type of pre-linguistic / prelexical learning from distributions


G&G: No need to have categories in
order to get PM effects


(next time) Distribution learning can
happen in a short time (Maye&Gerken, 00)
Also see Lacerda (95)’s exemplar model
Levels of view: cognition or cortex

Maybe Guenther is right …
26
Categorization / discrimination
training (Guenther et al, 99)




1. Space the stimuli
according to JND
2. Test whether they
can hear the
distinctions in control
and training regions
3. TRAIN in some way
(45 minutes!)
4. Test again as in 2.
Band-passed white noise
27
Guenther et al, 99

Categorization training:



Tell people that sounds from the training
region belong to a “prototype”
Present them sounds from training region
and band-edges, ask whether it’s prototype
Discrimination training:


People say “same” or “different”, and get
feedback
Made harder by tighter spacing over time
28
Guenther et al, 99

Result of distribution learning depends
on what people do with it
control
training
29
Guenther’s neural story for
this effect



Propose to modify G&G’s architecture to
simulate effects of different training
procedure
Did some brain scan
to verify the hypothesis
(Guenther et al, 04)
New model forthcoming?
30
Guenther’s neural story for
this effect



Propose to modify G&G’s architecture to
simulate effects of different training
procedure
Did some brain scan
to verify the hypothesis
(Guenther et al, 04)
New model forthcoming?
(A+ guaranteed if you
turn one in)
31
The untold story of Guenther
and Gjaja

Size of model matters a lot
100 cells

500 cells
1500 cells
Larger models give more flexibility, but no
free lunch:



Take longer to learn
Must tune the learning rate parameter
Must tune the activation neighborhood
32
Randomness in the outcome

This is perhaps
the nicest

Some other runs
33
A technical view of G&G’s
model

“Self-organization”: a type of
unsupervised learning



Receive input with no information on which
categories they belong
Also referred to as “clustering” (next time)
“Emergent behavior”: spatial input
patterns coded connection weights that
realize the non-linear mapping
34
A technical view of G&G’s
model

A non-parametric approximation to the
input-percept non-linear mapping



Lots of parameters, but individually do not
correspond to categories
Will look at a parametric approach to
clustering next time
Difficulties: tuning of parameters, rates
of convergence, formal analysis (more
later)
35
Other questions for G&G




If G&G is a model of auditory cortex,
then where do F1 and F2 come from?
(see Adam’s presentation)
What if we expose G&G to noisy data?
How does it deal with
speech segmentation?
More next time.
36
Homework problems related
to perceptual magnets

Problem 2: warping in one dimension
predicted by perceptual magnets
Before

Training data
After?
Problem 3: program a simplified G&G
type of model or any other SOM to
simulate the effect shown above
37
Commercial:
CogSci colloquium this year

Many big names in connectionism this
semester
38
A digression to the history of
neural nets




40’s -- study of neurons (Hebb,
Huxley&Hodgin)
50’s -- self-programming computers,
biological inspirations
ADALINE from Stanford (Widrow &
Hoff)
Perceptron (Rosenblatt, 60)
39
Perceptron





Input: x1, …, xn
Output: {0, 1}
Weights:
w1,…,wn
Bias: a
How it works:
40
What can perceptrons do?

A linear machine: finding a plane that
separates the 0 examples from 1
examples
41
Perceptron learning rule


Start with random weights
Take an input, check if the output,
when there is an error





weight = weight + a*input*error
bias = bias + a*error
Repeat until no more error
Recall G&G learning rule
See demo
42
Criticism from traditional AI
(Minsky & Papert, 69)



Simple perceptron not capable of
learning simple boolean functions
More sophisticated machines hard to
understand
Example: XOR (see demo)
43
Revitalization of neural nets:
connectionism




Minsky and Papert’s book stopped the
work on neural nets for 10 years …
But as computer gets faster and
cheaper,
CMU parallel distributed processing
(PDP) group
UC San Diego institute of cognitive
science
44
Networks get complicated:
more layers

3 layer perceptron can solve the XOR
problem
45
Networks get complicated:
combining non-linearities

Decision surface can be non-linear
46
Networks get complicated:
more connections



Bi-directional
Inhibitory
Feedback loops
47
Key to the revitalization of
connectionism


Back-propagation algorithm that can be used
to train a variety of networks
Variants developed for other architectures
error
48
Two kinds of learning with
neural nets

Supervised




Classification: patterns --> {1,…,N}
Adaptive control: control variables --> observed
measurements
Time series prediction: past events --> future
events
Unsupervised


Clustering
Compression / feature extraction
49
Statistical insights from the
neural nets






Overtraining / early-stopping
Local maxima
Model complexity -- training error
tradeoff
Limits of learning: generalization
bounds
Dimension reduction, feature selection
Many of these problems become
standard topics in machine learning
50
Big question: neural nets and
symbolic computing


Grammar: harmony theory (Smolensky,
and later Optimality Theory)
Representation: Can neural nets
discovery phonemes?
51
Next week

Reading will be posted on the course
webpage
52