COMPUTATIONAL MODELS OF COGNITIVE PHENOMENA

Transcript COMPUTATIONAL MODELS OF COGNITIVE PHENOMENA

Università di Milano-Bicocca
Laurea Magistrale in Informatica
Corso di
APPRENDIMENTO E APPROSSIMAZIONE
Prof. Giancarlo Mauri
Lezione 4 - Computational Learning Theory
1
Computational models of cognitive
phenomena

Computing capabilities: Computability theory

Reasoning/deduction:

Learning/induction:
Formal logic
?
2
A theory of the learnable (Valiant ‘84)

[…] The problem is to discover good models that are
interesting to study for their own sake and that promise to
be relevant both to explaining human experience and to
building devices that can learn […] Learning machines must
have all 3 of the following properties:

the machines can provably learn whole classes of concepts,
these classes can be characterized

the classes of concepts are appropriate and nontrivial for
general-purpose knowledge

the computational process by which the machine builds the
desired programs requires a “feasible” (i.e. polynomial) number
of steps
3
A theory of the learnable

We seek general laws that constrain inductive
learning, relating:

Probability of successful learning

Number of training examples

Complexity of hypothesis space

Accuracy to which target concept is approximated

Manner in which training examples are presented
4
Probably approximately correct learning
formal computational model which want shed light on
the limits of what can be learned by a machine,
analysing the computational cost of learning
algorithms
5
What we want to learn
CONCEPT =
recognizing algorithm
LEARNING = computational description of recognizing
algorithms starting from:
- examples
- incomplete specifications
That is:
to determine uniformly good approximations of an unknown
function from its value in some sample points

interpolation

pattern matching

concept learning
6
What’s new in p.a.c. learning?
Accuracy of results
and
running time for learning algorithms
are explicitly quantified and related
A general problem:
use of resources (time, space…) by computations

COMPLEXITY THEORY
Example
Sorting:
n·logn time (polynomial, feasible)
Bool. satisfiability:
2ⁿ time (exponential, intractable)
7
Learning from examples
DOMAIN
Concept
LEARNER
EXAMPLES
CONCEPT: subset of domain
A REPRESENTATION
OF A CONCEPT
EXAMPLES: elements of concept (positive)
REPRESENTATION: domain→expressions
GOOD LEARNER ?
EFFICIENT LEARNER ?
8
The P.A.C. model

A domain X (e.g. {0,1}ⁿ, Rⁿ)

A concept: subset of X, f ⊆ X

A class of concepts F ⊆ 2X

A probability distribution P on X
or f: X→{0,1}
Example 1
X ≡ a square
F ≡ triangles in the square
9
The P.A.C. model
Example 2
X≡{0,1}ⁿ
fr(x1,…,xn)
F ≡ family of boolean functions
=
1 if there are at least r ones in (x1,…,xn)
0 otherwise
P a probability distribution on X
Uniform
Non uniform
10
The P.A.C. model
The learning process

Labeled sample ((x0, f(x0)), (x1, f(x1)), …, (xn,
f(xn))

Hypothesis
a function h consistent with the
sample (i.e., h(xi) = f(xi) i)

Error probability
Perr(h(x)≠f(x), xX)
11
The P.A.C. model
X, F
X, fF
Examples generator
with probability
distribution p
t examples
LEARNER
Inference
TEACHER
(x1,f(x1)), … , (xt,f(xt)))
The learning algorithm A is good if the
hypothesis h is “ALMOST ALWAYS”
“CLOSE TO” the target concept c
procedure A
Hypothesis h (implicit
representation of a concept)
12
The P.A.C. model
“CLOSE TO”
x
METRIC : given P
dp(f,h) = Perr = Px  f(x)≠h(x)
f
random choice
h
Given an approximation parameter
 (0<≤1), h is an ε-approximation
of f if dp(f,h)≤
“ALMOST ALWAYS”
Confidence parameter
 (0 <  ≤ 1)
The “measure” of sequences of examples,
randomly choosen according to P, such that h
is an ε-approximation of f is at least 1-
13
Learning algorithm
Generator of
examples
Learner
h
F concept class
S set of labeled samples from a concept in F
A: S  F such that:
0<,<1 fF mN S s.t. |S|≥m
I) A(S) consistent with S
II) P(Perr< ) > 1-
14
The efficiency issue
Look for algorithms which use “reasonable” amount of
computational resources
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning)
COMPUTATION TIME (Polynomial PAC learning)
DEF 1: a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm  with sample size t = t(n, 1/, 1/) bounded
by some polynomial function in n, 1/, 1/
15
The efficiency issue
DEF 2: a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm  with running time bounded
by some polynomial function in n, 1/, 1/
POLYNOMIAL PAC
PAC
STATISTICAL
16
Learning boolean functions
n = {f: {0, 1}n  {0, 1}}
The set of boolean functions in n variables
Fn   n
A class of concepts
Example 1:
x1...x n , x1...x n 
Fn = clauses with literals in
...; x3  xk  x2 ;...; x1  x2  ...  xn 
Example 2:
Fn = linearly separable functions in n variables
...;HSW X
k
k
  ;...
REPRESENTATION
- TRUTH TABLE (EXPLICIT)
- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEAN
CIRCUITS
BOOLEAN
FUNCTIONS
17
Boolean functions and circuits
• BASIC OPERATIONS
• COMPOSITION
,, 
[f(g1, … , gm)](x) = f(g1(x), … , gm(x))
in m variables
in n variables
CIRCUIT: Finite acyclic directed graph
Output node



Basic operations

Given an assignment {x1 … xn}  {0, 1}
to the input variables, the output node
computes the corresponding value
Input nodes
X1
X2
X3
18
Boolean functions and circuits
Fn   n
Cn : class of circuits which compute all and only the functions in Fn

F   Fn
n 1
Algorithm
A

C   Cn
n 1
to learn F by C
• INPUT (n,ε,δ)
•The learner computes t = t(n, 1/, 1/)
(t=number of examples sufficient to learn with accuracy ε and confidence δ)
• The learner asks the teacher for a labelled t-sample
• The learner receives the t-sample S and computes C = An(S)
• Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer
n and a t-sample on 0,1n and outputs An(S) = A(n, S)
19
Boolean functions and circuits
An algorithm A is a learning algorithm with sample size t(n, 1/, 1/) for a
concept class F  U  n1Fn using the class of representations C  U  m1Cm
If for all n≥1, for all fFn, for all 0<, <1 and for every probability
distribution p over 0,1n the following holds:
If the inference procedure An receives as input a t-sample, it outputs
a representation cCn of a function g that is probably approximately
correct, that is with probability at least 1- a t-sample is chosen such
that the function g inferred satisfies
P{x  f(x)≠g(x)} ≤ 
g is –good: g is an –approximation of f
g is –bad: g is not an –approximation of f
NOTE: distribution free
20
Statistical P.A.C. learning
PROBLEM: Estimate upper and lower bounds on the sample size
t = t(n, 1/, 1/)
Upper bounds will be given for
consistent algorithms
Lower bounds will be given for
arbitrary algorithms
DEF: An inference procedure An for the class F n is consistent if,
given the target function fF n, for every t-sample
S = (<x1,b1>, … , <xt,bt>), An(S) is a representation of a function
g “consistent” with S, i.e. g(x1) = b1, … , g(xt) = bt
DEF: A learning algorithm A is consistent if its inference procedure
is consistent
21
A simple upper bound
THEOREM: t(n, 1/, 1/) ≤ -1ln(#F n) +ln(1/)
PROOF: Prob(x1, … , xt) g (g(x1)=f(x1), … , g(xt)=f(xt)  g -bad) ≤
≤
≤
≤

g ε-bad
P(AUB)≤P(A)+P(B)
Prob (g(x1) = f(x1), … , g(xt) = f(xt))
 
Independent events
Prob (g(xi) = f(xi))
g ε-bad i=1, … , t

≤
≤
g is ε-bad
(1-)t ≤ #F n(1-)t ≤ #F ne-t
g ε-bad
Impose #F n e-t ≤ 
NOTE
- #F n must be finite
22
Vapnik-Chervonenkis approach (1971)
Problem: uniform convergence of relative frequencies to their probabilities
X
domain
F 2X
class of concepts
S = ( x 1 , … , x t)
t-sample
f S g iff f(xi) = g(xi)  xi  S
undistinguishable by S
F (S) = #(F /S)
index of F w.r.t. S
S1  S2
MF (t) = maxF (S)  S is a t-sample
growth function
23
A general upper bound
THEOREM
Prob(x1, … , xt) g (g -bad  g(x1) = f(x1), … , g(xt) = f(xt))≤ 2mF2te-t/2
FACT
mF (t) ≤ 2t
mF (t) ≤ #F
(this condition gives immediately
the simple upper bound)
mF (t) = 2t and j<t  mF (j) = 2j
24
Graph of the growth function
2t
#F
mF (t )
mF ()
?
d
t
DEFINITION
d = VCdim(F ) = max t  mF(t) = 2t
FUNDAMENTAL PROPERTY
mF (t ) 

t d
2t
d
d
   et 
  t      td  1
k  d 
K 0
BOUNDED BY A
POLYNOMIAL IN t !
25
Upper and lower bounds
THEOREM
If dn = VCdim(Fn)
then t(n, 1/, 1/) ≤ max (4/ log(2/), (8dn/)log(13/)
PROOF
Impose 2mFn2te-et/2 ≤ 
A lower bound on t(n, 1/, 1/):
Number of examples which are
necessary for arbitrary algorithms
THEOREM
For 0≤≤1/ and ≤1/100
t(n, 1/, 1/) ≥ max ((1-  )/ ln(1/), (dn-1)/32)
26
An equivalent definition of VCdim
F (S) = #(f-1(1)(x1, … , xt) | fF )
I.e. the cardinality of the set of subsets of S that can be obtained
by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the
largest finite set of points S  X that is shattered by F
27
Example 1
Learn the family f of circles contained in the square
VC
DIM
(F )  3
  0.01
  0.001
t (n, 1 , 1 )  MAX 400log 2000,3  800log1300


Sufficient! 24.000
3 1


1
1
t (n, , )  MAX 100ln 1000,
100
 
32


690 Necessary!
28
Example 2
Learn the family of linearly separable boolean functions in n variables, Ln
f  Ln  (W1,...,Wn ,  )
SUCH THAT
f ( X 1,..., X n )  HS(  Wk X k   )
k 1,n
HS(x)=
VC
DIM
1if X  0
0 otherwise
(Ln )  n  1
# Ln  2 n
2
SIMPLE UPPER BOUND
t (n, 1 , 1 ) 
1

(n 2  ln( 1 ))
UPPER BOUND USING VC DIM (Ln )
2 8(n  1) 13 
4
t (n, 1 , 1 )  M AX log ,
 


 

GROWS LINEARLY WITH n!
29
Example 2
Consider the class L2 of linearly separable functions in two variables
VC DIM (Ln )  n  1
VC DIM (L2 )  3
VC DIM (L2 )  3
VC DIM (L2 )  4
The green point cannot be
separated from the other three
No straight line can separate
the green from the red points
30
Classi di formule booleane
Monomi
x1x2 … xk
DNF
m1m2 … mj (mj monomi)
Clausole
x1x2 … xk
CNF
c1c2 … cj
k-DNF
≤ k letterali nei monomi
k-term-DNF
≤ k monomi
k-CNF
≤ k letterali nelle clausole
k-clause-CNF
≤ k clausole
Formule monotone
non contengono letterali negati
-formule
ogni variabile appare al più una volta
(cj clausole)
31
I risultati
Th. (Valiant)
I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi (
xi   xi
errore tollerato) ponendo g  
x i 1
in tutti gli es.
x i 0
in tutti gli es.
N.B. L’apprendibilità è non monotona
A  B
se B appr., allora A appr.
H : x 1 x 1x 2 x 2 ...x n x n ;
for i : 1 to B do
begin
es : genera( );
for j : 1 to n do
if es ( j )  0 then
ca ncella x j da H
Th.
i monomi non sono
apprendibili da esempi negativi
else
ca ncella x j da H
end
32
Risultati positivi
1) K-CNF apprendibili da soli esempi positivi  K
1b) K-DNF apprendibili da soli esempi negativi
2) (K-DNF  K-CNF)
apprendibili da es.
(K-DNF  K-CNF)
positivi e negativi
3) la classe delle K-decision lists è apprendibile
K  DL  ((m1, b1 ),...(m j , b j )) con
mi monomio, | mi | k , bi  0,1
C K  DL ; v vettore boolea no
i :  m inn | m in(v )  1
a ll oraC (v )  bi (.0 se i non esi ste )
Th.
Ogni K-DNF (o K-CNF)-formula è rappresentabile da una K-DL piccola
33
Risultati negativi
(se RP  NP )
(in senso distribution  free)
1) Le -formule non sono apprendibili
2) Le funzioni
booleane a soglia non sono apprendibili

3) Per K ≥ 2, le formule K-term-DNF non sono apprendibili
34
Mistake bound model

So far: how many examples needed to learn ?

What about: how many mistakes before convergence ?

Let’s consider similar setting to PAC learning:

Instances drawn at random from X according to distribution D

Learner must classify each instance before receiving correct
classification from teacher

Can we bound the number of mistakes learner makes before
converging ?
35
Mistake bound model

Learner:

Receives a sequence of training examples x

Predicts the target value f(x)

Receives the correct target value from the trainer

Is evaluated by the total number of mistakes it makes before
converging to the correct hypothesis

I.e.:

Learning takes place during the use of the system, not off-line

Ex.: prediction of fraudolent use of credit cards
36
Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean
literals
FIND-S:

Initialize h to the most specific hypothesis in H:
x1x1x2x2 … xnxn

For each positive training instance x


Remove from h any literal not satisfied by x
Output h
37
Mistake bound for Find-S


If C  H and training data noise free, Find-S
converges to an exact hypothesis
How many errors to learn cH (only positive
examples can be misclassified)?

The first positive example will be misclassified, and n
literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal

#mistakes ≤ n+1 (worst case, for the “total” concept x
c(x)=1)
38
Mistake bound for Halving




A version space is maintained and refined (e.g., Candidateelimination)
Prediction is based on majority vote among the hypotheses in
the current version space
“Wrong” hypotheses are removed (even if x is exactly
classified)
How many errors to exactly learn cH (H finite)?

Mistake when the majority of hypotheses misclassifies x

These hypotheses are removed

For each mistake, the version space is at least halved

At most log2(|H|) mistakes before exact learning (e.g., single
hypothesis remaining)

Note: learning without mistakes possible !
39
Optimal mistake bound


Question: what is the optimal mistake bound (i.e., lowest
worst case bound over all possible learning algorithms A) for
an arbitrary non empty concept class C, assuming H=C ?
Formally, for any learning algorithm A and any target concept
c:

MA(c) = max #mistakes made by A to exactly learn c over all
possible training sequences

MA(C) = maxcC MA(c)
Note: Mfind-S(C) = n+1
MHalving(C) ≤ log2(|C|)

Opt(C) = minA MA(C)
i.e., # of mistakes made for the hardest target concept in C,
using the hardest training sequence, by the best algorithm
40
Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|)

There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
e.g., the power set 2X of X, for which it holds:
VC(2X) = |X| = log2(|2X|)

There exist concept classes for which

VC(C) < Opt(C) < MHalving(C)
41
Weighted majority algorithm




Generalizes Halving
Makes predictions by taking a weighted vote among a
pole of prediction algorithms
Learns by altering the weight associated with each
prediction algorithm
It does not eliminate hypotheses (i.e., algorithms)
inconsistent with some training examples, but just
reduces its weight, so is able to accommodate
inconsistent training data
42
Weighted majority algorithm

i wi := 1

 training example (x, c(x))

q0 := q1 := 0

 prediction algorithm ai



If ai(x)=0 then q0 := q0 + wi
If ai(x)=1 then q1 := q1 + wi
if q1 > q0 then predict c(x)=1
if q1 < q0 then predict c(x)=0
if q1 > q0 then predict c(x)=0 or 1 at random

 prediction algorithm ai do
If ai(x)≠c(x) then wi := wi
(0≤<1)
43
Weighted majority algorithm (WM)


Coincides with Halving for =0
Theorem - D any sequence of training examples, A any
set of n prediction algorithms, k min # of mistakes
made by any ajA for D, =1/2. Then W-M makes at
most
2.4(k+log2n)
mistakes over D
44
Weighted majority algorithm (WM)

Proof

Since aj makes k mistakes (best in A) its final weight wj will
be (1/2)k

The sum W of the weights associated with all n algorithms in
A is initially n, and for each mistake made by WM is reduced
to at most (3/4)W, because the “wrong” algorithms hold at
least 1/2 of total weight, that will be reduced by a factor of
1/2.

The final total weight W is at most n(3/4)M, where M is the
total number of mistakes made by WM over D.
45
Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the final total
weight W, hence:
(1/2)k ≤ n(3/4)M
from which
M ≤ (k+log2 n) ≤ 2.4(k+log2n)
-log2 (3/4)

I.e., the number of mistakes made by WM will never be greater
than a constant factor times the number of mistakes made by
the best member of the pool, plus a term that grows only
logarithmically in the size of the pool
46