COMPUTATIONAL MODELS OF COGNITIVE PHENOMENA
Download
Report
Transcript COMPUTATIONAL MODELS OF COGNITIVE PHENOMENA
Università di Milano-Bicocca
Laurea Magistrale in Informatica
Corso di
APPRENDIMENTO E APPROSSIMAZIONE
Prof. Giancarlo Mauri
Lezione 4 - Computational Learning Theory
1
Computational models of cognitive
phenomena
Computing capabilities: Computability theory
Reasoning/deduction:
Learning/induction:
Formal logic
?
2
A theory of the learnable (Valiant ‘84)
[…] The problem is to discover good models that are
interesting to study for their own sake and that promise to
be relevant both to explaining human experience and to
building devices that can learn […] Learning machines must
have all 3 of the following properties:
the machines can provably learn whole classes of concepts,
these classes can be characterized
the classes of concepts are appropriate and nontrivial for
general-purpose knowledge
the computational process by which the machine builds the
desired programs requires a “feasible” (i.e. polynomial) number
of steps
3
A theory of the learnable
We seek general laws that constrain inductive
learning, relating:
Probability of successful learning
Number of training examples
Complexity of hypothesis space
Accuracy to which target concept is approximated
Manner in which training examples are presented
4
Probably approximately correct learning
formal computational model which want shed light on
the limits of what can be learned by a machine,
analysing the computational cost of learning
algorithms
5
What we want to learn
CONCEPT =
recognizing algorithm
LEARNING = computational description of recognizing
algorithms starting from:
- examples
- incomplete specifications
That is:
to determine uniformly good approximations of an unknown
function from its value in some sample points
interpolation
pattern matching
concept learning
6
What’s new in p.a.c. learning?
Accuracy of results
and
running time for learning algorithms
are explicitly quantified and related
A general problem:
use of resources (time, space…) by computations
COMPLEXITY THEORY
Example
Sorting:
n·logn time (polynomial, feasible)
Bool. satisfiability:
2ⁿ time (exponential, intractable)
7
Learning from examples
DOMAIN
Concept
LEARNER
EXAMPLES
CONCEPT: subset of domain
A REPRESENTATION
OF A CONCEPT
EXAMPLES: elements of concept (positive)
REPRESENTATION: domain→expressions
GOOD LEARNER ?
EFFICIENT LEARNER ?
8
The P.A.C. model
A domain X (e.g. {0,1}ⁿ, Rⁿ)
A concept: subset of X, f ⊆ X
A class of concepts F ⊆ 2X
A probability distribution P on X
or f: X→{0,1}
Example 1
X ≡ a square
F ≡ triangles in the square
9
The P.A.C. model
Example 2
X≡{0,1}ⁿ
fr(x1,…,xn)
F ≡ family of boolean functions
=
1 if there are at least r ones in (x1,…,xn)
0 otherwise
P a probability distribution on X
Uniform
Non uniform
10
The P.A.C. model
The learning process
Labeled sample ((x0, f(x0)), (x1, f(x1)), …, (xn,
f(xn))
Hypothesis
a function h consistent with the
sample (i.e., h(xi) = f(xi) i)
Error probability
Perr(h(x)≠f(x), xX)
11
The P.A.C. model
X, F
X, fF
Examples generator
with probability
distribution p
t examples
LEARNER
Inference
TEACHER
(x1,f(x1)), … , (xt,f(xt)))
The learning algorithm A is good if the
hypothesis h is “ALMOST ALWAYS”
“CLOSE TO” the target concept c
procedure A
Hypothesis h (implicit
representation of a concept)
12
The P.A.C. model
“CLOSE TO”
x
METRIC : given P
dp(f,h) = Perr = Px f(x)≠h(x)
f
random choice
h
Given an approximation parameter
(0<≤1), h is an ε-approximation
of f if dp(f,h)≤
“ALMOST ALWAYS”
Confidence parameter
(0 < ≤ 1)
The “measure” of sequences of examples,
randomly choosen according to P, such that h
is an ε-approximation of f is at least 1-
13
Learning algorithm
Generator of
examples
Learner
h
F concept class
S set of labeled samples from a concept in F
A: S F such that:
0<,<1 fF mN S s.t. |S|≥m
I) A(S) consistent with S
II) P(Perr< ) > 1-
14
The efficiency issue
Look for algorithms which use “reasonable” amount of
computational resources
COMPUTATIONAL RESOURCES
SAMPLE SIZE (Statistical PAC learning)
COMPUTATION TIME (Polynomial PAC learning)
DEF 1: a concept class F = n=1F n is statistically PAC learnable if there
is a learning algorithm with sample size t = t(n, 1/, 1/) bounded
by some polynomial function in n, 1/, 1/
15
The efficiency issue
DEF 2: a concept class F = n=1F n is polynomially PAC learnable
if there is a learning algorithm with running time bounded
by some polynomial function in n, 1/, 1/
POLYNOMIAL PAC
PAC
STATISTICAL
16
Learning boolean functions
n = {f: {0, 1}n {0, 1}}
The set of boolean functions in n variables
Fn n
A class of concepts
Example 1:
x1...x n , x1...x n
Fn = clauses with literals in
...; x3 xk x2 ;...; x1 x2 ... xn
Example 2:
Fn = linearly separable functions in n variables
...;HSW X
k
k
;...
REPRESENTATION
- TRUTH TABLE (EXPLICIT)
- BOOLEAN CIRCUITS (IMPLICIT)
BOOLEAN
CIRCUITS
BOOLEAN
FUNCTIONS
17
Boolean functions and circuits
• BASIC OPERATIONS
• COMPOSITION
,,
[f(g1, … , gm)](x) = f(g1(x), … , gm(x))
in m variables
in n variables
CIRCUIT: Finite acyclic directed graph
Output node
Basic operations
Given an assignment {x1 … xn} {0, 1}
to the input variables, the output node
computes the corresponding value
Input nodes
X1
X2
X3
18
Boolean functions and circuits
Fn n
Cn : class of circuits which compute all and only the functions in Fn
F Fn
n 1
Algorithm
A
C Cn
n 1
to learn F by C
• INPUT (n,ε,δ)
•The learner computes t = t(n, 1/, 1/)
(t=number of examples sufficient to learn with accuracy ε and confidence δ)
• The learner asks the teacher for a labelled t-sample
• The learner receives the t-sample S and computes C = An(S)
• Output C (C= representation of the hypothesis)
Note that the inference procedure A receives as input the integer
n and a t-sample on 0,1n and outputs An(S) = A(n, S)
19
Boolean functions and circuits
An algorithm A is a learning algorithm with sample size t(n, 1/, 1/) for a
concept class F U n1Fn using the class of representations C U m1Cm
If for all n≥1, for all fFn, for all 0<, <1 and for every probability
distribution p over 0,1n the following holds:
If the inference procedure An receives as input a t-sample, it outputs
a representation cCn of a function g that is probably approximately
correct, that is with probability at least 1- a t-sample is chosen such
that the function g inferred satisfies
P{x f(x)≠g(x)} ≤
g is –good: g is an –approximation of f
g is –bad: g is not an –approximation of f
NOTE: distribution free
20
Statistical P.A.C. learning
PROBLEM: Estimate upper and lower bounds on the sample size
t = t(n, 1/, 1/)
Upper bounds will be given for
consistent algorithms
Lower bounds will be given for
arbitrary algorithms
DEF: An inference procedure An for the class F n is consistent if,
given the target function fF n, for every t-sample
S = (<x1,b1>, … , <xt,bt>), An(S) is a representation of a function
g “consistent” with S, i.e. g(x1) = b1, … , g(xt) = bt
DEF: A learning algorithm A is consistent if its inference procedure
is consistent
21
A simple upper bound
THEOREM: t(n, 1/, 1/) ≤ -1ln(#F n) +ln(1/)
PROOF: Prob(x1, … , xt) g (g(x1)=f(x1), … , g(xt)=f(xt) g -bad) ≤
≤
≤
≤
g ε-bad
P(AUB)≤P(A)+P(B)
Prob (g(x1) = f(x1), … , g(xt) = f(xt))
Independent events
Prob (g(xi) = f(xi))
g ε-bad i=1, … , t
≤
≤
g is ε-bad
(1-)t ≤ #F n(1-)t ≤ #F ne-t
g ε-bad
Impose #F n e-t ≤
NOTE
- #F n must be finite
22
Vapnik-Chervonenkis approach (1971)
Problem: uniform convergence of relative frequencies to their probabilities
X
domain
F 2X
class of concepts
S = ( x 1 , … , x t)
t-sample
f S g iff f(xi) = g(xi) xi S
undistinguishable by S
F (S) = #(F /S)
index of F w.r.t. S
S1 S2
MF (t) = maxF (S) S is a t-sample
growth function
23
A general upper bound
THEOREM
Prob(x1, … , xt) g (g -bad g(x1) = f(x1), … , g(xt) = f(xt))≤ 2mF2te-t/2
FACT
mF (t) ≤ 2t
mF (t) ≤ #F
(this condition gives immediately
the simple upper bound)
mF (t) = 2t and j<t mF (j) = 2j
24
Graph of the growth function
2t
#F
mF (t )
mF ()
?
d
t
DEFINITION
d = VCdim(F ) = max t mF(t) = 2t
FUNDAMENTAL PROPERTY
mF (t )
t d
2t
d
d
et
t td 1
k d
K 0
BOUNDED BY A
POLYNOMIAL IN t !
25
Upper and lower bounds
THEOREM
If dn = VCdim(Fn)
then t(n, 1/, 1/) ≤ max (4/ log(2/), (8dn/)log(13/)
PROOF
Impose 2mFn2te-et/2 ≤
A lower bound on t(n, 1/, 1/):
Number of examples which are
necessary for arbitrary algorithms
THEOREM
For 0≤≤1/ and ≤1/100
t(n, 1/, 1/) ≥ max ((1- )/ ln(1/), (dn-1)/32)
26
An equivalent definition of VCdim
F (S) = #(f-1(1)(x1, … , xt) | fF )
I.e. the cardinality of the set of subsets of S that can be obtained
by intersecting S with concepts in F
If F (S) = 2S we say that S is shattered by F
The Vapnik-Chervonenkis dimension of F is the cardinality of the
largest finite set of points S X that is shattered by F
27
Example 1
Learn the family f of circles contained in the square
VC
DIM
(F ) 3
0.01
0.001
t (n, 1 , 1 ) MAX 400log 2000,3 800log1300
Sufficient! 24.000
3 1
1
1
t (n, , ) MAX 100ln 1000,
100
32
690 Necessary!
28
Example 2
Learn the family of linearly separable boolean functions in n variables, Ln
f Ln (W1,...,Wn , )
SUCH THAT
f ( X 1,..., X n ) HS( Wk X k )
k 1,n
HS(x)=
VC
DIM
1if X 0
0 otherwise
(Ln ) n 1
# Ln 2 n
2
SIMPLE UPPER BOUND
t (n, 1 , 1 )
1
(n 2 ln( 1 ))
UPPER BOUND USING VC DIM (Ln )
2 8(n 1) 13
4
t (n, 1 , 1 ) M AX log ,
GROWS LINEARLY WITH n!
29
Example 2
Consider the class L2 of linearly separable functions in two variables
VC DIM (Ln ) n 1
VC DIM (L2 ) 3
VC DIM (L2 ) 3
VC DIM (L2 ) 4
The green point cannot be
separated from the other three
No straight line can separate
the green from the red points
30
Classi di formule booleane
Monomi
x1x2 … xk
DNF
m1m2 … mj (mj monomi)
Clausole
x1x2 … xk
CNF
c1c2 … cj
k-DNF
≤ k letterali nei monomi
k-term-DNF
≤ k monomi
k-CNF
≤ k letterali nelle clausole
k-clause-CNF
≤ k clausole
Formule monotone
non contengono letterali negati
-formule
ogni variabile appare al più una volta
(cj clausole)
31
I risultati
Th. (Valiant)
I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi (
xi xi
errore tollerato) ponendo g
x i 1
in tutti gli es.
x i 0
in tutti gli es.
N.B. L’apprendibilità è non monotona
A B
se B appr., allora A appr.
H : x 1 x 1x 2 x 2 ...x n x n ;
for i : 1 to B do
begin
es : genera( );
for j : 1 to n do
if es ( j ) 0 then
ca ncella x j da H
Th.
i monomi non sono
apprendibili da esempi negativi
else
ca ncella x j da H
end
32
Risultati positivi
1) K-CNF apprendibili da soli esempi positivi K
1b) K-DNF apprendibili da soli esempi negativi
2) (K-DNF K-CNF)
apprendibili da es.
(K-DNF K-CNF)
positivi e negativi
3) la classe delle K-decision lists è apprendibile
K DL ((m1, b1 ),...(m j , b j )) con
mi monomio, | mi | k , bi 0,1
C K DL ; v vettore boolea no
i : m inn | m in(v ) 1
a ll oraC (v ) bi (.0 se i non esi ste )
Th.
Ogni K-DNF (o K-CNF)-formula è rappresentabile da una K-DL piccola
33
Risultati negativi
(se RP NP )
(in senso distribution free)
1) Le -formule non sono apprendibili
2) Le funzioni
booleane a soglia non sono apprendibili
3) Per K ≥ 2, le formule K-term-DNF non sono apprendibili
34
Mistake bound model
So far: how many examples needed to learn ?
What about: how many mistakes before convergence ?
Let’s consider similar setting to PAC learning:
Instances drawn at random from X according to distribution D
Learner must classify each instance before receiving correct
classification from teacher
Can we bound the number of mistakes learner makes before
converging ?
35
Mistake bound model
Learner:
Receives a sequence of training examples x
Predicts the target value f(x)
Receives the correct target value from the trainer
Is evaluated by the total number of mistakes it makes before
converging to the correct hypothesis
I.e.:
Learning takes place during the use of the system, not off-line
Ex.: prediction of fraudolent use of credit cards
36
Mistake bound for Find-S
Consider Find-S when H = conjunction of boolean
literals
FIND-S:
Initialize h to the most specific hypothesis in H:
x1x1x2x2 … xnxn
For each positive training instance x
Remove from h any literal not satisfied by x
Output h
37
Mistake bound for Find-S
If C H and training data noise free, Find-S
converges to an exact hypothesis
How many errors to learn cH (only positive
examples can be misclassified)?
The first positive example will be misclassified, and n
literals in the initial hypothesis will be eliminated
Each subsequent error eliminates at least one literal
#mistakes ≤ n+1 (worst case, for the “total” concept x
c(x)=1)
38
Mistake bound for Halving
A version space is maintained and refined (e.g., Candidateelimination)
Prediction is based on majority vote among the hypotheses in
the current version space
“Wrong” hypotheses are removed (even if x is exactly
classified)
How many errors to exactly learn cH (H finite)?
Mistake when the majority of hypotheses misclassifies x
These hypotheses are removed
For each mistake, the version space is at least halved
At most log2(|H|) mistakes before exact learning (e.g., single
hypothesis remaining)
Note: learning without mistakes possible !
39
Optimal mistake bound
Question: what is the optimal mistake bound (i.e., lowest
worst case bound over all possible learning algorithms A) for
an arbitrary non empty concept class C, assuming H=C ?
Formally, for any learning algorithm A and any target concept
c:
MA(c) = max #mistakes made by A to exactly learn c over all
possible training sequences
MA(C) = maxcC MA(c)
Note: Mfind-S(C) = n+1
MHalving(C) ≤ log2(|C|)
Opt(C) = minA MA(C)
i.e., # of mistakes made for the hardest target concept in C,
using the hardest training sequence, by the best algorithm
40
Optimal mistake bound
Theorem (Littlestone 1987)
VC(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|)
There exist concept classes for which
VC(C) = Opt(C) = MHalving(C) = log2(|C|)
e.g., the power set 2X of X, for which it holds:
VC(2X) = |X| = log2(|2X|)
There exist concept classes for which
VC(C) < Opt(C) < MHalving(C)
41
Weighted majority algorithm
Generalizes Halving
Makes predictions by taking a weighted vote among a
pole of prediction algorithms
Learns by altering the weight associated with each
prediction algorithm
It does not eliminate hypotheses (i.e., algorithms)
inconsistent with some training examples, but just
reduces its weight, so is able to accommodate
inconsistent training data
42
Weighted majority algorithm
i wi := 1
training example (x, c(x))
q0 := q1 := 0
prediction algorithm ai
If ai(x)=0 then q0 := q0 + wi
If ai(x)=1 then q1 := q1 + wi
if q1 > q0 then predict c(x)=1
if q1 < q0 then predict c(x)=0
if q1 > q0 then predict c(x)=0 or 1 at random
prediction algorithm ai do
If ai(x)≠c(x) then wi := wi
(0≤<1)
43
Weighted majority algorithm (WM)
Coincides with Halving for =0
Theorem - D any sequence of training examples, A any
set of n prediction algorithms, k min # of mistakes
made by any ajA for D, =1/2. Then W-M makes at
most
2.4(k+log2n)
mistakes over D
44
Weighted majority algorithm (WM)
Proof
Since aj makes k mistakes (best in A) its final weight wj will
be (1/2)k
The sum W of the weights associated with all n algorithms in
A is initially n, and for each mistake made by WM is reduced
to at most (3/4)W, because the “wrong” algorithms hold at
least 1/2 of total weight, that will be reduced by a factor of
1/2.
The final total weight W is at most n(3/4)M, where M is the
total number of mistakes made by WM over D.
45
Weighted majority algorithm (WM)
But the final weight wj cannot be greater than the final total
weight W, hence:
(1/2)k ≤ n(3/4)M
from which
M ≤ (k+log2 n) ≤ 2.4(k+log2n)
-log2 (3/4)
I.e., the number of mistakes made by WM will never be greater
than a constant factor times the number of mistakes made by
the best member of the pool, plus a term that grows only
logarithmically in the size of the pool
46