Bayes classifiers

Download Report

Transcript Bayes classifiers

Bayes classifiers
Edgar Acuna
Bayes Classifiers
• A formidable and sworn enemy of decision
trees
Input
Attributes
DT
Classifier
Prediction of
categorical output
BC
•
•
•
•
•
How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values
v1, v2, … vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input
distribution among the Y=vi records.
•
•
•
•
•
How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values
v1, v2, … vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input
distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
•
•
•
•
•
How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values
v1, v2, … vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
For each DSi , learn Density Estimator Mi to model the input
distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm
= um) come along to be evaluated predict the value of Y that
makes P(X1, X2, … Xm | Y=vi ) most likely
Y predict  argmaxP( X 1  u1  X m  um | Y  v)
v
Is this a good idea?
•
•
•
•
•
How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values
v1, v2, … vny.
This
is a XMaximum
Likelihood
Assume there are m input attributes
called
1, X2, … Xm
classifier.
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
It can get silly if some Ys are
For each DSi , learn Density Estimator Mi to model the input
very unlikely
distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm
= um) come along to be evaluated predict the value of Y that
makes P(X1, X2, … Xm | Y=vi ) most likely
Y predict  argmaxP( X 1  u1  X m  um | Y  v)
v
Is this a good idea?
•
•
•
•
•
How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values
v1, v2, … vny.
Assume there are m input attributes called X1, X2, … Xm
Break dataset into nY smaller datasets called DS1, DS2, … DSny.
Define DSi = Records in which Y=vi
Much Better Idea
For each DSi , learn Density Estimator Mi to model the input
distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm
= um) come along to be evaluated predict the value of Y that
makes P(Y=vi | X1, X2, … Xm) most likely
Y predict  argmaxP(Y  v | X 1  u1  X m  um )
v
Is this a good idea?
Terminology
• MLE (Maximum Likelihood Estimator):
Y predict  argmaxP( X 1  u1  X m  um | Y  v)
v
• MAP (Maximum A-Posteriori Estimator):
Y predict  argmaxP(Y  v | X 1  u1  X m  um )
v
Getting what we need
Y predict  argmaxP(Y  v | X 1  u1  X m  um )
v
Getting a posterior probability
P(Y  v | X 1  u1  X m  um )
P( X 1  u1  X m  um | Y  v) P(Y  v)
P( X 1  u1  X m  um )


P( X 1  u1  X m  um | Y  v) P(Y  v)
nY
 P( X
j 1
1
 u1  X m  um | Y  v j ) P(Y  v j )
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
Y predict  argmaxP(Y  v | X 1  u1  X m  um )
v
 argmaxP( X 1  u1  X m  um | Y  v) P(Y  v)
v
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
We can use our favorite
4. For a new prediction:
Density Estimator here.
Y predict  argmaxP(Y  v | X 1 Right
u1 now
X m we
 uhave
m ) two
v
options:
 argmaxP( X 1  u1  X m  um | Y  v) P(Y  v)
v
•Joint Density Estimator
•Naïve Density Estimator
Joint Density Bayes Classifier
Y predict  argmaxP( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the joint Bayes Classifier this
degenerates to a very simple rule:
Ypredict = the class containing most records in which X1
= u1, X2 = u2, …. Xm = um.
Note that if no records have the exact set of inputs X1
= u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi )
= 0 for all values of Y.
In that case we just have to guess Y’s value
Ejemplo
X1
0
X2
0
X3
1
Y
0
0
1
1
1
0
0
0
0
0
1
0
1
1
1
1
1
0
0
1
1
1
1
0
1
Ejemplo: Continuacion
P(Y  0)  3 / 7
P(Y  1)  4 / 7
P( X 1  0, X 2  0, X 3  1 / Y  0)  1 / 3
P( X 1  0, X 2  0, X 3  1 / Y  1)  1 / 2
X1=0,X2=0, x3=1 sera asignado a la clase 1 . Notar
tambien que en esta clase el record (0,0,1) aparece
mas veces que en la clase 0.
Naïve Bayes Classifier
Y predict  argmaxP( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the naive Bayes Classifier this can be
simplified:
nY
Y predict  argmaxP(Y  v) P( X j  u j | Y  v)
v
j 1
Naïve Bayes Classifier
Y predict  argmaxP( X 1  u1  X m  um | Y  v) P(Y  v)
v
In the case of the naive Bayes Classifier this can be
simplified:
nY
Y predict  argmaxP(Y  v) P( X j  u j | Y  v)
v
j 1
Technical Hint:
If you have 10,000 input attributes that product will
underflow in floating point math. You should use logs:
nY


predict
Y
 argmax log P(Y  v)   log P( X j  u j | Y  v) 
v
j 1


Ejemplo: Continuacion
P(Y  0)  3 / 7
P(Y  1)  4 / 7
P( X 1  0, X 2  0, X 3  1/ Y  0)  P( X 1  0 / Y  0) P( X 2  0 / Y  0) P( X 3  1/ Y  0) 
(2 / 3)(1/ 3)(1 / 3)  2 / 27
P( X 1  0, X 2  0, X 3  1/ Y  1)  P( X 1  0 / Y  1) P( X 2  0 / Y  1) P( X 3  1/ Y  1) 
(2 / 4)( 2 / 4)(3 / 4)  3 / 16
X1=0,X2=0, x3=1 sera asignado a la clase 1
BC Results: “XOR”
The “XOR” dataset consists of 40,000 records and 2 Boolean inputs called a
and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b
The Classifier
learned by
“Joint BC”
The Classifier
learned by
“Naive BC”
BC Results:
“MPG”: 392
records
The Classifier
learned by
“Naive BC”
More Facts About Bayes
Classifiers
• Many other density estimators can be slotted in*.
• Density estimation can be performed with real-valued
inputs*
• Bayes Classifiers can be built with real-valued inputs*
• Rather Technical Complaint: Bayes Classifiers don’t try
to be maximally discriminative---they merely try to
honestly model what’s going on*
• Zero probabilities are painful for Joint and Naïve. A hack
(justifiable with the magic words “Dirichlet Prior”) can
help*.
• Naïve Bayes is wonderfully cheap. And survives 10,000
attributes cheerfully!
*See future Andrew Lectures
Naïve Bayes classifier
Naïve Bayes classifier puede ser aplicado cuando hay
predictoras continuas, pero hay que aplicar
previamente un metodo de discretizacion tal como:
Usando intervalos de igual ancho, usando intervalos
con igual frecuencia, ChiMerge,1R, Discretizacion
usando el metodo de la entropia con distintos criterios
de parada, Todos ellos estan disponible en la libreria
dprep( ver disc.mentr, disc.ew, disc.ef, etc) .
La libreria e1071 de R contiene una funcion
naiveBayes que calcula el clasificador naïve Bayes.
Si la variable es continua asume que sigue una
distribucion Gaussiana.
The misclassification error rate
The misclassification error rate R(d) is the probability
that the classifier d classifies incorrectly an instance
coming from a sample (test sample) obtained in a later
stage than the training sample. Also is called the True
error or the actual error.
It is an unknown value that needs to be estimated.
Methods for estimation of the
misclassification error rate
i)
Resubstitution or Aparent Error (Smith, 1947). This is merely the
proportion of instances in the training sample that are incorrectly
classified by the classification rule. In general is an estimator too
optimistic and it can lead to wrong conclusions if the number of
instances is not large compared with the number of features. This
estimator has a large bias.
ii) “Leave one out” estimation. (Lachenbruch, 1965). In this case an
instance is omitted from the training sample. Then the classifier is
built and the prediction for the omitted instances is obtained. One
must register if the instance was correctly or incorrectly classfied.
The process is repeated for all the instances in the training sample
and the estimation of the ME will be given by the proportion of
instances incorrectly classified. This estimator has low bias but its
variance tends to be large.
Methods for estimation of the
misclassification error rate
iii) Cross validation. (Stone, 1974) In this case the
training sample is randomly divided in v parts (v=10 is
the most used). Then the classifier is built using all the
parts but one. The omitted part is considered as the
test sample and the predictions for each instance on it
are found. The CV misclassification error rate is found
by adding the misclassification on each part and
dividing them by the total number of instances. The
CV estimated has low bias but high variance. In order
to reduce the variability we usually repeat the
estimation several times.
The estimation of the variance is a hard problem (bengio
and Grandvalet, 2004).
Methods for estimation of the
misclassification error rate
iv) The holdout method. A percentage (70%) of the dataset is
considered as the training sample and the remaining as the test
sample. The classifier is evaluated in the test sample. The experiment
is repeated several times and then the average is taken.
v) Bootstrapping. (Efron, 1983). In this method we generate several
training samples by sampling with replacement from the original
training sample. The idea is to reduce the bias of the resubstitution
error.
It is almost unbiased, but it has a large variance. Its computation cost
is high.
There exist several variants of this method.
Naive Bayes para Bupa
Sin discretizar
> a=naiveBayes(V7~.,data=bupa)
> pred=predict(a,bupa[,-7],type="raw")
> pred1=max.col(pred)
> table(pred1,bupa[,7])
pred1 1 2
1 112 119
2 33 81
> error=152/345
[1] 0.4405797
Discretizando con el metodo de la entropia
> dbupa=disc.mentr(bupa,1:7)
> b=naiveBayes(V7~.,data=dbupa)
> pred=predict(b,dbupa[,-7])
> table(pred,dbupa[,7])
pred 1 2
1 79 61
2 66 139
> error1=127/345
[1] 0.3681159
Naïve Bayes para Diabetes
Sin Descritizar
> a=naiveBayes(V9~.,data=diabetes)
> pred=predict(a,diabetes[,-9],type="raw")
> pred1=max.col(pred)
> table(pred1,diabetes[,9])
pred1 1 2
1 421 104
2 79 164
> error=(79+104)/768
[1] 0.2382813
Discretizando
> ddiabetes=disc.mentr(diabetes,1:9)
> b=naiveBayes(V9~.,data=ddiabetes)
> pred=predict(b,ddiabetes[,-9])
> table(pred,ddiabetes[,9])
pred 1 2
1 418 84
2 82 184
> 166/768
[1] 0.2161458
Naïve Bayes usando discretizacion
ChiMerge
> chibupa=chiMerge(bupa,1:6)
> b=naiveBayes(V7~.,data=chibupa)
> pred=predict(b,chibupa[,-7])
> table(pred,chibupa[,7])
pred 1 2
1 117 21
2 28 179
> error=49/345
[1] 0.1420290
> chidiab=chiMerge(diabetes,1:8)
> b=naiveBayes(V9~.,data=chidiab)
> pred=predict(b,chidiab[,-9])
> table(pred,chidiab[,9])
pred 1 2
1 457 33
2 43 235
> error=76/768
[1] 0.09895833
Otros clasificadores Bayesianos
Analisis Discriminante Lineal (LDA). Aqui se asume
que la funcion de clase condicional P(X1,…Xm/Y=vj) se
asume que es normal multivariada para cada vj. Se
supone ademas que la matriz de covarianza es igual
para cada una de las clases. La regla de decision para
asignar el objeto x se reduce a
Y
predict
1


1
 argmax  log P (Y  v)  '  x  '  1  
v
2


Notar que la regla de decision es lineal en el vector de
predictoras x. Estrictamente hablando solo deberia
aplicarse cuando las predictoras son continuas.
Ejemplos de LDA:Bupa y Diabetes
> bupalda=lda(V7~.,data=bupa)
> pred=predict(bupalda,bupa[,-7])$class
> table(pred,bupa[,7])
pred 1 2
1 78 35
2 67 165
> error=102/345
[1] 0.2956522
> diabeteslda=lda(V9~.,data=diabetes)
> pred=predict(diabeteslda,diabetes[,-9])$class
> table(pred,diabetes[,9])
pred 1 2
1 446 112
2 54 156
> error=166/768
[1] 0.2161458
Otros clasificadores Bayesianos
Los k vecinos mas cercanos : (k nearest neighbor). Aqui
la funcion de clase condicional P(X1,…Xm/Y=vj) es
estimada por el metodo de los k-vecinos mas cercanos.
Estimadores basados en estimacion de densidad por
Kernel.
Estimadores basados en estimacion de la densidad
condiiconal usando mezclas Gaussianas
El clasificador k-nn
• En el caso multivariado, el estimado de la
función de densidad tiene la forma
fˆ ( x ) 
k
nvk ( x )
donde vk(x) es el volumen de un elipsoide
centrado en x de radio rk(x), que a su vez es la
distancia de x al k-ésimo punto más cercano.
El clasificador k-nn
Desde el punto de vista de clasificacion supervisada el método k-nn
es bien simple de aplicar.
En efecto, si las funciones de densidades condicionales f(x/Ci) de la
clase Ci que aparecen en la ecuación
P(Ci / x ) 
f ( x / Ci )i
f (x)
son estimadas por k-nn. Entonces, para clasificar un objeto, con
mediciones dadas por el vector x, en la clase Ci se debe cumplir que
k j j
ki i

ni vk ( x ) n j vk ( x)
para ji. Donde ki y kj son los k vecinos de x que caen en las clase
Ci y Cj respectivamente.
El clasificador k-nn
Asumiendo priors proporcionales a los tamaños de
las clases (ni/n y nj/n respectivamente) lo anterior
es equivalente a:
ki>kj para ji
Luego, el procedimiento de clasificación sería así:
1) Hallar los k objetos que están a una distancia
más cercana al ojbeto x, k usualmente es un
número impar 1 o 3.
2) Si la mayoría de esos k objetos pertenecen a la
clase Ci entonces el objeto x es asignado a ella.
En caso de empate se clasifica al azar.
El clasificador k-nn
Hay dos problemas en el método k-nn, la elección de la distancia o
métrica y la elección de k.
• La métrica más elemental que se puede elegir es la euclideana
d(x,y)=(x-y)'(x-y). Esta métrica sin embargo, puede causar
problemas si las variables predictoras han sido medidas en
unidades muy distintas entre sí. Algunos prefieren rescalar los datos
antes de aplicar el método. Otra distancia bien usada es la distancia
Manhatan definida por d(x,y)=|x-y|. Hay metricas especiales
cuando hay distintode variables en el conjunto de datos.
• Enas y Choi (1996) usando simulación hicieron un estudio para
determinar el k óptimo cuando solo hay dos clases presentesy
determinaron que si los tamaños muestrales de las dos clases son
comparables entonces k=n3/8 si habia poca diferencia entre las
matrices de covarianzas de los grupos y k=n2/8 si habia bastante
diferencia entre las matrices de covarianzas.
Ejemplo de knn: Bupa
> bupak1=knn(bupa[,-7],bupa[,-7],as.factor(bupa[,7]),k=1)
> table(bupak1,bupa[,7])
bupak1 1 2
1 145 0
2
0 200
error=0%
> bupak3=knn(bupa[,-7],bupa[,-7],as.factor(bupa[,7]),k=3)
> table(bupak3,bupa[,7])
bupak3 1 2
1
106 29
2
39 171
error=19.71%
> bupak5=knn(bupa[,-7],bupa[,-7],as.factor(bupa[,7]),k=5)
> table(bupak5,bupa[,7])
bupak5 1 2
1 94 23
2 51 177
error=21.44%
Ejemplo de knn: diabetes
> diabk3=knn(diabetes[,-9],diabetes[,-9],as.factor(diabetes[,9]),k=3)
> table(diabk3,diabetes[,9])
diabk1 1 2
1 459 67
2 41 201
error=14.06%
> diabk5=knn(diabetes[,-9],diabetes[,-9],as.factor(diabetes[,9]),k=5)
> table(diabk5,diabetes[,9])
diabk1 1 2
1 442 93
2 58 175
error=19.66
What you should know
• Bayes Classifiers
– How to build one
– How to predict with a BC
– How to estimate the misclassification error.