Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN

Download Report

Transcript Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN

MAS 622J Course Project
Classification of Affective States
- GP Semi-Supervised Learning, SVM and kNN
Hyungil Ahn ([email protected])
Objective & Dataset
•
Recognize the affective states of a child solving
a puzzle
•
Affective Dataset
- 1024 features from Face, Posture, Game
- 3 affective states, labels annotated by teachers
High interest (61), Low interest (59), Refreshing (16)
Task & Approaches

Binary Classification
High interest (61 samples) vs.
Low Interest or Refreshing (75 samples)

Approaches
- Semi-Supervised Learning: Gaussian Process (GP)
- Support Vector Machine
- k-Nearest Neighbor (k = 1)
GP Semi-Supervised Learning
Given


, predict the labels of unlabeled pts
Assume the data, data generation process
X : inputs, y : vector of labels,
t : vector of hidden soft labels,
Each label
(binary classification)
Final classifier y = sign[ t ] = sign [
]

Define
Similarity function

Infer
given
GP Semi-Supervised Learning
Infer
given
 Bayesian Model

: Prior of the classifier
: Likelihood of the classifier given the labeled data
GP Semi-Supervised Learning
 How to model the prior & the likelihood ?
The prior :
Using GP,
(Soft labels vary smoothly across the data manifold!)
The likelihood :
GP Semi-Supervised Learning


EP (Expectation Propagation)  approximating the posterior
as a Gaussian
Select hyperparameter { kernel width σ, labeling error rate ε }
that maximizes evidence !

Advantage of using EP  we get the evidence
as a side product

EP estimates the leave-one-out predictive performance
without performing any expensive cross-validation.
Support Vector Machine



OSU SVM toolbox
RBF kernel :
Hyperparameter {C, σ} Selection  Use leaveone-out validation !
kNN (k = 1)

The label of test point follows that of its
nearest point

This algorithm is simple to implement and
the accuracy of this algorithm can be used
as a base line.

However, sometimes this algorithm gives a
good result !
Split of the dataset & Experiment

GP Semi-supervised learning
- randomly select labeled data (p % of overall data), use the
remaining data as unlabeled data, predict the labels of
unlabeled data (In this setting, unlabeled data == test data)
- 50 tries for each p (p = 10, 20, 30, 40, 50)
- Each time select the hyperparameter that maximizes the
evidence from EP
SVM and kNN
- randomly select train data (p % of overall data), use the
remaining data as test data, predict the labels of test data
- 50 tries for each p (p = 10, 20, 30, 40, 50)
- In the SVM, leave-one-out validation for hyperparameter
selection was achieved by using the train data

GP – evidence & accuracy
Rec Accuracy (Unlabeled)
Log Evidence
88
Recognition Accuracy / Log Evidence
86
84
82
80
78
76
74
72
0
1
2
3
4
Sigma (Hyperparameter)
5
6
The case of Percentage of train points per class = 50 % (average over 10 tries)
(Note) An offset was added to log evidence to plot all curves in the same figure.
Max of Rec Accuracy ≈ Max of Log Evidence
 Find the optimal hyperparameter by using evidence from EP
SVM – hyperparameter selection
Evidence from Leave-one-out validation
Log (C)
Log (1/
)
Select the hyperparameter {C, sigma} that maximizes
the evidence from leave-one-out validation !
Classification Accuracy
Percentage of recognition on unlabeled(or test) points
100
GP
kNN(k=1)
SVM
95
90
85
80
75
70
65
60
0
10
20
30
40
Percent of labeled(or train) points per class
50
60
As expected, kNN is bad at small # of train pts and better at large # of train pts
SVM has good accuracy even when the # of train pts is small, why?
GP has bad accuracy when the # of train pts is large, why?
Analysis-SVM
Why does SVM give a good test accuracy
even when the number of train points is small ?
# of SVs
# of train points
70
CV accuracy rate, Test accuracy rate, # SVs / # train pts
Number of support vectors, Number of train points
80
60
50
40
30
20
10
0
0
10
20
30
40
Percent of train points per class
50
60
100
CV accuracy rate
Test accuracy rate
# SVs / # train pts
90
80
70
60
50
40
0
10
20
30
40
Percent of train points per class
50
60
The best things I can tell…
1.
{# Support Vectors} / {# of Train Points} is high in this task, in particular when the percentage of train points is low.
The support vectors decide the decision boundary. But it is not guaranteed that the SV ratio is highly related with the test
accuracy.
Actually it is known that {Leave-one-out CV error} is less than {# Support Vectors} / {# of Train Points}.
2.
CV accuracy rate is high even when the # of train pts is small. CV accuracy rate is very related with Test accuracy rate.
Analysis-GP
Why does GP give a bad test accuracy
when the number of train points is small ?
Rec Accuracy (Unlabeled)
Log Evidence
88
Rec Accuracy (Unlabeled)
Log Evidence
74
72
Recognition Accuracy / Log Evidence
Recognition Accuracy / Log Evidence
86
84
82
80
78
76
68
66
64
62
74
72
70
0
1
2
3
4
Sigma (Hyperparameter)
5
6
60
0
1
2
3
4
Sigma (Hyperparameter)
5
6
Percentage of train points per class = 50 %
Percentage of train points per class = 10 %
Max of Rec Accuracy ≈ Max of Log Evidence
Log Evidence curve is flat  fail to find optimal Sigma !
Conclusion
 GP
Small number of train points  bad accuracy
Large number of train points  good accuracy
 SVM
Regardless of the number of train points  good accuracy
 kNN (k = 1)
Small number of train points  bad accuracy
Large number of train points  good accuracy