Online Active Learning with Imbalanced Classes Zahra Ferdowsi October 15th 2013 DePaul University Accenture Technology Labs.

Download Report

Transcript Online Active Learning with Imbalanced Classes Zahra Ferdowsi October 15th 2013 DePaul University Accenture Technology Labs.

Online Active Learning with
Imbalanced Classes
Zahra Ferdowsi
October 15th 2013
DePaul University
Accenture Technology Labs
1
Do we always have
enough labeled data to
train the classifier?
2
Active Learning Scenario
•
•
•
•
Large number of unlabeled examples
The interactive nature (experts in the process)
Limited labeling resources
High labeling costs
3
Healthcare example: motivation of this study
• Inefficiencies in the healthcare insurance
process result in large monetary losses
affecting corporations and consumers
– $91 billion over-spent in US every year on Health
Administration and Insurance (McKinsey study’
Nov 2008)
– 131 percent increase in insurance premiums over
past 10 years
4
Health Insurance Claim Process
5
Healthcare example
• Claim payment errors drive a significant
portion of these inefficiencies
– Increased administrative costs and service issues
of health plans
– Overpayment of Claims - direct loss
– Underpayment of Claims – loss in interest
payment for insurer, loss in revenue for provider
6
Early Rework Detection: How its done before
1. Random Audits for Quality Control
Claims
Database
Random Samples
Manual Audits
Auditors
Extremely Low Hit Rates
Long audit times due to fully manual audits
7
Early Rework Detection: How its done before
2. Hypothesis and Rule Based Audits
Database
Queries
Claims
Database
Generate expert
hypotheses
HypothesisAuditors
based
audits
Better hit rates but still lot of manual effort in discovering, building, updating,
executing, and maintaining the hypotheses
8
Data
•
•
•
•
Duration: 2 years
Number of claims: 3.5 million
Labeled claims: 121k (49k rework)
Number of features: 16k
9
Features
• Member information
• Provider information
• Claim header
– Contract information, total amount billed,
diagnosis code, date of service
• Claim line details
– amount billed per service, procedure code,
counter for the procedure (quantity)
10
Predictive Modeling
• Domain characteristics
– High dimensional data
– Sparse data
– Fast training, updating and scoring required
– Ability to generate explanation for domain experts
• Classifier: Linear SVMs
– Distance from margin is used as the ranking score
11
Well-known Instance Selection Strategies (ISS)
• Uncertainty
– Distance to the hyper plain (Shen et. al, 2004)
– Entropy (Settles, 2008)
• Clustering
– Density (similarity cosine)
• Average similarity to all other cases (Shen et. al, 2004)
– Hierarchical (Dasgupta, 2008)
– k-means using Cosine similarity (Zhu et. Al, 2001)
12
Well-known ISS (con.)
• Hybrid approach: Density*Uncertainty (Zhu et.
al, 2008; Settles et. al, 2008)
• Query-by-Committee
– measuring the level of disagreement of a few
classifiers (Melville and Mooney, 2004)
13
Experimental Setup
• 5-fold cross-validation
• Evaluation metric:
precision at top 1%,
2%, and 5%.
• Numbers of instances
labeled in each
iteration = 100
• SVM as the base
classifier using LibSVM
Select n instances randomly
from the pool set
Remove selected instances
from the pool set
Add these instances with
label to the training set
Select n instances
from the pool set
using an instance
selection strategy
Train the classifier on the
training set
Use the classifier to measure
precision @ k% on testing
set
No
Is the pool
set
exhausted?
Yes
End
14
How do existing ISS perform?
Claims data set
15
How do existing ISS perform?
Claims data set
16
Experiments on more datasets
• KDD cup 1999 dataset for network intrusion detection. I use
the ”probing” intrusion as label.
• HIVA is a chemoinformatics dataset was used to predict which
compounds are active against AIDS HIV infection.
• ZEBRA is an embryology dataset provides a feature
representation of cells of zebrafish embryo to determine
whether they are in division (meiosis) or not.
17
How do existing ISS perform?
ZIBRA data set
18
Do existing ISS work?
• No ISS is consistently the best in all domains and
at all precision levels
• Creating a validation set is challenging in since
labeled data are scarce and expensive to obtain.
Proposing an unsupervised score that can
predict the performance of an ISS without
using any additional labeled examples.
19
Proposed Unsupervised Scores
• MS on Unlabeled set (MSU) : mean score of
the top k% instances in the unlabeled set
• MS on Labeled set (MSL) : mean score of the
top k% instances in the labeled set from the
previous iteration
• MS on All (MSA) : mean score of the top k%
instances in the combined (unlabeled set and
the labeled set from the last iteration) set.
20
Do the new unsupervised scores work?
• The graphs show high correlation between the score
and precision.
Certainty on Claims data set
21
Do they work?
• The correlation values are promising
22
Can we use the unsupervised score to
predict the best ISS in each iteration?
• The online algorithm has two component:
– The unsupervised score (MSU) that can track the
performance of individual ISS without using any
validation set.
– a simple online algorithm that uses MSU to switch
between different strategies.
• The existing unsupervised score:
– CEM (Classification Entropy Maximization) as score
– Algorithm for switch between ISS (multi-armed bandit)
23
Online Active Learning
24
How does the online algorithm work?
HIVA data set
25
Conclusion
• Proposing an online algorithm for active learning
that switches between different candidate ISS for
classification in imbalanced data sets.
• This online algorithm has two components:
– a score, MSU, that can track the performance of individual
ISS without using any validation set
– a simple online algorithm that uses change in MSU to switch
between different strategies.
• The online approach works better than (or at least
similar to) the best individual ISS and achieves 80% 100% of the highest possible precision.
26
Questions
27
References
[1] Active learning challenge.
[2] Kdd cup 1999.
[3] J. Attenberg and F. Provost. Inactive learning?: difficulties employing active
learning in practice. SIGKDD Exploration Newsletter, 12, March 2011.
[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic
multiarmed bandit problem. SIAM J. Comput., 32(1):48– 77, 2002.
[5] Y. Baram, R. El-Yaniv, K. Luz, and M. Warmuth. Online choice of active learning
algorithms. Journal of Machine Learning, 2004.
[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
[7] P. Donmez and J. G. Carbonell. Active sampling for rank learning via optimizing
the area under the roc curve. In Proceedings of the 31th European Conference on IR
Research on Advances in Information Retrieval, ECIR ’09, pages 78–89, Berlin,
Heidelberg, 2009. Springer.
[8] P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In
ECML, 2007.
28
References
[9] J. He and J. Carbonell. Nearest-neighbor-based active learning for rare category
detection. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural
Information Processing Systems 20, MIT Press, Cambridge, MA, 2008.
[10] M. Kumar, R. Ghani, and Z.-S. Mei. Data mining to predict and prevent errors in
health insurance claims processing. In KDD 2010, KDD ’10, New York, USA, 2010.
[11] A. McCallum and K. Nigam. Employing em in pool-based active learning for text
classification. In In Proceedings of the International Conference on Machine
Learning (ICML), pages 359–367. Morgan Kaufmann, 1998.
[12] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. ICML, 2004.
[13] B. Settles. Active learning literature survey. Computer Sciences Technical Report
1648, University of Wisconsin–Madison, 2009.
[14] B. Settles and M. Craven. An analysis of active learning strategies for sequence
labeling tasks. In EMNLP, 2008.
[15] S. Tong and D. K. Nguyen. Support vector machine active learning with
applications to text classification. In In Proceedings of the International Conference
29
on Machine Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000.