goadrich.ilp04.ppt

Transcript goadrich.ilp04.ppt

Learning Ensembles of
First-Order Clauses for
Recall-Precision Curves
A Case Study in
Biomedical Information Extraction
Mark Goadrich, Louis Oliphant and Jude Shavlik
Department of Computer Sciences
University of Wisconsin – Madison USA
6 Sept 2004
Talk Outline






Link Learning and ILP
Our Gleaner Approach
Aleph Ensembles
Biomedical Information Extraction
Evaluation and Results
Future Work
ILP Domains

Object Learning


Trains, Carcinogenesis
Link Learning

Binary predicates
Link Learning

Large skew toward negatives



Difficult to measure success



500 relational objects
5000 positive links means 245,000 negative links
Always negative classifier is 98% accurate
ROC curves look overly optimistic
Enormous quantity of data


4,285,199,774 web pages indexed by Google
PubMed includes over 15 million citations
Our Approach


Develop fast ensemble algorithms focused
on recall and precision evaluation
Key Ideas of Gleaner



Keep wide range of clauses
Create separate theories for different recall ranges
Evaluation


Area Under Recall Precision Curve (AURPC)
Time = Number of clauses considered
Gleaner - Background

Focus evaluation on positive examples



Recall =
TP
TP + FN
Precision =
TP
TP + FP
Rapid Random Restart (Zelezny et al ILP 2002)



Stochastic selection of starting clause
Time-limited local heuristic search
We store variety of clauses (based on recall)
Gleaner - Learning



Precision

Recall
Create B Bins
Generate Clauses
Record Best
Repeat for K seeds
Gleaner - Combining

Combine K clauses per bin


How to choose L ?



L=1 then high recall, low precision
L=K then low recall, high precision
Our method



If at least L of K clauses match, call example positive
Choose L such that ensemble recall matches bin b
Bin b’s precision should be higher than any clause in it
We should now have set of high precision
rule sets spanning space of recall levels
How to use Gleaner


Precision

Recall = 0.50
Precision = 0.70
Recall
Generate Curve
User Selects Recall Bin
Return Classifications
With Precision Confidence
Aleph Ensembles


We compare to ensembles of theories
Algorithm (Dutra et al ILP 2002)




Use K different initial seeds
Learn K theories containing C clauses
Rank examples by the number of theories
Need to balance C for high performance


Small C leads to low recall
Large C leads to converging theories
Aleph Ensembles (100 theories)
0.60
Testset AURPC
0.50
0.40
0.30
0.20
0.10
0.00
0
50
100
150
200
Number of Clauses Used Per Theory
250
300
Biomedical Information Extraction


Given: Medical Journal abstracts tagged
with protein localization relations
Do: Construct system to extract protein
localization phrases from unseen text
NPL3 encodes a nuclear protein with an RNA
recognition motif and similarities to a family
of proteins involved in RNA metabolism.
Biomedical Information Extraction

Hand-labeled dataset (Ray & Craven ’01)


7,245 sentences from 871 abstracts
Examples are phrase-phrase combinations


1,810 positive & 279,154 negative
1.6 GB of background knowledge


Structural, Statistical, Lexical and Ontological
In total, 200+ distinct background predicates
Evaluation Metrics
Two dimensions

Area Under RecallPrecision Curve (AURPC)



1.0
All curves standardized
to cover full recall range
Averaged AURPC
over 5 folds
Precision

Number of clauses
considered


Rough estimate of time
Both are “stop anytime”
parallel algorithms
Recall
1.0
AURPC Interpolation

Convex interpolation in RP space?


Precision interpolation is counterintuitive
Example: 1000 positive & 9000 negative
TP
FP
TP Rate
FP Rate
Recall
Prec
500
500
0.50
0.06
0.50
0.50
750
4750
0.75
0.53
0.75
0.14
1000
9000
1.00
1.00
1.00
0.10
Example Counts
ROC Curves
RP Curves
AURPC Interpolation
1.0
Precision
0.8
0.6
Correct Interpolation
Incorrect Interpolation
0.4
0.2
0.0
0.0
0.2
0.4
0.6
Recall
0.8
1.0
Experimental Methodology


Performed five-fold cross-validation
Variation of parameters

Gleaner (20 recall bins)



# seeds = {25, 50, 75, 100}
# clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K}
Ensembles (0.75 minacc, 35,000 nodes)


# theories = {10, 25, 50, 75, 100}
# clauses per theory = {1, 5, 10, 15, 20, 25, 50}
Results: Testfold 5 at 1,000,000 clauses
1.0
0.9
0.8
0.7
Precision
0.6
Gleaner
0.5
0.4
Ensembles
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1.0
Results: Gleaner vs Aleph Ensembles
0.50
0.45
0.40
Testset AURPC
0.35
0.30
0.25
0.20
Gleaner
Aleph Ensembles
0.15
0.10
0.05
0.00
10,000
100,000
1,000,000
10,000,000
Number of Clauses Generated (Logarithmic Scale)
100,000,000
Conclusions

Gleaner




Aleph ensembles



Focuses on recall and precision
Keeps wide spectrum of clauses
Good results in few cpu cycles
‘Early stopping’ helpful
Require more cpu cycles
AURPC


Useful metric for comparison
Interpolation unintuitive
Future Work






Improve Gleaner performance over time
Explore alternate clause combinations
Better understanding of AURPC
Search for clauses that optimize AURPC
Examine more ILP link-learning datasets
Use Gleaner with other ML algorithms
Take-Home Message

Definition of Gleaner


Gleaner and ILP



One who gathers grain left behind by reapers
Many clauses constructed and evaluated in ILP
hypothesis search
We need to make better use of those that aren’t
the highest scoring ones
Thanks, Questions?
Acknowledgements








USA NLM Grant 5T15LM007359-02
USA NLM Grant 1R01LM07050-01
USA DARPA Grant F30602-01-2-0571
USA Air Force Grant F30602-01-2-0571
Condor Group
David Page
Vitor Santos Costa, Ines Dutra
Soumya Ray, Marios Skounakis, Mark Craven
Dataset available at (URL in proceedings)
ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location
Deleted Scenes





Aleph Learning
Clause Weighting
Sample Gleaner Recall-Precision Curve
Sample Extraction Clause
Gleaner Algorithm
Director Commentary
on
off
Aleph - Learning

Aleph learns theories of clauses
(Srinivasan, v4, 2003)




Pick a positive seed example and saturate
Use heuristic search to find best clause
Pick new seed from uncovered positives
and repeat until threshold of positives covered
Theory produces one recall-precision point


Learning complete theories is time-consuming
Can produce ranking with theory ensembles
Clause Weighting

Single Theory Ensemble


rank by how many clauses cover examples
Weight clauses using tuneset statistics



CN2 (average precision of matching clauses)
Lowest False Positive Rate Score
Cumulative


F1 score
Precision
Recall
Diversity
Clause Weighting
0.45
0.40
0.35
0.30
0.25
AURPC
0.20
0.15
0.10
0.05
0.00
Precision
Equal
Ranked List
Weighting Schemes
CN2
Further Results
0.50
0.45
0.40
Testset AURPC
0.35
0.30
0.25
0.20
0.15
0.10
Gleaner
Aleph Ensembles
Ensembles 1K
0.05
0.00
10,000
100,000
1,000,000
10,000,000
Number of Clauses Generated (Logarithmic Scale)
100,000,000
Biomedical Information Extraction
sentence
noun
phrase
noun
verb
phrase
verb
noun
phrase
article
adj
prep
phrase
noun
…
prep
NPL3 encodes a nuclear protein with …
alphanumeric
marked location
Sample Extraction Clause
contains
alphanumeric
S
P
noun
contains
alphanumeric


L
noun
A
article
contains
marked
location
C
B
contains no
between halfX verb
P = Protein, L = Location, S = Sentence
29% Recall 34% Precision on testset 1
Gleaner Algorithm


Create B equal-sized recall bins
For K different seeds



For each recall bin B


Generate rules using Rapid Random Restart
Record best rule (precision x recall)
found for each bin
Find threshold L of K clauses such that
recall of “at least L of K clauses match examples”
= recall for this bin
Find recall and precision on testset using each
bin’s “at least L of K” decision process

goadrich.ilp04.ppt

Transcript goadrich.ilp04.ppt

Directory