presentation

Download Report

Transcript presentation

The Fudan-UIUC participation in the BioASQ
Challenge Task2a: The Antinomyra System
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
[email protected]
1Fudan
University
2 Central South University,
3 University of Illinois at Urbana-Champaign
Outline





Introduction
Related Work
Our Methods
Experimental Result
Conclusion
Introduction
MeSH
Terms
Each year, around 0.8 million biomedical documents are added into MEDLINE.
MeSH is Important

Indexing all documents in MEDLINE

Indexing many books and collections in NLM

Improving the retrieval performance by query expansion
using MeSH

Improving the clustering performance by integrating
MeSH information [Zhu et al. 2009 IP&M] [Zhe et al.
2009 Bioinformatics] [Gu et al. 2013 IEEE TSMCB]

Improving the biomedical text mining performance
Automatic MeSH Annotation is a challenging
problem

More than 26,000 MeSH
headings organized in
hierarchical structure

Quickly approaching
1,000,000 articles indexed
per year

~$9.40 to index an article
The number of distinct MeSH is large (almost 27000)
The large variations of MeSH frequencies in MEDLINE
The large variations in the number of MeSH terms for each document
BioASQ (Large Scale Biomedical Semantic Indexing Competition )
Batch 3, week 1
Batch 3, week 2
Batch 3, week 3
Batch 3, week 4
Batch 3, week 5
4342 docs
8840 docs
3702 docs
4726 docs
4533 docs

Label Based Micro F1-measure (MiF)
L represent the label set, |L| represents the number of labels.
 It means that Frequent labels will be weighed more in the
evaluation.

Batch 3, week 5
4533 docs
Fudan
University
NLM Current
Solution
We achieved around 10% improvement over
current NLM MTI solution (Result of June 2014)
Outline





Introduction
Related Work
Our Methods
Experimental Result
Conclusion
NLM approach: MTI
Two sources:
 MetaMap Indexing
Maps UMLS Concepts
restricting to MeSH

PubMed Related
Citations
•reference:
http://ii.nlm.nih.gov/MTI/history.shtml
Advanced machine learning
algorithms are not utilized
MetaLabeler (Tsoumakas et al. 2013)

Firstly, for each MeSH heading, a binary classification
model was trained using linear SVM.

Secondly, a regression model was trained to predict the
number of MeSH headings for each citation.

Finally, given a target citation, different MeSH headings
were ranked according to the SVM prediction score of each
classifier, and the top K MeSH headings were returned as
the suggested MeSH headings, where K is the number of
predicted MeSH headings by the model.
Problem:
Only use global information.
The scores from different classifiers are not comparable.
NCBI’s learning to rank (LTR)
(Huang et al., 2011; Mao et al., 2013)

Each citation was deemed as a query and each MeSH
headings as a document.

LTR method was utilized to rank candidate MeSH
headings with respect to target citation.

The candidate MeSH headings came from similar
citations (nearest neighbors).
Problem:
Only use local information.
Similar citations might be rare.
Outline





Introduction
Related Work
Our Methods
Experimental Result
Conclusion
Our solution: Learning to Rank (LTR) Framework
* Obtain an initial list of main headings
Initial List
Logistic Regression
* Rank the main headings
MH-0
MH-1
Ranked List
MH-2
…
MH-0
MH-n
MH-1
MH-2
Ranking
model
…
Target
Doc
* Generate
features of main
headings
MH-n
PRA
MH-0
MH-1
Features
MH-2
…
…
MH-m
* Retrieve Similar documents
LambdaMart
Evaluation
Main idea: various information (Features) integrated in the
Learning to Rank (LTR) Framework
Given a target document, for each candidate MeSH, we get
prediction scores from all kinds of sources:
(1) Logistic Regression (Global information)
(2) KNN (Local information)
(3) Pattern Matching
(4) MTI result (KNN+ pattern +rule)
Logistic Regression
Train a binary-class Logistic Regression Model for Each
Label. Finally we have 25,000+ binary-class models
Question: The prediction scores are from different
classifiers. How to make these scores comparable?
Key idea: We have huge validation set of whole MEDLINE.
Use the Precision at prediction score K as the Normalized
score
[Liu et al., In preparation]
The performance comparison on LR between default
prediction scores and our normalized scores.
Method
mip
mir
mif
Default scores
0.5576
0.5614
0.5595
Normalized scores
0.5734
0.5774
0.5754
[Liu et al., In preparation]
KNN
Given a target citation, we used NCBI efetch to find its
similar(neighbor) citations.
For a candidate MeSH, we compute a score from
neighbors to represent its confidence.
Specifically, in Top 25 documents most similar to target
citation, we use the following formula, where Si is the
score of a document appearing in top 25, Sk is the score of
a document not only appearing in top 25 and also
annotated with candidate MeSH.
Pattern matching

Use direct string pattern matching for finding
MeSH Term, its synonyms and entry terms from
the text
MTI

Whether the candidate MeSH appears in the
default results of MTI
The number of MeSH Labels
A Support Vector Regression for predicting the Number of
Labels by using a number of features, such as
Journal information
The number of labels in nearest neighbors
Number of labels predicted by MTI
Number of labels predict by metalabeler
Outline





Introduction
Related Work
Our Methods
Experimental Result
Conclusion
Evaluation & Experiment




Server
◦ 4 * Intel XEON E5-4650 2.7GHzs CPU
◦ 128GB RAM.
Training of LR Classifiers took
5 days.
All other training tasks took
1 day.
Annotating 10,000 citations
2 hours.
Evaluation & Experiment
Outline





Introduction
Related Work
Our Methods
Experimental Result
Conclusion
Conclusion & Future Work

The superior performance of our methods come from
integrating all kinds of information in LTR framework,
MTI, KNN, LR as well as direct matching .

The big data of MEDLINE make the prediction score
normalization possible and improves the performance
significantly.

More information could be used, such as full text, and
indexing rules.

How to minimize the gap between a good competition
system and real applications?
Acknowledgement
Dr. Hongning Wang
 Mr. Mingjie Qian
 Mr. Jieyao Deng
 Mr. Tianyi Peng

UIUC
UIUC
Fudan
Tsinghua