transporters - The Zhao Bioinformatics Laboratory

Download Report

Transcript transporters - The Zhao Bioinformatics Laboratory

Functional
characterization of
membrane
transporters from
protein sequences
Haiquan Li
The Samuel Roberts Noble Foundation
Membrane transport
proteins (transporters)
• Functions
 Uptake of nutrients
(nitrogen)
 Pump out toxic metabolites
 Mediate signal transduction
 Maintain ionic osmotic
homeostasis
• Classes based on
driving energy
 channels (passive diffusion)
 carrier-type facilitators
(electrochemical potentialdriven eg. sodium potential)
 primary active transporters
2
Characterization of transporters
• Small-scale experimental methods




Patch-clamp techniques for channels
Isotopical-labeled substrates
Heterologous expression
Mutant complementation
• The demand of genome-scale computational
methods (transportomics)
 Comparative studies
Comparative study of transporter families from multiple
organisms, such as lignin-making organisms and non-lignin
marking organisms
 Integrative study with transporter gene expression
Exchange of metabolites (e.g. nitrogen) between legumes and
rhizobia
3
An example of transportomics
Udvardi & Day, 1997
Day et al., 2001
4
Transporter resources and
classification systems
• Manually curated resources
 TCDB by Sailer et al.
 TransportDB by Ren et al.
5
Computational characterization
of transporters
Machine
learning
Homology
search (Domain)
• False positives caused by gene
duplication (paralogs), domain
shuffling, or non-transporter
domains


Example: Plant Plasmodesmata (PPD)
family (1.A.26) transports hormones or
growth factors.
Single member: Connexin 32, a gap
junction protein
(Blast)
Empirical rules
Computational characterization
methods
6
Motivation of our work
• Objectives
 List of all candidate transporters, since the low
confidence may imply novelty and significance
 Reduce curation efforts significantly
• Methodologies
 Using distinct machine learning and empirical
rules to enhance annotation confidence
 Efficiently and automatically integrate multiple
evidence from TCDB, Pfam, GO, SWISS-PROT
and transmembrane segment (TMS)
7
Saport: a semi-automatic
transporter annotation system
Input
sequences
Machine Learning Module (TransportTP)
Empirical Rule Module
Initial classifier from TCDB
BLAST Search
HMM Search
Collect transporterrelated evidence
Score integration and initial prediction
Refining classifier
TMS
KNN in
TCDB
Pfam
domains
Go
Terms
SwissProt
Homologs
Classification by ensemble of SVMs
Score integration
and ranking
Summarize
family-based
empirical rules
Interpret rules and
generate putative
transporters
8
TransportTP: Two-phase
classification
Initial classifier
from TCDB
?
F1
Fi
(Correctly categorized
transporters)
False positives
Transporters
Refining classifier
(incorrectly predicted Nontransporters
transporters)
blast ( pij )*HMM ( Fi )
p
True positives
NN
transporter
…
Fm
Haiquan Li, Xinbin Dai & Xunchun
Zhao. Bioinformatics, 24,1129-1136,
2008.
False negatives
(Missed transporters)
True negatives
(non-transporters)
Haiquan Li, Vagner A. Benedito, Michael K.
Udvardi and Xunchun Zhao. BMC Bioinformatics,
under revision.
9
Refining features: TMS & KNN
TMS distribution for 1.A.1 family (72 channels)
18
Number of transporters
KNN
Ptms ( p)   tms ( F )
ztms ( p, F ) 
 tms ( F )  
16
14
12
10
8
6
4
2
0
0
1 2
3 4 5
6 7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
TMS number
10
Refining features: Pfam,GO &
Swissprot
Pfam families TC families
…
p
TCDB
Swissprot
…
+ cross-links
11
Refining classifier: ensemble of
SVMs
• Classification label of training samples
 Positives are benchmarked by TransportDB for their manual annotation
 Others are negatives
SVM1
Major class
Major samples
SVM2
unknown proteins
pos_weight
>
neg_weight?
Minor class
SVMk
12
Generation of empirical rules
• Manual curation of transporters
 Collect transporter-related evidence
 Categorize the evidence manually
• Summarize the rules on transporter families
during the curation of plant organisms
 medicago, lotus, sorghum, poplar, grape, sorghum,
moss, green algae
seqid
protein
size
hmmtop
tms
Tmpred
tms
Tcdb
hits
Pfam
domains
Go
terms
Universal table of raw evidence
Swissprot
homologs
NR
hits
Localiza
tions
13
Representation of rules
• Categories of curation
 Level 1: every expected features are there
 Level 2: a minor feature is missing
 Level 3: a major feature is missing or multiple features are conflicted
• Representation and customization of
complicated empirical rules
family
Min
len
Max
len
len std
dev
hmmtop
tms
Tmpred
tms
Tcdb
hits
Pfam
domains
Go
terms
Swissprot
homologs
definition
1.A.1
…
3.A.1.1
isnull($tcdb_top_evalue); lt($len,$-2/2):=3; lt($len,$-2-$0):+1
up to 3; gt($len,$-1+$0):+1 up to 3
14
Interpretation of Rules
• A simple script language








Flow control: serial ‘&’, otherwise ‘;’
Variable definition: database field variable and rule column variable
Assign and arithmetic operations: ‘:’ ‘+’ ‘-’ ‘*’ ‘/’
Comparison operations: lt, gt, eq, le, ge
String operations: isnull, matched, items, match_items, compatible
Boundary functions: up to, down to
Advance functions: key, index, gradient, etc
Nested functions
• The interpretation program can be fixed and the rules can
be tuned and customized for other kingdoms of
organisms
• Interpret the script language using programming
techniques
15
Final issues on Saport
• Final Integration
 Final scores are integrated from machine
learning scores and empirical
categorization
 Sequences annotated by either method
is accepted, otherwise, will be filtered out
 Confidence is gained from the mutual
support of both methods; further review is
need for conflicted or single annotated
ones
• Tools: filtering, visualization and online
curation
16
Saport
(http://bioinfo3.noble.org/saport)
17
Evaluation of TransportTP module:
cross-validation results
Organism
Matches
Text
mining
validated
Recall
(%)
Precision
(%)
Balanced
accuracy
(%)
577
456
61
79.03
77.42
78.22
1073
1278
996
38
77.93
92.82
84.73
56278
1230
1283
1061
88
82.70
86.26
84.44
C. elegans
20051
906
667
601
87
90.10
66.34
76.42
D. melanogaster
13890
663
646
535
26
82.82
80.69
81.74
H. sapiens
37742
1272
1466
1140
79
77.76
89.62
83.27
81.72
82.19
81.96
Num of
proteins
Predictions
by
TransportTP
E. coli
5411
589
A. thaliana
26960
O. sativa
Annotations
in
TransportDB
Average on model proteomes
P. torridus
1535
165
171
137
15
80.12
83.03
81.55
P. profundum
5489
550
580
445
35
76.72
80.91
78.76
D. psychrophila
3234
316
305
242
38
79.34
76.58
77.94
A. fumigatus
9923
671
619
563
50
90.95
83.90
87.28
81.78
81.11
81.44
81.75
81.76
81.75
Average on non-model proteomes
Average on all testing proteomes
7.57%
2  Re call  Pr ecision
Re call  Pr ecision
Yeast was used for training and e-value threshold of initial classifier was set to 0.1
Balanced _ accuracy 
18
Full results of TransportTP in
Leave-one-in cross-validation
Recall/sensitivity
Average=80.2%
Precision
Average=81.9%
E-value threshold was set to 0.1 in initial classifier
19
General model versus genome-specific
model on the balanced accuracy of
TransportTP
E-value thresholds of initial classifier
2  Re call  Pr ecision
Balanced _ accuracy 
Re call  Pr ecision
20
Benefit of integrating machine
learning with homology search
100
90
80
80
70
Balanced Accuracy (%)
90
Precision (%)
70
60
50
40
60
50
TransprotTP
40
BLAST plus HMM
TransportTP
BLAST
BLAST
plus HMM
BLAST
30
20
30
10
20
40
0.0
1
0.0
01
0.0
0
0.0 01
00
0.0 01
0
0.0 000
00 1
00
01
1E
-08
1E
-09
1E
-10
1E
-11
1E
-12
1E
-13
1E
-14
1E
-15
1E
-16
1E
-17
1E
-18
1E
-19
1E
-20
1E
-21
1E
-22
1E
-23
1E
-24
1E
-25
1E
-26
1E
-27
1E
-28
1E
-29
1E
-30
1E
-35
1E
-40
1E
-45
1E
-50
10
0
1
0.1
0
10
50
60
70
80
90
100
E-value thresholds
Recall (%)
Yeast was used for training and e-value threshold 10 to 1e-50 were tested
21
The predictive performance of
TransportTP on plant organisms
Organism
Manually
curated
Predictions
Matches Recall
(%)
Precision
(%)
Potential
transporter
rate (%)
M. truncatula
1621
1991
1251
77.17
62.83
29.83
G. Max
3509
4178
3054
87.03
73.10
18.26
L. Japonicus
1740
2381
1299
74.66
54.56
25.66
S. Bicolor
1918
1960
1485
77.42
75.77
7.70
P. Trichocarpa
2512
2889
1936
77.07
67.01
14.36
V. Vinifera
2188
2002
1540
70.38
76.92
5.49
P. Patens
1388
1380
1019
73.41
73.84
6.81
76.74
69.28
15.45
56.59
71.95
7.66
Average
C. Reinhardtii
979
770
554
Manually curated: curation with confidence level 1 and 2
Potential transporter rates: proportion of predictions match curation level 3
Arabidopsis was used for training and 10 was used as e-value threshold
22
Preliminary results of automatic
annotation by empirical rules
Organism
Manually
curated
Automatic
annotated
Matches
Recall
(%)
Precision
(%)
M. truncatula
1621
1665
1386
85.50
83.24
G. Max
3509
3876
2867
81.70
73.97
L. japonicus
1740
1580
1136
65.29
71.90
S. Bicolor
1918
1836
1534
79.98
83.55
P. trichocarpa
2512
2575
2011
80.06
78.10
V. vinifera
2188
1674
1429
65.31
85.36
P. patens
1388
1384
1101
79.32
79.55
76.74
79.38
59.55
89.28
average
C. reinhardtii
979
653
583
23
Consistence between the two
modules
Organism
Curation
TransportTP
Rules
Overlaps
Matches Recall
(%)
Precision
(%)
M. Truncatula
1621
1991
1665
1235
1110
68.48
89.88
G. max
3509
4178
3876
2838
2638
75.18
93.28
L. japonicus
1740
2381
1580
1193
971
55.80
81.39
S. bicolor
1918
1960
1836
1374
1308
68.20
95.20
P. trichocarpa
2512
2889
2575
1915
1693
67.40
88.41
V. vinifera
2188
2002
1674
1294
1222
55.85
94.44
P. patens
1388
1380
1384
952
891
64.19
93.59
65.01
90.88
44.84
97.12
Average
C. reinhardtii
979
770
653
452
439
24
Consistence between the two
methods (con’t)
Curation
results
76.74
79.38
69.28
Machine
Learning
results
TransportTPEmpirical
Rules
76.74
65.01
90.88
Empirical
rule
results
Human Curation
Recall
Precision
Saport
25
Comparative study of monolignal
transporters
Plant cell
High plants
moss ?
algae
 Comparative study
 strengthening predictions versus all potential predictions
 Candidate mono-lignol transporters
 2.A.85 Aromatic Acid Transporters (ArAE)
fungi
26
Results on nodule transporters
TC
Family
Num of
Transporter
Genes
substrates
(specific)
Expr
folds
Characterized
orthologs
Reference
1.A.8.12
2 LIMP
ammonia NH3+
over
LIMP1/2 in lotus
Guenther & Roberts,
2000
2.A.17
1 POT/PTR
dicarboxylate
(malate)
>200
AgDCAT1 in A.
glutinosa
Jeong et al. 2004
2.A.53
13 (2)
sulfate
>2
LjSST1 in lotus
Krusell et al 2005
2.A.1.8
2.A.1
2 NPP
32
nitrate/nitrite
over
LjN70 in lotus,
GmN70 in soybean
Vincill et al. 2005
2.A.7
2.A.5
7 (1) DMT
6 ZIP
iron
zinc
>50
>2
GmDMT1 in soybean
GmZIP1
Kaiser et al 2003
Moreau et al 2002
2.A.72
1
potassium (K+)
over
LjKUP1 in lotus
Desbrosses et al 2004
3.A.3.2
1
Ca2+-ATPase
>150
unknown
Andreev et al 1998,1999
total
195 transporter genes expressed at least five folds and 50 transporter genes are nodule specific
Benedito, Li et al. Plant Physiology, under review.
27
Discussion
• Comparison of two methods
 Machine learning method is general, but the black boxed method is
difficult to check by biologists
 Empirical rules are family-based, easy to check by biologists, but
may be biased on the organisms summarized
•



Pitfalls of system
Difficult to distinguish transporters and sensors
Sensitive to partial sequences such as ESTs
Weak to handle transporter complexes
• Further work
 Integrate gene expression and sub-cellular localization analysis
 Integrate phylogenetic analysis 1) characterize subfamily or
substrates based on SIFTER or TransportDB and 2) comparative
study of annotated transporter families from multiple organisms
28
Summary
• Present a transporter annotation system
which effectively integrates homology
based, machine learning methods and
empirical rules
• The system is promising to characterize
eukaryotic transporters with significantly
reduced curation efforts
• Provide a general framework for
integrative decision, including integration
of multiple resources and prior biological
knowledge
29
Acknowledgements
• Patrick Xuechun
Zhao
• Vagner Benedito
• Ranamalie
Amarasinghe
• Jian Zhao
• Xinbin Dai
• Michael Udvardi
• Carolyn Young
• Rick Dixon
30