EFI-GNT - Enzyme Function Initiative

Download Report

Transcript EFI-GNT - Enzyme Function Initiative

EFI-Genome Neighborhood Tool: a web
tool for large-scale analysis of genome
context
Enzyme Function Initiative (EFI)
Gordon Research Conference on
Enzymes, Coenzymes, and Metabolic Pathways
July 15, 2014
What is a Genome Neighborhood Network?
High sequence homology
Enzyme function
Low/Med. Sequence homology + Genome Context
Enzyme function
What is a Genome Neighborhood Network?
Genes << Operon << Regulon
gene products forming a biological pathway
R
A
B
C
Genome neighborhood information facilitates enzyme function
discovery via contextual evidence
What is a Genome Neighborhood Network?
The GNN organizes genome
neighborhood information for
thousands of query genes in a high
throughput and rapid fashion.
The resulting network allows a user to
quickly identify the protein families that
are encoded by the genes within close
proximity to the SSN dataset.
GNN Generation
SSN Cluster
Inventory
• SSN network file
parsing
• Singletons excluded
• Clusters assigned
number and unique
color
Neighbor Annotation
Gathering
• European Nucleotide
Archive (ENA) is
queried with each
SSN sequence
• Protein-encoding
genes are compared
to Pfam
Network Generation
• Network xgmml file
written
• Query sequences and
neighbor sequences =
nodes
• Genome proximity =
edge
• Additional annotation
information is
gathered
The entire process is fast and computationally inexpensive
GNNs: query families
Query families
GNNs: bacterial proteins in gene clusters
Query families
Genome neighbors
GNNs: collect neighbors
Query families
Genome neighbors
GNNs: cluster neighbors
Query families
Genome neighbors
network for neighbors
GNNs: deduce function
Query families
Genome neighbors
shared context
same pathway
same function
network for neighbors
unique context
unique pathway
unique function
Example: proline racemase superfamily
< 10-120
> 60% ID
Zhao et al. 2014 eLife: http://dx.doi.org/10.7554/eLife.03275
GNN: “BLAST” network
GNN: Pfam network
Full GNN
Pfam GNN
GNN: pathway “parts”
DAO
ALDH
DHDPS
LDH/MDH
OCD
From GNN: complete pathways
DAO
DHDPS
OCD
ALDH
LDH/MDH
GNN Format
The GNN visually organizes genome neighborhood
information into multiple hub-and-spoke clusters.
Hub Nodes
Hub node = Pfam family in neighborhood
Node Attribute, Neighbor_Accessions = list of all Pfam members found in genome
context of SSN, with the following additional information:
• EC number
• PDB code
• PDB-hit
• Swiss-Prot status (reviewed/unreviewed)
Additional Node Attributes:
• Num_neighbors = the number of neighbor sequences belonging to this Pfam
family
• pfam = Pfam number, e.g., PF13365
• Pfam description = a short description of the family, e.g., Trypsin-like peptidase
domain
PDB-Hit
PDB-hit - a sequence shares significant (evalue < e-15) homology with a protein with
an X-ray crystal structure in RCSB Protein
DataBase.
PDB
284k
The format of this information is “PDB
code:e-value”
BLASTp
PDB-Hit
Database
22M
Related structure  homology model for
docking
For users that are new to homology
modeling, see resources by Sali lab at the
University of California at San Francisco.
UniProt
48M
Spoke Nodes
Spoke nodes = single cluster from SSN with ≥1 neighbor in hub
The Node Attributes:
• Cluster Number = # assigned to SSN-cluster
• Query_Accessions = a list of UniProt accessions for
query sequences
• Distance = a list of distance between query and neighbor.
This is formatted “UniprotID-query:UniprotID-neighbor: ()N”, where query = 0, next gene = 1, etc., and a negative
N value indicates an upstream position.
• SSN Cluster Size = the size of SSN-cluster
• Num_neighbors = # of neighbor sequences retrieved by
spoke node
• Num_queries = # of query sequences in spoke node
• Num_ratio = % co-occurrence as a ratio
• ClusterFraction = % co-occurrence as fraction, 0-3
Spoke Nodes
Spoke node size is dependent on the % co-occurrence of that Pfam in
the neighborhood of that SSN cluster.
% co-occurrence = # neighbors retrieved / SSN cluster size * 100
% Co-Occurrence Indicative Situation
< 100%
The neighbor gene is not well-conserved and potentially
unimportant to the physiological pathway of the query gene.
< 100%
This particular SSN-cluster is not isofunctional, containing
multiple neighborhood contexts.
≈ 100%
The neighbor gene is a well-conserved member of the genome
neighborhood.
> 100%
Two or more instances of neighbors from this particular Pfam
family exist in the genome neighborhood.
Pfam and the GNN
More universal
Highly
represented in
SSN cluster
Lowly
represented in
SSN cluster
Unique
www.pfam.xfam.org
Pfam and the GNN
Identify the general classes of
enzymes present in the genome
context of an SSN cluster.
Eg., the presence of a kinase
Pfam family and isomerase Pfam
family, may indicate that the
proteins of this particular SSNcluster may carry out an
aldolase-type reaction for a
catabolic pathway.
Kinase
Pfam
Isomerase
Pfam
www.pfam.xfam.org
Neighborhood Size
EFI-GNT default neighborhood size = +10 and -10 genes
Users may lower this to +/- 3 to 9 genes
R
A
B
C
Zheng et al. 2002, Genome Research 12, 1221
GNN Signal-to-Noise: Added Noise
The utility of the GNN is limited primarily by its signal-to-noise
Signal = proximal and functionally related genes
Noise = proximal and irrelevant genes
Source of Noise
Remedy
Distant genes
Decrease neighborhood size
Uncommonly co-occurring
genes
Increase co-occurrence threshold
SSN over-fractionation
Return SSN to less stringent e-value
GNN Signal-to-Noise: Lost Signal
Why did my query sequence return less than 20 neighbors?
•
•
•
•
•
Query sequence does not match to the ENA sub-databases
Non-coding RNA
Query sequence is located near the beginning or end of the ENA file
The neighbor entry does not have an associated EMBL accession number
The neighbor entry has not been incorporated into a current Pfam family.
R
A
B
C
X X
EFI-GNT Web tool
www.enzymefunction.org
EFI-GNT Input
www.efi.igb.illinois.edu/efi-gnt
1. Upload xgmml network,
full or rep-node
2. Pick neighborhood size:
3-10 +/- genes
3. Enter co-occurrence cutoff
(1-100)
4. Enter email address
5. Hit “go”
Upload status bar
EFI-GNT Output
The EFI-GNT output is a pair of
.xgmml files:
• genome neighborhood
network (GNN)
• Colored version of the
original SSN
EFI-GNT Output
A download link will be sent to the e-mail address provided.
Data stored on server for 7 days.
EFI-GNT Output
NOTE – depending on your
browser, the files may download
with an additional file extension,
such as: .xgmml.txt or
.xgmml.xml
You must delete the .txt or .xml
extension in order to open these
files in Cytoscape!
Cytoscape opens .xgmml
Network Visualization
Version 3.1.0
GNN files must be viewed in Cytoscape 3.0 (or more recent)
Best layouts: Organic or Prefuse Force Directed
Opening both the GNN and colored SSN in a single instance of Cytoscape
allows fast comparison between the two networks (see above).
www.cytoscape.org
Network Visualization
NOTE – in Cytoscape the automatic rendering and coloring of
the colorized SSN is size dependent. Cytoscape settings
include a “Threshold View” that needs to be adjusted in the
following manner in order to automatically view your colored
SSN:
• In any version 3.X, go to Edit -> Preferences -> Properties
• With “cytoscape 3” selected in the pull-down menu at the
top, scroll to the bottom of the Property list and select
“viewThreshold”
• Click “Modify” and insert 5 zeros to the end of the displayed
number
• Click “OK”
Restart Cytoscape (this should only need to be done once per
version of Cytoscape installed on your machine)
Network Manipulation
Generally, the full +/-10 neighbor
GNN presents an overwhelming
amount of information.
Filter GNN networks by SNN Cluster
Number, in order to assign enzyme
function to subgroups of homologous
sequences.
Network Manipulation
Only hubs connected to the designated SSN cluster (eg., the cyan cluster 5).
Analyze the genome neighborhood Pfams specific to this SSN-cluster.
Network Manipulation
Spoke length is arbitrary.
click+drag+drop overlapping spoke nodes until all are visible
Tutorial Pages
Tutorial
pages
containing
content
similar to this
presentation
Test Case:
Predicted Novelties of the Sialic Acid
Degradation Pathway
Protein SSN
Bacterial
extracellular
solute-binding
protein family 1
(SBP_bac_1,
PF01547)
100% rep node net
BLAST E-value 10-80
40% identical
21833 sequences
11073 nodes
Cluster 164
15 members
EFI ID 510644
ThermoFluor hit on
N-acetylneuraminate
J. Bouvier, UIUC
Genome Neighborhood Network for Cluster 164
Permease
ABC transporter
Regulator Epimerase
Kinase
DHDPS
DUF
J. Bouvier, UIUC
EFI ID 510644 gene neighborhood
Streptococcus uberis Diernhofer (strain 0140J, ATCC BAA-854)
+6 +5 +4
+3 +2 +1 query -1
-2 -3 -4
Pfam Family ID
Pfam Description
Predicted Role
+6
Unassigned
None
none
+5
PF01380 PF01418
SIS HTH_6
transcription regulator
93
+4
PF05448
Acetyl xylan esterase
deacetylase
7
+3
PF00480
ROK
kinase
93
+2
PF00701
DHDPS
lyase
93
+1
PF04074
DUF386
isomerase/deaminase
67
PF01547
SBP_bac_1
solute-binding
120
-1
PF00528
BPD_transp_1
permease
120
-2
PF00528
BDP_transp_1
permease
120
-3
PF04131
NanE
epimerase
107
-4
PF00468
Ribosomal_L34
ribosome subunit
67
J. Bouvier, UIUC
% Occurrence
unavailable
N-acetylneuraminate degradation pathway
PF00480
PF00701
ATP
N-acetyl
neuraminate
N-acetyl-Dglucosamine
6-phosphate
pyruvate
ADP
H+
N-acetyl-Dmannosamine
PF01979
PF01182
H2O
H2O
acetate
D-glucosamine
6-phosphate
PF04131
N-acetyl-Dmannosamine
6-phosphate
glycolysis
NH4+
β-D-fructofuranose
6-phopshate
Enzyme Pfam family ID
J. Bouvier, UIUC
Found in GNN
Found alternative Pfam
Orphan EC
Three sources of unknown enzymes
1. Orphan enzyme activity (EC number with no enzyme) - in vivo evidence suggests an
enzyme from PF04131 converts N-acetyl-D-mannosamine 6-phosphate to N-acetyl-Dglucosamine 6-phosphate in the third step of the pathway, but no biochemical work
has been done on this putative epimerase.
2. Non orthologous gene replacement - The deacetylase from PF01979 known to
convert N-acetyl-D-glucosamine 6-phosphate to D-glucosamine 6-phosphate in the
four step of this pathway is located elsewhere in the genome (locus tag Sub1443).
However Sub1651 which is located four genes downstream is a member of PF05448,
and other members of PF05448 have known deacetylase activity. Is this a non
orthologous gene replacement, and does it’s low occurrence (7%) in the
neighborhoods of the queries suggest it to be a relic?
3. Domain of unknown function - The deaminase/isomerase from PF01182 known to
convert α-D-glucosamine 6-phosphate to β-D-fructofuranose 6-phosphate in the fifth
step of the pathway is located elsewhere in the genome (locus tag Sub1239). However
Sub1654 which is located one gene downstream has been suggested to be a sugar
isomerase. Sub1654 is a member of PF04074 (DUF386). Sub1654 is a good candidate
for docking.
J. Bouvier, UIUC
Hands-on Portion of Workshop
Feel free now to download Cytoscape 3.1, run EFI-EST, and run EFI-GNT for your
protein (family) of interest.
Please see posters by Katie Whalen (#55) and Daniel Wichelecki (#56) for further
examples of EFI-EST/EFI-GNT use.
Tutorials for using Cytoscape: http://enzymefunction.org/resources/tutorials/efiand-cytoscape3
Feel free to contact us throughout the conference with questions/comments.
Acknowledgements
GNN Development Suwen Zhao (UCSF) Alan
Barber (Pythoscape, UCSF) Shoshana Brown
(Pythoscape, UCSF) Eyal Akiva (Pythoscape,
UCSF) Jason Bouvier (UIUC)
Website Build Daniel Davidson (UIUC)
Principal
DavidInvestigators
Slater (UIUC) Matthew Jacobson (UCSF) P
Babbitt (Pythoscape, UCSF) John Gerlt (UIUC)
Documentation Katie Whalen (UIUC)