Prokaryotic Annotation at TIGR

Download Report

Transcript Prokaryotic Annotation at TIGR

Prokaryotic Annotation at TIGR
Michelle Gwinn Giglio
June, 2005
Prokaryotic Annotation at TIGR
• we work in a high-throughput environment
• our team of 7 annotators finish 1500-2000
genes per month
• there is a constant backlog of genomes
waiting for manual annotation
• most of our genomes have little, or no,
experimentally characterized proteins
• we rely heavily on sequence similarity
methods to determine the functions of
proteins in our genomes
• nearly all of our projects receive complete
manual annotation prior to publication (only a
very few have been released with automatic annotation)
GO Annotation at TIGR
• our manual annotation process is the same
whether we add GO terms to our proteins or
not
• using GO to categorize our proteins allows us
to capture information that we have
discovered in the manual annotation process
that would otherwise be lost
• GO offers a system for the unambiguous
communication of annotation information in a
format amenable to computer searching and
easy exchange.
Some History
•
•
•
•
•
•
•
•
TIGR always recognized the importance of grouping genes according
to the functions and processes in which they were involved.
with the first prokaryotic genome published, H. influenzae, we adapted
Monica Riley’s E. coli role categories
We have continued to modify that role scheme and still assign TIGR
roles today
In 1998 we recognized that it would be really useful to have a set of
role categories that could be used by all species and we started a
project in that direction
Also in 1998, I met Michael Ashburner and Suzi Lewis and learned of
their efforts with GO, we decided to stop our project and wait to use GO
During 2000-2001, TIGR’s genome V. cholerae was annotated to GO
In 2002 TIGR joined the GO consortium
Currently, TIGR has 11 prokaryotic genomes deposited with the GO
repository (and many more with manual GO annotation, waiting
internally at TIGR for publication.)
Adding GO Annotation to our system….
• required us all to learn the GO system, its
rules, data formats, etc.
• required significant changes to our tools and
databases for the visualization and storage of
GO data
• took time, however, there are vastly more
resources available today then there were 5
years ago when we were making the shift,
when GO was still quite young
The Goal of the Annotation Process
• determine the function of the protein if
possible
• assign annotation to the protein: common
name, gene symbol, EC number, TIGR role,
GO terms, comments as needed
• store evidence for the annotation (something
we always did)
• annotation should only be as specific as
evidence supports, err on the side of
undercalling rather than overcalling
How do we determine the functions of the proteins?
• The best thing is to do an experiment on the protein - not really
possible for us to do
• shared sequence implies shared function
– we are well aware of cases where one amino acid change results in change
of function
– all of our functional assignments must be considered putative until
experimentally confirmed
• collect and evaluate information from many sequence based
search and prediction tools
–
–
–
–
–
–
–
–
BER (BLAST-extend-repraze)
HMM (Hidden Markov Model)
TMHMM (Transmembrane HMM)
SignalP (Signal peptide)
PROSITE
InterPro
Paralogous families
Genome Properties
• The ISS annotations in TIGR data are sequence evaluations
performed by us, not from authors in the literature
• We use our annotation tool Manatee to view information and
make annotations, you will see screen shots in following slides
The Manatee
Gene
Curatation
Page
BER searches
• TIGR’s pairwise alignment tool
• initial BLAST to collect proteins with any similarity to the search
protein
• modified Smith-Waterman alignment generated between search
protein and each BLAST result
• result is a file containing one pairwise alignment for each match
protein from the BLAST
• view alignments in our Manatee annotation tool
• we do the 2-step process because BLAST is fast and SmithWaterman is slow, so it saves CPU time to only do the SmithWaterman alignments on things that have any hope of matching
BER in pictures
genome.pep
niaa
(non-identical amino acid)
BLAST
mini-db for
protein #1
,
mini-db for
protein #2
mini-db for
protein #3000
mini-db for
protein #3
,
...
Significant hits put into mini-dbs for each protein
modified SmithWaterman Alignment
File of pairwise alignments
BER alignment
from Manatee
Are all matches with equal alignment
quality of equal value to annotation?
• NO!
• we want to see matches of our genome proteins to
proteins from other species which have been
experimentally characterized in that other species
• only such “characterized matches” can be used as
evidence for functional annotation
• to help in our annotation process we have created a
database storing accessions of proteins known to be
experimentally characterized (does not contain all
such proteins, but we add to it constantly)
• our tools highlight experimentally characterized
proteins to help annotators see them
BER skim
from
Manatee
HMMs
•
•
•
•
•
Hidden Markov Model
statistical model of the patterns of amino acids in a multiple alignment of proteins
(called the “seed) which share sequence and functional similarity
at TIGR, each HMM is assigned to a category (called “isology type”) which
describes the type of relationship the proteins in the model have to each other
– equivalog
– superfamily
– subfamily
– domain
one can search proteins against HMMs, they receive a score indicating how well
they match the model
by comparing this score to the cutoff scores assigned to each model, one can
determine whether or not the search protein is a member of the group defined by
the HMM
– “trusted cutoff’ - proteins scoring above this score are considered a member
of the group defined by the HMM
– “noise cutoff” - proteins scoring below this score are considered NOT to be a
member of the group defined by the HMM
– for proteins scoring between trusted and noise, the HMM evidence is not
sufficient to determine whether the protein is a member of the functional
group or not
Annotation is attached to HMMs
•
•
TIGR00433
– isology: equivalog
– name: biotin synthase
– EC: 2.8.1.6
– gene symbol: bioB
– TIGR role: 77 (Biotin biosynthesis)
– GO terms: GO:0004076 (biotin synthase activity), GO:0009102
(biotin biosynthesis)
PF04055
– isology: domain
– name: radical SAM domain protein
– EC: not applicable
– gene symbol: not applicable
– TIGR role: 703 (enzymes of unknown specificity)
– GO terms: GO:0003824 (catalytic activity), GO:0008152
(metabolism)
HMM
section from
Manatee
Things to ask yourself when using HMMs
• Does my protein score above the trusted cutoff?
• What isology type is the HMM?
• What annotation on the HMM can I use for my
protein?
Genome Properties
•
Used to get “the big picture” of an organism. Specifically to record and/or predict the
presence/absence of:
– metabolic pathways
• biotin biosynthesis
– cellular structures
• outer membrane
– traits
• anaerobic vs. aerobic
• optimal growth temperature
•
Particular property has a given “state” in each organism, for example:
– YES - the property is definitely present
– NO - the property is definitely not present
– Some evidence - the property may be present and more investigation is required to make a
determination
•
The state of some properties can be determined computationally
– metabolic pathway
• the property is defined be several reaction steps which are modeled by HMMs
• HMM matches to steps in pathway indicate that the organism has the property
•
•
Other property’s states must be entered manually (growth temp, anaerobic/aerobic, etc.)
data for a particular genome viewable in Manatee
– links from HMM section on the Gene Curation Page
– links from gene list for role category
– entire list of properties and states can be viewed
•
Searchable across genomes on the Comprehensive Microbial Resource (CMR) site
Genome Property
Report page from
Manatee
Goals
• assign annotation to each protein
– name, gene symbol, EC number, TIGR
role, GO terms
• confirm coordinates of gene
• avoid transitive annotation
AutoAnnotate
• computationally gives preliminary annotation
to each protein
• adds GO terms with IEA
– from HMM match
– from BER match
• AutoAnnotate designed for a system in which
all annotations are manually reviewed
• If automatic annotation was the endpoint for
our projects, we would have to change
AutoAnnotate to be more strict and
conservative in its decisions
Knowledge about function reflected
in specificity of protein names
• high confidence – “adenylosuccinate lyase”, purB, 4.3.2.2
• general function, lacks specificity
– “carbohydrate kinase, FGGY family
– no gene symbol, partial EC number
• family designation
– “Cbby family protein”
• homolog designation
– “recA homolog”
• hypotheticals
– “hypothetical protein”
– “conserved hypothetical protein”
• “putative recA”
– used sparingly
Knowledge about function reflected in specificity of GO terms
available evidence
for 3 genes
#1
-HMM for ribokinase
-match to an experimentally
characterized ribokinase
#2
-HMM for kinase
-match to experimentally
characterized glucokinase and
fructokinase
#3
-HMM for kinase
Sample GO trees
Function
catalytic activity
kinase activity
carbohydrate kinase activity
ribokinase activity
glucokinase activity
fructokinase activity
Process
metabolism
carbohydrate metabolism
monosaccharide metabolism
hexose metabolism
glucose metabolism
fructose metabolism
pentose metabolism
ribose metabolism
translation disruptions
•
•
•
•
•
•
•
•
•
authentic frameshift
Get GO terms
authentic point mutation
degenerate
truncation
deletion
No GO terms
TIGR role 270 insertion
“disrupted reading frame”
interruption
fusion
fragment
Assigning GO terms
• Once we have found out all that we can about
a protein, we assign GO terms to describe the
protein
• things that facilitate finding a term
–
–
–
–
fast/easy ontology search tools
tools that make term suggestions
tools that format the evidence for you
tools that reduce copy/paste/typing as much as
possible
Tools that suggest terms
• Mapping files
– ec2go
– tigrfams2go
– interpro2go
• Manatee suggestions
– Matches to V. cholerae, B. anthracis
– Genome Properties
– HMMs
• Automated assignments
– From HMMs and good pairwise matches
– Viewed as suggestions, not final annotation
Our Manatee Tool
• Prevents assignment of GO terms that are nonexistent or obsolete
• Knows the correct format for the evidence fields
– Allows addition of terms and evidence with one click
– Uses correct abbreviations
• Rarely a need to copy and paste
• In many cases the term you need is suggested on the
page somewhere already
Clicking on the various GO
suggestions around the Manatee
Gene Curation page puts the correct
info into the correct fields in the
correct format without the need to
copy and paste.
Searching for terms in Manatee
• Searches of ontologies
– go_id search (returns tree, term info)
– GO term keyword search (searches synonyms too)
• Searches of annotations
– Protein name keyword search
– go_id search (returns lists of proteins assigned that term)
– Correlations (input a go_id and receive a list of terms
assigned in conjunction with input term and the percent of
occurrence of each correlation)
• EC number search (input EC #, return go_id)
• GO BLAST page (searches all proteins annotated to
GO)
GO search
tools in
Manatee
Keeping up with GO content
• TIGR downloads the newest version of the
ontologies nightly into our db for use by our
tools
• Periodically we check our annotations for the
presence of obsolete or secondary ids and
we send updates
Changing GO content
• TIGR has been contributing requests for ontology
content changes continuously since we joined GO
(close to 200 submissions)
• The SourceForge submission system works very well
• Most requests are handled within a few days, some
more complicated things may take a few weeks, the
rare really complicated thing may take a few months
(again, that’s very rare, see PAMGO example)
• Initially there were some aspects of the ontologies
that were incorrect for proks (ex. ATPsynthase), these
have been fixed as they were discovered.
PAMGO effort
Future directions
• Develop prok GO slim
– Use it where we now use TIGR roles
– Cease use of TIGR roles
• Add more functionality to CMR GO tools
– More refined searches
– Search across all TIGR GO data, not just prokaryotic
• Use accumulated prok GO data more effectively to
predict annotations for new proteins
More about Manatee
•
•
•
•
•
•
•
•
TIGR’s main manual annotation tool
web based
Displays all known information about a protein
interface for entry of annotation information into the
database
open source, freely available on SourceForge for
downloading and local use
(manatee.sourceforge.net)
TIGR offers a hands on 3-day annotation course, 4
times per year which details our annotation process,
the use of Manatee, installation of Manatee, and the
use of the CMR
Taught by Michelle Gwinn Giglio, Tanja Davidson,
and Todd Creasy
Next class June 28-30, Aug. 23-25
The Manatee
Gene
Curatation
Page
Annotation Engine
• clients send us a DNA sequence
• we run our entire pipeline up to the point where manual
curation starts
• we return a MySQL database and associated files with all
of the data so the client can do manual annotation of the
genome
• the client can install Manatee locally and run it using the
MySQL database
• the data is kept completely confidential if that is the desire
of the client
• this service allows researchers access to TIGR’s
infrastructure and tools, saving the need to expend the
time and expense (which they might not have) to create
infrastructure of their own
It’s all a team effort
• Owen White
• Prokaryotic annotation team
– Bill Nelson, Bob Dodson, Scott Durkin, Sean
Daugherty, Ramana Madupu, Lauren Brinkac,
Steven Sullivan, Sagar Kothari
•
•
•
•
Todd Creasy, Tanja Davidson
Eukaryotic annotation team
All of our tool developers
GO group (last but definitely not least)