13. Finding the genes in microbial genomes

Download Report

Transcript 13. Finding the genes in microbial genomes

Advancing Science with DNA Sequence

Finding the genes in microbial genomes

Natalia Ivanova

MGM Workshop January 31, 2012

Advancing Science with DNA Sequence

Outline

1. Introduction 2. Tools out there 3. Basic principles behind tools and known problems 4. Metagenomes

Advancing Science with DNA Sequence

Finding the genes in microbial genomes

features

Sequence features in prokaryotic genomes:

 Well-annotated bacterial genome in Artemis genome viewer: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA)protein-coding genes (CDSs)transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends)translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins)pseudogenes (tRNA and protein-coding genes)

Outline

Advancing Science with DNA Sequence

1. Introduction 2. Tools out there

(don’t bother to write down the names and links, all presentations will be available on the web site)

3. Known problems 4. Metagenomes

Advancing Science with DNA Sequence

Publicly available genome annotation services IMG-ER

http://img.jgi.doe.gov/

RAST

http://rast.nmpdr.org/

JCVI Annotation Service

http://www.jcvi.org/cms/research/projects/annotation service/

RefSeq

http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.h

tml

Advancing Science with DNA Sequence

What they provide and how they do it - RNAs

Large structural RNAs (23S and 16S rRNAs)

BLASTn RNAmmer

http://www.cbs.dtu.dk/services/RNAmmer/ • Small structural RNAs (5S rRNA, tRNAs,

tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool

http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/

tRNAScan-SE

http://lowelab.ucsc.edu/tRNAscan-SE/

Advancing Science with DNA Sequence

What they provide and how they do it – protein-coding gens (CDSs - not ORFs!)

Reading frames : translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon

Advancing Science with DNA Sequence

Gene finders: ab initio tools; evidence-based refinement

Ab initio tools used by the pipelines: Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI, RAST, JCVI

http://glimmer.sourceforge.net/

GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBI

http://exon.gatech.edu/GeneMark/

PRODIGAL -> IMG-ER, NCBI

http://compbio.ornl.gov/prodigal/ Evidence-based refinement:

mostly undocumented in-house developed tools.

Types of corrections: missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites (RAST)

1. Introduction

Outline

Advancing Science with DNA Sequence

2. Tools out there 3. Basic principles behind tools 4. Metagenomes

Advancing Science with DNA Sequence

What is ab initio gene finder?

Two major approaches to prediction of protein-coding genes:

ab initio

(ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity; very fast!

Limitations: often misses “unusual” genes; high rate of false positives “

evidence-based

” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of

ab initio

annotation tools; slow!

Ribosome binding site

Advancing Science with DNA Sequence

How ab initio tools work – very briefly

open reading frame Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA

Prokaryotic gene model used by all ab initio gene finders

Ribosome-binding site within certain distance of the start codon; One of 3 start codons; One of 3 stop codons; No frame interruptions • • •

Statistical model of coding and non-coding regions

(codon or dicodon frequencies, hidden Markov models of different lengths)

Statistical model architecture Additional algorithms for refinement of predictions

(RBS finder, overlap resolution, etc.)

Advancing Science with DNA Sequence

Known problems of all annotation pipelines

RNAs

Incomplete rRNAs

Trans-spliced tRNA in archaeal genomes

Small structural RNAs not predicted at all

Genome

Synechococcus sp.

CC9311

Synechococcus sp.

CC9605

Synechococcus elongatus

PCC 7942

Synechococcus sp.

JA-2-3BA(2-13)

Synechococcus sp.

JA-3-3Ab

Synechococcus sp.

RCC307

Synechococcus sp.

WH7803

Sequencing center UCSD, TIGR JGI JGI TIGR TIGR Genoscope Genoscope 16S rRNA, nt

1477 1440 1490 1323 1324 1498 1497, 1464

Protein-coding genes that don’t fit into prokaryotic gene model used by

ab initio

gene finders

Ribosome – –

no RBS (leaderless transcripts) interrupted translation frame

sequencing errors or translational exceptions

binding site open reading frame –

non-canonical start

Start codon: ATG, GTG, TTG Stop codon: TAG, TAA, TGA

Advancing Science with DNA Sequence

Symptoms of gene finding problems

Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing

“Truncated” genes

(shorter than homologs) => funky translation initiation features (non-canonical start codons, leaderless transcripts)

Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts)

Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families

Advancing Science with DNA Sequence

Supplemental tools

   

TIS (translation initiation site) prediction/correction TICO

http://tico.gobics.de/ TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA Two tools often disagree about the best TIS, especially in high GC genomes

Operon prediction

JPOP http://csbl.bmb.uga.edu/downloads/#jpop http://www.cse.wustl.edu/~jbuhler/research/operons/ http://www.sph.umich.edu/~qin/hmm/

Proteins with unusual translational features – selenocysteine-containing genes

bSECISearch http://genomics.unl.edu/bSECISearch/

Advancing Science with DNA Sequence

Metagenomes sequenced with new technologies: low-coverage problems

• •

Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x) High sequence coverage cannot be achieved for metagenome data

metagenome genome sequence sequence

How does this affect metagenome annotation?

~70% of 454 Titanium reads have at least 1 sequencing artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution >100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions

Advancing Science with DNA Sequence

Just one example…

4-read contig, 1476 nt, no misassembly 3 frameshifts

Contig has 27 homopolymers (3 nt and more), 3 of them have errors No correlation with homopolymer type or error type Reads were quality trimmed prior to assembly

Advancing Science with DNA Sequence

Metagenome annotation tools (more details will be given)

GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs)

http://exon.gatech.edu/GeneMark/ • MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ • FragGeneScan http://omics.informatics.indiana.edu/FragGeneScan/

Full-service annotation pipelines

IMG/M-ER

– “metagenome gene calling” + other options

http://img.jgi.doe.gov/submit • MG-RAST http://metagenomics.nmpdr.org/ • CAMERA annotation pipeline http://camera.calit2.net/