PowerPoint Presentation - The Medicago truncatula genome

Download Report

Transcript PowerPoint Presentation - The Medicago truncatula genome

The Medicago truncatula genome: a progress report

Dr. Bruce A. Roe Advanced Center for Genome Technology Department of Chemistry and Biochemistry University of Oklahoma [email protected] www.genome.ou.edu

Plant and Animal Genome San Deigo January 11 , 2004 A C G T

Photos by Steve Hughes, Genetic Resource Centre (PIRSA-SARDI), Adelaide, Australia.

http://www.fao.org/ag/AGP/AGPC/doc/gallery/pictures/meditrunc/meditrunc.htm

A C G T

Why sequence the Medicago genome?

An important forage crop

A genetically tractable model legume

A relatively small (~500 Mbp) diploid genome

Active legume research community

Medicago Research Consortium

Large collection of ESTs

Excellent BAC library

• •

Integrated physical and genetic map Large number of BAC-end sequences

Sequence Pipeline at the University of Oklahoma Genome Center, OU-ACGT

DNA GenBank DNA shearing (Hydroshear TM ) Growing subclones (HiGro TM) Subclone isolation II (VPrep TM ) Sequencing (ABI 3700) A C G T Colony Piking (QPixII TM ) Subclone Isolation I (Mini-Staccato TM ) Thermocycling (ABI 9700) Miscelaneous liquid handling

Closure

Data assembly and Analysis Primer Synthesis

Subclone Isolation (Mini-Staccato

TM

)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

A C G T

• This Zymark robot has 384 cannula array, four built in shakers, three attached storage racks, built-in barcoding and a Twister II robotic arm.

• This automation has allow us to perform the DNA isolation completely unattended from as many as eighty 384 well plates of bacterial cells per day.

Subclone Isolation (Mini-Staccato

TM

)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

A C G T

• Once all three solutions have been added, the plates are transferred from the SciClone workspace deck to a storage rack by the Twister II robotic arm.

Subclone Isolation and Sequencing Reaction Pipetting (Velocity 11 VPrep)

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

A C G T

• • •

Liquid handling station with 384-channel pipettor head Four movable shelves on either side of the pipettor head Used for subclone isolation, sequencing reaction set-up and clean-up.

Data assembly and Analysis

Sun V880 server Phred/Phrap/Consed

32 GB RAM running Solaris 8 OS and 3 TB of data stored on RAID-5 arrays with autoloader tape backup Also: • 12 workstations each with 1 GB RAM

A C G T Exgap

A C G T

Initial WGS Skimming for ~500 Mb Medicago truncatula genome

Collected ~25,000 end-sequences from ~12,500 plasmid-based WGS clones.

Of these ~25,000 sequences, ~1,000 have homology with Medicago truncatula ESTs.

URL: http://www.genome.ou.edu/medicago.html

A C G T

Phrap assembly of our Medicago truncatula whole genome shotgun survey sequencing data at 0.005-fold genomic sequence coverage

A C G T DotPlot of a Phrap assembled whole genome shotgun contig showing multiple repeated regions

Bases 0 100 200 300 400 500 600 700

A C G T DotPlot of a Phrap assembled whole genome shotgun contig showing 4 repeated blocks of ~600 bases

Yet another genomic contig showing extensive repeated regions A C G T Contig 1931

A C G T

>Contig1931 TTTACGTCCCCGTAGTGAACTATTTCCTAAGTTGACTAGTCAATTAGGTG ATAGTTCGTCCGGATGACGTACCGCCGTGAACCCGATATGAGAATTTCAT GTGG TGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGA ACGTGGCTGGTGT CGTTCACGATAGAGGCACGTTTAGGTCCCTACGGTGA ACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTTGTCCGGATGAC GTACCTCCGTGAACCCGATCTGAGAAATTCAAGTTTC TGCATCCTTCTAT GTTTGATAAGGTCATTTTGAACGGTCGGATTGAAGGTGGCTGGTGT TCTT CACATTCTAGGCACGTTTAGGTTCCCGCGGTGAACTAGTTCCTAAGTTGA CTAGTCAATTAGGTGATAGTTCGTCCGGATGACCTACCTCCGTGAACCCG ATATTAGAAATTCAAGTTTC TGCATCCTTCTATGTTTGATAAGGTCATTT TGAACGGTCAGATTGAACGTGGCTGGTGT CGTTCACGATCTAGGCACGTT TAGGTCCCCGCAGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGAT AGTTTGTCCGGATGACGTGACTCCGTAAAGCCAGTATGAGAACTTCTAGT TTC TGCATCCTTTTATGTTTGATAAGGTCATTTTGAACGGTGGGATTGAA CGTTGTTGGTGT CGTTCACGATCTAGGCACGTTTAGGTCCCCGCAGTGAA CTAGTTCCTTAGTTGACTAGTCAATTAGGTGATAGTTCGTCCGGATGACG TATCTCCGTCAGCCCGATCTGAGAAATTCAAATTTC TGCATCCTTCTATG TTTGATAAGGTCATTTTGAACGGTCGGATTGAACGTGGCTGGTGT CGTGC ACGATCAAGGCACGTTTAGGTCCCCGCAGCGAACTAGTTCCTAAGTTGAC TAGTCAATTAGGTGATACCTTGTCCGGATGACGTACCTCCGTGAACCCGA TCTGAGAAATTCAAGTTTC TGCATCCTTCTATGTTTGATAAGGTCATTTT GAACGGTTGGATTGAACATGGCTGGTGT CGTTCACGATCTAGGCACGTTT AGGTCCCCGCAGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATA GTTCGTCTGGATGACGTACCTCCTTGAACCCAATATGAGAAATTCAATTT TCT TCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGAAC GTGCCTGGTG TCGTTCACGATCGAGGCACGTTTAGGTCCCCGCAGTGAAC . . .

Summary of our Medicago truncatula WGS Sequencing Assembly with only 0.005-fold Genomic Sequence Coverage

The largest contig (21,157 bp) contained the 26S rRNA genes

19 smaller contigs (105,455 bp total) were from the chloroplast genome

The remaining ~500 contigs, ranging in size from 2,000 to 12,000 bp contain highly repetitive DNA, which were unique to Medicago, as they had no significant homology in the GenBank database A C G T

We concluded that a more directed strategy was needed

A C G T

Mapped BAC approach in collaboration with Doug Cook and DJ Kim at U.C. Davis with funding from the Noble Foundation, Ardmore, OK

The first ~1000 Medicago truncatula BACs

Initially concentrated on BACs with known biological markers and in regions of biological interest that were supplied to us by the UC Davis group.

Requests for sequencing specific BACs were directed to Doug Cook and DJ Kim at UC Davis and they supplied us with the BACs once these BACs have been characterized.

Once the BACs were received, we created the shotgun libraries, isolated the sequencing templates and obtained the working draft sequence followed by closure and finishing.

All data was made publically available in GenBank A C G T within 24 hours of sequence assembly.

A C G T

UC Davis ------- Oklahoma University

A C G T

100000000 90000000 80000000 70000000 60000000 50000000 40000000 30000000 20000000 10000000 0

Medicago BAC Sequencing Date

Phase 1 Phase 2 Phase 3 Total

The next ~750 Medicago truncatula BACs

With recent NSF funding, we will be sequencing BACs from chromosomes 1,4, 6, and 8 with the goal of completing the sequence of the euchromatic regions of these chromosomes over the next 3 years.

Chromosomes 2 and 7 will be sequenced at TIGR, chromosome 3 at The Sanger Institute and and chromosome 5 at Genoscope.

All data will be released immediately as before.

A C G T

A C G T

www.genome.ou.edu/medicago.html

A C G T

www.genome.ou.edu/medicago_totals.html

A C G T Medicago-specific gene with ESTs but no known homology Gene density of this BAC is ~1 gene per 10 kb

A C G T Medicago-specific gene with ESTs but no known homology

A C G T myosin-like protein Gene density ~1 gene per 10 kb

A C G T myosin-like protein

A C G T

4500 4000 3500 3000 Number 2500 of Genes 2000 1500 1000 500 0 A C G T Gene Size Distribution (All Sequence Data) (FgenesH vs. Genscan) 13,396 FgeneSH predicted genes 11,488 Genscan predicted genes Gene Size Range FgeneSH Genscan

20000 18000 16000 14000 Number 12000 of Exons 10000 8000 6000 4000 2000 0 A C G T Exon Size Distribution (All Sequence Data) (FgenesH vs. Genscan) 59,808 FgeneSH predicted exons 55,792 Genscan predicted exons Exon Size Range FgeneSH Genscan

12000 10000 Number 8000 of Introns 6000 4000 2000 0 A C G T Intron Size Distribution (All Sequence Data) (FgenesH vs. Genscan) 46,412 FgeneSH predicted introns 44,305 Genscan predicted introns Intron Size Range FgeneSH Genscan

Gene Density of the ~450 Mb Medicago truncatula genome

A C G T

FgeneSH Genscan Total number of genes Total length of genes Total exon length Total number of exons 13,397 30,793,326 15,794,243 59,808 11,488 51,687,528 14,400,445 55,792 Total intron length 14,999,083 37,287,083 Total number of introns 46,412 44,305 _______________________________________________________ Base Pairs Sequenced 87,423,457 87,423,457 _______________________________________________________ Gene Space (Gene Length/BP Sequenced) 35% 59% _______________________________________________________ Gene Density (Genes/200Mb) 30,649 26,281 1 gene/6.5 kb 1 gene/7.6 kb _______________________________________________________ Arabidopsis 25,498 protein coding genes

A C G T Medicago GC Content for ~90 Mb of Genomic BAC

Clones Sequenced (mainly from gene rich regions)

Metabolic Overview of Medicago 13,396 FgeneSH predicted genes using the COG Database Multiple COG Hits 8% No Hits 5% DNA Metabolism 23% Poorly Characterized 17% Cellular Processes 23% A C G T Metabolism 24%

Metabolic Overview (detailed view) of Medicago 13,396 FgeneSH predicted genes using the COG Database No Hits 5% Multiple COG Hits 8% Translation, ribosomal structure & biogenesis 7% Transcription 5% DNA replication, recombination & repair 11% Poorly Characterized 17% Secondary metabolites biosynthesis, transport & catabolism 3% Lipid metabolism 2% Coenzyme metabolism 2% Cell division & chromosome partitioning 2% Posttranslational modification, protein turnover, chaperones 5% Cell envelope biogenesis, outer membrane 4% Nucleotide transport & metabolism 2% A C G T Amino acid transport & metabolism 5% Carbohydrate transport & metabolism 4% Cell motility & secretion 3% Inorganic ion transport & metabolism 3% Signal transduction Energy production & mechanisms 5% conversion 5%

A C G T Gene Duplication: Three copies of the phosphoglycerate kinase gene in one BAC

Gene Duplication: Three copies of phosphoglycerate kinase in one BAC AC138448.fg.10 MATKRSVGTLKEAELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHL---- AC138448.fg.11 MA-K K SVG D L SG AELKGKKVFVR A DLNVPLDDN Q NITDDTRIRAA I PTIKYL IQN GAKVILSSHL---- AC138448.fg.8 MATKRSVGTLKEGELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHL EIYKT AC138448.fg.10 ------------------------------------------GRPKGVTPKYSLKPLVPRLSELLGTQVK AC138448.fg.11 ------------------------------------------GRPKGVTPKYSLAPLVPRLSEL I G IE V I AC138448.fg.8 EVSVSEYNLAVSEYKLAISDTYRYRIRVRHDSSPFLEYRGSQ GRPKGVTPKYSLKPLVPRLSELL E TQVK AC138448.fg.10 IADDSIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNDPEFAKKLASLADLYVNDAFGTAHRAHASTEGV AC138448.fg.11 K A E DSIG P EVEKLVA SL P D GGVLLLENVRF Y KEEEKNDPE H AKKLA A LADLYVNDAFGTAHRAHASTEGV AC138448.fg.8 ISDDCIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNEPEFAKKLASLADLYVNDAFGTAHRAHASTEGV AC138448.fg.10 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKA AC138448.fg.11 TKYLKPSVAGFL L QKELDYLVGAVS S PKRPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKA AC138448.fg.8 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIYTFYKA AC138448.fg.10 QGYAVGSSLVEEDKLDLATTLIEKAKAKGVSLLLPTDVVIADKFAADANDKIVPASSIPDGWMGLDIGPD AC138448.fg.11 QG L AVGSSLVEEDKL E LATTLI A KAKAKGVSLLLP S DVVIADKFA P DAN SQ IVPASAIPDGWMGLDIGPD AC138448.fg.8 QGY SI GSSLVEEDKLDLAT S L M EKAKAKGVSLLLPTDVVIADKF S ADANDKIVPASSIPDGWMGLDIGPD AC138448.fg.10 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKM AC138448.fg.11 SIKTFNEALD TT QTIIWNGPMGVFEFDKFA V GTE S IAKKLA DL SGKGVTTIIGGGDSVAAVEKVGVAD V M AC138448.fg.8 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKM AC138448.fg.10 SHISTGGGASLELLEGKPLPGVLALDDA* 401 amino acids AC138448.fg.11 SHISTGGGASLELLEGK

A C G T

E LPGVLALD E A TPVAV* 405 amino acids, differs at 42 positions AC138448.fg.8 SHISTGGGASLELLEGKPLPGVLALDDA* 448 amino acids, differs at 6 positions

A C G T Printrepeat Analysis of M. truncatula BAC AC121240 vs. A. thaliana Chr.2

Expansion, Duplication, Repeat Elements

~25 kb region ~5 kb region

A C G T

PIP of

M. truncatula

BAC AC121240 vs.

A. thaliana

Chr.2

A C G T

Medicago truncatula

Summary and Conclusions

Average Predicted Gene Density of 1 gene per 6.5 to 7.6 Kb by FgeneSH and Genscan, respectively.

Genome characteristics such as %GC, intron/exon size and conserved unique 5’ splice sites reveal Medicago characteristics

The sequence of the Medicago truncatula genome shows homology to the sequenced Arabidopsis thaliana genome but expansion, rearrangements and duplications are evident.

Data Release and Preliminary Annotation

All our sequence data is available through links on our web site to GenBank and on our ftp site at URL: ftp.genome.ou.edu/medicago

keyword and blast searches can be done on our web site at URL: http://www.genome.ou.edu/medicago.html

Additional annotation via Genome Browser database are available on our web site at URL: http://www.genome.ou.edu/medicago_table.html

A C G T

E-mail suggestions for additional annotation to Bruce Roe at: [email protected]

A C G T

Three Year Plan

Obtain the contiguous sequence of the Gene Rich regions of four of the 8 Medicago truncatula genome at OU, with the remaining four being completed by our international partners at TIGR, Sanger, and Genoscope.

This information will serve as a solid foundation for anticipated comparative and functional legume genomics.

Informatics Production Laboratory Organization Bruce Roe, PI Support Teams DNA Synthesis Reagents & Equip. Maint.

Administration Jim White Steve Kenton Hongshing Lai Sean Qian Rose Morales-Diaz* Mounir Elharam* Yonas Tesfai Steve Shaull** Phoebe Loh* Sulan Qi Bart Ford* Doug White Work-study Undergraduates** Mounir Elharam* Doug White Kay Lynn Hale Dixie Wishnuck Tami Womack Mary Catherine Williams Research Teams Doris Kupfer Julia Kim* Sun So Graham Wiley** Lauren Ritterhouse** Limei Yang Angie Prescott* Audra Wendt** Mandi Aycock** Fu Ying Liping Zhou Ruihua Shi Junjie Wu Stephan Deschamps Shelly Oommen Christopher Lau Yanhong Li A C G T Ziyun Yao Steve Shaull* Youngju Yoon Trang Do Anh Do Lily Fu Yang Ye James Yu Tessa Manning** ShaoPing Lin Honggui Jia Hongming Wu Baifang Qin Peng Zhang Lin Song Ying Ni Huarong Jiang Funding from the Noble Foundation, DOE, and NSF Collaborators at Univ. Minnesota, UC Davis, TIGR, Sanger, Genoscope, and the Noble Foundation Fares Najar Chunmei Qu Keqin Wang Carson Qu Shuling Li Pheobe Loh * Sulan Qi Bart Ford* Axin Hua Weihong Xu Jami Milam Sara Downard** * Previous undergraduate research student ** Present undergraduate research student

A C G T

The A C G T Team

A C G T

Conserved Intron/Exon Boundry Features by a FELINEs** Analysis of 181,444 Medicago truncatula ESTs in GenBank vs Genomic Sequence

Exons Introns Size Range 6 - 5,789 nt 20 - 3,921 nt Mean Length 268 nt 429 nt Intron Conserved Splice Site Sequence Elements Percent Introns w/ 5’ GU 99.21% Introns w/ 5’ GC 0.36%* Introns w/ 5’ AU 0.31% Introns w/ U12 branch sites instead of A12 0.13% *Compared to 0.5 - 2.5% in fungi, and 0.5% in mammals with an EST minimum identity of 90% ** S. Drabensctot, D. Kupfer, J. White, D. Dyer, B. Roe, K. Buchanan and J. Murphy. FELINES: A Utility for Extracting and Examining EST-Defined Introns and Exons.

A C G T

Nucleic Acid Research 31(22), E141 (2003).

Consensus Logogram of the 5’GU vs the 5’AU Class of Introns in Medicago truncatula determined by FELINES GU intron consensus AU intron consensus A C G T