Assembling and Annotating the Draft Human Genome

Download Report

Transcript Assembling and Annotating the Draft Human Genome

Tools for understanding the sequence, evolution,
and function of the human genome.
Jim Kent and the Genome Bioinformatics Group
University of California Santa Cruz
The Goal
Make the human genome
understandable by humans.
Step 1
Sequence the human genome
Idealized Hierarchical Shotgun Sequencing
Mapping
300,000 BAC
Clones Were
Digested and Run
on Agarose Gels
Cari Soderlund’s
FPC and Wash U
Pathfinders Made
Fingerprint Map
Contigs
Bob Waterston
escaping management
Genetic and radiation hybrid maps placed contigs on chromsomes
Sequence and Assembly
• BAC Clones shotgun sequenced at high
throughput to 4x ‘draft’.
• Assembled with Phil Green’s Phrap
GigAssembler
Jim Kent
David Haussler
(meanwhile Celera working on whole genome shotgun version)
The Truth
+ light
- darkness
+
+ +
- ?
? +
?
-
Keeping strands straight is the hard part
?
+
?
“Finishing” Sequence
• Using primers to end of contigs close gaps.
• Checking automatic assembly especially near
tandem repeats.
• Checking in-silico restriction digest of BAC
matches actual digest.
• Time consuming - 1 year to ‘draft’ genome, 2
years to ‘finish’.
• Human finished. Mouse will be finished
(currently half finished). Other genomes may stay
at draft stage, though draft stage can be very good
these days.
Now What?
GGCTTTTGAAGGGAGTTCTGTTTATATATACGTCAACATCCAGTTGGAGGTGAAAAGGTTAGCACTTGACCCAGGAAGTATCCATGT
AAATCTGCTTCATAAATTTCTTCATCAGTCTTTTTTTCCATTATGAGCTTTGATTATAATAAAGGAGCTGTTATTAACTTTTATTCA
CTCTTTGAAAATATTTACCACCCTTCTCCCTTTCCCCTCATGAAATGTGCCAACTTCATAGGAATTAACAAATTGTAGCCCAGCCAA
TAAGCATACCTGAAACTTGAGTATATTTATTTATTACAGACATCCTAAGACCCGTAAACTCTGCTCTGGATCATATCACTCCAGGAT
ATGATTGTACAGGAAATGGGGAATATCATAGGCTCACAAAGGATAACTGATAGAACTCAGTGTGGTACTTTGGGGACATCAAACATT
AAGACTATTCACGAATAACACAAAATATACATTCATTGTGCCATCCATCACATTAACAATTGAGCTGAAAATACATTATATCCAGCT
AAGGAAGAAATTGGTTTGAATAATACTTTTAGGTTCTGAATAACCCAGCACAAATTTTAAACAGAGGGTGGCCCGAGAAGAAAGGGG
AGACTTAGCACAGGAAGCCGGGTTTCTGAAGTTTGTGCTCTGCAGGGCTTCTTAACTGTAAGAACAAATCAAGGCTACCCTCTGAGG
TTAAATGAGGGAATTTTTTCTTTCACCTATAAAATTGTACCAGTTTAGAGAGTTTGCCCACCCTGTTTTAGTAACCTAAACATTTCT
AAAGATAAATCTCTTAGGACAAAGTATTTACAACCAGCAAACTCACACACATGAAAATGACTTAAATTAAGGGATGAATTAATTGTG
CATCTCTTCTTCCTGAGCTCCTGGACTCGCCTTTCGCTATATCCTACTTTCAAGGACAAGGGAGGGGAGAGCTGTACATATAGTTAG
AGATTCCTTCTGGCATGTTTCTGTTGGCAAAGGGAACTATTTTCCAAAAGGTCATCTGAAAGGAACAGTAGGTTCTGTGAATTCTCC
GATGTTAAGGCCCACCAGAAAATGTATGCTGGCACCCAATCTGGATGAAGGTGTTAACCCCGCACCAAGTCTCTGGTCCAGAATTAT
ATCCTGGCCAGGAGCTCCCCAGATAGGATTAGAAAGGAAGAAAGAGACTGTAAATGGAAAGAAAGATAAGCTAAGCATGTGCTTTGG
GCCCAAGGAGATGCCTGGGCTGTTGTCTGGGGCTGGAGCCGCCTCAGTGGGAGGTAGTCAGAGTGTCTGAGGTAGAAGACCCCGGGG
CGAAGAGCTGGACTTCTCTGAGGATTCCTCGGCCTTCTCGTCGTTTCCTGGCGGGGTGGCCGGAGAGATGGGCAAGAGACCCTCCTT
TGCTTCATTCGGCGGTTCTGGAACCAGATCTTCACTTGGGTCTCGTTGAGCTGCAGGGATGCAGCGATCTCCACCCTGCGGGCGCGC
TGAAGTGGAACTCCTTCTCCAGTTCCGTGAGCTGCTTGGTAGTGAAGTTGGTGCGCACCGCGTTGGGTTGACCCAGGTAGCCGTACT
TGGGGCAAAGTGGGAAGCCATGAGACGGAAATGTAAAAATTTTTAAATCGACTTGAGATTCCCCACACGCTTCATGGCAACACTCAG
CAAGAACTCAGCACAAATCGGGCTGTGGAGGGTGAGTGATGAGGTGTAAAGTGTTAACCTGATGTAAACCATTAGCATGGTCAGACC
GCCTCAAGATATTAACAGAACACTACCGTCACAATAACCACCCCCACATACTTCCTATTTCCCAAATGTATAAAATCCTTGAAAACA
GACTTCTTTGCCCCAACACCTCTGGGCACCCTCTCCATGCACTACAACACTAGTCTGATACAAAAGCCTTTTAAAAAAAAGATCATT
GAAATTAAGCATACCAGCTCCTTCCAGAATAATCAAGGAGCATCCACCAACCAGCAGGACTGACCTGTTTTGGGAGGGTTTCTTTTG
CAAAAGTCTGCGCTGGAGAAGATGTCTCCGATGCGGGGGAGCGACAGGCTTCTTGGTGGCTGGCGTGGAGAGGGGACAAGGAGTTAT
GGCCAGGCTCTGGTGCTCCTGTCCATATGAGTGGTGAATGTATTGAGGCGAGCCCACCGCGCCCCCAGCATAACCCTGGTGGTGGTG
Finding the Genes
Qui ckTime™ and a TIFF ( Uncompr essed) decompressor ar e needed to see this pi cture.
Dr. Blat helping a gene find itself.
SIGLEC7 - a gene with some
transcriptional complexity.
Sialic Acid Binding/Ig-like Lectin 7
displayed in UCSC Genome Browser
Genes: Lines of Evidence
•
•
•
•
•
Full length human mRNA (the best!)
Protein homology with other species.
EST evidence - 1st step for much mRNA.
Evidence from genome/genome alignments
HMM based gene finders
Transferrin Receptor in UCSC
Genome Browser
Transferrin
Clicking on a “known gene” brings up a large page of
information on the gene.
Current state of human genome
• ~99% of human genome sequenced. Last
1% will still be a challenge.
• ~85% of human genes located. Substantial
resources are being devoted to last 15%.
• ~20% of human genes with any depth of
functional annotation. Curation and
integrated database are key to progress.
• <1% of human regulatory regions located.
Transferrin Receptor
Note peaks of conservation in 3’ UTR. These include iron
response elements which regulate translation of this gene.
Comparative Genomics
Webb Miller
Comparative Genomics at BMP10
Conservation of Gene Features
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
aligning
identity
Conservation pattern across 3165 mappings of human
RefSeq mRNAs to the genome. A program sampled 200
evenly spaced bases across 500 bases upstream of
transcription, the 5’ UTR, the first coding exon, introns,
middle coding exons, introns, the 3’ UTR and 500 bases
after polyadenylatoin. There are peaks of conservation at
the transition from one region to another.
Chaining Alignments
• Chaining bridges the gulf between syntenic blocks
and base-by-base alignments.
• Local alignments tend to break at transposon
insertions, inversions, duplications, etc.
• Global alignments tend to force non-homologous
bases to align.
• Chaining is a rigorous way of joining together
local alignments into larger structures.
Chains join together related local alignments
Protease Regulatory Subunit 3
Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon relics. Affine
gap scores model red/blue plots as straight lines.
Before and After Chaining
Chaining Algorithm
• Input - blocks of gapless alignments from blastz
• Dynamic program based on the recurrence
relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
j<i
• Uses Miller’s KD-tree algorithm to minimize
which parts of dynamic programming graph to
traverse. Timing is O(N logN), where N is number
of blocks (which is in hundreds of thousands)
Netting Alignments
• Commonly multiple mouse alignments can
be found for a particular human region,
particularly for coding regions.
• Net finds best match mouse match for each
human region.
• Highest scoring chains are used first.
• Lower scoring chains fill in gaps within
chains inducing a natural hierarchy.
Net Focuses on Ortholog
Net highlights rearrangements
A large gap in the top level of the net is filled by an
inversion containing two genes. Numerous smaller
gaps are filled in by local duplications and processed
pseudo-genes.
Useful in finding pseudogenes
Ensembl and Fgenesh++ automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein must
be interesting!
Mouse/Human
Rearrangement Statistics
Number of rearrangements of given type per megabase
excluding known transposons.
A Rearrangement Hot Spot
Rearrangements are not evenly distributed. Roughly 5%
of the genome is in hot spots of rearrangements such as
this one. This 350,000 base region is between two very
long chains on chromosome 7.
Reconstructed ancestral
(boreutherian) genome for one
chromosome
Finding Function
• We’ve located 85% of the genes, on
track for 95% in a year or two.
• We have SOME idea of what 30% of
the genes do.
• We have virtually NO idea of what the
rest do.
How to Find Function
• Homology - guilt by association. Orthologs very
valuable.
• Genetics/knockouts - what happens when a gene
gets broken?
– RNAi is speeding this up amazingly in worms
and other model organisms.
• Expression - when and where is gene used?
– Microarrays, in situs, GFP fusions.
• Interactions - what molecules are touching?
– Yeast 2 hybrid, Immunoprecipitations
• Literature - finding out what we already know.
Data Mining
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Gene Sorter - info on sets of genes
Sorted by homology
Sorted by genome distance
Coping with Bioinformatics
Tower of Babel
Up in Testes, Down in Brain
Encode Project
• ENCyclopedia Of DNA Elements
• Pilot phase: detailed experimental analysis of 1%
of genome in ~40 different regions.
• Many types of experiments
–
–
–
–
CHIP/CHIP
DNAse hypersensitivity
Tiling microarrays
Deep comparative genomics
• Data available at genome.ucsc.edu via ENCODE
link .
ENCODE Dnase I Hypersensitivity, CHIP/CHIP, transcription data
ENCODE Dnase I Hypersensitivity, CHIP/CHIP, transcription data
Close up of region
VisiGene
• Image browser for in-situ and other geneoriented pictures
• Hopefully in the long run will have a
million images covering almost all
vertebrate genes.
• Currently has 6000 images covering 1000
mouse transcription factors courtesy of Paul
Gray et al.
Gene Browser Staff
• Programming: Hiram Clawson, Mark
Diekhans, Rachel Harte, Angie Hinrichs,
Fan Hsu, Andy Pohl, Kate Rosenbloom,
Chuck Sugnet,
• Docs, quality, support: Gill Barber, Ron
Chao, Jennifer Jackson, Donna Karolchik,
Bob Kuhn, Crystal Lynch, Ali SultanQurraie, Heather Trumbower
• Computer systems: Jorge Garcia, Patrick
Gavin, Paul Tatarsky
Comparative Genomics
• UCSC - Robert Baertsch, Gill Bejerano,
Yontoa Lu, Jacob Pedersen, Katie Pollard,
Adam Siepel, Daryl Thomas, David
Haussler
• PSU - Laura Elnitski, Belinda Giardine,
Ross Hardison, Minmei Hou, Scott
Schwartz, Webb Miller,
Data Contributors
•
•
•
•
•
•
•
Human Genome Project
Genbank/DDJ/EMBL contributors
Novartis GNF foundation
Affymetrix, Perlegen, SNP Consortium
SwissProt, Ensembl, EBI and NCBI
Jackson Labs, RGD, Wormbase, Flybase
Many contributors of gene prediction and
other tracks.
Funding
• National Human Genome Research Institute
• Howard Hughes Medical Institute
• Taxpayers in the USA and California
THE END
Confounded Pseudogenes!
• Pseudogenes confound HMM and homology based gene
prediction.
• Processed pseudogenes can be identified by:
– Lack of introns (but ~20% of real genes lack introns)
– Not being the best place in genome an mRNA aligns
(be careful not to filter out real paralogs)
– Being inserted from another chromosome since
dog/human common ancestor (breaking synteny).
– High rate of mutation (Ka/Ks ratio).
• Robert Baertsch at UCSC has produced a processed
pseudogene track.
• Yontoa Lu working on a non-processed pseudogene track.
Close up of two processed pseudogenes
Detail Near Translation Start
100%
95%
90%
85%
80%
75%
70%
65%
60%
-15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Note the relatively conserved base 3 before translation
Start (constrained to be a G or an A by the Kozak
Consensus sequence, and the first three translated bases
(ATG).
Normalized eScores
Table browser - text-oriented browsing and data
analysis of genome browser database.