Transcript Document

Today’s Lecture Topics

Whole genome sequencing

Shotgun sequencing method

Sequencing the human genome

Functional/comparative genomics

Transcriptome & RNA-Seq

Proteomics
Shotgun DNA sequencing:

Sequence the entire genome rapidly.

No requirement for a high resolution linkage or physical map.

Just break the genome up into small pieces, sequence it, and find the
gene of interest/do the bioinformatic analysis later.

Reverses the way genetic studies proceeds.

It used to be we had to find the gene first to study the cause of
the disease.

Now we can study effects of genes we didn’t even knew exist.
Fig. 8.13, Shotgun sequencing a genome
Shotgun DNA sequencing---dideoxy method:
1.
Begin with genomic DNA and/or 200-300 kb BAC clone library.
2.
Mechanically shear DNA into ~2 kb bp overlapping fragments.
3.
Isolate on agarose, purify, and clone into standard plasmid vectors.
4.
Sequence ~500 bp from each end of each 2 kb insert.
5.
Sequence from the middle 1,000 bp of each insert is obtained from
overlapping clones.
6.
Repeat the process so that 4-5x the total length of the genome is
sequenced (dideoxy sequencing is 99.99% accurate).
7.
Results in a contig library with ~97% genome coverage (the
missing 3% is composed mostly of repeated DNA sequence).
8.
Assemble hundreds of thousands of overlapping ~500 bp sequences
with fast computers operating in parallel (supercomputer).
How to deal with the repeated DNA - 2 kb clones present a problem,
solved with 10 kb clones:
1.
Many repeated sequences in the genome are in regions spanning ~5
kb in size.
2.
So many 2 kb clones contain entirely repeated DNA.
3.
Results in a dead stop in the assembly, because there is ambiguity
about where each clone goes.

Repeated sequences occur all over the genome.
4.
On average, 10 kb clones contain less repeated DNA sequence.
5.
Solution is to create and sequence a 10 kb clone library derived from
the same genomic DNA or BAC library.
6.
Complete genome coverage requires combining the sequences from
the 2 kb & 10 kb libraries.
Genome
Date
Size
Institute
Homo sapiens
mtDNA
1981
Haemophilus
influenzae
(bacteria)
1995
1,830,137 bp TIGR
(1 circular)
Shotgun
Mycoplasma
genitalium
(bacteria)
1995
580,070 bp TIGR
(1 circular)
Shotgun
Escherichia coli
(bacteria)
1997
4,639,221 bp University of
(1 circular) WisconsinMadison
Shotgun
Methanococcus
jannaschii
(Archaeon)
1996
1,739,933 bp DOE
(3 circular)
Shotgun
Saccharomyce
s cerevisiae
(yeast)
1996
12,067,280 bp 100+ labs
(16 linear)
Mapping
Caenorhabditis
elegans
(nematode)
1998
97,000,000 bp Consortium
(6 linear)
Mapping
16,159 bp (1 circular)
Method
-
Genome
Date
Size
Institute
Drosophila
melanogaster
(fruit fly)
2000
180,000,000 bp UC Berkley
Celera Genomics
Arabidopsis
thaliana
(angiosperm)
2000
125,000,000 bp Consortium
(5 linear)
Homo sapiens
(human)
2000
3,400,000,000 bp Human Genome
Project &
Celera Genomics
Method
Shotgun
w/BAC map
Mapping &
Shotgun
Sequencing the human genome:
Two major players:
Human Genome Project (HGP):




Publicly funded international consortium (NIH, DOE, etc.)
Francis Collins, National Human Genome Res. Inst. (NHGRI)
Began in U.S. in 1990 with a goal of 15 years
Genetic and physical mapping approach + dideoxy sequencing
Celera Genomics Corporation (CRA):





Spin-off of Applied Biosystems (ABI)
J. Craig Venter, CEO
Created in 1998 with a goal of 3 years
Direct shotgun approach + dideoxy sequencing (+ HGP’s maps
for validation)
Both groups collected blood and sperm samples from anonymous
male and female donors of different ethnic backgrounds.
J. Craig Venter
Celera Genomics
Francis Collins
Human Genome Project
Milestone: 26 June 2000 White House press conference with Bill Clinton:
HGP:
Started 1990
~22.1 billion nucleotides of sequence data
7-fold coverage
Unfinished (24% completely finished, 50% near-finished)
Celera:
Started 1998
~14.5 billion nucleotides of sequence data
4.6-fold coverage
Complete assembled genome with >99% coverage
First assembled draft of human genome simultaneously published in
Nature & Science 15 & 16 February 2001 (Nature published 1 day
earlier).
How did Celera et al. assemble the sequences using shotgun methods?
Method A:
1.
Assembly of 26.4 million 550 bp sequences  4.6-fold coverage,
without reference to a physical map of any kind.
2.
Covered >99% of the genome.
3.
500 million trillion base-to-base comparisons.
4.
20,000 CPU hours (833 CPU days) on a year 2000 supercomputer.
Method B:
1.
Used BAC clone scaffold (combined lots of smaller maps) to validate
the whole genome direct shotgun assembly approach.
2.
Also helped resolved ambiguities resulting from the assembly of
short repeated DNA fragments.
Features of the human genome:

32,000 genes estimated (50,000-100,000 were predicted).

Not many more genes than Drosophila, and only 50% more genes
than Caenorhabditis elegans (nematode worm).

Only 1-1.5% of the genome codes for protein.

50% of the sequence is repeated DNA.

Humans share 223 genes found in bacteria, but not yeast,
nematodes, or fruit flies.
Next-generation shotgun genome sequencing:

The shotgun method is fundamentally the same, but uses shorter
read lengths (~100 bp paired-ends on Illumina).

300-500 bp fragments + mate-pairs of 2-12 kb to aid assembly

The throughput has increased and the cost has decreased.

Not uncommon to assemble trillions of sequence reads.
Some things to consider:
If error rates are high (454, Illumina) 30-50x genome
sequencing is required to get a good genome.
If error rates are low (SOLiD, Ion Torrent) 4-5x coverage is
sufficient.

Costs are falling from $10K to $1K.
Sequencing is no longer the primary need; data storage/retrieval and
computational needs are outpacing everything else.
How much data storage does 1 human genome require?
About 1.5 GB (2 CDs) if your stored only one copy of each letter.
For the raw format containing image files and base quality data 2-30
TB are required.
30-50x coverage requires more data storage capacity.
Sequence + quality scores is compressed to format called FASTQ.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
FASTQ
'!' represents the lowest quality while '~' is the highest.
Left-to-right increasing order of quality (ASCII):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Illumina Sequence Identifiers
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Sequence assembly & genotyping
• Trimming and filtering sequences based on base quality scores  Aligning
reads to a reference genome  Genotyping to determine homozygous &
heterozygous SNPs
http://gatkforums.broadinstitute.org/
Post-genome sequencing era is very different:

Classical genetics studies started with a phenotype and set out
to identify the gene.

But we now have the ability to start with a complete genome
and set out to identify the phenotype.

Large data sets required many computational and
mathematical tools, which requires strong bioinformatics
skillsets.
Lots of applications:
1.
Identify genes within genomic DNA sequences.
2.
Align and match homologous gene sequences in databases and
seek to determine function.
3.
Predict structure of gene products.
4.
Describe interactions between genes and gene products.
5.
Study gene expression.
1. Identifying genes in DNA sequences:

First step is annotation = identification and description of putative
genes and other important sequences.

Open reading frames (ORFs)
ORF = potential protein coding sequence that begins with a start
codon and ends with a stop codon.

ORFs come in all sizes.

Not all ORFs encode proteins (6-7% do not in yeast).

ORFs with introns can require sophisticated computer
algorithms to detect.
2. Homology searches to assign gene function:

Homology search = identify gene function by searching database.

Similarities reflect evolutionary relationships and shared function.

Homology searches are performed for nucleotides and amino acids
using BLAST = Basic Local Alignment Search Tool.

GenBank’s BLAST site: http://www.ncbi.nlm.nih.gov/BLAST/

Example, human mtDNA control region sequence:

TTCTCTGTTCTTCATGGGGAAGCAGATTTGGGTACCACCCAAGTATTGACTCACCC
ACAACAACCGCTATGTATTTCGTACATTACTGCCAGCCACCATGAATATTGCACGG
TACCATAAATACTTGACCACCTGTAGTACATAAAAACCCAATCCACATCAAAA
Fig. 9.2, Summary of genes in
the yeast genome.
3. Gene function can be identified and studied in other ways:

Gene knockout approach = systematically delete different genes
and observe the phenotypes (PCR + cloning is one method).

Synthesize tecombinant proteins with modified amino acid
sequence and expressed in E. coli.

Test effects of mutations that don’t exist in nature.

Study the transcriptome = complete set of mRNAs in a cell

mRNAs are not stable, but types and levels change with
different experimental conditions.
1.
Sample mRNA at experimental intervals and convert to cDNA
using reverse transcriptase.
2.
Probe unknown cDNAs with DNA microarray of PCR-generated
ORF sequences (requires known sequence for each probe).
3.
Or better yet, sequence the entire transcriptome using:
RNA-Seq = Whole Transcriptome Shotgun Sequencing of all
expressed RNAs.
http://www.nature.com/nbt/journal/v28/n5/images_article/nbt0510-421-F1.gif
Fig. 9.7b, Microarray study
of gene expression
“Proteomics”:
Proteome = complete set of expressed proteins in a cell
Major goals of proteomics:
•
Identify every protein, isolate and purify.
•
Determine the sequence and structure of each protein (and its
function).
•
Create a database with the sequence of each protein.
•
Analyze protein levels and interactions in different cell types, at
different times, and at different stages of development.
Rationale:

Genes are two-steps removed from disease (DNA  mRNA 
protein).

Most gene products involved in disease are composed of protein.

Understanding protein means understanding disease.
http://biol.lf1.cuni.cz/ucebnice/en/proteomics.htm
“Systems Biology”
Computational and mathematic modeling of complex biological systems---Wikipedia
Requires integration of genomic, proteomic, and metabolic data.