Sequencing genomes

Download Report

Transcript Sequencing genomes

Přednáška 13. 3. odpadá
Last lecture summary
• recombinant DNA technology
• DNA polymerase (copy DNA), restriction endonucleases (cut DNA),
ligases (join DNA)
• DNA cloning – vector (plasmid, BAC), PCR
• genome mapping
relative locations of genes are established by
following inheritance patterns
visual appearance of a chromosome when
stained and examined under a microscope
the order and spacing of the genes, measured
in base pairs
sequence map
• genetic markers
• polymorphic (alternative alleles)
• restriction fragment length polymorphisms (RFLPs)
• some restriction sites exist as two alleles
• simple sequence length polymorphisms (SSLPs)
• repeat sequences, minisatellites (repeat unit up to 25 bp),
microsatellites (repeat unit of 2-4 bp)
• single nucleotide polymorphisms (SNPs, pron.: “snips”)
• Positions in a genome where some individuals have one nucleotide and
others have a different nucleotide
RFLP
SSLP
New stuff
DNA sequencing
• Sanger method, chain-termination method,
developed 1974, Nobel prize in chemistry 1980
• The key principle: use of dideoxynucleotide triphosphates
(ddNTPs) as DNA chain terminators.
dNTP
ddNTP
source: http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis
source: wikipedia
Shotgun sequencing
• Current technology can only reliably sequence a short stretch –
•
•
•
•
a ‘read’ is typically ~1000 bp.
However genomes are large. The sequence of a long DNA
molecule has to be constructed from a series of shorter
sequences.
This is done by breaking (cleaving by restriction endonuclease)
the molecule into fragments, determining their sequences, and
using a computer to search for overlaps and build up the
master sequence
This shotgun sequencing is the standard approach for
sequencing small prokaryotic genomes.
But is much more difficult with larger eukaryotic genomes, as it
can lead to errors when repeats are analyzed.
• human genome is repeat-rich, >50% repeats (50-500 kpb duplicated
regions with >98% identity)
Target
Copies
Shotgun
Sequence each short piece
Sequence assembly
Consensus
Finalizing (directed read)
source: slides by Martin Farach-Colton
source: Brown T. A. , Genomes. 2nd ed. http://www.ncbi.nlm.nih.gov/books/NBK21129/
Human genome project (HGP)
• Determine the sequence of haploid human
genome
• Govermentally funded (DOE)
• Began in 1990, working draft published
in 2001, complete in 2003, last chromosome
finished in 2006
• Cost: $3 billion
• Whose genome was sequenced?
• The “reference genome” is a composite from several people who
donated blood samples.
Celera - competition begins
• In 1998, a similar, privately
•
•
•
•
funded quest was launched
by the American researcher
Craig Venter, and his company
Celera Genomics.
The $300,000,000 Celera effort
was intended to proceed at
a faster pace and at a fraction
of the cost.
Celera wanted to patent identified genes.
Celera promised to release data annually (while the HGP
daily). However, Celera would, unlike HGP, not permit free
redistribution or scientific use of the data.
HGP was compelled to release (7.7. 2000) the first draft of the
human genome before Celera for this reason.
How did it end?
• March 2000 – president Clinton announced that the
•
•
•
•
genome sequence could not be patented, and should be
made freely available to all researchers.
The statement sent Celera's stock plummeting and
dragged down the biotechnology-heavy Nasdaq. The
biotechnology sector lost about $50 billion in two days.
Celera and HGP annouced jointly the draft sequence in
2000.
The drafts covered about 83% of the genome.
Improved drafts were announced in 2003 and 2005, filling
in to ≈92% of the sequence currently.
Human genome
• 3 billions bps, ~20 000 – 25 000 genes
• Only 1.1 – 1.4 % of the genome sequence codes for proteins.
• State of completion:
• best estimate – 92.3% is complete
• problematic unfinished regions: centromeres, telomeres (both contain
highly repetitive sequences), some unclosed gaps
• It is likely that the centromeres and telomeres will remain unsequenced
until new technology is developed
Databases
• Genome is stored in databases
• Primary database (NCBI) – Genebank
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide)
• Additional data and annotation, tools for visualizing and searching
• UCSCS (http://genome.ucsc.edu) … University of California – Santa Cruz
• Ensembl (http://www.ensembl.org) … EBI+Sanger
• Chromosome
• largest #1 = 250 Mbp, smallest #21 = 48 Mbp
• http://www.ensembl.org/Homo_sapiens/Location/Genome
Hierarchical genome shotgun – HGS
• Hierarchical genome shotgun, hierarchical shotgun
sequencing, clone-by-clone sequencing, map-based
shotgun sequencing, clone contig sequencing
• Adopted by HGP
• Strategy “map first, sequence second”
• Create physical map
• Divide chromosomes to smaller fragments.
• Order (map) them to correspond to their respective
locations on the chromosomes.
• Determine the base sequence of each of the mapped
fragments.
Hierarchical genome shotgun – HGS
1. Map genome
• As genetic markers (landmarks), short tagged sites (STS) were used
(200 to 500 base pair DNA sequence that has a single occurrence in
the genome)
2. Copy target DNA
3. Make BAC library
The sequenced
sub-clones
• cleave (partial cleavage by restriction endonuclease)
all target
DNA are
copies randomly, insert these sub-clones
into BACs
linked
up to produce the DNA
4. Physically map all BACs
contig, which is the de-coded
5. Find a subset of BACs that cover target
DNA
version
of the original source
• minimal tiling path
DNA. As this method
progresses,
6. Shotgun sequence only BACs at minimal
tailinglarger
path and larger
• Divide BACs into fragments (ultrasoundcontigs
or pressure),
plasmid until a
will bedo
produced,
cloning, reconstruct BAC sequence
single ordered contig of the
7. Fill in gaps between BACS
genome is achieved.
8. Merge into consensus sequence
http://www.nature.com/scitable/content/idealized-representation-of-the-hierarchical-shotgun-sequencing-48221
Minimal tiling path
A collection of overlapping bacterial artificial chromosome (BAC) clones. The clones
outlined in red, which provide a minimal tiling path across the corresponding
genomic region, are selected for sequencing.
Coverage
• As it was shown, individual nucleotides are represented
•
•
•
•
•
•
more than with one read.
Coverage is the average number of reads representing a
given nucleotide in the reconstructed sequence.
Let’s say that for a source strand of length G = 100 Kbp
we sequence R = 1 500 reads of average legth L = 500.
Thus, we collect N = RL = 750 Kbps of data.
So we have sequenced on average every bp in the
source N/G = 7.5 times.
The coverage is 7.5X
Coverage in HGS adopted by HGP was 8X.
Whole genome shotgun – WGS
• Adopted by Celera.
• De facto application of shotgun to large genome. Never
•
•
•
•
done before on such a large scale.
Expensive and time consuming mapping is not performed.
Each piece of DNA is cut into smaller fragments. Each
fragment is sequenced first, and then overlapping
sequences are joined together to create the contig.
To achieve certain accuracy, higher coverage (20X) was
used.
The development of new algorithms was crucial for the
assembly was.
Genome assembly
• Aligning and merging short fragments of DNA sequence in
order to reconstruct the original (loger) sequence.
• reads – typically 500-1000 bp, merge them into contigs,
arrange/merge contigs into scaffolds
• scaffold – a series of contigs that are in the right order but are not
necessarily connected in one continuous stretch of sequence
source: Xiong, Essential Bioinformatics
Genome assembly
• Can be very computationally intensive when dealt at the
whole genome level.
• Major challenges:
• sequence errors – can be corrected by drawing consensus
sequence from an alignment of multiple overlapped sequences
• contamination by bacterial vectors – can be removed using filtering
programs prior to assembly
• repeats – RepeatMasker (http://www.repeatmasker.org/) can be
used to detect and mask repeats
Sequence assemblers
• base calling – convert raw/processed data from a
sequencing instrument into sequences and scores
• individual bases have scores, reflect the likelihood the base is
correct/incorrect
• in capillary sequencing, identify the sequence from
chromatogram
source: Lee SH, Vigliotti VS, Pappu S., J Clin Pathol. 2010, 63(3) 235-9 PMID: 19858529
PHRED
• Base caller
• Reads DNA sequence chromatogram files and analyzes
the peaks to call bases.
• base quality score – PHRED examines peaks around
each base call to assign a score q to each base call that
is logarithmically linked to the error propability P: 𝑞 =
− 10 log10(𝑃), typically taken q = 20 corresponding to 1%
error probability (99% correct base calling)
• Phred is a two-step process:
• Training: Given a set of reads, labels as to which bases are correct,
and a set of quality statistics for each base, produce a model that
can predict error rates for unseen bases
• Application: Given new reads and quality statistics, predict the
quality for each of the bases.
PHRAP
• Sequence assembly
• Takes PHRED base-call files with quality scores as input.
• Aligns individual fragments in a pairwise fashion. The
base quality information is taken into account during the
pairwise alignment.
Personal human genomes
• Personal genomes had not been sequenced in the
Human Genome Project to protect the identity of
volunteers who provided DNA samples.
• Following personal genomes were available by July 2011:
• Japanese male (2010, PMID: 20972442)
• Korean male (2009, PMID: 19470904)
• Chinese male (2008, PMID: 18987735)
• Nigerian male (2008, PMID: 18987734)
• J. D. Watson (2008, PMID: 18421352)
• J. C. Venter (2007, PMID: 17803354)
• HGP sequence is haploid, however, the sequence maps
of Venter and Watson are diploid.
New generation sequencing (NGS)
• The completion of human genome was just a start of
modern DNA sequencing era – “high-throughput next
generation sequencing” (NGS).
• New approaches, reduce time and cost.
• Holly Grail of sequencing – complete human genome
below $ 1000.
• Archon X Prize
• http://genomics.xprize.org/
• $10 million prize is to be awarded to the private company that is
able to sequence 100 human genomes within 10 days at cost of no
more than $10 000 per genome
1st and 2nd generation of sequencers
• 1st generation – ABI Prism 3700 (Sanger, fluorescence, 96
capillaries), used in HGP and in Celera
• Sanger method overcomes NGS by the read length (600 bps)
• 2nd generation - birth of HT-NGS in 2005. 454 Life
Sciences developed GS 20 sequencer. Combines PCR
with pyrosequencing.
• Pyrosequencing – sequencing-by-synthesis
• Relies on detection of pyrophosphate release on nucleotide
incorporation rather than chain termination with ddNTs.
• The release of pyrophosphate is detected by flash of light
(chemiluminiscence).
• Average read length: 400 bp
• Roche GS-FLX 454 (successor of GS 20) used for J.
Watson’s genome sequencing.