Lecture 14 - University of New England

Download Report

Transcript Lecture 14 - University of New England

Bioinformatics
Lecture 14
• Genome sequencing projects
• Hierarchical and Shotgun approaches
• Genome assembly
• TIGR Assembler
• Ensembl
Genome size
Mammalian genome ~ 3 megabase = 3x109 base pairs
How many books are needed to print the entire
mammalian genome?
1,500 letter per page x 1000 pages per book x 2000 books
Assuming 5 cm per book this shelf is ~ 100 meters long!
Genome sequencing: the problem
•
•
•
Sequencing read lengths vary depending upon several parameters but 600
to 800 nucleotides correspond to a good estimate. To sequence much
larger fragments or even whole genome, essentially two strategies have
been designed.
a)
The hierarchical approach. Depending on the vector used for
cloning BAC, YAC, cosmid and other libraries of cloned contigs are
usually created. The size of insert/contig may vary from tens to hundred
thousand of base pairs. Collections of sub-fragments obtained by
enzymatic restriction are mapped to get a unique contigs from which a
minimal set of sub-fragments can be selected and sequenced thus limiting
sequence redundancy.
b)
The shotgun approach. This can be applied to a DNA sequence of
any size, including the whole genome. DNA is randomly fragmented by
sonication or shearing. Following fragmentation and enzymatic end repair
the DNA fragments are ligated to a plasmid vector and a bacterium host
transformed to produce a library. Clones taken at random from the
library are then sequenced from both end using two universal primers. At
this stage a shotgun is characterised by its depth i.e. the cumulative length
of sequence determined divided by the length of the fragment or genome
to be sequenced. For example with an estimated size of 4 Mb a 10X
shotgun would correspond to the assembly of about 60,000 reads with a
mean size of 650 nt. The resulting sequences are assembled in a unique
contig representing the whole fragment by sequence comparison using
appropriate bio-informatic programs. The final stage or “polishing stage”
corresponds to the elimination of gaps and other possible problems.
Shotgun
approach
Genome
assembly
Assembly of a contiguous DNA sequences
•
•
•
•
•
Sequencing projects have rapidly moved to using the two approaches
sequentially.
For example, the construction of a BAC map covering an entire genome or
chromosome is followed by a shotgun strategy to sequence a minimal set of
BACs.
The change that was introduced by G. Venter was the size of the DNA
fragment or genome that was directly shotguned. The possibility to increase
the size of the shotgun projects was dependent upon the development of robots
adapted to high throughput project and of bioinformatic programs that solve
two major problems.
One is a quantitative problem regarding the capacity to store, compare, retrieve
millions of reads corresponding to billions of nucleotides. DB problem.
The second problem is related to the presence of numerous repeat sequences
that are often longer than the mean read length, complicating correct assembly.
Assembly problem.
Fragment assembly problem
•
•
•
•
•
•
•
•
The Shortest Superstring Problem, while representing a challenge, is simplified
abstraction, since it should also take into consideration three other difficulties.
1. Sequence data are not perfect and mistaken reads are possible.
2. Presence of numerous repeats. There is ~ a million of 300 base pairs Alu
copies and many other repeats. Fortunately some repeats may slightly differ due
to mutation process.
3. As DNA is double-stranded, orientation of substrings is unknown and it is not
known which strand should be used in the reconstruction.
Most of fragment assembly algorithms include the following three steps:
Overlap. The problem is to find the best match between the suffix of one
sequence an the prefix of another. The difficulties above force to use variation of
the dynamic programming algorithm + filtration methods
Layout. This is the hardest step in DNA assembly, which becomes even more
computationally demanding with increasing number of fragments. The most
difficult is deciding whether two fragments with a good overlap really overlap or
represent a repeat or something else.
Consensus. This step is devoted to finding the most frequent character in the
stringing layout that is constructed after the layout step is completed. More
sophisticated algorithms align substrings in small windows along the layout or
use a mosaic of the best (high probabilistic scores) segments from the layout.
Genome assembly from smaller sequence fragments
TIGR Assembler
• TIGR Assembler is an Open Source software.
• The TIGR Assembler is a sequence fragment assembly program
building contigs from small sequence reads.
• It is versatile, offering a wide variety of options for tuning the
assembly process and analyzing sequence data. The current assembly
engine uses a greedy algorithm and heuristics to build contigs, find
repeat regions, and target alignment regions.
• Sequence overlaps are detected and scored using a 32-mer hash.
• Sequence alignment and merging is done using a Smith-Waterman
dynamic programming algorithm.
• Gap penalties and score values corresponding to the bases and their
quality values are predefined and hard coded into the program.
Genome assembly – contigs and
suprcontigs alignment
• It is very difficult to produce a finished continuous sequence having
the level of redundancy typical for many high eukaryotes.
• Instead, a draft sequence of about 150,000 contigs will be generated
that could be combined to give a few thousand supercontigs.
• The production, in parallel, of a dense RH map will not only facilitate
the assembly of the contigs into supercontigs, but will also make it
possible to order the supercontigs — a necessary step for understand
genome rearrangements and synteny.
RH
CFA5
meiotic
AHTH68Ren
REN283H21
REN92G21
HSA
Cytogénétic
FISH
11
THY1
21
11q23
23
1
16
H201
CO2608
AHT141
CPH18
REN137C07
DIO1
16q24
CPH18
REN162F12
REN192M20
MSHR
34
35
REN114G01
1p32
32
K315
***
***
C05.771
FH2383
***
C05.414
***
AHTK315
36
SLC2A4
CD3E
11q23
ZUBECA6
REN285I23
31
33
ZUBECA6
REN12N03
DIO1
24
AHT141
***
REN265H13
SLC2A4
22
***
***
***
THY-1
11q22
14.3
17
FH2140
***
REN51I08
REN42N13
REN78M01
H248
13
14.1
14.2
FH2594
HuEST-D29618
H68
12
AHTH248
***
REN111B12
REN109K18
FH2140
11
***
AHTH248
REN175P10 *** /REN213E01
C05.377
REN134J18***
C05.414
C05.771
REN68H12
C05.377
99Mb
***
***
REN287B11
REN122J03
CPH14
***
AHTH201Ren
CPH14
650.2cR5000
***
85cM
Mouse Genome: sequencing and assembly
•
•
•
•
•
•
The mouse genome is about 14% smaller than the human genome (2.5 Gb
compared with 2.9 Gb) probably due to higher rate of deletions.
Over 90 % of mouse and human genomes can be partitioned into
corresponding regions of conserved synteny.
Sequencing strategy included four approaches: 1) construction of BAC-based
physical map by fingerprinting and sequencing the clones ends, 2) WholeGenome Shotgun sequencing to ~7 fold coverage and assembly to generate an
initial draft, 3) hierarchical shotgun sequencing of BAC clones combined with
WGS to create a hybrid WGS-BAC assembly, 4) production of finished
sequence by using the BAC clones as template for direct finishing
About 41 million reads were generated by the project participants, of which
33.6 million passed quality checks and 29.7 were paired (opposite end of the
same clone). Clone inserts provide ~47-fold physical coverage of the genome.
Genome assembly were achieved using two newly developed programs
Arachne and Phusion.
The assembly contains 224,713 contigs, connected into 7,418 supercontigs.
The 200 largest supercontigs span more that 98% of the assembled sequence,
of which 3 % is within sequence gaps.
Ensembl: An Open-Source Tool
•
•
•
•
•
The Ensembl consists of two main parts:
1) The analysis pipeline, which adds new data and analyses regularly to the
core database. The DB contains DNA sequences, predicted features on the
sequences and a complete body of evidence supporting these predictions.
Ensembl known genes therefore are those predicted genes that have high
similarity to genes confirmed by experimental evidence.
2) The API (application programming interface), which gives structured access
to the data. Easiness of retrieving information in meaningful form makes API
an extremely powerful tool. The initial implementation of the API is in Perl,
built upon layer of Bio-Perl objects. Other implementations and languages like
Java and Python are also in use.
The Ensembl is based around two ideas: a golden path (the pathway through
the data containing nonredundant sequence) and virtual contig (contig
determined by the user, an arbitrary region of a chromosome).
NCBI and USCS web-sites contains systems similar to the Ensembl.