Phred/Phrap/Consed

Download Report

Transcript Phred/Phrap/Consed

Phred/Phrap/Consed Analysis
A User’s View
International Training Course on Bioinformatics
Applied to Genomic Studies
Rio de Janeiro 2001
Arthur Gruber
Faculty of Veterinary Medicine and Zootechny
University of São Paulo
BRAZIL
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a worldwide distributed package for:
a. Trace file (chromatograms) reading;
b. Quality (confidence) assignment to each individual base;
c. Vector and repeat sequences identification and masking;
d. Sequence assembly;
e. Assembly visualization and editing;
f. Automatic finishing.
Why to assemble?
• Current
DNA
sequencing
methods
generate reads of 500-700 bp – resolution
limit of electrophoresis
• Whole genomes or large clones need to
be fragmented - clone library
• Short fragments are randomly sequenced
(shotgun approach)
assembled to form
sequence
– reads are
final consensus
Whole genome
BAC/cosmid clone
DNA fragmentation
sonic disruption
nebulization
Small fragments
1.0 - 2.0 kb
Clone library
pUC18
DNA sequencing
random clones
Partial Assembly
contigs
Finishing
quality
both stands coverage
gap filling
Whole genome
BAC/cosmid clone
final consensus sequence
How to deal with the enormous amount
of reads generated by the high
throughput DNA sequencers?
Sanger Centre
Phred
Genome Research 8: 175-185, 1998
Phred
Genome Research 8: 186-194, 1998
Phred
Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file
formats: SCF (standard chromatogram format), ABI
(373/377/3700), ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each
identified peak with a lower error rate than the
standard base calling programs.
Phred
c. Assigns quality values to the bases – a “Phred
value” based on an error rate estimation
calculated for each individual base.
d. Creates output files – base calls and quality
values are written to output files.
Trace File
High quality region – no ambiguities (Ns)
Trace File
Medium quality region – some ambiguities (Ns)
Trace File
Poor quality region – low confidence
Phred value formula
q = - 10 x log10 (p)
where
q - quality value
p - estimated probability error for a base call
Examples:
q = 20 means p = 10-2 (1 error in 100 bases)
q = 40 means p = 10-4 (1 error in 10,000 bases)
The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g
BEGIN_COMMENT
CHROMAT_FILE: EBV10201A02.g
ABI_THUMBPRINT:
PHRED_VERSION: 0.990722.g
CALL_METHOD: phred
QUALITY_LEVELS:99
TIME: Thu May 24 00:18:58 2001
TRACE_ARRAY_MIN_INDEX: 0
TRACE_ARRAY_MAX_INDEX: 12153
TRIM:
CHEM: term
DYE: big
END_COMMENT
BEGIN_DNA
t 8 5
c 13 17
a 19 26
c 19 32
t
a
a
a
g
c
c
t
g
g
g
g
t
g
c
c
t
a
a
t
g
24
24
22
27
25
19
12
19
12
15
19
23
33
36
44
44
39
39
34
35
34
2221
2232
2245
2261
2272
2286
2302
2314
2324
2331
2346
2363
2378
2390
2404
2419
2433
2446
2460
2470
2482
t
g
t
c
g
n
c
t
t
c
t
c
c
c
t
c
g
g
a
g
g
16 8191
19 8200
13 8211
13 8229
4 8241
4 8253
4 8263
10 8276
9 8286
12 8301
16 8313
12 8329
12 8336
15 8343
19 8356
9 8371
13 8386
14 8397
7 8417
9 8427
4 8445
t 6 11908
a 6 11921
g 6 11927
t 6 11947
c 6 11953
a 6 11964
g 6 11981
c 4 11994
n 4 12015
c 4 12037
n 4 12044
n 4 12058
n 4 12071
n 4 12085
n 4 12098
n 4 12111
n 4 12124
c 4 12144
n 4 12151
END_DNA
END_SEQUENCE
Phrap
Phragment Assembly Program
or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA
sequence data
Key Features:
a. Uses the entire read content – no need for trimming.
b. User supplied (i.e. Repbase) + internally computed data –
better accuracy of assembly in the presence of repeats.
c. Contig sequence is constituted by a mosaic of the highest
quality parts of the reads – it’s not a consensus!
Phrap
Phragment Assembly Program
or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA
sequence data
d. Provides extensive information about assembly – contained in
phrap.out, *.ace and *.screen.contigs.qual files
e. Handles very large datasets – hundreds of thousands of reads
are easily manipulated.
f. Generate output files – contain some important data and enable
visualization by other programs
Phrap output files
• *.contigs – fasta file containing the contigs
-
Contigs with more than one read
-
Singletons (single reads with a match to some other contig but that couldn’t be
merged consistently to it)
• *.singlets – fasta file of the singlet reads
-
Reads with no match to other read
• *.ace – allows for viewing the assembly using Consed
• *.view
– required for viewing the assembly using
Phrapview
Consed
Genome Research 8: 195-202, 1998
Consed
Consed is a program for viewing and editing
assemblies produced by Phrap
Key Features:
a. Assembly viewer - allows for visualization of contigs, assembly
(aligned reads), quality values of reads and final sequence.
b. Trace file viewer – single and multiple trace files can be
visualized allowing for comparison of a given sequence in several
reads.
Consed
Consed is a program for viewing and editing
assemblies produced by Phrap
Key Features:
c. Navigation – identify and list regions which are below a given
quality threshold, contain high quality discrepancies, singlestrand coverage, etc.
d. Autofinish – automatic set of functions for: gap closure,
improvement of sequence quality, determination of relative
orientation of contigs, identification of regions covered by a
single read or by reads of a single strand. The program
automatically performs primer picking and chooses the
templates.
Phred/Phrap/Consed Pipeline
Input
chromatogram files
Quality (confidence) values assignment
Phred
phd files - *.phd
Conversion - phd to fasta
phd2fasta.pl
nucleotide sequences - seqs_fasta
quality values - seqs_fasta.screen.qual
Vector screening and masking
Cross_Match (local alignment program) x vector.seq
screened/masked file - seqs_fasta.screen
Directories:
Chromat_dir
Assembly
Phrap
assembled contigs - seqs_fasta.screen.contigs
assembly file - seqs_fasta.screen.ace#
Assembly viewing/editing
Consed
Phd_dir
Edit_dir
Finishing Problems
Finishing can be a boring and difficult task due:
DNA sequencing problems
a. High GC content – genomes presenting a high GC content are
more prone to generate artifacts as compressions, sudden drops,
bad quality regions. Try to use Dye Primer instead of Dye Terminator, change
chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc.
b. Palindromic regions – lead to strong secondary structures
causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the
problematic region by PCR and sequence the product.
c. Homopolymeric regions – can reduce DNA synthesis efficiency
for some chemistries. Try to use Dye Primer instead of Dye Terminator, change
chemistry (dRhodamine instead of BigDye).
Finishing Problems
Finishing can be a boring and difficult task due:
DNA assembly problems
a. High content of repeats – highly repeated elements reduce
accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or
Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end.
Map the repetitive region using restriction enzymes to estimate its size and number of repeat
units.
b. High AT content – some highly biased genomes (i.e.
Plasmodium falciparum; plastid genomes) can pose a problem
for assembly programs. Very difficult to solve. Try to determine a restriction map and
associate mapping with DNA sequencing data.