Bioinformatic Techniques & Tools for SNP Analysis

Download Report

Transcript Bioinformatic Techniques & Tools for SNP Analysis

Bioinformatic Techniques & Tools
for SNP Analysis
Jill L Wegrzyn
Department of Plant Sciences
University of California at Davis, Davis, CA 95616 USA
E-mail: [email protected]
SNP Analysis Agenda
Sequence-Based SNP Identification
Common Bioinformatic Solutions
Phred, Phrap, Consed, Polyphred, and Polybayes
Re-Sequencing Project (ADEPT2)
High-Throughput SNP Identification Solution (PineSAP)
Finding SNPs: Sequence-based SNP Mining
Genomic
BAC
Library
RRS
Library
BAC
Overlap
Shotgun
Overlap
mRNA
Random
Shotgun
cDNA
Library
Align to
Reference
EST
Overlap
DNA
SEQUENCING
Sequence Overlap - SNP Discovery
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC
Comprehensive SNP Discovery: Resequencing
• Overlapping PCR Amplicons across entire gene
• Make no assumptions about sequence function
• Sequence diversity and genetic structure for each gene is different
• Proper association studies can only be designed in this context
• Complete resequencing facilitates population genetics methods
Sequence-based SNP Identification
Amplify DNA
5’
Sequence
Phrap
Phred
Sequence each end
Base-calling
of the fragment.
Quality determination
3’
Contig assembly
Final
quality determination
PolyPhred/Polybayes
Polymorphism detection
ATAGACG
ATAGACG
ATACACG
ATACACG
ATAGACG
ATACACG
Consed
Sequence viewing
Polymorphism tagging
Analysis
Homozygotes Heterozygote
Polymorphism reporting
Individual genotyping
Phylogenetic analysis
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a DNA sequence analysis package from
University of Washington:

Trace file (chromatograms) reading

Quality (confidence) assignment to each individual base

Vector and repeat sequences identification and masking
Sequence assembly

Assembly visualization and editing

Automatic finishing
Phred, Phrap, Consed, Polyphred, Polybayes

phred: Base calling and quality assignments

phrap: Contig formation and new quality assignments

consed: Visual X-Windows graphic interface, to view and edit
alignments and contigs, and to view the original traces

polyphred: find polymorphisms in phrap contigs, quality calls,
add data to phrap files to permit consed finding and
visualization of polymorphisms.

polybayes: Fully probabilistic SNP detection algorithm that
calculates the probability (SNP score) that discrepancies at a
given location of a multiple alignment represent true sequence
variations as opposed to sequencing errors.
Files and Program Execution

Input files: ABI chromat or SCF files

Order of Program execution:
 phred,
then phrap, then consed
 or: phred, then phrap, then polyphred,
polybayes, then consed
 or: phred, then phrap, then consed, then
polyphred, polybayes, then consed
Trace File
High quality region – no ambiguities
Trace File
Medium quality region – some ambiguities
Trace File
Poor quality region – low confidence
What is Phred?
• Phred observes the base trace, makes base calls, and
assigns quality values (qv) of bases in the sequence.
• Writes base calls and qv to output files for Phrap
assembly.
• Quality values (phred scores) range from 0 to 60. 20 and
above is considered a confident base call (1 in 100 chance
that is has been called wrong ~99% accuracy)
• Useful for consensus sequence construction
ATGCATTC
string1
CGTTCATGC
string2
ATGC-TTCATGC superstring
• Here we have a mismatch ‘A’ and ‘G’, the qv will determine the
dash in the superstring. The base with higher qv will replace the
dash.
Phred value formula
q = - 10 x log10 (p)
where
q - quality value
p - estimated probability error for a base call
Examples:
q = 20 means p = 10-2 (1 error in 100 bases)
q = 40 means p = 10-4 (1 error in 10,000 bases)
Why Phred?
• Output sequence might contain errors.
• Vector contamination
• Dye-terminator reaction might not occur.
• Weak or variable signal strength of peak
corresponding to a base.
The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g
BEGIN_COMMENT
CHROMAT_FILE: EBV10201A02.g
ABI_THUMBPRINT:
PHRED_VERSION: 0.990722.g
CALL_METHOD: phred
QUALITY_LEVELS:99
TIME: Thu May 24 00:18:58 2001
TRACE_ARRAY_MIN_INDEX: 0
TRACE_ARRAY_MAX_INDEX: 12153
TRIM:
CHEM: term
DYE: big
END_COMMENT
BEGIN_DNA
t 8 5
c 13 17
a 19 26
c 19 32
t
a
a
a
g
c
c
t
g
g
g
g
t
g
c
c
t
a
a
t
g
24
24
22
27
25
19
12
19
12
15
19
23
33
36
44
44
39
39
34
35
34
2221
2232
2245
2261
2272
2286
2302
2314
2324
2331
2346
2363
2378
2390
2404
2419
2433
2446
2460
2470
2482
t
g
t
c
g
n
c
t
t
c
t
c
c
c
t
c
g
g
a
g
g
16 8191
19 8200
13 8211
13 8229
4 8241
4 8253
4 8263
10 8276
9 8286
12 8301
16 8313
12 8329
12 8336
15 8343
19 8356
9 8371
13 8386
14 8397
7 8417
9 8427
4 8445
t 6 11908
a 6 11921
g 6 11927
t 6 11947
c 6 11953
a 6 11964
g 6 11981
c 4 11994
n 4 12015
c 4 12037
n 4 12044
n 4 12058
n 4 12071
n 4 12085
n 4 12098
n 4 12111
n 4 12124
c 4 12144
n 4 12151
END_DNA
END_SEQUENCE
Phrap
• Phrap constructs the contig sequence as a mosaic
of the highest quality parts of the reads rather than
a statistically computed “consensus”.
• Avoids the complex algorithm issues associated
with multiple alignment methods
• Speed and accuracy
• Phrap is an assembler NOT an aligner
• The sequence produced by Phrap is quite accurate
• Less than 1 error per 10 kb in typical datasets.
• Phrap considers sequence quality at a given
position (determined by Phred)
Running phredPhrap
Construct directory structure
 directory
for the run, containing subdirectories
chromat_dir, edit_dir, phd_dir, poly_dir
Move tracefiles to directory chromat_dir
Run phredPhrap from directory edit_dir
 Determine phredPhrap parameters
The greater the forcelevel, the more liberal the assembly
(generally produces fewer contigs).
however results in more manual editing reviewing polymorphisms later
Default
Adjusting Parameters
Phrap output files
• *.contigs – fasta file containing the contigs
-
Contigs with more than one read
-
Singletons (single reads with a match to some other contig but
could not be merged consistently)
• *.singlets – fasta file of the singlet reads
-
Reads with no match to other read
• *.ace – for viewing the assembly w/ Consed
• *.view – for viewing the assembly w/ Phrapview
Consed
Consed is a program for viewing and editing assemblies
produced by Phrap
a. Assembly viewer - allows for visualization of contigs,
assembly (aligned reads), quality values of reads and
final sequence.
b. Trace file viewer – single and multiple trace files can
be visualized allowing for comparison of a given
sequence in several reads.
c. Navigation – identify and list regions which are below
a given quality threshold, contain high quality
discrepancies, single-strand coverage, etc.
Phred/Phrap/Consed Pipeline
Input
chromatogram files
Quality (confidence) values assignment
Phred
phd files - *.phd
Directories:
Conversion - phd to fasta
phd2fasta.pl
nucleotide sequences - seqs_fasta
quality values - seqs_fasta.screen.qual
Vector screening and masking
Cross_Match (local alignment program) x vector.seq
screened/masked file - seqs_fasta.screen
Assembly
Phrap
assembled contigs - seqs_fasta.screen.contigs
assembly file - seqs_fasta.screen.ace#
Assembly viewing/editing
Consed
Chromat_dir
Phd_dir
Edit_dir
Using PolyPhred to Visualize SNPs
•Compares sequences across traces obtained from different individuals to identify
sites for SNPs.
•Will occasionally miscall genotypes - frequency of such mistakes depends on the
sequencing chemistry used to generate the trace.
•To reduce the number of miscalled sites, ignores regions of poor quality & ends
Polyphred

polyphred -ace gene_file.ace.1 -tag p -snp hom -indel -f 50 >
gene_file.polyphred.out







where gene_file.ace.1 is the .ace file present in the edit_dir directory.
Note: this command will give polymorphisms of quality 'ranks' 1
through 3: 1 is the highest quality, 6 is the lowest quality
The qualifier -f 50 is used to list 50bp of flanking sequence on each
side of the detected polymorphism
The qualifier -indel is used to identify insertions or deletions
The qualifier -snp hom is used to identify homozygous SNPs only
The qualifier -tag p is used to list the tagged polymorphisms in the
polyphred output file *.polyphred.out.
It first reads the ACE file to obtain the consensus sequence
and the names of the trace (chromat) files used in the
assembly. It then reads the PHD and POLY files associated
with each trace.
Polyphred

Reads the ACE file to obtain the consensus sequence and the
names of the trace (chromat) files used in the assembly.

Reads the PHD and POLY files associated with each trace.

During the SNP search phase, PolyPhred combines information
from all of the sequence traces to derive a genotype and a score for
each sequence

The score indicates how well the trace at the site matches the
expected pattern for a SNP.

Updates the ACE and PHD files by adding tags that mark the
positions of the sites. The tagged sites can then be examined using
Consed.
Polybayes
Bayesian statistical model takes into account:
- depth of coverge
- base quality values of the sequences
-a priori expected rate of polymorphic sites in region
Polybayes calculations are aided with information on
major/minor allele frequencies as well as polymorphism
rates within the species under investigation
**Can also integrate into the poly files for viewing with
Consed
Allele Discovery of Economic Pine Traits

Goals
 Re-sequence
~8,000 amplicons in
Loblolly Pine
 SNP Discovery
 Genotyping of 3,000 trees using Illumina
technology
 Phenotyping of same 3,000 trees
 Association studies
Resequencing Project Pipeline
EST Databases
(UGA, UMN)
Assemble
Contigs
Diversity Panel
DNA Extraction
Megagametophyte
Primer
Design
Illumina Genotyping
Primer
Validation
Agencourt Sequencing
18 individuals & 5 other
SNP Identication
Phenotyping
Drought, Dieases, & Wood
Population DNA Extraction
(Needles)
Association
Studies
Primer Selection and Validation
Primer Selection
Using OSP primer picking
Report to TreeGenes - several primers per
amplicon will be stored in TreeGenes
Successes - select 1 primer set per amplicon
Primer synthesis at Illumina
Sequencing on validation DNA
Report to TreeGenes
Successes
BLASTing of validation DNA sequence
back to EST database
Report to TreeGenes
Successes
Sequencing of diversity panel - 1 primer set per
amplicon which has been successful at each step
Report to TreeGenes
Species
Num.
Successful
Amplicons
Pinus taeda
7,424
Pinus radiata
6,249 (84%)
Pinus
lamertiana
2,234 (30%)
Picea abies
1,024 (13%)
Pseudotsuga
menziesii
750 (10%)
**312 successful in all 5 species
Alignment and SNP Calling Pipeline
Challenges in High-Throughput SNP Identification

Alignment
Critical in the automation of base calls
 Commonly used Phrap (from PhredPhrap) is an assembler and is
NOT ideal for alignments
Many commonly used aligners work best with protein
sequences or with a reference sequence
 Preservation of quality scores for input into SNP
identification programs
 Speed for high-throughput programs
Automated SNP Calls
- Reference Sequence Required
- Traditional approaches without reference sequence include
“eSNPs” (human, maize, and pine)


-Very little redundancy outside of abundant genes
-Overall high number of false positives (single pass reads)
-
Not specific to frequencies observed in different organisms
High number of false positives in currently accepted
methods (Polybayes & Polyphred)
PineSAP
Re-Sequencing data
from Agencourt:
Initial Processing
Base Calling
Sequence Alignment
SNP Identification
Machine Learning
Data Storage &
Release
Base Calling and Sequence Alignment
Modified PhredPhrap
allows for trimming of bases from
start and end of sequence based
on trace quality
Ace2FASTA
Converts native PhredPhrap output (ace file) into
an unaligned FASTA file
ProbconsRNA
Optimal DNA sequence alignment program
AlignedContig2ReadFASTA
Provides single multifasta file with all reads aligned to
the contig from PhredPhrap AND the contigs alignment
to the other contigs from probconsRNA
FASTA2Ace
Converts resulting FASTA file back into ace file for
SNP Identification
Phrap assembly
Number of contigs
1
2
3
4
5
6
7
8
Percentage
23.67%
27.00%
19.67%
14.00%
6.00%
4.33%
2.67%
2.33%
Alignment Pipeline Success
Problems (out of 356 validation set)
Sequencing problems
3
0.84%
Problems in end or middle
4
1.12%
F&R don't overlap
2
0.56%
Alignment problems due to low quality
1
0.28%
Alignment problem due to wrong direction in original phrap
alignment
2
0.56%
Summary (out of 356 validation set)
Correct alignments
Alignments we are working on fixing
344 96.63%
+5 98.03%
SNP Identification and Machine Learning
Algorithms applied Independently for SNP
Identification:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Polyphred and Polybayes both utilize
alignment and base quality information to
identify potential polymorphisms.
Both programs yield a high # of false positives
Machine Learning
Utilizes sequence based information and
compares actual base calls to proposed calls
to develop SNP identification filters
specific to pine
Parameters under Consideration: Local and global quality scores,
frequencies of major and minor alleles, local and global SNP
frequences, relative positions in sequence to existing variation, and
polyphred and polybayes scores
SNP Identification Overview

Examine features to improve the accuracy of SNP location
prediction

Utilize machine learning to apply the features

Refine the accuracy of the learning algorithm through
adjustments to feature representation (iterative process)

Utilize the final classifier against the large re-sequenced set
to improve accuracy of SNP calls originating from Polybayes
and Polyphred
Features Selected
Description
Representation
Sequence Depth
Continuous
Variation Type
Categorical
Polybayes Score
Continuous
Polyphred Score
Continuous
Freq of major/minor alleles
Continuous
Max quality of major/minor alleles
Continuous
Local average quality
Continuous
Overall average quality
Continuous
Alignment Quality
Continuous
Feature Representation

Sequence depth

Count of number of sequences in the alignment at the position of
variation.
 All sequences in the alignment may not overlap at the position of
variation;


Variation type


Variation type can be SNP or INDEL
PolyBayes score


number is different from the total number of the sequences in the alignment
PolyBayes program assigns a Bayesian posterior probability value for
each called SNP using the frequency priors given for observing a
variation at that position
Polyphred score

Polyphred assigns a score calculated primarily from sequence depth and
quality score.
Feature Representation

Base frequencies

The number of occurrences of different bases at the position of
variation is important in determining a polymorphic position.


Frequencies of the first (major allele) and the second (minor allele)
are represented as ratio to sequence depth.
Relative distance

Sequence quality at the ends of the alignment tends to be poor
due to inherent limitations of current sequencing technology.

SNP position was represented as the ratio of the distance in the
consensus sequence from the closest end, or the relative distance
Feature Representation

Sequence quality

Variation is observed because of a poor quality base.
 Based on the base frequencies calculated:



maximum qualities of the major and minor alleles
average qualities of major and minor alleles
Alignment quality

Misalignment of bases caused by sequence alignment programs
sometimes result in an erroneous SNP call.
 In the neighborhood of the SNP (+/- 10 bases) all the mismatches
with the consensus sequence are given a penalty and the penalty
is more if the mismatch is continuous
Classification
Goal: Prediction
Software: Decision tree C4.5 program is open-source C code
(WEKA)

At each point in the construction of the decision tree, C4.5
selects the feature to test based on maximum information
gain.

The goal is to generate a minimum size tree that correctly
classifies all the elements in the training set.

The size of the tree is the number of nodes (decision
nodes) and the number of leaves are the classes
(categories) that they are distributed to.
Alignment and SNP Identification
SNP Identification Datasets

Training set for loblolly pine was composed of a total of 300
validated sequences.

Divided to represent the relative percentages of sequence source
 Testing set is composed of 120 validated sequence sets

Training set for poplar was composed of 42 validated
sequences selected at random


Testing set is composed of a total of 30 validated sequence sets.
Validation = manually observed FP, FN, TP, and TN SNP calls
through observation of tracefiles in Consed.
Alignment and SNP Identification
Evaluation Criteria

Accuracy = (TP + TN)/total

Sensitivity = TP/(TP + FN)
Evaluation J48
Polyphred
Polybayes

Specificity = TN/(FP + TN)
Accuracy
93.6
76.25
78.02
Sensitivity
88.21
83.22
86.54
Specificity
98.73
N/A
N/A
Evaluation J48
Polyphred
Polybayes
Accuracy
94.6
79.35
80.24
Sensitivity
90.54
85.01
88.14
Specificity
97.23
N/A
N/A
Resulting Identifications






SNPs have been called using PineSAP on
7424 amplicons representing 6924
Unigenes.
Average amplicon length is 445 bp
Average of 5.5 SNPs/amplicons
Pine is highly polymorphic!
Total SNPS ~41,500
Distribution of SNPs/amplicon



1404 - 0 SNPs
1133 - 1 SNP
4887 - 2 or more SNPs
Alignment and SNP Identification
SNP Formatting
Custom lists can be exported into Illumina style input
**option for adding IUPAC codes for SNPs in flanking
sequence
Alignment and SNP Identification
Illumina Design
Data Storage and Release
SNP Identifcations
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
GENOTYPING
SNP Discovery: dbSNP database
dbSNP
NCBI SNP database
SNP data submitted to dbSNP
dbSNP processing of SNPs
SNPs submitted
by research communtiy
(submitted SNPs = ss#)
Unique mapping
to a genome location
(reference SNP = rs#)
(by 2hit-2allele)
Summary

PineSAP improves
 Inaccuracies introduced by using Phrap to align
sequences
 Time which would be required by using a aligner such as
ProbconsRNA or ClustalW on its own

PineSAP has a 98% success rate when used to align loblolly
resequencing data.

PineSAP identified a success list of features to enhance
polymorphism predictions and generated a prediction
accuracy of 93%

PineSAP provided a full alignment and polymorphism
detection system that can be adapted to specific
genomes
More Information…
Phred/Phrap/Consed: www.phrap.org/
Polyphred: droog.mbt.washington.edu/PolyPhred.html
Polybayes:
genomeold.wustl.edu/groups/informatics/software/polybayes/