reseq-wrkshp

Transcript reseq-wrkshp

Resequencing Genome
Timothee Cezard
EBI NGS workshop
16/10/2012
NGS Course – Data Flow
Overview
Karim Gharbi
DNA
Sequencing
Resequencing &
assembly
Gene
regulation
Timothee Cezard
ChIP-seq
analysis
Elizabeth
Murchison
Remco Loos/
Myrto Kostadima
Sequence
archives
ENA/SRA
submission and
retrieval
Data compression
Guy Cochrane
Genome
variation
&
disease
Jon Teague
/Adam
Butler/
Simon Forbes
Laura Clarke
Rajesh Radhakrishnan
Rasko Leinonen
Arnaud Oisel
Marc Rossello
Vadim Zalunin
RNA
Sequencing
Gene
annotation
Gene
expression
RNA-Seq
Ensembl gene
build
RNA-Seq
Transcriptome
analysis
Ensembl/John Collins
Myrto Kostadima/
Remco Loos
NGS Course – Data Flow
Overview
Karim Gharbi
DNA
Sequencing
Resequencing &
assembly
Gene
regulation
Timothee Cezard
ChIP-seq
analysis
Elizabeth
Murchison
Slides and
Sequence
archives
ENA/SRA
submission and
retrieval
Data compression
Guy Cochrane
Genome
variation
&
disease
Jon Teague
tutorials
are available
at:
Remco Loos/
/Adam
Rajesh Radhakrishnan
Rasko Leinonen
Arnaud Oisel
Marc Rossello
Vadim Zalunin
RNA
Sequencing
Gene
annotation
Gene
expression
RNA-Seq
Ensembl gene
build
RNA-Seq
Transcriptome
analysis
https://www.wiki.ed.ac.uk/display/GenePoolExternal/NGS+workshop+16.10.2012+at+EBI
Myrto Kostadima
Butler/
Ensembl/John Collins
Myrto Kostadima/
Simon Forbes
Laura Clarke
Remco Loos
NGS Course – Data Flow
Overview
Karim Gharbi
DNA
Sequencing
Resequencing &
assembly
Gene
regulation
Timothee Cezard
ChIP-seq
analysis
Elizabeth
Murchison
Remco Loos/
Myrto Kostadima
Sequence
archives
ENA/SRA
submission and
retrieval
Data compression
Guy Cochrane
Genome
variation
&
disease
Jon Teague
/Adam
Butler/
Simon Forbes
Laura Clarke
Rajesh Radhakrishnan
Rasko Leinonen
Arnaud Oisel
Marc Rossello
Vadim Zalunin
RNA
Sequencing
Gene
annotation
Gene
expression
RNA-Seq
Ensembl gene
build
RNA-Seq
Transcriptome
analysis
Ensembl/John Collins
Myrto Kostadima/
Remco Loos
Overview
•
•
•
DNA (Re)sequencing
• Sequencing technologies
• Sequencing output
• Quality control
Mapping
• Mapping programs
• Sam/Bam format
• Mapping improvements
Variant calling
• Types of variants
• SNPs/indels
• VCF format
Overview
•
•
•
DNA (Re)sequencing
• Sequencing technologies
• Sequencing output
• Quality control
Mapping
• Mapping programs
• Sam/Bam format
• Mapping improvements
Variant calling
• Types of variants
• SNPs/indels
• VCF format
Resequencing genomes
Library
prep
DNA
Extraction
Library
prep
Library
prep
Sequencing data
GATGGGAAGA
GCGGTTCAGC
AGGAATGCCG
AGACCGATAT
CGTATGCCGT
Sequence data
•
•
•
Precise
Fairly unbiased
Easy to QC
Coverage depth data
•
•
Can be biased
Hard to know what’s true
Sequencer specific errors
 Homopolymer run create false indels
 Specific sequence patterns can create phasing issues
Sequencer specific errors
 Specific sequence patterns can create phasing issues
Sequencing output
(Fastq format)
Example fastq record:
@ILLUMINA06_0016:6:1:5388:12733#0
GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%%
Sequencing output
(Fastq format)
Example fastq record:
@ILLUMINA06_0016:6:1:5388:12733#0
GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%%
Sequencing output
(Fastq format)
Example fastq record:
@ILLUMINA06_0016:6:1:5388:12733#0
GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%%%%%
Quality control
Questions you should ask (yourself or your sequencing provider):
•
•
•
Sequencing QC:
• How much sequencing?
• What’s the sequencing quality?
Library QC:
• What’s the base profile across the reads?
• Is there an unexpected GC bias?
• Are there any library preparation contaminants?
Post mapping QC:
• What is the fragment length distribution? (for paired end)
• Is there an unexpected Duplicate rate?
Example with FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Example with FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Overview
•
•
•
DNA (Re)sequencing
• Sequencing technologies
• Sequencing output
• Quality control
Mapping
• Mapping programs
• Sam/Bam format
• Mapping improvements
Variant calling
• Types of variants
• SNPs/indels
• VCF format
Mapping Reads to a reference genome
Problems:
• How to find the best match of short sequence onto a large
genome (high sensitivity)
• How to not find a match when
• for 100,000,000,000 reads in reasonable amount of time.
Solution:
• Hashing based algorithms:
• BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy
• More sensitive when SNPs/Indels
• Suffix trie + Burrows Wheeler Transform algorithms:
• Bowtie, SOAP BWA
• Faster
Different software for different
applications
Transcriptome to
genome
GSNAP
Tophat
Mapping to
distant reference
Stampy
Shrimp
Very fast mapping
bowtie
BWA
Different software for different
applications
Transcriptome to
genome
Mapping to
distant reference
GSNAP
Tophat
Genomatics
Bwasw
Splitseek
Very fast mapping
Stampy
Shrimp
Bowtie
CLC bio
Mr fast
Bwa
Smalt
Mrs fast
Ssaha2
Partek
Different software for different
applications
Transcriptome to
genome
GSNAP
Fastq
Very fast mapping
Genomatics
Bwasw
Splitseek
Tophat
Mapping to
distant reference
Stampy
Mapper
Shrimp
Bowtie
Sam/Bam
CLC bio
Mr fast
Bwa
Smalt
Mrs fast
Ssaha2
Partek
SAM/BAM format
SAM: Sequence Alignment/Map format v1.4
The SAM Format Specification Working Group (Sept 2011)
http://samtools.sourceforge.net/SAM1.pdf
•
•
•
•
Standardized format for alignment
Bam: binary equivalent of SAM
Bam can be indexed for fast record retrieval
Manipulate Sam/Bam file using samtools and others
2 parts:
• Header: contains metadata about the sample
• Alignment:
SAM/BAM format
1
2
3
R00
1
83
ref 37
COLUMNS:
1
QNAME
2
FLAG
3
RNAME
4
POS
5
MAPQ
6
CIGAR
7
RNEXT
8
PNEXT
9
TLEN
10 SEQ
11 QUAL
4
5
6
7
8
9
10
11
12
30
9M
=
7
-39
CAGCGCAT
CAGCGCAT
TAG
String
Int
String
Int
Int
String
String
Int
Int
String
String
Query template NAME
bitwise FLAG
Reference sequence NAME
1-based leftmost mapping POSition
MAPping Quality
CIGAR string
Ref. name of the mate/next fragment
Position of the mate/next fragment
observed Template LENgth
fragment SEQuence
ASCII of Phred-scaled base QUALity+33≈
Bitwise flag
83 = 1010011 in binary format
Bit
integer
Description
0x1
1
template having multiple segments in sequencing
0x2
2
each segment properly aligned according to the aligner
0x4
4
segment unmapped
0x8
8
next segment in the template unmapped
0x10
16
SEQ being reverse complemented
0x20
32
SEQ of the next segment in the template being reversed
0x40
64
the first segment in the template
0x80
128
the last segment in the template
0x100
256
secondary alignment
0x200
512
not passing quality controls
0x400
1024
PCR or optical duplicate
Bitwise flag
83 = 1010011 in binary format
http://picard.sourceforge.net/explain-flags.html
CIGAR alignment
M
I
D
N
S
H
P
=
X
alignment match (can be a sequence match or mismatch)
insertion to the reference
deletion from the reference
skipped region from the reference
soft clipping (clipped sequences present in SEQ)
hard clipping (clipped sequences NOT present in SEQ)
padding (silent deletion from padded reference)
sequence match
sequence mismatch
Ref:
AGGTCCATGGACCTG
|| ||||X|||||||
Query: AG-TCCACGGACCTG
CTTATGTGATC
|||||||||||
Query: CTTATGTGATCCCTG
2M1D12M
or
2=1D4=1X7=
Ref:
10M4S
Mapping enhancement
Each read is mapped independently:
Can borrow knowledge from neighbor to improve mapping
Picard Marking Duplicates: A duplicated read pair is when both two or
more read pairs have the same coordinates.
Samtools BAQ: Hidden markov model that downweight mismatching
based if they are close to indel
GATK Indel realignment: take every reads around potential indel and
perform a more sensitive alignment
GATK Base recalibration: look at several contextual information, such
as position in the read or dinucleotide composition to identify
covariate of sequencing errors
Indel realignment
AACAATATCTATGGA/TTTCG/TTTTG
Indel realignment
Indel realignment
Overview
•
•
•
DNA (Re)sequencing
• Sequencing technologies
• Sequencing output
• Quality control
Mapping
• Mapping programs
• Sam/Bam format
• Mapping improvements
Variant calling
• Types of variants
• SNPs/indels
• VCF format
The whole pipeline
Raw data
Alignment
Realignment
Mark
duplicates
Base
recalibration ?
Final
bam file(s)
The whole pipeline
Raw data
Alignment
Realignment
Mark
duplicates
Base
recalibration ?
Final
bam file(s)
Final
bam file(s)
SNPs/Indels
Calling
CNV
Calling
Structural
Variant Calling
Pool analysis
The whole pipeline
Raw data
Alignment
Realignment
Mark
duplicates
Base
recalibration ?
Final
bam file(s)
Final
bam file(s)
SNPs/Indels
Calling
CNV
Calling
Structural
Variant Calling
Pool analysis
SNPs and indels calling
Samtools mpileup +
bcftools
GATK
UnifiedGenotyper
Bayesian based
Bayesian based
yes
yes
Input:
bam file(s)
bam file(s)
output
vcf file
vcf file
Rather fast
Slow but
multithreaded
Up to 2alleles
3 by default
Algorithm
multiple samples calling
Runtime
Multi-allelic
VCF format
Variant format designed for 1000 genome project
- SNPs
- Insertions
- Deletions
- Duplications
- Inversions
- Copy number variation
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
VCF format
Header: define the optional fields
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
Variants:
• 8 mandatory columns describing the variant
• 1 column defining the genotype format
• 1 column per sample describing the
genotype for that SNP for that sample
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger
(,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2
samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline
tumor
chr4 27668 .
T
C
8.65 .
DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3
chr4 27669 .
G
T
4.77 .
DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4
chr4 27712 .
T
C
44 .
DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8
1/1:37,3,0:1:0:8
chr4 27774 .
G
A
5.47 .
DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ
0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3
chr4 36523 .
A
T
10.4 .
DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4
HEADER
DATA
VCF format
SNPs
#CHROM
chr4
chr4
chr4
chr4
chr4
POS
27668
27669
27712
27774
36523
ID
.
.
.
.
.
REF
T
G
T
G
A
ALT
C
T
C
A
T
QUAL
8.65
4.77
44
5.47
10.4
FILTER
.
.
.
.
.
INFO
DP=2;AF1=1;AC1=4;…
DP=2;AF1=1;AC1=4;…
DP=2;AF1=1;AC1=4;…
DP=2;AF1=0.5011; AC1=2; …
DP=1;AF1=1;AC1=4;…
FORMAT
GT:PL:DP:SP:GQ
GT:PL:DP:SP:GQ
GT:PL:DP:SP:GQ
GT:PL:DP:SP:GQ
GT:PL:DP:SP:GQ
germline
0/1:0,0,0:0:0:3
0/1:0,0,0:0:0:3
1/1:40,3,0:1:0:8
0/1:34,0,23:2:0:28
0/1:0,0,0:0:0:3
Genotype format
SNPs quality
SNP Identifier
SNPs information
Position
Reference base
Filtering reasons
Alternate base(s)
Chromosome
name
Genotype
information
Variant Filtering
• Depth of Coverage:
• confident het call= 10X-20X
• SNPs quality depends on the caller: 30-50
• Genotype quality: 20
• Strand bias
• Biological interpretation