Transcript Slide 1

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 4
Mapping and Genome Rearrangement
Jared Simpson, Ph.D.
Paired-end Reads
DNA fragment
ATCAA
CTAAG
Learning Objectives of Module
• Understand mapping sequence reads to a reference
genome
• Understand file formats like FASTA, FASTQ and SAM/BAM
• Learn common terminology used to describe alignments
• Learn how paired-end reads can be used to find genome
rearrangements
• Run a mapper and rearrangement caller
Module 4
bioinformatics.ca
Sequencing platforms
14TB/run
$
600Gb/10d
Cross-platform
data integration
needed.
Increasing
Data
Per Run
100Gb/15d
120Gb/1d
90Gb/10d
150Mb/3h
2Gb/27h
700Mb/23h
100Mb/1h
Proton?
GridION?
Increasing Run Time
Module 4
$
bioinformatics.ca
Basecalling
• How do we translate the machine data to base calls?
• How do we estimate and represent sequencing errors?
Module 4
bioinformatics.ca
Sources of error
Illumina: Pre-phasing & Phasing
Module 4
bioinformatics.ca
What is a base quality?
Phred quality scores:
- Estimate of probability the base call is incorrect
Module 4
Base Quality
Perror(obs. base)
3
50 %
5
32 %
10
10 %
20
1%
30
0.1 %
40
0.01 %
bioinformatics.ca
Error Profiles
• Illumina
– Low error rate (~0.5%), mainly substitutions
• 454/Ion Torrent
– Mainly insertions/deletions in homopolymer runs
• Pacbio
– Higher error rate, mixture of insertions, deletions, substitutions
Module 4
bioinformatics.ca
Mismatch by cycle
Module 4
bioinformatics.ca
Fasta files
ASF-1.fa
•
•
•
•
ASF-2.fa
Reads are often stored in fasta files
Separate file for forward and reverse pairs
header line: identifier
sequence lines: nucleotides
Module 4
bioinformatics.ca
Fastq files
ASF-1.fastq
• Most reads are stored in fastq
• 4 lines per read
Module 4
ASF-2.fastq
•
•
•
•
header line: @SEQUENCE_ID
sequence line
line beginning with +
encoded quality value line
bioinformatics.ca
Reference-based Alignment
• Goal:
– find position in reference genome from which read was sampled
• Issues:
– the human genome is large and repetitive
– NGS instruments produce huge amounts of data
– the sequenced genome will differ from the reference due to SNPs,
indels and structural variation
Module 4
bioinformatics.ca
Choosing an Aligner
• High accuracy needed
– Misaligned reads are a source of false positive variant calls
• High sensitivity needed
– The aligner must allow for differences between the
individual and reference to find the correct mapping
position
• High speed needed
– With large data the informatics cost is significant
• We will use the popular aligner bwa in the tutorial
Module 4
bioinformatics.ca
Reference alignments
Reference genome
Sequence read
?
Module 4
bioinformatics.ca
Reference alignments
Reference genome
x
x
x
Sequence read
Module 4
bioinformatics.ca
Alignment Quality
• Most aligners will estimate how reliable the alignment is
with a Mapping Quality
– Phred-scaled estimate of the probability that the chosen
mapping is wrong
– 1 in 1000 reads with “Q30” alignment will be placed incorrectly
Module 4
bioinformatics.ca
What are Paired Reads?
Paired-end Reads
DNA fragment
ATCAA
CTAAG
Insert size (IS)
Slides by M. Brudno
Module 4
bioinformatics.ca
Paired Reads
Reference genome
?
Sequence read pair
Module 4
bioinformatics.ca
Read pair alignment
Reference genome
x
x
x xxxxx
Sequence read pair
Module 4
bioinformatics.ca
Working with alignments
• SAM/BAM is a standardized format for working with read
alignments
• SAM is tab-delimited text representation
• BAM is a compressed binary representation
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
Module 4
bioinformatics.ca
SAM Description
Read name
Flag
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
➞ Flag indicates the reference strand, pairing information
Module 4
bioinformatics.ca
SAM Description
Chromosome
Coordinate
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
Module 4
bioinformatics.ca
SAM Description
Mapping Quality
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
Module 4
bioinformatics.ca
SAM Description
CIGAR
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
REF ACGATACATAC
READ ACGA-ACATAC
REF GACA-AACC
READ GTCATAACC
CIGAR: 4M1D6M
CIGAR: 4M1I4M
Module 4
bioinformatics.ca
SAM Description
Mate chromosome,
position
Insert size
SRR013667.1 99 19 8882171 60 76M = 8882214 119
NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77
ATCAA
CTAAG
Insert size (IS)
Module 4
bioinformatics.ca
Resources
• samtools: toolkit for working with SAM/BAM files
– Convert between SAM/BAM
– Sort alignments
– Extract alignments for a given genomic location
• SAM/BAM specification:
http://samtools.sourceforge.net/SAM1.pdf
• Questions/Help
–
–
–
Module 4
https://lists.sourceforge.net/lists/listinfo/samtools-help
http://www.biostars.org/
http://seqanswers.com/
bioinformatics.ca
We are now going to start an exercise
in read mapping
Module 4
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 4
bioinformatics.ca
What kinds of variation is there?
• Single Nucleotide Polymorphisms (SNPs)
• Short indels (< read length)
• Structural variations
–
–
–
–
Module 4
Large insertions and deletions
Inversions
Translocations
Copy number variation
bioinformatics.ca
Structural variants
Mate-pair and paired-end reads can be used to detect structural variants
Genomic
DNA
Mate-Pairs
1 - 20kb
Fragmentation &
circularization
to an internal adaptor
Paired-Ends
200 – 500bp
Fragmentation
Add amplification
and sequencing adaptors
Shear
Isolate internal
adaptors and
fragment ends
Add amplification
and sequencing adaptors
Module 4
Sequence
bioinformatics.ca
Read pair orientation
Reference genome
Sequence read pair
• The expected orientation is one read on the forward strand
and one read on the reverse strand for paired-end reads
Module 4
bioinformatics.ca
Read pair alignment
Fragment
number
Fragment size
• Fragment/insert size is determined by library preparation
• Pairs that match the expected orientation and distance are
called concordant
• Discordant read pairs give evidence of structural variation
Module 4
bioinformatics.ca
SV Signatures: Deletion
don
ref
Slides by M. Brudno
Module 4
bioinformatics.ca
SV Signatures: Deletion
don
ref
Deletion signature: mapped insert size larger than expected
Slides by M. Brudno
Module 4
bioinformatics.ca
SV Signatures: Insertion
don
ref
Insertion signature: mapped insert size smaller than expected
Slides by M. Brudno
Module 4
bioinformatics.ca
SV Signatures: Tandem Duplication
don
ref
Tandem duplication signature: wrong orientation
Module 4
bioinformatics.ca
SV Signatures: Inversion
don
ref
Inversion signature: wrong orientation of pairs
Module 4
bioinformatics.ca
SV summary
Type
Mapped Distance
Orientation
Insertion
too small
correct
Deletion
too big
correct
Inversion
*
Tandem duplication
*
Interchromosomal
different
chromosomes
N/A
Slides by M. Brudno
Module 4
bioinformatics.ca
Where can we go wrong:
missed insertion
don
ref
IS
Module 4
Insertions larger than insert size cannot
be detected this way
bioinformatics.ca
Structural Variants and Split Reads
Paired Short Reads
Align
Most of these pairs can
be aligned to the
reference genome
For some paired-end reads
one of the pair may not be
mapped because it goes
across the breakpoint of a
structural variant. We call
such reads split reads.
Slides by M. Brudno
Module 4
bioinformatics.ca
Deletion: split read signature
don
ref
Signature: read aligns in two pieces, one on either
side of the breakpoint
Module 4
bioinformatics.ca
Somatic vs. Germline
• tumor vs. normal sequencing
• approach 1:
– find SVs separately in two samples
– filter out somatic SVs that overlap germline SVs
• approach 2
– find somatic SVs
– for each somatic SV, find any type of evidence in germline
– filter out anything with germline evidence
Slides by M. Brudno
Module 4
bioinformatics.ca
Gene fusions
• if a linking signature connects two genes, this might indicate a
gene fusion
ChrA
ChrB
Gene X
Gene Y
Gene XY Protein
Module 4
bioinformatics.ca
SV Software and Exercise
• We will use HYDRA-SV in the tutorial
– https://code.google.com/p/hydra-sv/
– Quinlan et al, Genome-wide mapping and assembly of structural variant
breakpoints in the mouse genome. Genome Research
• Many others exist:
– Breakdancer, GASV, Pindel
– It is worth spending time learning multiple packages and their
strengths and weaknesses
– There is rarely one program that fits all needs!
Module 4
bioinformatics.ca
We are now going to start an exercise
in structural variant detection
Module 4
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 4
bioinformatics.ca
Any questions?
[email protected]
Module 4
bioinformatics.ca