Transcript Slide 1
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 4 Mapping and Genome Rearrangement Jared Simpson, Ph.D. Paired-end Reads DNA fragment ATCAA CTAAG Learning Objectives of Module • Understand mapping sequence reads to a reference genome • Understand file formats like FASTA, FASTQ and SAM/BAM • Learn common terminology used to describe alignments • Learn how paired-end reads can be used to find genome rearrangements • Run a mapper and rearrangement caller Module 4 bioinformatics.ca Sequencing platforms 14TB/run $ 600Gb/10d Cross-platform data integration needed. Increasing Data Per Run 100Gb/15d 120Gb/1d 90Gb/10d 150Mb/3h 2Gb/27h 700Mb/23h 100Mb/1h Proton? GridION? Increasing Run Time Module 4 $ bioinformatics.ca Basecalling • How do we translate the machine data to base calls? • How do we estimate and represent sequencing errors? Module 4 bioinformatics.ca Sources of error Illumina: Pre-phasing & Phasing Module 4 bioinformatics.ca What is a base quality? Phred quality scores: - Estimate of probability the base call is incorrect Module 4 Base Quality Perror(obs. base) 3 50 % 5 32 % 10 10 % 20 1% 30 0.1 % 40 0.01 % bioinformatics.ca Error Profiles • Illumina – Low error rate (~0.5%), mainly substitutions • 454/Ion Torrent – Mainly insertions/deletions in homopolymer runs • Pacbio – Higher error rate, mixture of insertions, deletions, substitutions Module 4 bioinformatics.ca Mismatch by cycle Module 4 bioinformatics.ca Fasta files ASF-1.fa • • • • ASF-2.fa Reads are often stored in fasta files Separate file for forward and reverse pairs header line: identifier sequence lines: nucleotides Module 4 bioinformatics.ca Fastq files ASF-1.fastq • Most reads are stored in fastq • 4 lines per read Module 4 ASF-2.fastq • • • • header line: @SEQUENCE_ID sequence line line beginning with + encoded quality value line bioinformatics.ca Reference-based Alignment • Goal: – find position in reference genome from which read was sampled • Issues: – the human genome is large and repetitive – NGS instruments produce huge amounts of data – the sequenced genome will differ from the reference due to SNPs, indels and structural variation Module 4 bioinformatics.ca Choosing an Aligner • High accuracy needed – Misaligned reads are a source of false positive variant calls • High sensitivity needed – The aligner must allow for differences between the individual and reference to find the correct mapping position • High speed needed – With large data the informatics cost is significant • We will use the popular aligner bwa in the tutorial Module 4 bioinformatics.ca Reference alignments Reference genome Sequence read ? Module 4 bioinformatics.ca Reference alignments Reference genome x x x Sequence read Module 4 bioinformatics.ca Alignment Quality • Most aligners will estimate how reliable the alignment is with a Mapping Quality – Phred-scaled estimate of the probability that the chosen mapping is wrong – 1 in 1000 reads with “Q30” alignment will be placed incorrectly Module 4 bioinformatics.ca What are Paired Reads? Paired-end Reads DNA fragment ATCAA CTAAG Insert size (IS) Slides by M. Brudno Module 4 bioinformatics.ca Paired Reads Reference genome ? Sequence read pair Module 4 bioinformatics.ca Read pair alignment Reference genome x x x xxxxx Sequence read pair Module 4 bioinformatics.ca Working with alignments • SAM/BAM is a standardized format for working with read alignments • SAM is tab-delimited text representation • BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Module 4 bioinformatics.ca SAM Description Read name Flag SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 ➞ Flag indicates the reference strand, pairing information Module 4 bioinformatics.ca SAM Description Chromosome Coordinate SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Module 4 bioinformatics.ca SAM Description Mapping Quality SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Module 4 bioinformatics.ca SAM Description CIGAR SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 REF ACGATACATAC READ ACGA-ACATAC REF GACA-AACC READ GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M Module 4 bioinformatics.ca SAM Description Mate chromosome, position Insert size SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 ATCAA CTAAG Insert size (IS) Module 4 bioinformatics.ca Resources • samtools: toolkit for working with SAM/BAM files – Convert between SAM/BAM – Sort alignments – Extract alignments for a given genomic location • SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf • Questions/Help – – – Module 4 https://lists.sourceforge.net/lists/listinfo/samtools-help http://www.biostars.org/ http://seqanswers.com/ bioinformatics.ca We are now going to start an exercise in read mapping Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session Module 4 bioinformatics.ca What kinds of variation is there? • Single Nucleotide Polymorphisms (SNPs) • Short indels (< read length) • Structural variations – – – – Module 4 Large insertions and deletions Inversions Translocations Copy number variation bioinformatics.ca Structural variants Mate-pair and paired-end reads can be used to detect structural variants Genomic DNA Mate-Pairs 1 - 20kb Fragmentation & circularization to an internal adaptor Paired-Ends 200 – 500bp Fragmentation Add amplification and sequencing adaptors Shear Isolate internal adaptors and fragment ends Add amplification and sequencing adaptors Module 4 Sequence bioinformatics.ca Read pair orientation Reference genome Sequence read pair • The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads Module 4 bioinformatics.ca Read pair alignment Fragment number Fragment size • Fragment/insert size is determined by library preparation • Pairs that match the expected orientation and distance are called concordant • Discordant read pairs give evidence of structural variation Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno Module 4 bioinformatics.ca SV Signatures: Deletion don ref Deletion signature: mapped insert size larger than expected Slides by M. Brudno Module 4 bioinformatics.ca SV Signatures: Insertion don ref Insertion signature: mapped insert size smaller than expected Slides by M. Brudno Module 4 bioinformatics.ca SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation Module 4 bioinformatics.ca SV Signatures: Inversion don ref Inversion signature: wrong orientation of pairs Module 4 bioinformatics.ca SV summary Type Mapped Distance Orientation Insertion too small correct Deletion too big correct Inversion * Tandem duplication * Interchromosomal different chromosomes N/A Slides by M. Brudno Module 4 bioinformatics.ca Where can we go wrong: missed insertion don ref IS Module 4 Insertions larger than insert size cannot be detected this way bioinformatics.ca Structural Variants and Split Reads Paired Short Reads Align Most of these pairs can be aligned to the reference genome For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads. Slides by M. Brudno Module 4 bioinformatics.ca Deletion: split read signature don ref Signature: read aligns in two pieces, one on either side of the breakpoint Module 4 bioinformatics.ca Somatic vs. Germline • tumor vs. normal sequencing • approach 1: – find SVs separately in two samples – filter out somatic SVs that overlap germline SVs • approach 2 – find somatic SVs – for each somatic SV, find any type of evidence in germline – filter out anything with germline evidence Slides by M. Brudno Module 4 bioinformatics.ca Gene fusions • if a linking signature connects two genes, this might indicate a gene fusion ChrA ChrB Gene X Gene Y Gene XY Protein Module 4 bioinformatics.ca SV Software and Exercise • We will use HYDRA-SV in the tutorial – https://code.google.com/p/hydra-sv/ – Quinlan et al, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research • Many others exist: – Breakdancer, GASV, Pindel – It is worth spending time learning multiple packages and their strengths and weaknesses – There is rarely one program that fits all needs! Module 4 bioinformatics.ca We are now going to start an exercise in structural variant detection Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session Module 4 bioinformatics.ca Any questions? [email protected] Module 4 bioinformatics.ca