Transcript Slide 1
Read Processing and Mapping: From Raw to Analysis-ready Reads
B E N P A S S A R E L L I Q U A K E L A B N G S W O R K S H O P M A Y 3 0 , 2 0 1 4
2
From Raw to Analysis-ready Reads
Raw reads
Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration
Session Topics
•
Brief
overview of high-throughput sequencing platforms • Understand read data formats and quality scores • Identify and fix some common read data problems • Find a genomic reference for mapping • Map reads to a reference genome • Understand alignment output • Sort, merge, index alignment for further analysis • Mark/eliminate duplicate reads • Locally realign at indels • Recalibrate base quality scores • How to get started
Analysis-ready reads
Sequencing Platforms at a Glance
Illumina Sequencing Platforms
MiSeq
Features # Flowcells # Sample Mixes # Clusters / Run Max Read Length Gb / Run Hours / Run Reagent Cost / Gb
NextSeq 500
MiSeq 1 1 25M 2x300 15 55 hours $79
HiSeq 2500
NextSeq 500 1 1 400M 2x150 120 30 hours $32 HiSeq 2500 2 16 3200M 2x100 640 12 days $36
Single Cell Analysis Toolset • Built on R Statistics Package • Differential gene expression analysis and visualization • PCA • Unsupervised clustering • ANOVA (statistical hypothesis testing)
Sample to Raw Reads
Sample Preparation QC and Quantification Library Construction Sequencing Raw Reads 6 C1 Single Cell Capture Imaging / Lysis Amp of DNA / cDNA AATI Fragment Analyzer Evaluate and Quantitate Harvested C1 DNA products NextSeq 500 300M or 800M Reads In ~24 hours
Solid Phase Amplification
Sequencing Steps •Clusters are linearized •Sequencing primer annealed •All labeled dNTPs added at each cycle •Intensity of different tags base call •Error Profile: substitutions 7 Library DNA binds to Oligos Immobilized on Glass Flowcell Surface
Instrument Output
Illumina LifeTech Pacific Biosciences Oxford Nanopore MiSeq NextSeq HiSeq Base call file (.bcl) PGM Proton Standard flowgram file (.sff) RS Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) MinION Squiggle (.h5)
Sequence Data
(FASTQ Format)
8
FASTQ Format (Illumina Example)
Read Record Header Separator (with optional repeated header)
Flow Cell ID Lane Tile Tile Coordinates Barcode @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read Quality
+
Scores
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Base Call Quality: Phred Quality Scores
Phred* quality score
Q
with base-calling error probability P
Q
= -10 log
10
P
* Name of first program to assign accurate base quality scores. From the Human Genome Project.
Q score
10 20 30 40
Probability of base error
0.1
0.01
0.001
0.0001
Base confidence
90% 99% 99.9% 99.99%
Sanger encoded (Q Score + 33) ASCII character
“+” “5” “?” “I” SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()* + ,-./01234 5 6789:;<=> ?
@ABCDEFGH I JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41
Initial Read Assessment and Processing
Raw reads
Common problems that can affect analysis:
Read assessment and prep
Mapping Duplicate Marking Local realignment Base quality recalibration Low confidence base calls • typically toward ends of reads • criteria vary by application Presence of adapter sequence in reads • poor fragment size selection • protocol execution or artifacts Over-abundant sequence duplicates Library contamination
Analysis-ready reads
Quick Read Assessment: FastQC
Free Download Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y Samples reads (200K default): fast, low resource use
Read Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact)
Read Assessment Example (Cont’d)
TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
Comprehensive Read Assessment: Prinseq
15 http://prinseq.sourceforge.net/
Selected Tools to Process Reads
Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).
FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.
SepPrep https://github.com/jstjohn/SeqPrep Adapter trimming Merge overlapping paired-end read Biopython http://biopython.org
, http://biopython.org/DIST/docs/tutorial/Tutorial.html
(for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http://galaxy.psu.edu
Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations SolexaQA2 http://solexaqa.sourceforge.net
Dynamic trimming Length sorting (resembles read grouping of Prinseq)
Many Analysis Pipelines Start with Read Mapping
Genotyping/Haplotyping Gene Expression https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq Tumor/Normal Comparison 17 http://www.appistry.com/sites/all/themes/appistry/files/pdfs/CGAS_download.pdf
https://www.broadinstitute.org/gatk/guide/best-practices
Read Mapping
Raw reads
Read assessment and prep
Mapping
Duplicate Marking Local realignment Base quality recalibration
Analysis-ready reads
http://www.broadinstitute.org/igv/
Sequence References and Annotations
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html
Comprehensive reference, annotation, and translation information ftp://[email protected]/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html
Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org
Fasta Sequence Format • • • • • One or more sequences per file “>” denotes beginning of sequence or contig Subsequent lines up to the next “>” define sequence Lowercase base denotes repeat masked base Contig ID may have comments delimited by “|”
>chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 …
Read Mapping
License Mismatch allowed Novoalign (3.0)
Commercial up to 8
SOAP3 (0.01 beta)
GPL v3 up to 3 random/all/none random/all/none
BWA (0.7.8) Bowtie2 (2.2.2)
GPL v3 Artistic user specified. max is function of read length and error rate user specified user selected user selected
Tophat2 (2.0.11)
Artistic uses Bowtie2
Alignments reported per read Gapped alignment
up to 7bp
Pair-end reads Best alignment Trim bases Comments
yes highest alignment score 3’ end At one time, best performance and alignment quality uses Bowtie2 1-3bp gap yes minimal number of mismatches 3’ end Can use nVIDIA CUDA (GPU) yes yes minimal number of mismatches yes yes highest alignment score yes splice junctions introns yes uses Bowtie2 3’ and 5’ end 3’ and 5’ end Element of B road’s “best practices” genotyping workflow Smith-Waterman quality alignments, currently fastest uses Bowtie2 Currently most popular RNA-seq aligner
STAR (2.3.0e)
GPL v3 user specified user selected yes splice junctions introns yes highest alignment score 3’ and 5’ end Very fast; uses memory to achieve performance
Read Mapping: BWA
BWA Features • Uses Burrows Wheeler Transform — fast — modest memory footprint (<4GB) • Accurate • Tolerates base mismatches — increased sensitivity — reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the
de facto
standard) • Large installed base, well-supported • Open-source (no charge)
Read Mapping: Bowtie2
Bowtie2 • Uses dynamic programming (edit distance scoring) o Eliminates need for realignment around indels o o Can be tuned for different sequencing technologies • Multi-seed search - adjustable sensitivity • Input read length limited only by available memory • Fasta or Fastq input • Caveats Longer input reads require much more memory o Trade-off parallelism with memory requirement Dynamic Programming Illustration 23 http://bowtie-bio.sourceforge.net/bowtie2 Langmead B, Salzberg S.
Fast gapped-read alignment with Bowtie 2
, Nature Methods. 2012, 9:357-359
SAM (BAM) Format
Sequence Alignment/Map format Universal standard Human-readable (SAM) and compact (BAM) forms Superset of FASTQ Structure Header version, sort order, reference sequences, read groups, program/processing history Alignment records
SAM/BAM Format: Header
[benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam
@HD VN:1.0
GO:none SO:coordinate @SQ SN:chrM LN:16571 sort order @SQ @SQ @SQ … @SQ @SQ @SQ @SQ @SQ @SQ … @RG @RG @RG @RG @RG … @RG @PG @PG SN:chr1 SN:chr2 SN:chr3 SN:chr19 SN:chr20 SN:chr21 SN:chr22 SN:chrX SN:chrY ID:86-191 ID:BsK010 ID:Bsk136 ID:MAK001 ID:NG87 LN:249250621 LN:243199373 LN:198022430 LN:59128983 LN:63025520 LN:48129895 LN:51304566 LN:155270560 LN:59373566 PL:ILLUMINA PL:ILLUMINA PL:ILLUMINA PL:ILLUMINA PL:ILLUMINA LB:IL500 LB:IL501 LB:IL502 LB:IL503 LB:IL504 reference sequence names SM:86-191-1 SM:BsK010-1 SM:Bsk136-1 SM:MAK001-1 SM:NG87-1 ID:SDH023 PL:ILLUMINA ID:GATK IndelRealigner ID:bwa PN:bwa LB:IL508 VN:0.6.2-r126 SM:SDH023 VN:2.0-39-gd091f72 with lengths samtools to view bam header read groups with platform, library and sample information program (analysis) history CL:knownAlleles=[] targetIntervals=tmp.intervals.list
SAM/BAM Format: Alignment Records
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
1 10 11 2 3 4 5 6 8 9 HW-ST605:127:B0568ABXX:2:1201:10933:3739 RG:Z:86-191 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ HW-ST605:127:B0568ABXX:3:1104:21059:173553 RG:Z:SDH023 83 chr1 27682 60 101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::=BDHFHGFFDCGDAACCABHCCBDFBEA??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability http://samtools.sourceforge.net/SAM1.pdf
Compression is Big Win for HTS Data
33.8M 100bp Illumina reads 6x Compression Ratio 5x Improvement 4x 3x
Preparing for Next Steps
Raw reads
Read assessment and prep Subsequent steps require sorted and indexed bams Sort orders: karyotypic, lexicographical Indexing improves analysis performance Mapping Duplicate Marking Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml
Sort: Merge: Index: SortSam.jar
MergeSamFiles.jar
BuildBamIndex.jar
Local realignment Order: sort, merge (optional), index Base quality recalibration 28
Analysis-ready reads
Duplicate Marking
Raw reads
Read assessment and prep Mapping
Duplicate Marking
Local realignment Base quality recalibration
Analysis-ready reads
$java -Xmx4g -jar
MarkDuplicates.jar
INPUT=aligned.sorted.bam \ OUTPUT=aligned.sorted.dedup.bam \ VALIDATION_STRINGENCY=LENIENT \ METRICS_FILE=aligned.dedup.metrics.txt \ REMOVE_DUPLICATES=false \ ASSUME_SORTED=true \ http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
SAM/BAM Format: Alignment Records
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
HW-ST605:127:B0568ABXX:2:1201:10933:3739 RG:Z:86-191 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ http://picard.sourceforge.net/explain-flags.html
http://samtools.sourceforge.net/SAM1.pdf
Local Realignment
Raw reads
BWT BASED ALIGNMENT IS FAST FOR MATCHING READS TO REFERENCE I NDIVIDUAL BASE ALIGNMENTS OFTEN SUB OPTIMAL AT INDELS Read assessment and prep Mapping A PPROACH Fast read mapping with BWT-based aligner Realign reads at indel sites using gold standard (but much slower) Smith-Waterman algorithm Duplicate Marking
Local realignment
Base quality recalibration
Analysis-ready reads
B ENEFITS Refines location of indels Reduces erroneous SNP calls Very high alignment accuracy in significantly less time, with fewer resources 1 Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195 –197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238
Local Realignment Raw BWA alignment Post re-alignment at indels
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Base Quality Recalibration
Raw reads
Read assessment and prep Mapping Duplicate Marking
STEP 1: Find covariates at non-dbSNP sites using:
Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx4g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I alignment.bam \ -R hg19/ucsc.hg19.fasta \ -knownSites hg19/dbsnp_135.hg19.vcf \
-o alignment.recal_data.grp
Local realignment
Base quality recalibration Analysis-ready reads STEP 2: Generate BAM with recalibrated base scores:
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R hg19/ucsc.hg19.fasta \ -I alignment.bam \
-BQSR alignment.recal_data.grp
-o alignment.recalibrated.bam
\
Base Quality Recalibration (Cont’d)
Raw reads
Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration
Analysis-ready reads
35
Is there an easier way to get started?!
http://galaxyproject.org/ Click on “Use Galaxy”
Getting Started
Raw reads
Read assessment and prep Mapping Duplicate Marking Local realignment Base quality recalibration
Analysis-ready reads
38