Transcript NGS QC
Sequencing Data Quality Saulo Aflitos Assembly - Concepts Read (≈100bp) Contig (≈2Kbp) Paired-End Mate-Pair Scaffold (≈ 2Mbp) Pseudo Molecule Low Complexity Region (Super Scaffold) Scaffolding Paired-End Mate-Pair Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Low Complexity Region Assembly Scaffolding Repeats?! Reality 1x Consensus 3x 2x 3x Contig Reads Depth of Coverage Goldberg SMD et al. 2006 1x Heterozygozity A N A/C A A A A A A A A A A A A A A A A A C G T A C G T A A A A A A A A A A A C C C C C C C 95% ±5 50% ±10 Consequences of Data Cleaning Non-unique 31-mers volume (bp) / 950 Mbp 265.89 36.83 Raw Filtered 48.65 27.04 50.37 23.66 41.61 29.72 57.60 1.33 Heinz All Round Pennellii F5 Pimpinellifolium Sequencing Shotgun RNAseq Sequencing Paired End Mate Pair Genome Sample Preparation Ultrasound Physical RE Shred Gel Beads Size Selection Adapter Illumina 454 PacBio Sequencing ID Binding to Surface Circularization Shredding Size Selection Sequencing Illumina PE Insert Size 150bp-2Kbp 100bp Read Length 100bp Sequencing 454 MP Insert Size 2K-20Kbp Read Length 500bp 150bp 150bp 150bp Data FastQ Machine Name Read ID (unique) Encoded Quality 0-40 Chance of being wrong FastQ Format FastQ Statistics 13 0.05 5% Cleaning FastQC Quality Checking Tool Contamination screen fastq screen Per base sequence quality Per base sequence content Sequence duplication Per sequence quality Sequence length distribution Per base N-content Per base GC content Per sequence GC content SolexaQA Cleaning Tool SolexaQA Cleaning Tool Exercise • Create “cleaning” folder – mkdir cleaning; cd cleaning • Inside it, run: wget -O saulo.bash http://goo.gl/Tx8g6 • Run it with: bash saulo.bash • This will download FastQC and SolexaQA – – – – FASTQC HELP : FASTQC TUTORIAL: FASTQC MANUAL : SolexaQA Help : http://goo.gl/EE8M7 http://goo.gl/rihyA http://goo.gl/9yihC http://solexaqa.sourceforge.net/ • Run FastQC: ./FastQC/fastqc & • File > open [Files of Type = FastQ files] Exercise • Verify the two .fq files (you can use less): – bad_MiSeq_dataset.fq – good_MiSeq_dataset.fq • Clean the bad dataset with SolexaQA’s DynamicTrim.pl script: – perl SolexaQA_v.2.1/DynamicTrim.pl ► bad_MiSeq_dataset.fq -h 25 • Verify the improvement (or not) by opening – bad_MiSeq_dataset.fq.trimmed ?