Transcript Bioinformatics Tips (NGS data processing)
Bioinformatics Tips NGS data processing and pipeline writing Na Cai 3 rd year DPhil in Clinical Medicine Supervisor: Jonathan Flint
Example projects
CONVERGE - 1.7x whole genome sequencing in 12,000 Han Chinese Women - 6000 Cases of MD, 6000 controls - Detailed questionnaire - 45T of sequencing data Commercial Outbred Mice - 0.1x whole genome sequencing in 2,000 mice - Known breeding history - Extensive phenotyping - 2T of sequencing data
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
One step at a time processing
• • • • Make new directories as you go along Make flag files to indicate successful completion of previous command Parallelize using make This is good for step by step troubleshooting
Pipeline writing – Ruffus
http://www.ruffus.org.uk
Setting up Ruffus
Once Ruffus is set up - Help
Once Ruffus is set up – just print
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
Processing a raw BAM file
• Things to consider – How many samples one is processing – Coverage per sample – Ploidy of subjects – Size of genome – Source of DNA and possible contamination – Server/cluster usage: How the jobs can be parallelized
Processing a raw BAM file
• Some manipulations of bam files – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads – Mask entire regions
Tools of the Trade
• Picardtools
Tools of the Trade - Picardtools
• • Commonly used Picard tools: – – ValidateSamFile SamToFastq – – MergeSamFiles ReplaceSamHeader Cool Picard options: – SORT_ORDER
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
Indel Realignment
Image from: http://www.broadinstitute.org/gatk/guide/best-practices
Why Realign Around Indels?
An*example*of*a*strand&discordant*locus*
Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)right)of)the) homopolymer) Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)leH)of)the) homopolymer) 7bp)“T”) homopolymer)run)
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf
Why Realign Around Indels?
Local*realignment*uncovers*the*hidden*indel*in*these* reads*and*eliminates*all*the*poten7al*FP*SNPs* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf
How does it work?
Identified intervals: Local*realignment*iden7fies*most*parsimonious* • Indels discovered in original alignments (in CIGAR strings of alignment*along*all*reads*at*a*problema7c*locus* reference,*best*fits*the*reads*in*a*pile*(maximum*of*1*indel)*** Ref:* AAGCGTCG !
Three* adjacent* SNPs* AAGCGTCG !
Realigning* determines* which*is* beVer* AAG---CG !
Read*pile*consistent* with*the*reference* sequence* Read*pile*consistent* with*a*3bp*inser7on* 2.*The*score*for*an*alternate*consensus*is*the*total*sum*of*the*quality* Realignment.pdf
3.*If*the*score*of*the*best*alternate*consensus*is*sufficiently*beVer*than*the* original*alignments*(using*a*LOD*score),*then*we*accept*the*proposed* realignment*of*the*reads* 12/4/12* 9*
Original*BAM*file*
RealignerTargetCreator)
Intervals*(.intervals)*
IndelRealigner)
Realigned*BAM*file* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf
Implementing the Indel Realignment sample1 sample2 sample3 sample4 sample5 sample6 sample7
Site1
reads reads reads …
Site2 Site3 Site4 Site5 Site6 Site7 Site8
The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time
Implementing the Indel Realignment sample1 sample2 sample3 sample4 sample5 sample6 sample7
Site1
reads reads reads …
Site2 Site3 Site4 Site5 Site6 Site7 Site8
Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths only need to parse in one sample’s data at a time
Base Quality Score Recalibration (BQSR) Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
The%quality%scores%issued%by%sequencers% are%inaccurate%and%biased% • • Quality%scores%are%cri2cal%for%all%downstream%analysis%%
Why BQSR?
Example:%Bias%in%the%quali2es%reported%depending%on%nucleo2de%context%%%
RMSE = 4.188
RMSE = 0.281
original% recalibrated% AA AG CA CG GA Dinuc GG TA TG AA AG CA CG GA Dinuc GG TA TG Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3 Base_recalibration.pdf
The BQSR workflow
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3 Base_recalibration.pdf
Implementing the BQSR
sample1 sample2 sample3 sample4 sample5 sample6 sample7
Site1
reads reads reads …
Site2 Site3 Site4 Site5 Site6 Site7 Site8
The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample
Variant Calling
Image from: http://www.broadinstitute.org/gatk/guide/best-practices
Variant Calling
Multi-sample calling integrates per sample likelihoods to jointly estimate allele frequency of variation!
Sample-associated reads!
Genotype likelihoods!
Individual 1!
Allele frequency!
Individual 2!
Joint estimate across samples!
SNPs and Indels!
Individual N!
Genotype frequencies!
• Simultaneous'es6ma6on'of:' From: http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5 – Allele'frequency'(AF)'spectrum'Pr{AF'='i'|'D}' – The'probability'that'a'variant'exists'Pr{AF'>'0'|'D}'' – Assignment'of'genotypes'to'each'sample' 12!
Excellent mathematical treatment at http://lh3lh3.users.sourceforge.net/download/samtools.pdf
Implementing Variant Calling
sample1 sample2 sample3 sample4 sample5 sample6 sample7
Site1
reads reads reads …
Site2 Site3 Site4 Site5 Site6 Site7 Site8
The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time
Variant Calling Softwares
• • • • • Samtools GATK UnifiedGenotyper GATK HaplotypeCaller Platypus Cortex (denovo assembly + variant caller)
Comparison between callers
Acknowledgements
Jonathan Flint Richard Mott Winni Kretzschmar Robbie Davies Leo Goodstadt (Ruffus) Kiran Garimella (GATK) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme