Bioinformatics Tips (NGS data processing)

Download Report

Transcript Bioinformatics Tips (NGS data processing)

Bioinformatics Tips NGS data processing and pipeline writing Na Cai 3 rd year DPhil in Clinical Medicine Supervisor: Jonathan Flint

Example projects

CONVERGE - 1.7x whole genome sequencing in 12,000 Han Chinese Women - 6000 Cases of MD, 6000 controls - Detailed questionnaire - 45T of sequencing data Commercial Outbred Mice - 0.1x whole genome sequencing in 2,000 mice - Known breeding history - Extensive phenotyping - 2T of sequencing data

NGS data processing

Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

One step at a time processing

• • • • Make new directories as you go along Make flag files to indicate successful completion of previous command Parallelize using make This is good for step by step troubleshooting

Pipeline writing – Ruffus

http://www.ruffus.org.uk

Setting up Ruffus

Once Ruffus is set up - Help

Once Ruffus is set up – just print

NGS data processing

Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

Processing a raw BAM file

• Things to consider – How many samples one is processing – Coverage per sample – Ploidy of subjects – Size of genome – Source of DNA and possible contamination – Server/cluster usage: How the jobs can be parallelized

Processing a raw BAM file

• Some manipulations of bam files – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads – Mask entire regions

Tools of the Trade

• Picardtools

Tools of the Trade - Picardtools

• • Commonly used Picard tools: – – ValidateSamFile SamToFastq – – MergeSamFiles ReplaceSamHeader Cool Picard options: – SORT_ORDER – – CREATE_INDEX CREATE_MD5_FILE – VALIDATION_STRINGENCY

NGS data processing

Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

Indel Realignment

Image from: http://www.broadinstitute.org/gatk/guide/best-practices

Why Realign Around Indels?

An*example*of*a*strand&discordant*locus*

Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)right)of)the) homopolymer) Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)leH)of)the) homopolymer) 7bp)“T”) homopolymer)run)

Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf

Why Realign Around Indels?

Local*realignment*uncovers*the*hidden*indel*in*these* reads*and*eliminates*all*the*poten7al*FP*SNPs* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf

How does it work?

Identified intervals: Local*realignment*iden7fies*most*parsimonious* • Indels discovered in original alignments (in CIGAR strings of alignment*along*all*reads*at*a*problema7c*locus* reference,*best*fits*the*reads*in*a*pile*(maximum*of*1*indel)*** Ref:* AAGCGTCG !

Three* adjacent* SNPs* AAGCGTCG !

Realigning* determines* which*is* beVer* AAG---CG !

Read*pile*consistent* with*the*reference* sequence* Read*pile*consistent* with*a*3bp*inser7on* 2.*The*score*for*an*alternate*consensus*is*the*total*sum*of*the*quality* Realignment.pdf

3.*If*the*score*of*the*best*alternate*consensus*is*sufficiently*beVer*than*the* original*alignments*(using*a*LOD*score),*then*we*accept*the*proposed* realignment*of*the*reads* 12/4/12* 9*

Original*BAM*file*

RealignerTargetCreator)

Intervals*(.intervals)*

IndelRealigner)

Realigned*BAM*file* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2 Realignment.pdf

Implementing the Indel Realignment sample1 sample2 sample3 sample4 sample5 sample6 sample7

Site1

reads reads reads …

Site2 Site3 Site4 Site5 Site6 Site7 Site8

The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there  need to parse in data for all samples at the same time

Implementing the Indel Realignment sample1 sample2 sample3 sample4 sample5 sample6 sample7

Site1

reads reads reads …

Site2 Site3 Site4 Site5 Site6 Site7 Site8

Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths  only need to parse in one sample’s data at a time

Base Quality Score Recalibration (BQSR) Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

The%quality%scores%issued%by%sequencers% are%inaccurate%and%biased% • • Quality%scores%are%cri2cal%for%all%downstream%analysis%%

Why BQSR?

Example:%Bias%in%the%quali2es%reported%depending%on%nucleo2de%context%%%

RMSE = 4.188

RMSE = 0.281

original% recalibrated% AA AG CA CG GA Dinuc GG TA TG AA AG CA CG GA Dinuc GG TA TG Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3 Base_recalibration.pdf

The BQSR workflow

Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3 Base_recalibration.pdf

Implementing the BQSR

sample1 sample2 sample3 sample4 sample5 sample6 sample7

Site1

reads reads reads …

Site2 Site3 Site4 Site5 Site6 Site7 Site8

The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset  need to parse in all of the data of each sample

Variant Calling

Image from: http://www.broadinstitute.org/gatk/guide/best-practices

Variant Calling

Multi-sample calling integrates per sample likelihoods to jointly estimate allele frequency of variation!

Sample-associated reads!

Genotype likelihoods!

Individual 1!

Allele frequency!

Individual 2!

Joint estimate across samples!

SNPs and Indels!

Individual N!

Genotype frequencies!

• Simultaneous'es6ma6on'of:' From: http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5 – Allele'frequency'(AF)'spectrum'Pr{AF'='i'|'D}' – The'probability'that'a'variant'exists'Pr{AF'>'0'|'D}'' – Assignment'of'genotypes'to'each'sample' 12!

Excellent mathematical treatment at http://lh3lh3.users.sourceforge.net/download/samtools.pdf

Implementing Variant Calling

sample1 sample2 sample3 sample4 sample5 sample6 sample7

Site1

reads reads reads …

Site2 Site3 Site4 Site5 Site6 Site7 Site8

The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend  need to parse in data for all samples at a particular site at the same time

Variant Calling Softwares

• • • • • Samtools GATK UnifiedGenotyper GATK HaplotypeCaller Platypus Cortex (denovo assembly + variant caller)

Comparison between callers

Acknowledgements

Jonathan Flint Richard Mott Winni Kretzschmar Robbie Davies Leo Goodstadt (Ruffus) Kiran Garimella (GATK) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme