Class slides for the BioInformatics part

Transcript Class slides for the BioInformatics part

Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software NGS bioinformatics resources Summary

Why Next Generation Sequencing

One can generate hundreds of millions of short sequences (35bp-150bp) in a single run in a short period of time with low per base cost.

• • • Illumina/Solexa GA II / HiSeq 2000, 2500,X Roche/454 FLX, Titanium Life Technologies/Applied Biosystems SOLiD Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31 Quail et al (2012) BMC Genomics Jul 24;13:341.

Why Bioinformatics

Informatics

(wall.hms.harvard.edu)

Bioinformatics Challenges in NGS Data Analysis

• • VERY large text files (thousands of millions of lines long) – Can’t do ‘business as usual’ with familiar tools – Impossible memory usage and execution time – Manage, analyze, store, transfer and archive huge files Need for powerful computers and expertise – Informatics groups must manage compute clusters – New algorithms and software are required and often time they are open source Unix/Linux based.

– Collaboration of IT experts, bioinformaticians and biologists

Basic NGS Workflow

Olson et al.

NGS Data Analysis Overview

Olson et al.

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software NGS bioinformatics resources Summary

Terminology

• • •

Experimental Design:

Coverage (sequencing depth): The number of nucleotides from reads that are mapped to a given position.

Paired-End Sequencing: Both end of the DNA fragment is sequenced, allowing highly precise alignment. Multiplex Sequencing: "barcode" sequences are added to each sample so they can be distinguished in order to sequence large number of samples on one lane.

• • • • •

Data analysis:

Quality Score: Each called base comes with a quality score which measures the probability of base call error.

Mapping: Assembly: sequence.

Align reads to reference to identify their origin.

Merging of fragments of DNA in order to reconstruct the original Duplicate reads: Reads that are identical.

Multi-reads: Reads that can be mapped to multiple locations equally well.

What does the data look like?

Common NGS Data Formats

For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.html

• • •

File Formats Reference sequences, reads:

– FASTQ – FASTA

Alignments:

– SAM – BAM

Features, annotation, scores

: – GFF/GTF – BED/BigBed – WIG/BigWig http://genome.ucsc.edu/FAQ/FAQformat.html

FASTA Format (Reference Seq)

FASTQ Format (reads)

FASTQ Format (Illumina Example)

Separator (with optional repeated header) Read Record Header

Flow Cell ID Lane Tile Tile Coordinates Barcode @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1 :N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG +

Read Quality Scores Read Bases

@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads.

(Passarelli, 2012)

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

General Data Pipeline

Why QC?

• •

Sequencing runs cost money

Consequences of not assessing the Data Sequencing a poor library on multiple runs – throwing money away!

• • • •

Data analysis costs money and time

Cost of analyzing data, CPU time $$ Cost of storing raw sequence data $$$ Hours of analysis could be wasted $$$$ Downstream analysis can be incorrect .

How to QC?

$ module load fastqc $ fastqc s_1_1.fastq; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ , available on HPC Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

FastQC: Example

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query UCSC databases .

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

The UCSC Genome Browser Homepage

General information Get genome annotation here!

Get reference sequences here!

Specific information — new features, current status, etc.

Downloading Reference Sequences

Downloading Reference Annotation

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

Sequence Mapping Challenges

• • • Alignment (Mapping) is the first steps once analysis-read reads are obtained.

The task: to align sequencing reads against a known reference.

Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.

Short Read Alignment

Olson et al.

Short Read Alignment Software

Short Reads Mapping Software

How to choose an aligner?

• • • • There are many short read aligners (59)and they vary a lot in performance(accuracy, memory usage, speed and flexibility etc). Factors to consider : application, platform, read length, downstream analysis, etc.

Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie) Guaranteed high accuracy will take longer.

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

NGS Applications and Analysis Strategy

Name RNA-Seq Small RNA sequencing ChIP-Seq RIP-Seq Nucleic acid population

RNA (may be poly-A mRNA or total RNA) Small RNA (often miRNA)

Brief analysis strategy

Alignment of reads to “genes”; variations for detecting splice junctions and quantifying abundance Alignment of reads to small RNA references (e.g. miRbase), then to the genome; quantify abundance DNA bound to protein, captured via antibody (ChIP = Chromatin ImmunoPrecipitation) Align reads to reference genome, identify peaks & motifs RNA bound to protein, captured via antibody (RIP = RNA ImmunoPrecipitation) Align reads to reference genome and/or “genes”, identify peaks and motifs

Methylation Analysis SNP calling/ discovery Structural Variation Analysis de novo Sequencing

Select methylated genomic DNA regions, or convert methylated nucleotides to alternate forms All or some genomic DNA or RNA Align reads to reference and either identify peaks or regions of methylation Genomic DNA, with two reads (mate-pair reads) per DNA template Either align reads to reference and identify statistically significant SNPs, or compare multiple samples to each other to identify SNPs Align mate-pairs to reference sequence and interpret structural variants Genomic DNA (possibly with external data e.g. cDNA, genomes of closely related species, etc.) Piece-together reads to assemble contigs, scaffolds, and (ideally) whole-genome sequence

Metagenomics

Entire RNA or DNA from a (usually microbial) community Phylogenetic analysis of sequences (Hunicke-Smith et al, 2010)

Application Specific Software

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

RNA-Seq Pipeline

(Wilhelm, B.T., et al, 2009)

RNA-Seq: Spliced Alignment

• • Some reads will span two different exons Need long enough reads to be able to reliably map both sides http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

RNA-Seq: Coverage

• • • • Coverage in RNA-Seq is highly non-uniform Within a single exon, there are regions with high coverage and regions with zero coverage.

They change when the library preparation protocol is changed.

The binding preferences of random hexamer primers explain them only partially.

We simply hope that this averages out over the whole transcript !

RNA-Seq: Normalization

Gene-length bias • Differential expression of longer genes is more significant because long genes yield more reads • Ratio-based filtering yields more false positives for short genes • • RNA-Seq normalization methods: Scaling factor based: Total count, upper quartile, median, DESeq, TMM in edgeR.

Quantile, RPKM.

Normalize by gene length and by number of reads mapped, e.g. RPKM.

Definition of Expression levels

RPKM: Reads Per Kilobase per Million of mapped reads: FPKM: Fragment Per Kilobase per Million of mapped reads (for paired-end reads) Mortazavi,

et al.

2008

RNA-Seq: Differential Expression

Discrete vs. Continuous data: Microarray florescence intensity data: continuous  Modeled using normal distribution RNA-Seq read count data: discrete  Modeled using negative binomial distribution

Microarray software canNOT be directly used to analyze RNA-Seq data!

RNA-Seq data analysis software

http://www.ncbi.nlm.nih.gov/pubmed/21623353

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

Classic RNA-Seq (Tuxedo Protocol)

SAM/BAM GTF/GFF

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Classic vs. advanced RNA-seq workflow

1. Spliced Alignment: Tophat

Tophat : a spliced short read aligner for RNA-seq.

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq

$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

The TopHat2 Pipeline

Tophat Parameters

http://tophat.cbcb.umd.edu/manual.html

2.Transcript assembly and abundance quantification: Cufflinks

Cufflinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Cufflinks Parameters

http://cufflinks.cbcb.umd.edu/manual.html

Cufflinks and related resources

• Pachter, L. Models for transcript quantification from RNA-Seq.arXiv preprint arXiv:1104.3889 (2011).

• Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

Nature Biotechnology doi: 10.1038/nbt.1621

• Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L.

Improving RNA-Seq expression estimates by correcting for fragment bias

Genome Biology doi:10.1186/ gb-2011-12-3-r22 • Roberts A, Pimentel H, Trapnell C, Pachter L.

Identification of novel transcripts in annotated genomes using RNA-Seq

Bioinformatics doi:10.1093/ bioinformatics/btr355

3. Final Transcriptome assembly: Cuffmerge

$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ more assembies.txt

./C1_R1_clout/transcripts.gtf

./C1_R2_clout/transcripts.gtf

./C2_R1_clout/transcripts.gtf

./C2_R2_clout/transcripts.gtf

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

4.Differential Expression: Cuffdiff

CuffDiff: a program that compares transcript abundance between samples.

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf

./C1_R1_thout/accepted_hits.bam,./C1_R2 _thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam,./C2_R2 _thout/accepted_hits.bam

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

CummeRbund: Expression Plot

http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2012.016.pdf

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data file formats, general workflow • • • • – Data Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression.

• • • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund Data Visualization RNA-seq pipeline software: RobiNA, Galaxy ChIP-Seq data analysis workflow and software Open source pipeline software with Graphical User Interface Summary

Integrative Genomics Viewer (IGV)

http://www.broadinstitute.org/igv Available on HPC. Use ‘module load igv’ and ‘igv’

Visualizing RNA-Seq mapping with IGV

http://www.broadinstitute.org/igv/UserGuide Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.Thorvaldsdóttir H et al. Brief Bioinform. 2013

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

Galaxy: Web based platform for analysis of large datasets

Galaxy: A platform for interactive large-scale genome analysis: Genome Res. 2005. 15: 1451-1455 http://hpc-galaxy.oit.uci.edu/root

https://main.g2.bx.psu.edu/

• • • • • •

Outline

Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis – Basics: terminology, data formats, general workflow etc.

– Data Analysis Pipeline • Sequence QC and preprocessing • • Downloading reference sequences: query NCBI, UCSC databases.

Sequence mapping • Downstream analysis workflow and software RNA-Seq data analysis • Concepts: spliced alignment, normalization, coverage, differential expression.

• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.

• RNA-Seq pipeline software: Galaxy vs. shell scripting ChIP-Seq data analysis workflow and software Scripting Languages and bioinformatics resources Summary

What is ChIP-Seq?

• Chromatin-Immunoprecipitation (ChIP) Sequencing • • ChIP - A technique of precipitating a protein antigen out of solution using an antibody that specifically binds to the protein.

Sequencing – A technique to determine the order of nucleotide bases in a molecule of DNA.

• Used in combination to study the interactions between protein and DNA.

ChIP-Seq Applications

Enables the accurate profiling of • • • • Transcription factor binding sites Polymerases Histone modification sites DNA methylation

A View of ChIP-Seq Data

• • Typically reads (35-55bp) are quite sparsely distributed over the genome. Controls (i.e. no pull-down by antibody) often show smaller peaks at the same locations Rozowsky et al Nature Biotech, 2009

ChIP-Seq Analysis Pipeline

Sequencin g Base Calling Read QC Short read Sequences Short read Alignment Peak Calling Enriched Regions Visualization with genome browser Motif Discovery Combine with gene expression Differential peaks

• • • •

ChIP-Seq: Identification of Peaks

Several methods to identify peaks but they mainly fall into 2 categories: – Tag Density – Directional scoring In the tag density method, the program searches for large clusters of overlapping sequence tags within a fixed width sliding window across the genome.

In directional scoring methods, the bimodal pattern in the strand specific tag densities are used to identify protein binding sites.

Determining the exact binding sites from short reads generated from ChIP-Seq experiments – – SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008) MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)

•

ChIP-Seq: Output

A list of enriched locations • Can be used: – In combination with RNA-Seq, to determine the biological function of transcription factors – Identify genes co-regulated by a common transcription factor – Identify common transcription factor binding motifs

ChIP-Seq with MACS in Galaxy

http://iona.wi.mit.edu/bio/education/hot_topics

Resources in NGS data analysis

• Stackoverflow.com

• • • • •

Summary

NGS technologies are transforming molecular biology.

Bioinformatics analysis is a crucial part in NGS applications – Data formats, terminology, general workflow – Analysis pipeline – Software for various NGS applications RNA-Seq and ChIP-Seq data analysis Data visualization Bioinformatics resources

Thank you!

Class slides for the BioInformatics part

Transcript Class slides for the BioInformatics part

Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility

Outline

Why Next Generation Sequencing

Why Bioinformatics

Informatics

Bioinformatics Challenges in NGS Data Analysis

Basic NGS Workflow

NGS Data Analysis Overview

Outline

Terminology

What does the data look like?

Common NGS Data Formats

File Formats Reference sequences, reads:

Alignments:

Features, annotation, scores

FASTA Format (Reference Seq)

FASTQ Format (reads)

FASTQ Format (Illumina Example)

Outline

General Data Pipeline

Why QC?

How to QC?

FastQC: Example

Outline

The UCSC Genome Browser Homepage

Downloading Reference Sequences

Downloading Reference Annotation

Outline

Sequence Mapping Challenges

Short Read Alignment

Short Read Alignment Software

Short Reads Mapping Software

How to choose an aligner?

Outline

NGS Applications and Analysis Strategy

Application Specific Software

Outline

RNA-Seq Pipeline

RNA-Seq: Spliced Alignment

RNA-Seq: Coverage

RNA-Seq: Normalization

Definition of Expression levels

RNA-Seq: Differential Expression

RNA-Seq data analysis software

Outline

Classic RNA-Seq (Tuxedo Protocol)

Classic vs. advanced RNA-seq workflow

1. Spliced Alignment: Tophat

The TopHat2 Pipeline

Tophat Parameters

2.Transcript assembly and abundance quantification: Cufflinks

Cufflinks Parameters

Cufflinks and related resources

3. Final Transcriptome assembly: Cuffmerge

4.Differential Expression: Cuffdiff

CummeRbund: Expression Plot

Outline

Integrative Genomics Viewer (IGV)

Visualizing RNA-Seq mapping with IGV

Outline

Galaxy: Web based platform for analysis of large datasets

Outline

What is ChIP-Seq?

ChIP-Seq Applications

A View of ChIP-Seq Data

ChIP-Seq Analysis Pipeline

ChIP-Seq: Identification of Peaks

ChIP-Seq: Output

ChIP-Seq with MACS in Galaxy

Resources in NGS data analysis

Summary

Directory