Transcript Document

Microarray data analysis with Chipster
22.9.2008
Jarno Tuimala
Program – an analysis workflow




Basic functionality of Chipster
Data import
Quality control
Normalization
•




Describing the experiment
Filtering and missing value considerations
Statistical testing
Clustering and visualization
Annotation
Introduction to Chipster
Chipster
 Goal: Easy access to leading analysis tools such as those developed in the
R/Bioconductor project
 Features
• Easy to use graphical user interface
• Comprehensive selection of tools
• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)
• Compatible with Windows, Linux and Mac OS X
• Easy to install and update
• Wizards and workflows
• Interactive graphics
• Transparency (as opposed to “black box”)
• Alternative annotations for Affymetrix arrays
• Automatic tracking of performed analyses
 http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf
 http://chipster.csc.fi
How does it work?
CSC
internet
front
end
security
desktop
SSL
client
SOAP
Java Web Start
installs and
updates client
automatically
analyser
Corona/Murska
ANALYSIS
international
Web Services
VISUALISATION
Phenodata – describing your experiment
 Phenodata file is created during normalization
 Fill in the group column with numbers describing your experimental setup
•
•
e.g. 1 = healthy control, 2 = cancer sample
necessary for the statistical tests to work
 If you bring in previously created normalized data and phenodata:
•
•
Choose ”import directly” in the import tool
Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”
 If you brought in normalized data and need to create phenodata for it:
•
•
•
Utilities/ Generate phenodata (fill in the chiptype parameter!)
Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”
Fill in the group column
Visualizing the data
 Data visualization panel
•
Maximize and redraw for better viewing
 Two types of visualizations
1. Interactive visualizations produced by the client program
• Select the visualization method from the pulldown menu of the data
visualization panel
• Save by right clicking on the image
2. Static images produced by R/Bioconductor, Weeder, etc
• Select from Analysis tools/ Visualisation
• View by double clicking on the image file
• Save by right clicking on the file name and choosing ”Export”
Interactive visualizations by the client










Spreadsheet
Histogram
Scatterplot
3D scatterplot
Expression profiles
Clustered profiles
Hierarchical clustering
SOM clustering
Array pseudo-image
Venn diagram
Available actions:
 Change titles, colors etc
 Zoom in/out
Static images produced by R/Bioconductor













Volcano plot
Box plot
Histogram
Heatmap
Venn diagram
Idiogram
Chromosomal position
Correlogram
Dendrogram
QC stats plot
RNA degradation plot
K-means clustering
SOM-clustering
Automatic tracking of analysis history
Running many analyses simultaneously
 You can have max 5 analysis jobs running at the same time
 Use Task manager to
•
•
view parameters, status,…
cancel jobs
Workspace – continue later/elsewhere
 Saving your workspace allows you to continue later
•
•
File/ Save workspace
File/ Load workspace
 Currently it is possible to have only one workspace saved at the time
 If you would like to continue your work on another computer, you need
to transfer the workspace-snapshot -folder to the corresponding
location
•
C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot
Importing files
 Affymetrix CEL-files are imported to
Chipster automatically
 Other files are imported using the
Import tool
Import tool, step 1
 Define
•
•
•
•
Header
Footer
Title row
Delimiter
Import tool, step 2
 Define columns
 Modify flags
Importing Agilent files (required fields)







Sample (rMeanSignal)
Sample background (rBGMedianSignal)
Control (gMeanSignal)
Control background (gBGMedianSignal)
Identifier (ProbeName)
Annotation (ControlType)
Flag (IsManualFlag)
 https://extras.csc.fi/biosciences/chipster-manual/data-formats.html
Quality control
Quality control tools
 Quality control -tools
•
•
Affymetrix basic
RNA degradation + Affy QC
Agilent
MA-plot + density plot + boxplot
 Visualization – dendrogram
 Statistics - NMDS
Affymetrix I
 Quality control tools are run on raw data (CEL files).
•
Dendrogram and NMDS on normalized data
Agilent
General QC – dendrogram and NMDS
Scatterplots
Heatmaps (this took an hour to calculate)
QC-tools in Chipster
 Quality control
•
•
•
Affymetrix basic
Affymetrix RLE and NUSE
Agilent 2-color
 Visualization
•
•
•
Dendrogram
Heatmap
Correlogram
 Statistics
•
NMDS
Normalization
What is normalization?
 Normalization is the process of removing systematic
variation from the data.
 Typically you would normalize your data so that all the
chips become comparable.
Methods
 Affymetrix
•
•
Background correction + expression estimation + summarization
RMA (default) uses only PM probes, fits a model to them, and gives out
expression values after quantile normalization and median polishing
 Agilent
•
Background correction + averaging duplicate spots + normalization
 After normalization the expression values are always expressed
on log2-scale
Affymetrix
 Methods: MAS5, Plier, RMA, GCRMA, Li-Wong
•
•
•
•
MAS5 is the older Affymetrix method, Plier is a newer one
RMA is the default, and works rather nicely if you have more than a
few chips
GCRMA is similar to RMA, but takes also GC% content into account
Li-Wong is the method implemented in dChip
 Variance stabilization makes the variance over all the chips
similar
•
Works only with MAS5 and Plier, since all others output log2tranformed data by default (and thus corrected for the same
phenomenon)
 Custom chiptype
•
If you want to use reannotated probes (they are really assigned to
the genes where they belong), select one from this menu.
Agilent I
 Background correction
•
•
Background treatment
None, Subtract, Edwards, Normexp
Background offset
0 or 50
 Normalize chips
•
None, median, loess
 Normalize genes (not typically used)
•
None, scale (to median), quantile
 Chiptype
•
A must setting!
Agilent II
 Background treatment typically generates many
negative values that are coded as missing values after
log2-transformation.
•
•
Usual subtract option does this
Using normexp + offset 50 will generate no negative values,
and gives rather good estimates (best method reported)
 Loess removes curvature from the data (suggested)
Checking normalization
Filtering
Gene filtering
 Removing probes for genes that are
•
•
Not expressed
Expressed at constant level (not changing)
 Often a good idea, and necessary before multiple
testing correction can be adequately applied
•
Some controversy on this…
 Non-specific filtering
•
Expression, flags, SD, …
 Specific filtering
•
Statistical testing
Non-specific filtering
 Often used for removing bad quality data:
•
•
•
Intensity value too low
Intensity value saturated
Appearance of the spot is abnormal
 Typically, non-changing genes are also removed
 These can be removed using
•
•
•
Filter by standard deviation
Filter by interquartile range
Filter by expression
Specific filtering
 Selecting genes that are associated with some
phenotype
 Typically involves statistical testing
 Biologists typically concentrate on fold change
(magnitude of effect), statisticians on p-value.
•
•
Both tell a slightly different story. Fold change ignores
knowledge of variability, p-value ignores the size of the effect.
Take both into account by combining the filters.
• Filter on expression value (what is biologically significant)
and test for differences (what is statistically significant)
Unspecific filtering in Chipster
 Pre-processing
•
•
•
•
Filter by expression
• Select the upper and lower cut-offs
• Select the number of chips this rule has to fulfilled on
• Select whether to return genes inside or outside the range
Filter by SD
• Select the percentage of genes to filter out
Filter by interquartile range (IQR)
• Select the IQR
Filter by coefficient of variation (CV)
• Median is used for filtering on CV (cannot be changed)
 Utilities
1. Calculate descriptive statistics
2. Filter using a column
Venn diagram
 Select three datasets in Chipster
 Run the Venn diagram tool from Visualization tool
category
SD
CV
IQR
Statistics
Some terminology
 Usually tests for comparing means of two or more groups are
used
•
Variance might be of interest too, but in practise this is never done.
 Parametric tests (assume data normally distributed)
•
Typically used for microarray data
 Non-parametric tests (assume no normality)
 P-value
•
•
•
Risk of saying that there is a difference when there really isn’t
Traditionally 0.05 is used as a cut-off for significance
False discovery range is a p-value corrected for multiple tests (more on
this later)
Mean and variance, an example for 1 gene
density.default(x = y1)
0.3
0.0
0.1
0.2
Density
0.2
0.1
0.0
Density
0.3
0.4
0.4
density.default(x = x1)
-6
-4
-2
0
2
N = 100000 Bandwidth = 0.08956
4
6
-10
-5
0
N = 100000 Bandwidth = 0.09006
5
10
Statistical testing
 Needs replication (>2 chips per group)
•
Replication makes it possible to estimate uncertainty or variability in the
measurements. This is typically measured by standard deviation.
 Comparing means (parametric tests)
•
•
•
•
One-group tests
• Compare to a known mean
• Example: One-sample t-test
Two-group tests
• Compare two groups’ means
• Example: Two-sample t-test
Several group tests
• Compare several groups’ means
• Example: Analysis of variance (ANOVA)
Two or more groups, two or more factors
• Compare means in the groups according to both factor simultaneously
• Example: multiple linear regression (linear modeling in Chipster)
t-test
 Compares means of two groups
•
•
•
If the p-value is small that means that there is a difference between the groups.
If the p-value is large (>0.05), there is no difference between the groups.
p-value is a risk of saying that there is a difference when there actually isn’t.
 A test for every gene is run separately -> thousands of tests and p-values
x1  x2
t
SE
ANOVA
 A generalization of t-test.
 Compares means of several groups.
 Tells whether the means are different, but not which
means differ from each other.
•
For this you can use post-hoc tests (not implemented in
Chipster) or linear modelling (implemented in Chipster)
 A test for every gene is run separately -> thousands of
tests and p-values
Multiple testing correction I
 After getting the results for all the genes, p-values are
adjusted for the number of tests conducted.
 When making several comparisons using the same test, some
of the results will be chance findings.
•
Example: if p threshold is 0.05, every 20th significant result might be due
to chance alone. If there were 10000 genes that were tested, 500 genes
would be expected to be chance findings. If we found 550 genes to be
significant, most of those (500) would be false positives, and only a
minority are true positives (50).
 This can be corrected for (to some extent) by using a multiple
testing correction.
•
•
Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of
significant results are expected to be false positives (chance findings). If
we tested 10000 genes, and 500 genes were significant after FDR
correction, 25 of those are expected to be false positives, and 475 are
expected to be true positives.
Thus, FDR can be much higher than p-value, and the results can still be
meaningful and worth investigating.
Multiple testing correction II
 The ranking of the genes does not change after multiple
testing correction!
•
•
If you know that you can validate, say, 10 genes, then there’s
no difference if you select the most significant genes before or
after the multiple testing correction.
If there are no significant genes left after multiple testing
correction, you probably have some differences, but not
enough power in your experiment to detect those differences.
In that case the top 10 genes are still the ones that are most
likely to validate.
Gene set test (”global test”)
 A typical result of an microarray experiment is a list of
differentially expressed genes.
 Biologically, grouping these genes in pathways or
functional categories would be more interesting.
 Are pathways associated with our endpoints of
interest?
•
Is there a difference in nucleotide metabolism
between 5-FU-treated cancer patients and
their healthy controls?
 Works on the expression values data.
Gene enrichment analysis
 A typical result of an microarray experiment is a list of
differentially expressed genes.
 Biologically, grouping these genes in pathways or
functional categories would be more interesting.
 Takes a list of differentially expressed genes, and tests
whether they are enriched in any functional categories.
 Works on the gene list.
Statistical tests in Chipster
 Statistics
•
•
•
•
One sample tests
• Are the genes expressed at all (different from 0)?
Two group tests
Several group tests
Linear modeling
 Visualization
•
Volcano plot
Clustering
Clustering methods
 Hierarchical clustering
 Non-hierarchical clustering
•
•
•
K-means
QT-clustering
Self-organizing maps
 Classification / class prediction
•
K-nearest neighbor (KNN)
Hierachical clustering
 Two phases:
•
•
Pick a distance measure
• Euclidean distance
• Standard / Pearson correlation
Pick the dendrogram drawing method
• Average linkage
Average linkage example
Hierarchical clustering - heatmap
Annotation
Annotation
 Annotation = Descriptive text used for labeling features. For
genes, extra information about their location in chromosomes,
biological functions, etc.
 Retrieved from multiple biological databases and stored as a
single database in Chipster (generated by Bioconductor project).
 Required by certain analysis tools (annotation, GO enrichment,
promoter analysis, chromosomal plots)
•
These tools don’t work for those chiptypes which don’t have Bioconductor
annotation packages
Alternative CDF environments for Affy
 CDF is a file that links individual probes to their location in genes
(probesets)
 Affymetrix default annotation use old CDF files that map a sizable
number of probes to wrong genes
 Alternative CDFs fix this problem
 In Chipster
•
•
selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use
Note: if you have normalized using a custom chiptype, certain tools requiring
annotation won’t work (GO term enrichment, promotor analysis, annotation)
 Dai et al, (2005) Nuc Acids Res, 33(20):e175
 http://brainarray.mbni.med.umich.edu/Brainarray/Database/Custom
CDF/genomic_curated_CDF.asp