Transcript Document
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 6
Metabolomic Data Analysis Using
MetaboAnalyst
David Wishart
A Typical Metabolomics
Experiment
2 Routes to Metabolomics
ppm
7
6
5
4
Quantitative (Targeted)
Methods
3
2
Chemometric (Profiling)
Methods
25
TMAO
hippurate
allantoin creatinine taurine
1
PC2
20
creatinine
15
10
citrate
ANIT
5
hippurate
urea
2-oxoglutarate
water
succinate
fumarate
0
-5
-10
ppm
7
6
5
4
3
2
1
Control
-15
PAP
-20
-25
-30
PC1
-20
-10
0
10
Metabolomics Data
Workflow
Chemometric Methods
Targeted Methods
• Data Integrity Check
• Spectral alignment or
binning
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
• Compound ID
• Data Integrity Check
• Compound ID and
quantification
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
Data Integrity/Quality
• LC-MS and GC-MS have
high number of false
positive peaks
• Problems with adducts
(LC), extra derivatization
products (GC), isotopes,
breakdown products
(ionization issues), etc.
• Not usually a problem
with NMR
• Check using replicates
and adduct calculators
MZedDB http://maltese.dbs.aber.ac.uk:8888/hrmet/index.html
HMDB http://www.hmdb.ca/search/spectra?type=ms_search
Data/Spectral Alignment
• Important for LC-MS
and GC-MS studies
• Not so important for
NMR (pH variation)
• Many programs
available (XCMS,
ChromA, Mzmine)
• Most based on time
warping algorithms
http://mzmine.sourceforge.net/
http://bibiserv.techfak.uni-bielefeld.de/chroma
http://metlin.scripps.edu/download/
Binning (3000 pts to 14 bins)
xi,yi
x = 232.1 (AOC)
y = 10 (bin #)
bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8...
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to whole sample
controls for dilution
• Normalize to integrated
area, probabilistic
quotient method,
internal standard,
sample specific (weight
or volume of sample)
• Choice depends on
sample & circumstances
Same or different?
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to feature(s)
helps manage outliers
• Several feature scaling
options available: log
transformation, autoscaling, Pareto scaling,
probabilistic quotient,
and range scaling
MetaboAnalyst http://www.metaboanalyst.ca
Dieterle F et al. Anal Chem. 2006 Jul 1;78(13):4281-90.
Data QC, Outlier Removal &
Data Reduction
• Data filtering (remove solvent peaks,
noise filtering, false positives, outlier
removal -- needs justification)
• Dimensional reduction or feature
selection to reduce number of
features or factors to consider (PCA
or PLS-DA)
• Clustering to find similarity
MetaboAnalyst
http://www.metaboanalyst.ca
• Web server designed to
handle large sets of LC-MS,
GC-MS or NMR-based
metabolomic data
• Supports both univariate
and multivariate data
processing, including ttests, ANOVA, PCA, PLS-DA
• Identifies significantly
altered metabolites,
produces colorful plots,
provides detailed
explanations & summaries
• Links sig. metabolites to
pathways via SMPDB
MetaboAnalyst Workflow
Data preprocessing
Data
normalization
Data
analysis
Data
annotation
14
• GC/LC-MS raw spectra
• Peak lists
• Spectral bins
• Concentration table
Data input
• Spectra processing
• Peak processing
• Noise filtering
• Missing value estimation
Data processing
Functional Interpretation
• Row-wise normalization
• Column-wise normalization
• Combined approach
Data integrity
check
Data normalization
Statistical
Exploration
Enrichment analysis
Pathway analysis
Time-series analysis
Two/multi-group analysis
• Over representation analysis
• Single sample profiling
• Quantitative enrichment
analysis
• Enrichment analysis
• Topology analysis
• Interactive visualization
• Data overview
• Two-way ANOVA
• ANOVA - SCA
• Time-course analysis
• Univariate analysis
• Correlation analysis
• Chemometric analysis
• Feature selection
• Cluster analysis
• Classification
Outputs
• Processed data
• Result tables
• Analysis report
• Images
Image Center
• Resolution: 150/300/600 dpi
• Format: png, tiff, pdf, svg, ps
Quality checking
• Methods comparision
• Temporal drift
• Batch effect
• Biolgoical checking
Other utilities
• Peak searching
• Pathway mapping
• Name/ID conversion
• Lipidomics
MetaboAnalyst Overview
• Raw data processing
– Using MetaboAnalyst
• Data Reduction & Statistical analysis
– Using MetaboAnalyst
• Functional enrichment analysis
– Using MSEA in MetaboAnalyst
• Metabolic pathway analysis
– Using MetPA in MetaboAnalyst
Example Datasets
Example Datasets
Metabolomic Data Processing
Common Tasks
• Purpose: to convert various raw data
forms into data matrices suitable for
statistical analysis
• Supported data formats
– Concentration tables (Targeted Analysis)
– Peak lists (Untargeted)
– Spectral bins (Untargeted)
– Raw spectra (Untargeted)
Data Upload
Alternatively …
Data Set Selected
• Here we will be selecting a data set
from dairy cattle fed different
proportions of cereal grains (0%,
15%, 30%, 45%)
• The rumen was analyzed using NMR
spectroscopy using quantitative
metabolomic techniques
• High grain diets are thought to be
stressful on cows
Data Integrity Check
Data Normalization
Data Normalization
• At this point, the data has been
transformed to a matrix with the samples
in rows and the variables
(compounds/peaks/bins) in columns
• MetaboAnalyst offers three types of
normalization, row-wise normalization,
column-wise normalization and combined
normalization
• Row-wise normalization aims to make
each sample (row) comparable to each
other (i.e. urine samples with different
dilution effects)
Data Normalization
• Column-wise normalization aims to make
each variable (column) comparable to
each other
• This procedure is useful when variables
are of very different orders of magnitude
• Four methods have been implemented for
this purpose – log transformation,
autoscaling, Pareto scaling and range
scaling
Normalization Result
Quality Control
• Dealing with outliers
– Detected mainly by visual inspection
– May be corrected by normalization
– May be excluded
• Noise reduction
– More of a concern for spectral bins/
peak lists
– Usually improves downstream results
Visual Inspection
• What does an outlier look like?
Finding outliers via PCA
Finding outliers via Heatmap
Outlier Removal
Noise Reduction
Noise Reduction (cont.)
• Characteristics of noise &
uninformative features
– Low intensities
– Low variances (default)
Data Reduction and
Statistical Analysis
Common tasks
• To identify important features;
• To detect interesting patterns;
• To assess difference between the
phenotypes
• To facilitate classification /
prediction
ANOVA
View Individual Compounds
Questions
• Q: Which compounds show
significant difference among all the
neighboring groups (0-15, 15-30, and
30-45)?
• Q: For Uracil, are groups 15, 30, 45
significantly different from each
other?
Overall correlation pattern
High resolution image
Specify format
Specify resolution
Specify size
Question
• Q: In untargeted metabolomics using
NMR, researchers often look for
region(s) on the spectra showing
biggest change in their correlation
patterns under different conditions.
Can you do that in MetaboAnalyst?
• Hint: check the available parameters
of Correlation analysis
Template Matching
• Looking for compounds showing interesting
patterns of change
• Essentially a method to look for linear trends or
periodic trends in the data
• Best for data that has 3 or more groups
Template Matching (cont.)
Strong linear
+ correlation
to grain %
Strong linear
- correlation
to grain %
Question
• Q: Identify compounds that decrease
in the first three groups but increase
in the last group?
PCA Scores Plot
PCA Loading Plot
Compounds
most responsible
for separation
3D-PCA
48
Question
Q: Identify compounds that contribute
most to the separation between group
15 and 45
PLS-DA Score Plot
Evaluation of PLS-DA Model
• PLS-DA Model
evaluated by cross
validation of Q2 and R2
• More components to
model improves quality
of fit, but try to
minimize this value
• 3 Component model
seems to be a good
compromise here
• Good R2/Q2 (>0.7)
Important Compounds
Model Validation
Questions
• Q: What does p < 0.01 mean?
• Q: How many permutations need to be
performed if you want to claim p value <
0.0001?
Heatmap Visualization
Note that the Heatmap is not being clustered on Rows (i.e. the % grain in
diet)
Heatmap Visualization (cont.)
Question
Q: Identify compounds with a low
concentration in group 0, 15 but
increase in the group 35 and 45
Q: Which compound is the only one
significantly increased in group 45?
Download Results
Analysis Report
Metabolite Set Enrichment
Analysis
Metabolite Set Enrichment
Analysis (MSEA)
http://www.msea.ca
• Web tool designed to handle
lists of metabolites (with or
without concentration data)
• Modeled after Gene Set
Enrichment Analysis (GSEA)
• Supports over
representation analysis
(ORA), single sample
profiling (SSP) and
quantitative enrichment
analysis (QEA)
• Contains a library of 6300
pre-defined metabolite sets
including 85 pathway sets &
850 disease sets
Enrichment Analysis
• Purpose: To test if there are some
biologically meaningful groups of
metabolites that are significantly enriched
in your data
• Biological meaningful
– Pathways
– Disease
– Localization
• Currently, only supports human
metabolomic data
MSEA
• Accepts 3 kinds of input files
• 1) list of metabolite names only (ORA)
• 2) list of metabolite names +
concentration data from a single
sample (SSP)
• 3) a concentration table with a list of
metabolite names + concentrations
for multiple samples/patients (QEA)
The MSEA approach
Over Representation
Analysis
Single Sample Profiling
Compound
concentrations
Compound
concentrations
ORA input
For MSEA
Compound concentrations
Compare to normal
references
Compound selection
(t-tests, clustering)
Important compound lists
Quantitative Enrichment
Analysis
Abnormal compounds
Assess metabolite
set sdirectly
Find enriched biological
themes
Metabolite set libraries
Biological interpretation
64
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR
(compound list + concentrations)
from the urine from 77 lung and
colon cancer patients, some of whom
were suffering from cachexia
(muscle wasting)
Start with a Compound List
Upload Compound List
Normally GSEA
would require
a list of all known
genes for the given
platform. Here we
just use the list of
metabolites found
in KEGG. ORA is
a “weak” analysis in
MSEA
Compound Name
Standardization
Name Standardization (cont.)
Select a Metabolite Set
Library
Result
Result (cont.)
The Matched Metabolite Set
Single Sample Profiling
(Basically used by a physician to
analyze a patient)
Single Sample Profiling
(cont.)
Concentration Comparison
Concentration Comparison
(cont.)
Quantitative Enrichment
Analysis
Result
The Matched Metabolite Set
Question
• Q: Are these metabolites increased
or decreased in the cachexia group?
Metabolic Pathway Analysis
with MetPA
Pathway Analysis
• Purpose: to extend and enhance
metabolite set enrichment analysis
for pathways by
– Considering the pathway structures
– Supporting pathway visualization
• Currently supports 15 organisms
Data Upload
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR
(compound list + concentrations)
from the urine from 77 lung and
colon cancer patients, some of whom
were suffering from cachexia
(muscle wasting)
Normalization
Pathway Libraries
Network Topology Analysis
Position Matters
Which positions are
important?
Hubs
Nodes that are highly
connected (red ones)
Bottlenecks
Nodes on many
shortest paths
between other nodes
(blue ones)
Graph theory
Degree centrality
Betweenness centrality
Junker et al. BMC Bioinformatics 2006
89
Which Node is More
Important?
High
degree
centrality
High
betweenness
centrality
Pathway Visualization
Pathway Visualization
(cont.)
Question
• Q: Which pathway do you think is
likely to be affected the most? Why?
Result
Not Everything Was
Covered
•
•
•
•
•
•
•
Clustering (K-means, SOM)
Classification (SVM, randomForests)
Time-series data analysis
Two factor data analysis
Data quality checks
Peak searching
….
Time Series Analysis in
MetaboAnalyst
96
Quality Checking Module