MALDI MS Data Preprocessing and Analysis

Download Report

Transcript MALDI MS Data Preprocessing and Analysis

The Biostatistical & Bioinformatics Challenges in the High
Dimensional Data Derived from High Throughput Assays:
Today and Tomorrow
Yu Shyr (石 瑜 ), Ph.D.
May 14, 2008
China Medical University
[email protected]
Vanderbilt University
泛德堡大學
US News & World Report (American’s best colleges -2007)
1. Princeton University (NJ)
2. Harvard University (MA)
3. Yale University (CT)
4. California Institute of Technology (CA)
4. Stanford University (CA)
4. Massachusetts Inst. Of Technology (MA)
7. University of Pennsylvania (PA)
8. Duke University (NC)
9. Dartmouth College (NH)
9. Columbia University (NY)
9. University of Chicago (IL)
12. Cornell University (NY)
12. Washington University in St. Louis (MO)
US News & World Report (American’s best colleges -2007)
14. Northwestern University (IL)
15. Brown University (RI)
16. Johns Hopkins University (MD)
17. Rice University (TX)
18. Vanderbilt University (TN)
18. Emory University (GA)
20. University of Notre Dame (IN)
21. Carnegie Mellon University (PA)
21. University of California – Berkeley (CA)
23. Georgetown University (DC)
24. University of Virginia (VA)
24. University of Michigan – Ann Arbor (MI)
Tennessee, the “Volunteer State”
Nashville, TN- “Music City, USA!”
Vanderbilt University
Vanderbilt University

A private, nonsectarian, coeducational
research university in Nashville, TN.

Established in 1873 by shipping and
rail magnate Cornelius Vanderbilt.

Enrolls 11,000 students in ten schools
annually.

Ranks 18th in the nation
national research universities.

Also has several research facilities and
a world-renowned medical center.

Famous alumni include former vicepresident Al Gore.
among
Vanderbilt University Medical Center
VUMC

Collection
of
several
hospitals
and
clinics
associated with Vanderbilt University in Nashville,
Tennessee.

In 2003, was placed on the Honor Roll of nation’s
best hospitals.

The medical school was ranked 17th in the nation
among research-oriented medical schools and in the
ISI top 5 for research impact in clinical medicine and
pharmacology.
Vanderbilt-Ingram Cancer Center
Vanderbilt-Ingram Cancer Center

Only
NCI-designated
Comprehensive
Cancer
Center in Tennessee and one of only 39 in the
United States

Nearly 300 investigators in seven research
programs

More than $190 million in annual research
funding

Among the top 10 in competitively awarded NCI
grant support
Vanderbilt-Ingram Cancer Center

Ranks 20th in the nation and consistently ranks
among the best places for cancer care by U.S.
News and World Report.

One of a select few centers to hold agreements
with the NCI to conduct Phase I and Phase II
clinical trials, where innovative therapies are
first evaluated in patient.
Department of Biostatistics

Created by the School of Medicine at Vanderbilt
University in September 2003.

The Dean and other senior medical school faculty
are
committed
to
providing
outstanding
collaborative support in biostatistics to clinical
and basic scientists and to develop a graduate
program in biostatistics that will train outstanding
collaborative scientists and will focus on the
methods of modern applied statistics.
High Dimensional Data

The major challenge in high throughput experiments, e.g.,
microarray data, MALDI-TOF data, SELDI-TOF data, or shotgun
proteomic data is that the data is often high dimensional.

When the number of dimensions reaches thousands or more,
the computational time for the pattern recognition algorithms
can become unreasonable. This can be a problem, especially
when some of the features are not discriminatory.
High Dimensional Data

The irrelevant features may cause a reduction in the accuracy of
some algorithms. For example (Witten 1999), experiments with a
decision tree classifier have shown that adding a random binary
feature to standard datasets can deteriorate the classification
performance by 5 - 10%.

Furthermore, in many pattern recognition tasks, the number of
features represents the dimension of a search space - the larger
the number of features, the greater the dimension of the search
space, and the harder the problem.
Outcome Measurement: MALDI-TOF
Reflex MALDI TOF Mass Spectrometer
Laser
Optics
Nitrogen
Laser (337
nm)
TOF
Analyzer
Microchannel
Detector
MALDI
Target

Ion Mirror
Ion
Grid
Time-of-Flight Mass Spectrometry (TOF-MS)
Linear TOF :
Ionizing Probe (start)
M3
Ion detector (MCP)
M2
M1
Ion
signals
+/- U
M2
M1
M3
Start
t1
t2
t  a M b
t3
Time or M
Issues in the Analysis of High-Throughput Experiment
 Experiment Design
 Measurement
 Preprocessing
♦ Baseline Correction, Normalization
♦ Profile Alignment, Feature selection, Denosing
 QCA (Quality Control Assessment)
 Feature Selection
 Classification
Issues in the Analysis of High-Throughput Experiment
 Computational Validation
♦ Estimate the classification error rate
♦ bootstrapping, k-fold validation, leave-one-out validation
 Significance Testing of the Achieved Classification Error
 Validation – blind test cohort
 Validation – laboratory technology, e.g. RTPCR,
Pathway analysis
 Reporting the result - graphic & table
Preprocessing
 Mass Spectrometry (MS) can generate high throughput protein profiles
for biomedical applications. A consistent, sensitive and robust MS data
preprocessing method would be greatly desirable because subsequent
analyses are determined by the preprocessing output.
 The preprocessing goal is to extract and quantify the common features
across the spectra.
 We propose a new comprehensive MALDI-TOF MS data preprocessing
method using feedback concepts associated with several new
algorithms.
 This new package successfully resolves many conventional difficulties
such as removing m/z measure error, objectively setting de-nosing
parameters, and define common features across spectra.
Math Model for MS Data Preprocessing

From a mathematical point of view, one MS data is a signal
function defined on a time or m/z domain. An observed MS signal
is often modeled as the superposition of three components:
f ( x)  B( x)  N * S ( x)  e( x) ,
where f(x) is observed signal, B(x) is a slowly varying “baseline”
artifact, S(x) is the “true” signal (peaks) to be extracted, N is the
normalization factor, and e(x) represents noise.
 Basic Descriptions of the Data Preprocessing
Registration
Denoising
Baseline correction
Normalization
Peak selection
Peak alignment or Binning
Math Model for MS Data Preprocessing

The preprocessing goal is to identify, quantify and match
peaks across spectra.

Several modern algorithms such as wavelets, splines,
nonparametric local maximum likelihood estimate(NLMLE)
are successfully applied to the whole processing system.

The feedbacks optimized the calibration and peak picking
procedures automatically.
Raw data
General steps
(1) Calibration: Calibration based on multiple identified peaks (linear
shifts on the time domain) and the shape of peak (convolution); in the
meanwhile all spectra get aligned.
(2) Quantification:
Baseline Correction (splines) =>Normalization (TIC) =>area based
peak quantification method.
(3) Feature Extraction:
Denoising (wavelets) => Peak Selection (local maximum) => common
peak finding across spectra(NLMLE)
(4) Feedback: optimally choosing calibration peaks and setting feature
extraction parameters.
Flowchart of the Preprocessing Procedure
Raw data
Calibration
Alignment
De-noising
Baseline
Correction
Peak
Detection
Normalization
Peak
Distribution
Common
Feature
detection
Results
Convolution Based Calibration Algorithm
1. Known peaks’ simulation (choose
peaks with high prevalence across
spectra and clear pattern by feedback
80% ).
2. Convolve each spectra with the
known peak simulation (Gaussian, or
Beta). Maximum happens when two
peak shapes match best.
3. The linear shift units makes multiple
peaks matched best is the optimal shift.
Notice: all process are on the time
domain.
Pre- Calibration
Post Calibration
1. Accurate m/z peak position (as theoretical)
2. Less variation of the peaks position
3. Easily to handle large dataset in batch mode.
Pre- Calibration
Post Calibration
Baseline Correction & Normalization
 Baseline is generally considered as an artificial bias of the
signal.
 We propose baseline might be caused by delayed charge
releasing.
 We apply quadratic splines to the local minimums to get the
continuous curve by sliding windows.
 Trimmed total ion current (TIC) normalization.
Baseline Data Before Correction
Baseline Corrected Data
Wavelets Denoising
 Wavelet: FBI's image coding standard for digitized fingerprints,
successful to reproduce true signal by removing noises of
specific energy levels.
 Wavelets method has been used to denoise signals in a wide
variety of contexts.
 Wavelet method analyzes the data in both time and frequency
domain to extract more useful information.
c( j , k )   f (t ) j ,k (t )dt
f (t )   c( j, k ) j ,k (t )
jZ kZ
 Adaptive stationary discrete wavelet denoising method is
applied in our research, which is shift-invariant and efficient in
denoising.
Denoising strategy

Stationary discrete wavelet denoising method is shiftinvariant and offers both
performance and smoothness.
good
reconstruction
 Adaptive denoising method is based on the noise
distribution, we set up different threshold values at
different mass intervals and frequency levels.
 Parameters
(decomposition
and
thresholds
determined by the feedback information)
are
DWT Decomposition
Denoised Data
Peak list across spectra
Kernel Density Estimation
Peak distribution without high-quality preprocessing
Peak distribution with high-quality preprocessing
Peak Selection
Peak Selection
Preprocessing on one spectrum after calibration
1.
Read in spectrum by two columns: m/z values and corresponding intensities.
2.
Apply Adaptive Stationary Discrete Wavelet Transform for denoising.
3.
Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current
Normalization through the whole spectrum.
4.
Local maximums contribute to peak list across spectra.
Expression Profiles
day1
day2
day3
day4
The Results from the Cluster Analysis
Laser Power
Why?
Day
Quality Control Assessment - Reproducibility
 Correlation of Variation (CV)
SD/Mean

Intra-class Correlation Coefficient (ICC)
Intra / Intra + Inter

Variance Component Analysis
Mixed/Random Effect Model.
The model: investigators, day, spot, machine, lab, etc.
 Goal – Make sure the data is reproducible !
 SOP is a necessary component
Source of Variability for MALDI-TOF Data
 Biological Heterogeneity in Population
 Specimen Collection/Handling Effects
- Tumor: surgical related effects
- Cell Line: culture condition
 Biological Heterogeneity in Specimen
 Laser power variation
Table IV
Power = 80% Type I error = 5%
Intra-Case Variance
Subsample
0.2
0.5
1.0
Inter-Case
Variance
Inter-Case
Variance
Inter-Case
Variance
Number
(m)
0.2
0.5
1.0
0.2
0.5
1.0
0.2
0.5
1.0
1
6
11
19
11
16
24
19
24
32
5
4
9
17
5
10
17
7
11
19
20
4
8
16
4
9
16
4
9
17
CV in different days
ICC
Variance Components Analysis
Variance Component Analysis
Tumor
Things DON’T DO
 Fold-change for feature selection
 Cluster analysis for class comparison or class prediction
 Ignore the over-fitting issues
 Extremely small sample size for the Independent test cohort
 Only report the good news
Multidimensional scaling (MDS)
Agulnik, M. et al. J Clin Oncol; 25:2184-2190 2007
Multidimensional scaling (MDS)
Acknowledgement

Preprocessing







Dr. Dean Billheimer
Dr. Ming Li
Dr. Dong Hong
Shuo Chen
Huiming Li
Analysis





Jeremy Roberts
Will Gray
Nimish Gautam
Joan Zhang
Haojie Wu
Additional Acknowledgements



Bashar Shakhtour
Dr. William Wu
Dr. Bonnie LeFure



Dr. Heidi Chen
Dr. Jonathan Xu
Dr. Tatsuki Koyama
END