Temporal Aspects of Visual Extinction
Download
Report
Transcript Temporal Aspects of Visual Extinction
Statistics – Modelling Your Data
Chris Rorden
1. Modelling data:
Signal, Error and Covariates
Parametric Statistics
2. Thresholding Results:
Statistical power and statistical errors
The multiple comparison problem
Familywise error and Bonferroni Thresholding
Permutation Thresholding
False Discovery Rate Thresholding
Implications: null results uninterruptible
1
The fMRI signal
Last lecture: we predict areas that are involved with a
task will become brighter (after a delay)
Therefore, we expect that if someone repeatedly
does a task for 12 seconds, and rests for 12 seconds,
our signal should look like this:
8
Model
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
-4
1
Signal Brightness
6
Tim e (volum es, each 2sec)
2
Calculating statistics
Observed
8
Error
6
Signal Brightness
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
Time
Observed
8
Model
Error
6
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
-4
1
Signal Brightness
– Top panel: very
good predictor
(very little error)
– Lower panel:
somewhat less
good predictor
4
-4
1
Does this brain
area change
brightness when
we do the task?
Model
Time
3
General Linear Model
The observed data is composed of a signal that is
predicted by our model and unexplained noise
(Boynton et al., 1996).
Y M
Amplitude (solve for)
Measured Data
Design Model
Noise
4
What is your model?
Model is predicted effect.
Consider Block desgin experiment:
– Three conditions, each for 11.2sec
Intensity
1. Press left index finger when you see
2. Press right index finger when you see
3. Do nothing when you see
Time
5
FSL/SPM display of model
Analysis programs display model as grid.
Each column is regressor
– e.g. left / right arrows.
– for within-subject fMRI = time
Intensity
Brightness of row is model’s predicted
intensity.
Time
Each row is a volume of data
Time
6
Statistical Contrasts
fMRI inference based on contrast.
Consider study with left arrow and
right arrow as regressors
1. [1 0] identifies activation correlated
with left arrows: we could expect
visual and motor effects.
2. [1 –1] identifies regions that show
more response to left arrows than right
arrows. Visual effects should be
similar, so should select contralateral
motoric.
Choice of contrasts crucial to
inference.
7
Statistical Contrasts
t-Test is one tailed, F-test is
two-tailed.
– T-test: [1 –1] mutually exclusive
of [-1 1]: left>right vs right>left.
– F-test: [1 –1] = [-1 1]: difference
between left and right.
Choice of test crucial to
inference.
8
How many regressors?
We collected data during a block design,
where the participant completed 3 tasks
–
–
–
Left hand movement
Right hand movement
Rest
We are only interested in the brain areas
involed with Left hand movement.
Should we include uninteresting right hand
movement as a regressor in our statistical
model?
–
–
=?
I.E. Is a [1] analysis the same as a [1 0]?
Is a [1 0] analysis identical, better, worse or
different from a [1] analysis?
9
Meaningful regressors decrease noise
Meaningful regressors can
explain some of the variability.
Adding a meaningful regressor
can reduce the unexplained
noise from our contrast.
10
Correlated regressors decrease signal
If a regressor is strongly
correlated with our effect, it can
reduce the residual signal
– Our signal is excluded as the
regressor explains this variability.
– Example: responses highly
correlated to visual stimuli
11
Single factor…
Consider a test to see how well height predicts weight.
Weight
Height
t=
Explained Variance
Unexplained Variance
Small t-score
High t-score
height only
weakly predicts
weight
height strongly
predicts weight
12
Adding a second factor…
How does an additional
factor influence our test?
E.G. We can add waist
diameter as a regressor.
Does this regressor
influence the t-test
regarding how well height
predicts weight?
Consider ratio of cyan to
green.
Weight
Height
Waist
Increased t
Decreased t
Waist explains
portion of weight
not predicted by
height.
Waist explains
portion of weight
predicted by
height.
13
Regressors and statistics
Our analysis identifies three classes of variability:
1. Signal: Predicted effect of interest
2. Noise (aka Error): Unexplained variance
3. Covariates: Predicted effects that are not relevant.
Statistical significance is the ratio:
t=
Signal
Noise
Covariates will
–
–
Improve sensitivity if they reduce error (explain otherwise
unexplained variance).
Reduce sensitivity if they reduce signal (explain variance
that is also predicted by our effect of interest).
14
Summary
Regressors should be orthogonal
– Each regressor describes independent variance.
– Variance should not be explained by more than
one regressor.
E.G. we will see that including temporal
derivatives as regressors tend to help event
related designs (temporal processing lecture).
15
Group Analysis
We typically want to make inferences about the
general population
Conduct time course analysis on many people.
Identify which patterns are consistent across group.
16
Parametric Statistics
SPM and FSL conduct parametric statistics.
– T-test, F-test, Correlation
These make assumptions about data.
We will not check to see if these assumptions
are valid.
17
Parametric Statistics
Parameters = Assumptions
Parametric Statistics assume that data can
be accurately defined by two values:
1. Mean = measure of central tendency
2. Variance = measure of noise
2 = 1
2 = 2
f(x)
Means differ
2 = 3
2 = 4
Variabilities Differ
18
Parametric Statistics
Parametric Statistics are popular
– Simple (complex data described by two numbers:
mean and variability)
– Flexible (can look at how multiple factors interact)
– Powerful: very sensitive at detecting real effects
– Robust: usually work even if assumptions violated
Tend to fail graciously: by becoming more conservative
19
Normal Distribution
Parametric Statistics Assume Bell-Shaped data:
Often, this is wrong. Mean may not be a good
measure:
Positive Skew: response times,
hard exam
Negative Skew: easy exam
Bimodal: some students got it
20
Rank-Order Statistics
Rank-order statistics make fewer assumptions.
Have less power (if data is normal)
– Require more measurements
– May fail to detect real results
Computationally slow
Classic examples:
– Wilcoxon Mann-Whitney
– Fligner and Policello’s robust rank order test
21
Problem with rank order statistics
While rank-order statistics are often referred to
as non-parametric, most make assumptions:
– WMW: assume both distributions have same
shape.
– FP: assume both distributions are symmetrical.
Both these tests become liberal if their
assumptions are not met.
– They fail catastrophically.
22
What to do?
In general, use parametric tests.
– In face of violations, you will simply lose power
One alternative is to use permutation testing, e.g.
SnPM.
– Permuation testing is only as powerful as the test statistic it
uses: SnPM uses the t-test, which is sensitive to changes in
mean (so it can be blind to changes in median).
Recent alternative is truly non-parametric test of
Brunner and Munzel.
– Can offer slightly better power than t-test if data is skewed.
Rorden et al. 2007.
23
Statistical Thresholding
– Type I/II Errors
– Power
– Multiple Comparison Problem
Bonferroni Correction
Permutation Thresholding
False Discovery Rate
ROI Analysis
24
Statistics
E.G. erythropoietin (EPO) doping in athletes
–
–
–
–
–
In endurance athletes, EPO improves performance ~ 10%
Races often won by less than 1%
Without testing, athletes forced to dope to be competitive
Dangers: Carcinogenic and can cause heart-attacks
Therefore: Measure haematocrit level to identify drug users…
If there was no
noise in our
measure, it would
be easy to identify
EPO doping:
haematocrit
50%
30%
25
The problem of noise
Science tests hypotheses based on observations
– We need statistics because our data is noisy
In the real world, haematocrit levels vary
– This unrelated noise in our measure is called ‘error’
How to we identify dopers?
In the real world,
hematocrit varies
between people
hematocrit
50%
30%
26
Statistical Threshold
If we set the
threshold too high,
we will fail to detect
dopers (high rate of
misses).
30%
50%
hematocrit
If we set the
threshold too low,
we will accuse
innocent people
(high rate of false
alarms).
hematocrit
50%
30%
27
Possible outcomes of drug test
Reality (unknown)
Decision
nonDoper
Accuse
Innocent
and expel accused
(false alarm)
Type I error
Allow to
Innocent
compete competes
(correct
rejection)
EPO Doper
Doper expelled
(hit)
Doper sneaks
through (miss)
Type II error
28
Errors
With noisy data, we will make mistakes.
Statistics allows us to
– Estimate our confidence
– Bias the type of mistake we make (e.g. we can decide whether
we will tend to make false alarms or misses)
We can be liberal: avoiding misses
We can be conservative: avoiding false alarms.
We want liberal tests for airport weapons detection (X-ray
often leads to innocent cases being opened).
Our society wants conservative tests for criminal
conviction: avoid sending innocent people to jail.
29
Liberal vs Conservative Thresholds
LIBERAL
A low threshold, we
will accuse innocent
people (high rate of
false alarms, Type I).
CONSERVATIVE
A high threshold, we
will fail to detect
dopers (high rate of
misses, Type II).
30
Statistical Power
Statistical Power is our probability of making a Hit.
It reflects our ability to detect real effects.
Reality
Decision
To make new
discoveries, we
need to optimize
power.
There are 4 ways
to increase
power…
Reject Ho
Ho true
Ho false
Type I error
Hit
Accept Ho Correct
rejection
Type II error
31
1.) Alpha and Power
By making alpha less strict, we can increase power.
(e.g. p < 0.05 instead of 0.01)
However, we increase the chance of a Type I error!
32
2.) Effect Size and Power
Power will increase if the effect size increases. (e.g.
higher dose of drug, 7T MRI instead of 1.5T).
Unfortunately, effect sizes are often small and fixed.
33
3.) Variability and Power
Reducing variability increases the relative effect size.
Most measures of brain activity noisy.
34
4.) Sample Size
A final way to increase our power is to collect
more data.
We can sample a person’s brain activity on
many similar trials.
We can test more people.
The disadvantage is time and money.
Increasing the sample size is often our only
option for increasing statistical power.
35
Reflection
Statistically, relative ‘effect size’ and ‘variability’
are equivalent.
Our confidence is the ratio of effect size versus
variability (signal versus noise).
In graphs below, same is used.
=
36
Alpha level
Statistics allow us to estimate our confidence.
is our statistical threshold: it measures our chance of Type I error.
An alpha level of 5% means only 1/20 chance of false alarm (we will
only accept p < 0.05).
An alpha level of 1% means only 1/100 chance of false alarm (p<
0.01).
Therefore, a 1% alpha is more conservative than a 5% alpha.
37
Multiple Comparison Problem
Assume a 1% alpha for drug testing.
An innocent athlete only has 1% chance of
being accused.
Problem: 10,500 athletes in the Olympics.
If all innocent, and = 1%, we will wrongly
accuse 105 athletes (0.01*10500)!
This is the multiple comparison problem.
38
Multiple Comparisons
The gray matter volume ~900cc (900,000mm3)
Typical fMRI voxel is 3x3x3mm (27mm3)
Therefore, we will conduct >30,000 tests
With 5% alpha, we will make >1500 false
alarms!
39
Multiple Comparison Problem
If we conduct 20 tests, with an = 5%, we will on
average make one false alarm (20x0.05).
If we make twenty comparisons, it is possible that we
may be making 0, 1, 2 or in rare cases even more
errors.
The chance we will make at least one error is given by
the formula: 1- (1- )C: if we make twenty comparisons
at p < .05, we have a 1-(.95) 20 = 64% chance that we
are reporting at least one erroneous finding. This is our
familywise error (FWE) rate.
40
Bonferroni Correction
Bonferroni Correction: controls FWE.
For example: if we conduct 10 tests, and want
a 5% chance of any errors, we will adjust our
threshold to be p < 0.005 (0.05/10).
Benefits: Controls for FWE.
Problem: Very conservative = very little chance
of detecting real effects = low power.
41
Random Field Theory
We spatially smooth our data – peaks due to noise
should be attenuated by neighbors.
5mm
– Worsley et al, HBM 4:58-73, 1995.
RFT uses resolution elements (resels) instead of
voxels.
– If we smooth our data with 8mm FWHM, then resel size is
8mm.
10mm
SPM uses RFT for FWE correction: only requires
statistical map, smoothness and cluster size threshold.
– Euler characteristic: unsmoothed noise will have high peaks
but few clusters, smoothed data will be have lower peaks but
show clustering.
15mm
RFT has many unchecked assumptions (Nichols)
Works best for heavily smoothed data (x3 voxel size)
Image from Nichols
42
Permutation Thresholding
Group 1
Group 2
Prediction: Label ‘Group 1’
and ‘Group 2’ mean
something.
Null Hypothesis (Ho):
Labels are meaningless.
If Ho true, we should get
similar t-scores if we
randomly scramble order.
43
Permutation Thresholding
Group 1
Group 2
Observed, max T = 4.1
1.
Permutation 1, max T = 3.2
2.
Permutation 2, max T = 2.9
3.
4.
Permutation 3, max T = 3.3
Permutation 4, max T = 2.8
5.
Permutation 5, max T = 3.5
…
1000.Permutation 1000, max T = 3.1
…
…
44
Permutation Thresholding
5
T= 3.9
Max T
Compute maximum Tscore for 1000
permutations.
Find 5th Percentile max T.
Any voxel in our observed
dataset that exceeds this
threshold has only 5%
probability of being noise.
0
0
100
Percentile
5%
45
Permutation Thresholding
Permutation Thresholding offers the same protection
against false alarms as Bonferroni.
Typically, much more powerful than Bonferroni.
Implementations include SnPM, FSL’s randomise, and
my own NPM.
Disadvantage: computing 1000 permutations means it
takes x1000 times longer than typical analysis!
Simulation data from Nichols et al.:
Permutation always optimal. Bonferroni
typically conservative. Random Fields only
accurate with high DF and heavily smoothed.
46
False Discovery Rate
Traditional statistics attempts to control the
False Alarm rate.
‘False Discovery Rate’ controls the ratio of
false alarms to hits.
It often provides much more power than
Bonferroni correction.
47
FDR
Assume Olympics where no athletes took EPO:
Assume Olympics where some cheat:
– When we conduct many tests, we can estimate the
amount of real signal
48
FDR vs FWE
Bonferroni FWE applies same threshold to
each data set
5% Bonferroni: only
a 5% chance an
innocent athlete will
be accused.
FDR is dynamic: threshold based on signal
detected.
5% FDR: only 5% of
expelled athletes
are innocent.
49
Controlling for multiple comparisons
Bonferroni correction
– We will often fail to find real results.
RFT correction
– Typically less conservative than Bonferroni.
– Requires large DF and broad smoothing.
Permutation Thresholding
– Offers same inference as Bonferroni correction.
– Typically much less conservative than Bonferroni.
– Computationally very slow
FDR correction
– At FDR of .05, about 5% of ‘activated’ voxels will be false alarms.
– If signal is only tiny proportion of data, FDR will be similar to
50
Bonferroni.
Alternatives to voxelwise analysis
Conventional fMRI statistics compute one statistical comparison
per voxel.
– Advantage: can discover effects anywhere in brain.
– Disadvantage: low statistical power due to multiple comparisons.
Small Volume Comparison: Only test a small proportion of
voxels. (Still have to adjust for RFT).
Region of Interest: Pool data across anatomical region for single
statistical test.
Example: how many
comparisons on this slice?
•SPM: 1600
•SVC: 57
•ROI: 1
SPM
SVC
ROI
51
ROI analysis
In voxelwise analysis, we conduct an indepent
test for every voxel
M1: movement
– Each voxel is noisy
– Huge number of tests, so severe penalty for multiple
comparisons
Alternative: pool data from region of interest.
– Averaging across meaningful region should reduce
noise.
– One test per region, so FWE adjustment less severe.
Region must be selected independently of
statistical contrast!
S1: sensation
– Anatomically predefined
– Defined based on previous localizer session
– Selected based on combination of conditions you will
contrast.
52
Inference from fMRI statistics
fMRI studies have very low power.
– Correction for multiple comparisons
– Poor signal to noise
– Variability in functional anatomy between people.
Null results impossible to interpret. (Hard to
say an area is not involved with task).
53
Between and Within Subject Variance
70
Typing speed: words per minute
Consider experiment to see if
music influences typing speed.
Possible effect will be small.
Large variability between
people: some people much
better typist than others.
Solution: repeated measure
design to separate between and
within subject variability.
60
Alice
Bob
50
Donna
Nick
40
Sam
30
20
10
0
Bach
Rock
Silent
54
Multiple Subject Analysis: Mixed Model
Model all of the data at
once
Between and within
subject variation is
accounted for
Can’t apply mixed model
directly to fMRI data
because there is so
much data!
Z stats
Group
Sub 1
Sub 2
Sub 3
Sub 4
55
Multiple Subject Analysis: SPM2
First estimate each
subject’s contrast effect
sizes (copes)
Run a t-test on the copes
Holmes and Friston
assume within subject
variation is same for all
subjects, this allows them
to ignore it at the group
level
Results
Group:
T-test
copes
Sub 1
copes
Sub 2
copes
Sub 3
copes
Sub 4
– Not equivalent to a mixed
model
56
Multiple Subject Analysis: FSL
First estimate each subject’s
copes and cope variability
(varcopes)
Then enter the copes and
varcopes into group model
– varcopes supply within subject
variation
– Between subject variation and
group level means are then
estimated
Equivalent to mixed model
Much slower than SPM
Z stats
Group
copes
copes
copes
copes
varcopes varcopes varcopes varcopes
Sub 1
Sub 2
Sub 3
Sub 4
57