Temporal Aspects of Visual Extinction

Download Report

Transcript Temporal Aspects of Visual Extinction

Statistics – Modelling Your Data
 Chris Rorden
1. Modelling data:


Signal, Error and Covariates
Parametric Statistics
2. Thresholding Results:






Statistical power and statistical errors
The multiple comparison problem
Familywise error and Bonferroni Thresholding
Permutation Thresholding
False Discovery Rate Thresholding
Implications: null results uninterruptible
1
The fMRI signal
 Last lecture: we predict areas that are involved with a
task will become brighter (after a delay)
 Therefore, we expect that if someone repeatedly
does a task for 12 seconds, and rests for 12 seconds,
our signal should look like this:
8
Model
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
-4
1
Signal Brightness
6
Tim e (volum es, each 2sec)
2
Calculating statistics
Observed
8
Error
6
Signal Brightness
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
Time
Observed
8
Model
Error
6
4
2
0
-2
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
-4
1
Signal Brightness
– Top panel: very
good predictor
(very little error)
– Lower panel:
somewhat less
good predictor
4
-4
1
Does this brain
area change
brightness when
we do the task?
Model
Time
3
General Linear Model
 The observed data is composed of a signal that is
predicted by our model and unexplained noise
(Boynton et al., 1996).
Y  M  
Amplitude (solve for)
Measured Data
Design Model
Noise
4
What is your model?
 Model is predicted effect.
 Consider Block desgin experiment:
– Three conditions, each for 11.2sec
Intensity
1. Press left index finger when you see 
2. Press right index finger when you see 
3. Do nothing when you see 
Time
5
FSL/SPM display of model
 Analysis programs display model as grid.
 Each column is regressor
– e.g. left / right arrows.
– for within-subject fMRI = time
Intensity
 Brightness of row is model’s predicted
intensity.
Time
 Each row is a volume of data
Time
6
Statistical Contrasts
 fMRI inference based on contrast.
 Consider study with left arrow and
right arrow as regressors
1. [1 0] identifies activation correlated
with left arrows: we could expect
visual and motor effects.
2. [1 –1] identifies regions that show
more response to left arrows than right
arrows. Visual effects should be
similar, so should select contralateral
motoric.
 Choice of contrasts crucial to
inference.
7
Statistical Contrasts
 t-Test is one tailed, F-test is
two-tailed.
– T-test: [1 –1] mutually exclusive
of [-1 1]: left>right vs right>left.
– F-test: [1 –1] = [-1 1]: difference
between left and right.
 Choice of test crucial to
inference.
8
How many regressors?

We collected data during a block design,
where the participant completed 3 tasks
–
–
–


Left hand movement
Right hand movement
Rest
We are only interested in the brain areas
involed with Left hand movement.
Should we include uninteresting right hand
movement as a regressor in our statistical
model?
–
–
=?
I.E. Is a [1] analysis the same as a [1 0]?
Is a [1 0] analysis identical, better, worse or
different from a [1] analysis?
9
Meaningful regressors decrease noise
 Meaningful regressors can
explain some of the variability.
 Adding a meaningful regressor
can reduce the unexplained
noise from our contrast.
10
Correlated regressors decrease signal
If a regressor is strongly
correlated with our effect, it can
reduce the residual signal
– Our signal is excluded as the
regressor explains this variability.
– Example: responses highly
correlated to visual stimuli
11
Single factor…
 Consider a test to see how well height predicts weight.
Weight
Height
t=
Explained Variance
Unexplained Variance
Small t-score
High t-score
height only
weakly predicts
weight
height strongly
predicts weight
12
Adding a second factor…
 How does an additional
factor influence our test?
 E.G. We can add waist
diameter as a regressor.
 Does this regressor
influence the t-test
regarding how well height
predicts weight?
 Consider ratio of cyan to
green.
Weight
Height
Waist
Increased t
Decreased t
Waist explains
portion of weight
not predicted by
height.
Waist explains
portion of weight
predicted by
height.
13
Regressors and statistics
 Our analysis identifies three classes of variability:
1. Signal: Predicted effect of interest
2. Noise (aka Error): Unexplained variance
3. Covariates: Predicted effects that are not relevant.
 Statistical significance is the ratio:
t=
Signal
Noise
 Covariates will
–
–
Improve sensitivity if they reduce error (explain otherwise
unexplained variance).
Reduce sensitivity if they reduce signal (explain variance
that is also predicted by our effect of interest).
14
Summary
Regressors should be orthogonal
– Each regressor describes independent variance.
– Variance should not be explained by more than
one regressor.
E.G. we will see that including temporal
derivatives as regressors tend to help event
related designs (temporal processing lecture).
15
Group Analysis
We typically want to make inferences about the
general population
Conduct time course analysis on many people.
Identify which patterns are consistent across group.
16
Parametric Statistics
SPM and FSL conduct parametric statistics.
– T-test, F-test, Correlation
These make assumptions about data.
We will not check to see if these assumptions
are valid.
17
Parametric Statistics
 Parameters = Assumptions
 Parametric Statistics assume that data can
be accurately defined by two values:
1. Mean = measure of central tendency
2. Variance = measure of noise
2 = 1
2 = 2
f(x)
Means differ
2 = 3
2 = 4
Variabilities Differ
18
Parametric Statistics
Parametric Statistics are popular
– Simple (complex data described by two numbers:
mean and variability)
– Flexible (can look at how multiple factors interact)
– Powerful: very sensitive at detecting real effects
– Robust: usually work even if assumptions violated
Tend to fail graciously: by becoming more conservative
19
Normal Distribution
Parametric Statistics Assume Bell-Shaped data:
Often, this is wrong. Mean may not be a good
measure:
Positive Skew: response times,
hard exam
Negative Skew: easy exam
Bimodal: some students got it
20
Rank-Order Statistics
Rank-order statistics make fewer assumptions.
Have less power (if data is normal)
– Require more measurements
– May fail to detect real results
Computationally slow
Classic examples:
– Wilcoxon Mann-Whitney
– Fligner and Policello’s robust rank order test
21
Problem with rank order statistics
While rank-order statistics are often referred to
as non-parametric, most make assumptions:
– WMW: assume both distributions have same
shape.
– FP: assume both distributions are symmetrical.
Both these tests become liberal if their
assumptions are not met.
– They fail catastrophically.
22
What to do?
 In general, use parametric tests.
– In face of violations, you will simply lose power
 One alternative is to use permutation testing, e.g.
SnPM.
– Permuation testing is only as powerful as the test statistic it
uses: SnPM uses the t-test, which is sensitive to changes in
mean (so it can be blind to changes in median).
 Recent alternative is truly non-parametric test of
Brunner and Munzel.
– Can offer slightly better power than t-test if data is skewed.
Rorden et al. 2007.
23
Statistical Thresholding
– Type I/II Errors
– Power
– Multiple Comparison Problem
Bonferroni Correction
Permutation Thresholding
False Discovery Rate
ROI Analysis
24
Statistics
 E.G. erythropoietin (EPO) doping in athletes
–
–
–
–
–
In endurance athletes, EPO improves performance ~ 10%
Races often won by less than 1%
Without testing, athletes forced to dope to be competitive
Dangers: Carcinogenic and can cause heart-attacks
Therefore: Measure haematocrit level to identify drug users…
If there was no
noise in our
measure, it would
be easy to identify
EPO doping:
haematocrit
50%
30%
25
The problem of noise
 Science tests hypotheses based on observations
– We need statistics because our data is noisy
 In the real world, haematocrit levels vary
– This unrelated noise in our measure is called ‘error’
 How to we identify dopers?
In the real world,
hematocrit varies
between people
hematocrit
50%
30%
26
Statistical Threshold
If we set the
threshold too high,
we will fail to detect
dopers (high rate of
misses).
30%
50%
hematocrit
If we set the
threshold too low,
we will accuse
innocent people
(high rate of false
alarms).
hematocrit
50%
30%
27
Possible outcomes of drug test
Reality (unknown)
Decision
nonDoper
Accuse
Innocent
and expel accused
(false alarm)
Type I error
Allow to
Innocent
compete competes
(correct
rejection)
EPO Doper
Doper expelled
(hit)
Doper sneaks
through (miss)
Type II error
28
Errors
 With noisy data, we will make mistakes.
 Statistics allows us to
– Estimate our confidence
– Bias the type of mistake we make (e.g. we can decide whether
we will tend to make false alarms or misses)
 We can be liberal: avoiding misses
 We can be conservative: avoiding false alarms.
 We want liberal tests for airport weapons detection (X-ray
often leads to innocent cases being opened).
 Our society wants conservative tests for criminal
conviction: avoid sending innocent people to jail.
29
Liberal vs Conservative Thresholds
LIBERAL
A low threshold, we
will accuse innocent
people (high rate of
false alarms, Type I).
CONSERVATIVE
A high threshold, we
will fail to detect
dopers (high rate of
misses, Type II).
30
Statistical Power
Statistical Power is our probability of making a Hit.
It reflects our ability to detect real effects.
Reality
Decision
To make new
discoveries, we
need to optimize
power.
There are 4 ways
to increase
power…
Reject Ho
Ho true
Ho false
Type I error
Hit
Accept Ho Correct
rejection
Type II error
31
1.) Alpha and Power
 By making alpha less strict, we can increase power.
(e.g. p < 0.05 instead of 0.01)
 However, we increase the chance of a Type I error!
32
2.) Effect Size and Power
 Power will increase if the effect size increases. (e.g.
higher dose of drug, 7T MRI instead of 1.5T).
 Unfortunately, effect sizes are often small and fixed.
33
3.) Variability and Power
 Reducing variability increases the relative effect size.
 Most measures of brain activity noisy.
34
4.) Sample Size
A final way to increase our power is to collect
more data.
We can sample a person’s brain activity on
many similar trials.
We can test more people.
The disadvantage is time and money.
Increasing the sample size is often our only
option for increasing statistical power.
35
Reflection
Statistically, relative ‘effect size’ and ‘variability’
are equivalent.
Our confidence is the ratio of effect size versus
variability (signal versus noise).
In graphs below, same  is used.
=
36
Alpha level 
 Statistics allow us to estimate our confidence.
  is our statistical threshold: it measures our chance of Type I error.
 An alpha level of 5% means only 1/20 chance of false alarm (we will
only accept p < 0.05).
 An alpha level of 1% means only 1/100 chance of false alarm (p<
0.01).
 Therefore, a 1% alpha is more conservative than a 5% alpha.
37
Multiple Comparison Problem
Assume a 1% alpha for drug testing.
An innocent athlete only has 1% chance of
being accused.
Problem: 10,500 athletes in the Olympics.
If all innocent, and  = 1%, we will wrongly
accuse 105 athletes (0.01*10500)!
This is the multiple comparison problem.
38
Multiple Comparisons
The gray matter volume ~900cc (900,000mm3)
Typical fMRI voxel is 3x3x3mm (27mm3)
Therefore, we will conduct >30,000 tests
With 5% alpha, we will make >1500 false
alarms!
39
Multiple Comparison Problem
 If we conduct 20 tests, with an  = 5%, we will on
average make one false alarm (20x0.05).
 If we make twenty comparisons, it is possible that we
may be making 0, 1, 2 or in rare cases even more
errors.
 The chance we will make at least one error is given by
the formula: 1- (1- )C: if we make twenty comparisons
at p < .05, we have a 1-(.95) 20 = 64% chance that we
are reporting at least one erroneous finding. This is our
familywise error (FWE) rate.
40
Bonferroni Correction
Bonferroni Correction: controls FWE.
For example: if we conduct 10 tests, and want
a 5% chance of any errors, we will adjust our
threshold to be p < 0.005 (0.05/10).
Benefits: Controls for FWE.
Problem: Very conservative = very little chance
of detecting real effects = low power.
41
Random Field Theory
 We spatially smooth our data – peaks due to noise
should be attenuated by neighbors.
5mm
– Worsley et al, HBM 4:58-73, 1995.
 RFT uses resolution elements (resels) instead of
voxels.
– If we smooth our data with 8mm FWHM, then resel size is
8mm.
10mm
 SPM uses RFT for FWE correction: only requires
statistical map, smoothness and cluster size threshold.
– Euler characteristic: unsmoothed noise will have high peaks
but few clusters, smoothed data will be have lower peaks but
show clustering.
15mm
 RFT has many unchecked assumptions (Nichols)
 Works best for heavily smoothed data (x3 voxel size)
Image from Nichols
42
Permutation Thresholding
Group 1
Group 2
 Prediction: Label ‘Group 1’
and ‘Group 2’ mean
something.
 Null Hypothesis (Ho):
Labels are meaningless.
 If Ho true, we should get
similar t-scores if we
randomly scramble order.
43
Permutation Thresholding
Group 1
Group 2
 Observed, max T = 4.1
1.
Permutation 1, max T = 3.2
2.
Permutation 2, max T = 2.9
3.
4.
Permutation 3, max T = 3.3
Permutation 4, max T = 2.8
5.
Permutation 5, max T = 3.5
…
1000.Permutation 1000, max T = 3.1
…
…
44
Permutation Thresholding
5
T= 3.9
Max T
 Compute maximum Tscore for 1000
permutations.
 Find 5th Percentile max T.
 Any voxel in our observed
dataset that exceeds this
threshold has only 5%
probability of being noise.
0
0
100
Percentile
5%
45
Permutation Thresholding
 Permutation Thresholding offers the same protection
against false alarms as Bonferroni.
 Typically, much more powerful than Bonferroni.
 Implementations include SnPM, FSL’s randomise, and
my own NPM.
 Disadvantage: computing 1000 permutations means it
takes x1000 times longer than typical analysis!
Simulation data from Nichols et al.:
Permutation always optimal. Bonferroni
typically conservative. Random Fields only
accurate with high DF and heavily smoothed.
46
False Discovery Rate
Traditional statistics attempts to control the
False Alarm rate.
‘False Discovery Rate’ controls the ratio of
false alarms to hits.
It often provides much more power than
Bonferroni correction.
47
FDR
Assume Olympics where no athletes took EPO:
Assume Olympics where some cheat:
– When we conduct many tests, we can estimate the
amount of real signal
48
FDR vs FWE
Bonferroni FWE applies same threshold to
each data set
5% Bonferroni: only
a 5% chance an
innocent athlete will
be accused.
FDR is dynamic: threshold based on signal
detected.
5% FDR: only 5% of
expelled athletes
are innocent.
49
Controlling for multiple comparisons
 Bonferroni correction
– We will often fail to find real results.
 RFT correction
– Typically less conservative than Bonferroni.
– Requires large DF and broad smoothing.
 Permutation Thresholding
– Offers same inference as Bonferroni correction.
– Typically much less conservative than Bonferroni.
– Computationally very slow
 FDR correction
– At FDR of .05, about 5% of ‘activated’ voxels will be false alarms.
– If signal is only tiny proportion of data, FDR will be similar to
50
Bonferroni.
Alternatives to voxelwise analysis
 Conventional fMRI statistics compute one statistical comparison
per voxel.
– Advantage: can discover effects anywhere in brain.
– Disadvantage: low statistical power due to multiple comparisons.
 Small Volume Comparison: Only test a small proportion of
voxels. (Still have to adjust for RFT).
 Region of Interest: Pool data across anatomical region for single
statistical test.
Example: how many
comparisons on this slice?
•SPM: 1600
•SVC: 57
•ROI: 1
SPM
SVC
ROI
51
ROI analysis
 In voxelwise analysis, we conduct an indepent
test for every voxel
M1: movement
– Each voxel is noisy
– Huge number of tests, so severe penalty for multiple
comparisons
 Alternative: pool data from region of interest.
– Averaging across meaningful region should reduce
noise.
– One test per region, so FWE adjustment less severe.
 Region must be selected independently of
statistical contrast!
S1: sensation
– Anatomically predefined
– Defined based on previous localizer session
– Selected based on combination of conditions you will
contrast.
52
Inference from fMRI statistics
fMRI studies have very low power.
– Correction for multiple comparisons
– Poor signal to noise
– Variability in functional anatomy between people.
Null results impossible to interpret. (Hard to
say an area is not involved with task).
53
Between and Within Subject Variance
70
Typing speed: words per minute
 Consider experiment to see if
music influences typing speed.
 Possible effect will be small.
 Large variability between
people: some people much
better typist than others.
 Solution: repeated measure
design to separate between and
within subject variability.
60
Alice
Bob
50
Donna
Nick
40
Sam
30
20
10
0
Bach
Rock
Silent
54
Multiple Subject Analysis: Mixed Model
Model all of the data at
once
Between and within
subject variation is
accounted for
Can’t apply mixed model
directly to fMRI data
because there is so
much data!
Z stats
Group
Sub 1
Sub 2
Sub 3
Sub 4
55
Multiple Subject Analysis: SPM2
 First estimate each
subject’s contrast effect
sizes (copes)
 Run a t-test on the copes
 Holmes and Friston
assume within subject
variation is same for all
subjects, this allows them
to ignore it at the group
level
Results
Group:
T-test
copes
Sub 1
copes
Sub 2
copes
Sub 3
copes
Sub 4
– Not equivalent to a mixed
model
56
Multiple Subject Analysis: FSL
 First estimate each subject’s
copes and cope variability
(varcopes)
 Then enter the copes and
varcopes into group model
– varcopes supply within subject
variation
– Between subject variation and
group level means are then
estimated
 Equivalent to mixed model
 Much slower than SPM
Z stats
Group
copes
copes
copes
copes
varcopes varcopes varcopes varcopes
Sub 1
Sub 2
Sub 3
Sub 4
57