Nonparametric statistics - Center for Astrostatistics

Download Report

Transcript Nonparametric statistics - Center for Astrostatistics

Nonparametrics.Zip
(a compressed version of nonparametrics)
Tom Hettmansperger
Department of Statistics, Penn State University
References:
1.
2.
3.
4.
5.
6.
Higgins (2004) Intro to Modern Nonpar Stat
Hollander and Wolfe (1999) Nonpar Stat Methods
Arnold Notes
Johnson, Morrell, and Schick (1992) Two-Sample
Nonparametric Estimation and Confidence Intervals
Under Truncation, Biometrics, 48, 1043-1056.
Beers, Flynn, and Gebhardt (1990) Measures of
Location and Scale for Velocities in Cluster of GalaxiesA Robust Approach. Astron J, 100, 32-46.
Website: http://www.stat.wmich.edu/slab/RGLM/
Robustness and a little philosophy
Robustness: Insensitivity to assumptions
a. Structural Robustness of Statistical Procedures
b. Statistical Robustness of Statistical Procedures
Structural Robustness
(Developed in 1970s)
Influence: How does a statistical procedure respond to a single
outlying observation as it moves farther from the center of
the data.
Want: Methods with bounded influence.
Breakdown Point: What proportion of the data must be
contaminated in order to move the procedure beyond any
bound.
Want: Methods with positive breakdown point.
Beers et. al. is an excellent reference.
Statistical Robustness
(Developed in 1960s)
Hypothesis Tests:
• Level Robustness in which the significance level is not sensitive
to model assumptions.
• Power Robustness in which the statistical power of a test to
detect important alternative hypotheses is not sensitive to
model assumptions.
Estimators:
• Variance Robustness in which the variance (precision) of an
estimator is not sensitive to model assumptions.
Not sensitive to model assumptions means that the property
remains good throughout a neighborhood of the assumed
model
Examples
1. The sample mean is not structurally robust and is not variance robust.
2. The sample median is structurally robust and is variance robust.
3. The t-test is level robust (asymptotically) but is not structurally robust nor
power robust.
4. The sign test is structurally and statistically robust. Caveat: it is not very
powerful at the normal model.
5. Trimmed means are structurally and variance robust.
6. The sample variance is not robust, neither structurally nor variance.
7. The interquartile range is structurally robust.
Recall that the sample mean minimizes
2
(
x


)
 i
Replace the quadratic by r(x) which does not
increase like a quadratic. Then minimize:
 r (x  )
i
The result is an M-Estimator which is structurally
robust and variance robust.
See Beers et. al.
The nonparametric tests described here are often
called distribution free because their significance
levels do not depend on the underlying model
assumption. Hence, they are level robust.
They are also power robust and structurally robust.
The estimators that are associated with the tests
are structurally and variance robust.
Opinion: The importance of nonparametric methods
resides in their robustness not the distribution free
property (level robustness) of nonparametric tests.
Single Sample Methods
• Robust Data Summaries
• Graphical Displays
• Inference: Confidence Intervals and
Hypothesis Tests
Location, Spread, Shape
CI-Boxplots (notched boxplots)
Histograms, dotplots, kernel density estimates.
Absolute Magnitude
Planetary Nebulae
Milky Way
Abs Mag (n = 81)
-5.140 -6.700 -6.970
-8.000 -8.079 -8.359
…
-7.190
-8.478
-7.273
-8.558
-7.365 -7.509 -7.633 -7.741
-8.662 -8.730 -8.759 -8.825
Dotplot of Abs Mag
-14.4
-13.2
-12.0
-10.8
-9.6
A bs Mag
-8.4
-7.2
-6.0
Summary for Abs Mag
A nderson-Darling Normality Test
-14
-12
-10
-8
A -Squared
P -V alue
0.30
0.567
M ean
StDev
V ariance
Skew ness
Kurtosis
N
-10.324
1.804
3.253
0.305015
-0.048362
81
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
-6
-14.205
-11.564
-10.557
-9.144
-5.140
85% C onfidence Interv al for M ean
-10.615
-10.032
85% C onfidence Interv al for M edian
-10.699
85% C onfidence Interv al for StDev
8 5 % Confidence Inter vals
1.622
Mean
Median
-10.6
-10.4
-10.2
-10.208
-10.0
2.039
Probability Plot of Abs Mag
Normal - 95% CI
99.9
Mean
StDev
N
AD
P-Value
99
95
Percent
90
80
70
60
50
40
30
20
10
5
1
0.1
-17.5
-15.0
-12.5
-10.0
Abs Mag
-7.5
-5.0
-10.32
1.804
81
0.303
0.567
But don’t be too quick to “accept” normality:
Probability Plot of Abs Mag
3-Parameter Weibull - 95% CI
Percent
99.9
99
Shape
Scale
Thresh
N
AD
P-Value
90
80
70
60
50
40
30
20
10
5
3
2
1
0.1
1
10
Abs Mag - Threshold
2.680
5.027
-14.79
81
0.224
>0.500
Histogram of Abs Mag
3-Parameter Weibull
Shape
Scale
Thresh
N
20
Frequency
15
10
5
0
-14
-12
-10
Abs Mag
-8
-6
2.680
5.027
-14.79
81
Weibull Distribution :
c( x  t )c 1
xt c
f ( x) 
exp{(
) for x  t and 0 otherwise
c
b
b
t  threshold
b  scale
c  shape
Null Hyp: Pop distribution, F(x) is normal
The Kolmogorov-Smirnov Statistic
D  max | Fn ( x)  F ( x) |
The Anderson-Darling Statistic
AD  n ( Fn ( x)  F ( x))2 [ F ( x)(1  F ( x))]1 dF( x)
Boxplot of Abs Mag (with 95% CI)
-5
Outlier
-6
-7
Whisker
Abs Mag
-8
-9
-10
-11
-12
-13
-14
3rd Quartile
95% Confidence Interval
for the Median (in red)
Median
1st Quartile
Anatomy of a 95% CI-Boxplot
• Box formed by quartiles and median
• IQR (interquartile range) Q3 – Q1
• Whiskers extend from the end of the box to the farthest
point within 1.5xIQR.
For a normal benchmark distribution, IQR=1.348Stdev
and 1.5xIQR=2Stdev.
Outliers beyond the whiskers are more than 2.7 stdevs
from the median. For a normal distribution this should
happen about .7% of the time.
Pseudo Stdev = .75xIQR
Estimation of scale or variation.
Recall the sample standard deviation is not robust.
From the boxplot we can find the interquartile range.
Define a pseudo standard deviation by .75IQR.
This is a consistent estimate of the population standard
deviation in a normal population. It is robust.
The MAD is even more robust. Define the MAD by
Med|x – med(x)|. Further define another psueudo
standard deviation by 1.5MAD. Again this is calibrated
to a normal population.
Suppose we enter 5.14 rather than -5.14 in the data set.
The table shows the effect on the non robust stdev
and the robust IQR and MAD.
No
Outlier
Outlier
Breakdown
Stdev
1.80
2.43
1 obs
.75IQR
1.82
1.82
25%
1.5MAD 1.78
1.78
50%
The confidence interval and hypothesis test
A populationis locatedat d 0 if the populationm edian
is d 0 .
Sam pleX 1 ,..., X n from the population.
Say X 1 ,..., X n is located at d if X 1  d ,..., X n  d
is located at 0.
S (d )  S ( X 1  d ,..., X n  d ) a statistic useful for
locationanalysisif
Ed 0 ( S (d 0 ))  0 when popis located at d 0
SignStatistic :
S (d )   sgn( X i  d )  # X i  d  # X i  d



 S (d )  S (d )  2S (d )  n
Estim ated 0 from data, note: Ed 0 S (d 0 )  0
 ˆ
ˆ
ˆ
Find d  S (d )  0 [or S (d )  n / 2]
Solution: dˆ  m edian( X )
i
HYPOTHESIS TEST of H 0 : d  d 0 vs. H A : d  d 0

Rule: reject H 0 if | S (d 0 ) | | 2 S (d 0 )  n |  c
where Pd 0 (| 2 S (d 0 )  n |  c)   .

nc
nc

S (d 0 ) 
 k or S (d 0 ) 
 nk
2
2
UnderH 0 : d  d 0 ,

1
S (d 0 ) distributed Binom ial(n, )
2
Distribution Free

CONFIDENCE INTERVAL
d is populationlocation
Pd (k  S  (d )  n  k )  1  
Find sm allestd  (# X i  d )  n  k
d  X ( k ) : (# X i  X ( k ) )  n  k
d min  X ( k 1) : (# X i  X ( k 1) )  n  k  1
Likewise d max  X ( n  k )
Then[ X ( k 1) , X ( n  k ) ] is (1   )100% Conf . Int.
Distribution Free
SUMMARY :
X 1 ,...,X n a sample from a population located at d 0 .
SIGN STATISTIC : S ( d )  S  ( d )  S  ( d )  # X i  d  # X i  d
ESTIMATE : dˆ  S ( dˆ )  0  dˆ  median ( X i )
TEST of H 0 : d  d 0 vs. H A : d  d 0 
1
1
H 0 : P ( X  d 0 )  vs. H A : P ( X  d 0 ) 
2
2
reject H 0 if S  ( d 0 )  k or  n  k
where Pd0 ( S  ( d 0 )  k )  / 2 and S  ( d 0 ) binomial ( n ,1 / 2 )
CONFIDENCE INTERVAL: if Pd ( S  ( d )  k )   / 2 then
[ X ( k 1) , X ( n  k ) ] has confidence coefficient (1 )100%
Boxplot of Abs Mag (with 95% CI)
-5
-6
-7
Abs Mag
-8
-9
-10
-11
-12
-13
-14
Q1 Median SE Med
-11.5 -10.7
.18
Q3 IQR
-9.14 2.42
Additional Remarks:
The median is a robust measure of location. It is not affected by outliers.
It is efficient when the population has heavier tails than a normal
population.
The sign test is also robust and insensitive to outliers. It is efficient when
the tails are heavier than those of a normal population.
Similarly for the confidence interval.
In addition, the test and the confidence interval are distribution free and
do not depend on the shape of the underlying population to determine
critical values or confidence coefficients.
They are only 64% efficient relative to the mean and t-test when the
population is normal.
If the population is symmetric then the Wilcoxon Signed Rank statistic
can be used, and it is robust against outliers and 95% efficient relative to
the t-test.
Two-Sample Methods
Two-Sample Comparisons
85% CI-Boxplots
Mann-Whitney-Wilcoxon Rank Sum Statistic
•Estimate of difference in locations
•Test of difference in locations
•Confidence Interval for difference in locations
Levene’s Rank Statistic for differences in scale
or variance.
85% CI-Boxplots
20
15
10
5
0
-5
-10
-15
MW
M-31
Boxplot of App Mag, M-31
19
18
17
App Mag
16
15
14
13
12
11
10
Dotplot of App Mag, M-31
11
12
13
14
15
App Mag
16
17
18
Summary for App Mag, M-31
A nderson-Darling Normality Test
10.5
12.0
13.5
15.0
16.5
18.0
A -Squared
P -V alue <
1.79
0.005
M ean
StDev
V ariance
Skew ness
Kurtosis
N
14.458
1.195
1.427
-0.396822
0.366104
360
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
10.749
13.849
14.540
15.338
18.052
85% C onfidence Interv al for M ean
14.367
14.549
85% C onfidence Interv al for M edian
14.453
14.610
85% C onfidence Interv al for StDev
8 5 % Confidence Inter vals
1.134
Mean
Median
14.40
14.45
14.50
14.55
14.60
1.263
Summary for App Mag (low outliers removed)
A nderson-Darling Normality Test
12
13
14
15
16
17
18
A -Squared
P -V alue
1.01
0.012
M ean
StDev
V ariance
Skew ness
Kurtosis
N
14.522
1.115
1.243
-0.172496
0.057368
353
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
11.685
13.887
14.550
15.356
18.052
85% C onfidence Interv al for M ean
14.436
14.607
85% C onfidence Interv al for M edian
14.483
14.639
85% C onfidence Interv al for StDev
8 5 % Confidence Inter vals
1.058
Mean
Median
14.45
14.50
14.55
14.60
14.65
1.179
Probability Plot of App Mag
Normal - 95% CI
99.9
Mean
StDev
N
AD
P-Value
99
95
Percent
90
80
70
60
50
40
30
20
10
5
1
0.1
10
11
12
13
14
15
App Mag
16
17
18
19
14.46
1.195
360
1.794
<0.005
Why 85% Confidence Intervals?
We have the following test of
H 0 : d  d1  d2  0 vs. H A : d  d1  d2  0
Rule: reject the null hyp if the 85% confidence
intervals do not overlap.
The significance level is close to 5% provided
the ratio of sample sizes is less than 3.
Mann-Whitney-Wilcoxon Statistic: The sign statistic on the
pairwise differences.
X 1 ,..., X m and Y1 ,...,Yn with X from pop F
and Y from popG with d  dY  d X .
U (d )  


sgn(
Y

d

X
)

U
(
d
)

U
(d )

i
j
 (#Yi  X j  d )  (#Yi  X j  d )
Unlike the sign test (64% efficiency for normal population, the MWW test
has 95.5% efficiency for a normal population. And it is robust against
outliers in either sample.
SUMMARY :
MWW STATISTIC :U ( d ) U  ( d ) U  ( d )  #Y j  X i  d  #Y j  X i  d
ESTIMATE : dˆ U ( dˆ )  0  dˆ  mediani , j (Y j  X i )
TEST of H 0 : d  0 vs. H A : d  0 
1
1
H 0 : P (Y  X )  vs. H A : P (Y  X ) 
2
2
reject H 0 if U  ( 0 )  k or  n  k
where Pd0 (U  ( 0 )  k )  / 2 and U  ( d 0 ) a tabled distribution .
CONFIDENCE INTERVAL:if Pd (U  ( d )  k )   / 2 then
[ D( k 1) , D( mn k ) ] has confidence coefficient (1 )100 %
where D(1) ...D( mn ) are the ordered pairwise differences .
Mann-Whitney Test and CI: App Mag, Abs Mag
N Median
App Mag (M-31) 360 14.540
Abs Mag (MW)
81 -10.557
Point estimate for d is 24.900
95.0 Percent CI for d is (24.530,25.256)
W = 94140.0
Test of d=0 vs d not equal 0 is significant at 0.0000
What is W?
U  # Y j  X i
W U

n
n( n  1)

  Rj
2
j 1
R1 ,..., Rn are ranksof Y1 ,...,Yn in com bineddata
RY  R X
1
1
nm

 (  )U 
n m
2
HenceMWW canbe written as the differencein
averageranksratherthanY  X in t  test.
What about spread or scale differences between the two populations?
Below we shift the MW observations to the right by 24.9 to line up with
M-31.
Dotplot of MW and M-31
MW
M-31
11.2
12.6
14.0
15.4
16.8
18.2
19.6
Each symbol represents up to 2 observations.
Variable
MW
M-31
StDev
1.804
1.195
IQR
2.420
1.489
PseudoStdev
1.815
1.117
Levene’s Rank Test
Compute |Y – Med(Y)| and |X – Med(X)|, called absolute deviations.
Apply MWW to the absolute deviations. (Rank the absolute deviations)
The test rejects equal spreads in the two populations when difference
in average ranks of the absolute deviations is too large.
Idea: After we have centered the data, then if the null hypothesis
of no difference in spreads is true, all permutations of the combined data
are roughly equally likely. (Permutation Principle)
So randomly select a large set of the permutations say B permutations.
Assign the first n to the Y sample and the remaining m to the X sample
and compute MMW on the absolute deviations.
The approximate p-value is #MMW > original MMW divided by B.
Difference of rank mean abso devs
51.9793
Histogram of levenerk
Normal
Mean
StDev
N
120
0.1644
16.22
1000
Frequency
100
80
60
40
20
0
-45
-30
-15
0
levenerk
15
30
45 52
So we easily reject the null hypothesis of no difference in spreads and
conclude that the two populations have significantly different spreads.
One Sample Methods
Two Sample Methods
k-Sample Methods
Variable
Messier 31
Mean
22.685
StDev
0.969
Median .75IQR
23.028 1.069
Skew
-0.67
Kurtosis
-0.67
Messier 81
24.298
0.274
24.371
0.336
-0.49
-0.68
NGC 3379
NGC 4494
NGC 4382
26.139
26.654
26.905
0.267
0.225
0.201
26.230
26.659
26.974
0.317
0.252
0.208
-0.64
-0.36
-1.06
-0.48
-0.55
1.08
All one-sample and two-sample methods can be applied one at a time
or two at a time. Plots, summaries, inferences.
We begin k-sample methods by asking if the location differences between
the NGC nebulae are statistically significant.
We will briefly discuss issues of truncation.
85% CI-Boxplot Planetray Nebula Luminosities
28
27
26
25
24
23
22
21
20
M-31
M-81
NGC-3379
NGC-4494
NGC-4382
ExtendingMWW to several sam ples
Given N  total sam plesize with ranksof com bineddata
with R1 , R2 , and R3 construct:
KW 

12
nn
nn
nn
{ 1 2 ( R1  R2 ) 2  1 3 ( R1  R3 ) 2  2 3 ( R2  R3 ) 2 }
N ( N  1) N
N
N
12
N 1 2
N 1 2
N 1 2
{n1 ( R1 
)  n2 ( R2 
)  n3 ( R3 
)}
N ( N  1)
2
2
2
Generally use a chisquare(k  1  2) Degreesof Freedomas
approxim ate sam pling distribution for KW
Kruskal-Wallis Test on NGC
sub
1
2
3
Overall
N
45
101
59
205
KW = 116.70
Median
26.23
26.66
26.97
DF = 2
Ave Rank
29.6
104.5
156.4
103.0
Z
-9.39
0.36
8.19
P = 0.000
This test can be followed by multiple comparisons.
For example, if we assign a family error rate
of .09, then we would conduct 3 MWW tests, each
at a level of .03. (Bonferroni)
85% CI-Boxplot
27.25
27.00
26.75
26.50
26.25
26.00
25.75
25.50
NGC3379
NGC4494
NGC4382
What to do about truncation.
1. See a statistician
2. Read the Johnson, Morrell, and Schick reference. and then
see a statistician.
Here is the problem: Suppose we want to estimate the difference in locations
between two populations: F(x) and G(y) = F(y – d).
But (with right truncation at a) the observations come from
F ( x)
for x  a and 1 for x  a
F (a)
F(y  d)
Ga ( y ) 
for y  a and 1 for y  a
F (a  d )
Fa ( x) 
Suppose d > 0 and so we want to shift the X-sample to the right toward the
truncation point. As we shift the Xs, some will pass the truncation point and
will be eliminated from the data set. This changes the sample sizes and
requires adjustment when computing the corresponding MWW to see if
it is equal to its expectation. See the reference for details.
Comparison of NGC4382 and NGC 4494
Data multiplied by 100 and 2600 subtracted.
Truncation point taken as 120.
Point estimate for d is 25.30
W = 6595.5
m = 101 and n = 59
Computation of shift estimate with truncation
d
m
n
W
E(W)
25.3
88
59
5.10
4750.5
4366.0
28.3
84
59
3.60
4533.5
4248.0
30.3
83
59
2.10
4372.0
4218.5
32.3
81
59
0.80
4224.5
4159.5
33.3
81
59
-0.20
4144.5
4159.5
33.1
81
59
-0.00
4161.5
4159.5
dˆ
Robust regression fitting and correlation (association)
Dataset (http://astrostatistics.psu.edu/datasets/HIP_star.html)
We have extracted a sample of 50 from the subset of 2719 Hipparcos stars
Vmag = Visual band magnitude. This is an inverted logarithmic
measure of brightness
Plx = Parallactic angle (mas = milliarcsseconds).
1000/Plx gives the distance in parsecs (pc)
B-V = Color of star (mag)
The HR diagram logL vs. B-V where (roughly) the log-luminosity
in units of solar luminosity is constructed
logL=(15 - Vmag - 5logPlx)/2.5.
All logs are base10.
Row
LogL BV
1 0.69233 0.593
2 1.75525 0.935
3 -0.30744 0.830
4 -0.17328 0.685
5 0.57038 0.529
6 -1.04471 1.297
7 0.51396 0.510
8 0.52149 0.607
9 -1.06306 1.288
10 0.41990 0.677
11 -0.76152 0.950
12 -1.10608 1.260
13 0.42593 0.651
14 -0.44066 0.909
15 -0.90039 1.569
16 -0.74118 1.065
17 -0.66820 1.049
18 -0.26810 0.884
19 0.56722 0.480
20 -0.93809 0.490
21 -0.38095 1.160
22 -0.19267 0.810
23 0.54619 0.498
24 0.20161 0.614
25 0.37348 0.538
26 -0.38556 0.879
27 -0.22978 0.723
28 0.57671 0.455
29 -1.00092 1.110
30 -0.00215 0.637
31 -0.95768 1.616
32 0.10378 0.606
33 -1.43872 1.365
34 1.23674 0.395
35 0.10866 0.630
36 -1.60621
*
37 0.06468 0.599
38 -0.18214 0.709
39 0.37988 0.561
40 1.23793 0.257
41 -0.16896 0.864
42 -0.59331 0.955
43 1.78028 1.010
44 -0.63099 1.100
45 0.61900 0.664
46 -0.28520 0.706
47 -0.71404 0.898
48 0.35061 0.616
49 0.55002 0.466
50 0.37922 0.548
Fitted Line Plot
Resistent Line in Black
Resistant Line in Black
Least
Squares line in Blue
Least Squares Line
in Blue
2.00
1.50
1.00
LogL
0.50
0.00
LogL = 1.253 - 1.605 BV
-0.50
-1.00
logL = 1.513 - 2.067BV
-1.38
-1.50
0.0
0.2
0.4
0.6
0.8
BV
1.0
1.2
1.4
1.6
The resistant line is robust and not affected by the outliers. It follows
the bulk of the data much better than the non robust least squares
regression line.
There are various ways to find a robust or resistant line. The most typical
is to use the ideas of M-estimation and minimize:
 r (x  a  bc )
i
i
where the r(x) does not increase as fast as a quadratic.
The strength of the relationship between variables is generally
measured by correlation.
Next we look briefly at non robust and robust measures of
correlation or association.
Pearson product moment correlation is not robust.
Spearman’s rank correlation coefficient is simply the
Pearson coefficient with the data replaced by their ranks.
Spearman’s coefficient measures association or the
tendency of the two measurements to increase or
decrease together. Pearson’s measures the degree
to which the measurements cluster along a straight line.
For the example:
Pearson r = -.673
Spearman rs= -.743
Significance tests:
Pearson r: refer z to a standard normal distribution, where
z
r n2
(1  r 2 )
 6.28
Spearman rs: refer z to a standard normal distribution, where
z
n  1rs  5.201
Kendall’s tau coefficient is defined as
where P is the number of concordant pairs out of
n(n-1)/2 total pairs.
For example (1, 3) and (2, 7) are concordant since
2>1 and 7>3. Note that Kendall’s tau estimates the
probability of concordance minus the probability of
discordance in the bivariate population.
For the example:
Kendalls Tau = -0.63095
Significance Test: refer z to a standard normal distribution where
z
3 n(n  1)tau
2(2n  5)
 6.47
What more can we do?
1. Multiple regression
2. Analysis of designed experiments (AOV)
3. Analysis of covariance
4. Multivariate analysis
These analyses can be carried out using the website:
http://www.stat.wmich.edu/slab/RGLM/
Professor Lundquist, in a seminar on compulsive thinkers, illustrates his brain
stapling technique.
The End