Statistics in Ecology - Nicholls State University

Download Report

Transcript Statistics in Ecology - Nicholls State University

From Websters Ninth New Collegiate Dictionary:
Statistics \stә-'tis-tiks\ n pl but sing or pl in constr
[G statistik study of political facts and figures, fr. NL
statisticus of politics, fr. L status state] (1787) 1 : a
branch of mathematics dealing with the collection,
analysis, interpretation, and presentation of
masses of numerical data 2 : a collection of
quantitative data
Statistics \stә-'tis-tiks\ n pl but sing or pl in constr [G
statistik study of political facts and figures, fr. NL
statisticus of politics, fr. L status state] (1787) 1 : a
branch of mathematics dealing with the collection,
analysis, interpretation, and presentation of masses of
numerical data 2 : a collection of quantitative data
Politics – Poly + Ticks
Poly – many
Ticks – blood sucking parasites
Statistics – not sadistics!
Mais, I ain’t scared me!
-Cajun Neaux Fear
-Statistics are an acquired taste
-Basic knowledge is a must
-No substitution for experience
-Similar to learning a new language
least-squares
eigan value
?
R2
?
orthogonal contrast
?
variance
?
split-plot
data
F-ratio
?
regression
ANCOVA
?
sub-sample
?
?
P-value
canonical
correlation
?
?
t-test
standard
deviation
?
population
?
covariates
?
Testing of a Null Hypothesis
• Hypothesis Testing
– Null versus Alternative Hypothesis
• Simple:
– Null: Two means are not different
– Alternative: Two means are not similar
• A test statistic based on a predetermined
probability (usually 0.05) is used to reject
or accept the null hypothesis
Population: a data set representing the
entire entity of interest
Sample: a data set representing a portion
of a population
Population
Sample
Population mean – the true mean for that population
-a single number
Sample mean – the estimated population mean
-a range of values (estimate ± 95% confidence
interval)
Population
Sample
Population Distribution 
Normal Distribution
Frequency
10
9
8
7
6
5
4
3
2
1
0
Frequency Distribution
Number of Individuals
Frequency Distribution of Biomass
Most Common
14
12
10
8
6
Least Common
Least Common
4
2
0
1
2
3
4
5
6
Biomass (g)
7
8
9
10
Sampling
• Usually impossible to measure entire
population
• We assume that our samples are
representative of the population we are
studying
• Therefore, we can make conclusions
about the population of interest based on
our sample results
Sample Distribution Approaches
Normal Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Sample Distribution Approaches
Normal Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Sample Distribution Approaches
Normal Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Central Limit Theorem
-as stated in stats books:
• In random samples of units from almost
any population, the probability
distribution of the sample mean can be
approximated by a normal distribution,
and that this approximation becomes
better the larger is the number of units
sampled.
WHAT???!!!
In terms easier understood:
As our sample size increases, our
sample distribution approaches a
normal distribution.
Think about it…..
As our sample size increases, we sample
more and more of the population. Eventually,
we will have sampled the entire population
and our sample distribution will be the
population distribution
Increasing
sample
size
Mean ± Confidence Interval
When a population is sampled, a mean value is
determined and serves as the point-estimate for that
population.
However, we cannot expect our estimate to be the
exact mean value for the population.
Instead of relying on a single point-estimate, we
estimate a range of values, centered around the
point-estimate, that probably includes the true
population mean.
That range of values is called the confidence interval.
Error Measurement
-a measure of sample variation
29
28
27
Mean
26
25
= x-x
24
23
22
0
1
2
3
4
5
6
Mean = x =
Average Error =
Variance =
Ni=1x
N
(x-x)
= zero always
N
(x-x)2
N-1
Standard Deviation = 
Individual Weight
1
25
2
23
3
26
4
26
5
27
6
28
Sum of Squares
(x-x)2
N-1
Mean
25.83333
25.83333
25.83333
25.83333
25.83333
25.83333
W-M
-0.833333333
-2.833333333
0.166666667
0.166666667
1.166666667
2.166666667
7.10543E-15
1.18424E-15
(W - M)2
0.694444
8.027778
0.027778
0.027778
1.361111
4.694444
14.83333
2.966667
1.722401
Sum of Squares
Variance
Standard
Deviation
Confidence Interval
Confidence Interval: consists of two numbers (high
and low) computed from a sample that identifies the
range for an interval estimate of a parameter.
y ± (t/0.05)[() / (n)]
132 ± 13.8
118.2    145.8
Standard
Deviation
True population
distribution
Individual sample
distributions with P =
0.05.
In the long run, 95%
of all samples give an
interval that covers .
What happens if you
change the P – value?
Confidence Interval
• We assume the ‘true’ population mean is within
the confidence interval
• Size of the interval depends on the alpha level,
variation, and sample size
Data
98
94
85
92
92
84
76
75
100
Mean
88.44
SD
9
Smaller alpha  larger CI
alpha
0.1
0.05
0.01
0.001
95%CI
4.93
5.88
7.73
9.87
L95%
83.51
82.56
80.71
78.57
U95%
93.37
94.32
96.17
98.31
Confidence Interval
• We assume the ‘true’ population mean is within
the confidence interval
• Size of the interval depends on the alpha level,
variation, and sample size
Mean
88.44
N
9
alpha
0.05
SD
2
9
20
40
Larger variation  larger CI
95%CI
1.10
4.93
10.97
21.93
L95%
87.34
83.51
77.47
66.51
U95%
89.54
93.37
99.41
110.37
Confidence Interval
• We assume the ‘true’ population mean is within
the confidence interval
• Size of the interval depends on the alpha level,
variation, and sample size
Mean
88.44
SD
9
alpha
0.05
N
2
9
20
40
Larger sample size  smaller CI
95%CI
10.47
4.93
3.31
2.34
L95%
77.97
83.51
85.13
86.10
U95%
98.91
93.37
91.75
90.78
Common to All Studies
• Design
– Treatment, Control, Replication,
Randomization
• Sampling
– Control of Variation, Replication,
Randomization
• Analysis
– Use of Proper Test
Need for a Control
• What exactly are you testing?
– Effect of reduced diet on growth
• Can you control everything outside
of that variable?
– Temperature, pH, photoperiod
• Controls give us a benchmark to
analyze treatment effects within the
framework of our design.
Need for Replication
• Replication
– Necessary for treatment comparisons
• (need a SD)
• Replication
– Increases the Power of your analysis
• Replication
– Unfortunately is usually limited by money and
or logistics
Five Ponds Are Better Than Two.
1
Mean based
on five
numbers
2
1
Mean based
on two
numbers
2
3
4
5
Power increases with repetition.
Power: Probability of correctly
rejecting the null hypothesis when
it is false.
When Repetition is Not Optimal
• Do what you can – but report your
specifics so that the study can be
taken for what it is.
– Junk in = junk out
Randomization
• Protects against the systematic
influence of unrecognized sources of
variation
• Removes bias (conscious or
otherwise) from the study
– Bias  unidirectional shift in error
Single Population Size
• Boudreaux tells everyone that his
bass pond has bass that average 8
pounds
• His neighbor, Alphonse, doesn’t
believe him.
• Who is right?
Single Sample t-test
• Used to compare the mean of a sample to
a known number
• Assumes that subjects are randomly
drawn from a population and the
distribution of the mean being tested is
normal
• Basically, does the confidence interval
include the number of interest?
Simple as Creating a Confidence Interval
ID #
1
2
3
4
5
6
7
8
9
10
Weight
3.3
5.8
7.2
8.9
6.2
6.4
5.9
5.6
3.8
4.8
N=
Range =
Mean =
Var()=
t/0.05 =
10
3.3 – 8.9
5.79
2.599
1.82
y ± (t/0.05)[() / (n)]
5.79 ± (1.82)(2.599)/(3.16)=
5.79 ± 1.497=
4.29  5.79  7.29
8 is not included in the rangeBoudreaux is wrong!
Are Two Populations The Same?
• Boudreaux: ‘My pond is better than
yours, cher’!
• Alphonse: ‘Mais non! I’ve got much
bigger fish in my pond’!
• How can the truth be determined?
Two Sample t-test
• Simple comparison of a specific
attribute between two populations
• If the attributes between the two
populations are equal, then the
difference between the two should be
zero
• This is the underlying principle of a ttest
Two sample t-test versus
paired t-test
• A two sample t-test is used to compare
two separate entities
– For example, two different populations
• A paired t-test is used if successive
measurements are taken on the same
individual unit
– For example, a pond, tree, fish, batch, etc.
– The second sample is the same as the first
after some treatment has been applied
Analysis of Variance
• Compares means to determine if the
population distributions are not
similar
• Uses means and confidence intervals
much like a t-test
• Test statistic used is called an F
statistic (F-test)
The SAS System
13:17 Monday, October 4, 2004
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
treat
size
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
25
22
30
26
29
15
12
22
19
18
32
31
27
26
29
Data Set;
3 treatments
with 5
replicates per
treatment
The SAS System
13:17 Monday, October 4, 2004
The ANOVA Procedure
Class Level Information
Variable name
Class
Levels
treat
3
Number of observations
Values
1 2 3
15
Variable labels
2
35
30
25
20
Total
Error
15
Overall
Average
10
5
0
0
1
2
3
4
35
30
25
20
15
Treatment
Error
Treatment
Average
10
5
0
0
1
2
3
4
Treat
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
Value
25
22
30
26
29
15
12
22
19
18
32
31
27
26
29
Diff
0.8
-2.2
5.8
1.8
4.8
-9.2
-12.2
-2.2
-5.2
-6.2
7.8
6.8
2.8
1.8
4.8
Treat
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
Value
25
22
30
26
29
15
12
22
19
18
32
31
27
26
29
Diff
-1.4
-4.4
3.6
-0.4
2.6
-2.2
-5.2
4.8
1.8
0.8
3
2
-2
-3
0
Squared Diff
0.64
4.84
33.64
3.24
23.04
84.64
148.84
4.84
27.04
38.44
60.84
46.24
7.84
3.24
23.04
510.4
Squared Diff
1.96
19.36
12.96
0.16
6.76
4.84
27.04
23.04
3.24
0.64
9
4
4
9
0
126
The SAS System
13:17 Monday, October 4, 2004
The ANOVA Procedure
Dependent Variable: size
Source
Sum of
Squares
DF
Model
2
Error
12
Corrected Total
14
Source
treat


384.4000000
126.0000000
=
=
Mean Square
F Value
Pr > F
192.2000000
18.30
0.0002
10.5000000
F-value
510.4000000
R-Square
Coeff Var
Root MSE
size Mean
0.753135
13.38996
3.240370
24.20000
P-value
DF
Anova SS
Mean Square
F Value
Pr > F
2
384.4000000
192.2000000
18.30
0.0002
3
The SAS System
13:17 Monday, October 4, 2004
The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for size
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher
Type II error rate than REGWQ.
Alpha
Error Degrees of Freedom
Error Mean Square
Critical Value of Studentized Range
Minimum Significant Difference
0.05
12
10.5
3.77278
5.4673
Means with the same letter are not significantly different.
Tukey Grouping
Mean
N
treat
A
A
A
29.000
5
3
26.400
5
1
B
17.200
5
2
Treat 1 and 3 are not
different, 1 and 3 are
different than 2
7
Showing Results
35
30
A
A
25
B
20
15
10
5
0
1
2
3
Analysis of Variance – post hoc
Regression
• Measures the relationship between two
variables and measures the strength of
their correlation
• Regression models can be used for
predictions (fertilizer – growth)
• Significance as well as R2 (strength of
correlation) considered
Regression
• For the purposes of this class:
– Does Y depend on X?
– Can Y be predicted from X?
• Y= mX + b
Dependent Variable (Y-axis)
10
8
6
4
2
0
0
2
4
6
Independent Variable (X-axis)
8
10
When analyzing a data set, the first step is to plot the
data:
X
35
45
55
65
75
Y
114
124
143
158
166
Dependent Variable
180
160
140
120
100
30
40
50
60
70
Independent Variable
Clearly the trend is upward and linear.
The straight line is the sample regression of Y on X, and
its position is fixed by two results.
80
The straight line is the sample regression of Y on X, and
its position is fixed by two results:
Independent Variable
180
Y = 1.38X + 65.1
160
140
(55, 141)
120
Rise/Run
100
30
40
50
60
70
80
Dependent Variable
The regression line passes through the point (X, Y).
Its slope is at the rate of “b” units of Y per unit of X,
where b = sample regression coefficient (slope)
Independent Variable
180
160
140
Observed – Predicted = residual error
120
100
30
40
50
60
Dependent Variable
70
80
For each X, we can predict Y:
Y = 1.38(X) + 65.1
x
observed
y
predicted
ŷ
Deviation
(y- ŷ)
Square of
Deviations
35
114
113.4
0.6
0.36
45
124
127.2
-3.2
10.24
55
143
141
2.0
4.00
65
158
154.8
3.2
10.24
75
166
168.6
-2.6
6.76
0.0
31.60

The value 31.60 is termed the sum of squares
deviations and is the basis for an estimate of error in
fitting the line. Regression analysis determines what
line has the smallest sum of squares.
Independent Variable
180
Model Error
160
Predicted - ŷ
140
120
100
30
40
50
60
70
80
Dependent Variable
Total Error
Independent Variable
180
160
140
Observed – ŷ
120
100
30
40
50
60
Dependent Variable
70
80
Correlation (r):
-another measure of the mutual linear
relationship between two variables.
• ‘r’ is a pure number without units or dimensions
• ‘r’ is always between –1 and 1
• Positive values indicate that y increases when x does
and negative values indicate that y decreases when x
increases.
– What does r = 0 mean?
• ‘r’ is a measure of intensity of association observed
between x and y.
– ‘r’ does not predict – only describes associations
between variables
180
Independent Variable
Independent Variable
180
160
140
r>0
120
160
140
r<0
120
100
100
30
40
50
60
70
80
Dependent Variable
30
40
50
60
70
Dependent Variable
Independent Variable
180
160
r is also called Pearson’s
correlation coefficient.
140
r=0
120
100
30
40
50
60
Dependent Variable
70
80
If we square r (r2) we get
how much variation in y is
explained by x (get rid of
the negative).
80
r2 = Model Error / Total Error
• Model error = distance between regression line
(predicted value) and mean value
• Total error = distance between each point
(observed value) and mean value
Independent Variable
180
Model Error
160
Predicted – ŷ = Model error
140
120
100
30
40
50
60
70
80
Dependent Variable
Total Error
Independent Variable
180
160
140
Observed – ŷ = Total error
120
100
30
40
50
60
Dependent Variable
70
80
R2 – How well does the data fit?
Total Error
Model Error
R2 = Model Error / Total Error
If similar, R2 approaches 1 (good fit)
Not all significant models are
good fits!
Length – Weight Data
2.5
Weight
2
1.5
1
0.5
0
35
40
45
50
Length
55
60
The REG Procedure
Model: MODEL1
Dependent Variable: weight
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
Model
1
5.75495
5.75495
660.13
<.0001
Error
43
0.37487
0.00872
Corrected Total
44
6.12982
Source
Root MSE
0.09337
R-Square
0.9388
Dependent Mean
1.26902
Adj R-Sq
0.9374
Coeff Var
7.35764
Weight = 0.07112(length) + (-2.28087)
Parameter Estimates
Parameter
Standard
DF
Estimate
Error
t Value
Pr > |t|
Intercept
1
-2.28087
0.13887
-16.43
<.0001
length
1
0.07112
0.00277
25.69
<.0001
Variable
Regression – slope comparison
Total Length (mm)
17
Low Temperature
14
11
High Temperature
8
5
80
114
148
182
Otolith Radius (um)
216
250
Common Garden Experiment
• A common garden experiment can be used to separate
the phenotypic (environmental) from the genotypic
(genotype) components of variation
• Plants of the same species but growing in a diversity of
habitats are grown in the same environment.
– Any differences in phenotype can then be attributed to
genotype differences
• Common garden experiments are also used for animal
studies.
– Temp tolerance, salinity tolerance, reproductive
differences etc.
Plantago maritima
• Marsh normal height: 30 – 40 cm
• Cliff normal height: 5 – 10 cm
• Each grown in a common garden:
Plantago
Mean height (cm)
maritima Source
in garden
Marsh Population
31.5
Cliff Population
20.7
Measuring Biodiversity
• The simplest measure of biodiversity is the
number of species – called species richness.
– Usually only count resident species, and not accidental
or temporary immigrants
• Another concept of species diversity is
heterogeneity:
Species A
Species B
Community 1
99
1
Community 2
50
50
Heterogeneity is higher in a community where
there are more species and when the species are
more equally abundant.
Diversity Indices
• A mathematical measure of species
diversity in a community.
• Reveals important information
regarding rarity and commonness of
species in a community.
Shannon-Wiener Diversity Index (H)
H = - pi(lnpi)  ranges from 0 to 1
• Variables associated with the Shannon-Weiner
Diversity index:




Species
1
2
3
4
1.386294
S – total number of species in the community (richness)
pi – proportion of S made up of the ith species
EH – equitability (evenness; b/t 0 and 1) = H / Hmax
Hmax = ln(S)
#
12
562
8
1
583
pi
0.020583
0.963979
0.013722
0.001715
ln(pi )
(pi )(lnpi )
-3.88328 -0.07993
-0.03669 -0.03536
-4.28875 -0.05885
-6.36819 -0.01092
-0.18507
H= 0.18507
E = 0.18507/1.386297 =
0.1335
Simpson’s Index
D = ni(ni-1)/N(N-1)  ranges from 0 to 1
• N = total number of individuals
• ni = number of each species
• Simpson’s Index of Diversity = 1 – D
– Ranges from 0 to 1
• Simpson’s Reciprocal Index = 1 / D
Species
1
2
3
4
N(N-1)=
Simpson's Index
Simpson's Index of Diversity
Simpson's Reciprocal Index
0.23404255
0.76595745
4.27272727
#
12
12
12
12
48
2256
ni (ni -1)
132
132
132
132
528
Stats Used in Published
Studies
Percent Frequency
Paired t-test
Large Yellowfin Shiners
Number of Ants Eaten
45
40
35
30
First Feeding
Second Feeding
25
20
15
10
5
0
18 3 12 2 15 5 11 13 9 14 4 8 20 16 17 7 10 1 6 19
Replicate
Analysis of Variance
Analysis of Variance
ANOVA / Regression
Regression was also used
on this data to show that
the plasma nitrite level
should reach 0 by day 4.
P-N = 137.2 – 33.96(D.O.E)
*R2 for 15°C = 0.742 and for 5 °C = 0.574
Simple Observation
Logistic Regression
Presence/Absence
1
0.8
Significant relationship
between larval fish
presence and dissolved
oxygen.
0.6
0.4
0.2
0
0
2
4
6
Dissolved Oxygen
8
10
12
Analysis of Variance
*
*
Principle Components Analysis
Design an experiment to test the following:
• Cypress trees in South Carolina live longer than
cypress trees in Louisiana
• Cypress trees grow faster in warmer weather
• Fast growing cypress trees do not live as long as
slow growing cypress trees