What is Error? - University of California, Irvine

Download Report

Transcript What is Error? - University of California, Irvine

Introduction to
Statistics
Statistic [stuh-tis-tik] noun .
A numerical fact or datum, especially
one computed from a sample
How long does the ball take to fall?
 Measured values:
 See Board
• How do we decide which of these
measured values is correct?
• How do we discuss the variation in our
measurements?
Mean
 Also known as “Average”
 Add all results, and divide by the number of
measurements.
 Equation form:
x1 + x 2 +... + xn 1
m=x=
= å xi
n
n i=1
n
Propagation of Uncertainty
Accuracy
 Sources of Inaccuracy:
 Broken measurement
device
 Parallax
 Random error
 ?
Low bias, high variability
Precision
 Sources of
Imprecision:
 Multiple
measurement
methods
 Systematic error?
High bias, low variability
Variance and Standard Deviation
 Squared deviation: How much variation is there
from the mean?


x1  x   x2  x 
2
s 
2
2
 .... xn  x 
2
n
1 n
2
  xi  x 
n i 1
 Variance: measures the absolute distance
observations are from the mean
s @s= s
2
Error
 Error is the difference between the measured and
expected value
 Error is how we make sense of differences between
two measurements that should be the same
 Error is NOT mistakes! If you made a mistake, do
it again.
Types of Error Descriptions
For a true mean, µ, and standard deviation, σ, the sample mean has an
uncertainty of the mean over the square root of the number of samples.
Gives a measure of reliability of the mean.
Dx = s
N
Sample standard error tells you how close your sample mean should be to the true mean.
s
Dx @
N
Using the Standard Error
This is the simplest way of using data to confirm or refute a hypothesis.

inside
x  x
 outside x  x
 confirmed
 not confirmed
 x
This is also what is used to
create the error bars.
x
 x
Density Curve
Low values indicate a small
spread (all values close to the
mean)
high values indicate a large
spread (all values far from the
mean)
Normal Distribution
• Particularly important
class of density curve
• Symmetric, unimodal,
•bell-shaped
• Mean, μ, is at the center
of the curve
• Probabilities are the area
under the curve
• Total area = 1
The Empirical Rule
In a normal distribution
with mean μ and
standard deviation of σ:
•68% of observations fall
within 1 σ of the mean
•95% of observation fall
within 2 σ of the mean
•99.7% observations fall
within 3 σ of the mean
F
D
C
B
A
Example with data
 Set of values: 2, 4, 4, 4, 5, 5, 7, 9
 Mean:
2 4  4  4 55 79
5
8
 Standard Deviation:
 mean  measurement 
2

# samples
(2
- 5)2 + (4 - 5)2 + (4 - 5)2 + (4 - 5)2 + (5 - 5)2 + (5- 5)2 + (7 - 5)2 + (9 - 5)2
8

 32  12  12  12  02  02  22  42
8
2
Data Distribution
5-6
5-4
5-2
5
5+2 5+4 5+6
Confidence Interval
5-6
5-4
5-2
5
5+2 5+4 5+6
Central Limit Theorem
 If X follows a normal distribution with mean μ and standard
deviation σ, then x̄ is also normally distributed with mean
 What if X is not normally distributed?
 When sampling from any population with mean μ and standard
deviation σ, when n is large, the sampling distribution of x̄ is
approximately normal:
 As the number of measurements increase, they will approach a
normal distribution (Gaussian).

Px 

2
 x
2 2 N
e
2
2


 x

N
Px e
e
2
2 x 2
Visit This webpage to play with the numbers
2x 2
x 2
http://www.intuitor.com/statistic
s/CLAppClasses/CentLimApplet.
htm
Applications
 Simulated examples:
Dice rolling, coin
flipping ect…
Exit polling
Non-normal Distributions
Central Limit Theorem Summary
 For large N of sample, the distribution of those mean
values will be:
P ( x) µ e
-x
2
which is a normal distribution.
 Normal distribution of CLT is independent of the type of
distribution of data.
Where else would this become
problematic?
Where can it still be used, but issues
should be considered?
Questions?
Effective Statistics
You might have strong association, but how do you
prove causation? (that x causes y?)
Good evidence for causation: a well designed
experiment where all other variables that cause
changes in the response variable are controlled
The Scientific/Statistic Process
1.
2.
3.
4.
5.
6.
7.
Formulating a scientific question
Decide on the population you are interested in
Select a sample
Observational study or experiment?
Collect data
Analyze data
State your conclusion
Ways to collect information from sample
 Anecdotal evidence
 Available data
 Observational study
 Experiment
Sampling and Inference
population
sample
sampling
σ
s
μ
x̄
inference
Some Cautions
 Statistics can not account for poor experimental
design
 There is no sharp border between “significant” and
“non-significant” correlation, only increasing and
decreasing evidence
 Lack of significance may be due to poorly designed
experiment
Fit Tests
t-test, z-test, and χ2 test
z-Test
z-test
•
All normal distributions are the same if we
standardize our data:
•
•
•
Units of size σ
Mean μ as center
If x is an observation from a normal
distribution, the standardized value of x is
called the z-score
• Z-scores tell how many standard deviations
away from the mean an observation is
z- test procedure
• To use: find the mean, standard deviation, and
standard error
• Use these statistics along with the observed value
to find Z value
• Consult the z-score table to find P(Z) the
determined z
Equation for
x -m
z
=
hypothesis testing:
s/ n
Example
 Jacob scores 16 on the ACT. Emily scores 670 on the
SAT. Assuming that both tests measure scholastic
aptitude, who has the higher score? The SAT
scores for 1.4 million students in a recent
graduating class were roughly normal with a mean
of 1026 and standard deviation of 209. The ACT
scores for more than 1 million students in the same
class were roughly normal with mean of 20.8 and
standard deviation of 4.8.
Example Continued
Jacob – ACT
Emily - SAT
Score: 16
Mean: 20.8
Standard Dev.: 4.8
Score: 670
Mean: 1026
Standard Dev.: 209
Interpreting Results
“Backwards” z-test
 What if we are given a probability (P(Z)) and we
are interested in finding the observed value
corresponding to the probability.?
 Find the Z-score
 Set up the probability (could be 2 sided)
P(-z0<Z<zo) =
 Convert the score to x by
æs ö
x = z ´ç ÷+ m
è nø
t Tests
Necessary assumptions for t-Test
1. Population is normally distributed.
2. Sample is randomly selected from the unknown
population.
3. Standard deviation of the unknown population is
the same as the known population.
So, we can take the sample standard deviation as an
estimate of the known population.
x
t
s/ n
Probability that the fish are the same in both lakes
Probability that fish populations are the same average length in each lake
1
0.8
0.6
T Test Accumulating Data (N) Progressively
0.4
0.2
0
1
11
21
31
41
51
61
71
81
91
101
111
121
# of samples included in analysis from each lake
This is typical of the kind of data many of you may generate. Let’s
take a quick
Look at how this T Test calculated from the data, using Excel.
z versus t procedures
 Use z procedures if you know the population
standard deviation
 Use t procedure if you don’t know the population
standard deviation
 Usually we don’t know the population standard
deviation, unless told otherwise
 Central Limit Theorem
2
χ -test
(kai)
χ2-test (Goodness-of-fit) Users Guide
•
•
•
•
•
χ2-test tells us whether distributions of
categorical variables differ from one another
Can use to determine if your data conforms to
a functional fit.
Compares multiple means to multiple
expected values.
Can only use when you have multiple data sets
that cannot be combined into one mean.
Use when comparing means to expected
values.
χ2-test
 Xi is each individual mean
 µi is each expected value
 ΔXi = uncertainty in Xi
 d = # of mean values
• χ2/d table gives probability that data matches expected
values.
• In χ2/d , d is count of independent measurements.
d
2  
i1
X
i
 i
X i2

2
χ2- (Goodness-of-fit) Test Procedure
 Find averages and uncertainty for each average.
 Calculate χ2 using averages, uncertainties, and
expected values.
 Count number of independent variables.
 Use table to find probability of fit accuracy based
on χ2/d and number of independent variables (d).
Example
• Launch a bottle
rocket with several
different volumes of
water.
• Measure height of
flight multiple times
for each volume.
• You decide you have
a fit of:
• Plot of fit with data
on left.
y  0.204 V (m/ml)- 10-4  V2 (m/ml2 )
Example
 7 degrees of
freedom
 Probability of fit
≈50%
 50% of the time,
chance alone
could produce a
•This does not mean that other fits might larger χ2 value.
 No reason to
not match the data better, so try other
reject fit.
fits and see which one is closest.
Interpreting Results
 Probability is how
similar data is to
expected value.
 Large P means data
is similar to expected
value.
 Small P means data
is different than
expected value.
Summary
 Propagation of uncertainty
 Mean
 Accuracy vs. Precision
 Error
 Standard deviation
 Central Limit Theorem
 Fit Tests
 z-test
 t-test
 χ2-test