Data and Error Treatment and some Statistics Data • In science, we observe and measure things • We call these observations/measurements data (e.g.

Download Report

Transcript Data and Error Treatment and some Statistics Data • In science, we observe and measure things • We call these observations/measurements data (e.g.

Data and Error Treatment
and some Statistics
Data
• In science, we observe and measure things
• We call these observations/measurements
data (e.g. individual facts, statistics, or
items of information)
• Two different types of data
– Quantitative: data that is measured
• Quantities = numbers
– Qualitative: data that is observed
• Qualities = words/descriptions of characteristics
Quantitative vs. Qualitative Data
Accuracy vs. Precision
Accuracy vs. Precision
• Accuracy – How close a value
is to its expected or theoretical
value
• Precision – How
reproducible a result is
Accuracy vs. Precision
• Accuracy – How close a value is to its
expected or theoretical value
Accuracy vs. Precision
• Precision – How reproducible a result is
Error
• The deviation from the expected result is
called the error
• This is a common obstacle in science that
we seek to minimize
Introduction of Error
• Intrinsic - inherent variability
in the system being measured
(can be high for biological systems)
• Systematic - variability due
to the measurement process
• We have statistical methods to determine
how much error has occured
Deviation
• The deviation tell us whether we are above or
below our target value or theoretical value
• Deviation = Experimental value – Theoretical value
where experimental value is the value you obtain
from your experiment and theoretical value is the
expected result from your experiment
Percent Error
• To determine the accuracy of a series of measurements,
scientists calculate the percent error (% Error)
• This will show how close your measurements came to
the expected or theoretical value
• % Error =
Experimental value – Theoretical value
Theoretical value
• The lower the percent error, the more accurate
are the measurements in the experiment
x 100%
What about when we have lots of
data like in our first lab?
Never fear...
STATISTICS will
save the day!
Statistics
• Statistics are SUMMARIES of data.
• Many studies generate large numbers of
data points. To make sense of all that data,
researchers use statistics that summarize
the data, providing a better understanding
of overall tendencies within the
distributions of scores.
Reasons for using statistics
• aid in summarization
• aid in understanding the trends in the data
• aid in extracting “information” from the
data
• aid in communication
Summary Statistics
• Measures of Central Tendency
– Arithmetic mean
– Median
– Mode
• Measures of Variation or Spread of
Distributions
– Range
– Standard deviation
Measures of Central Tendency
• Mode - the most frequently
occurring score
• Median - the value that
lies in the middle after
ranking all the scores
• Mean - the
average score
Measures of Central Tendency
Mean … the most frequently used but is
sensitive to extreme scores
e.g. 1 2 3 4 5 6 7 8 9 10
Mean = 5.5 (median = 5.5)
e.g. 1 2 3 4 5 6 7 8 9 20
Mean = 6.5 (median = 5.5)
e.g. 1 2 3 4 5 6 7 8 9 100
Mean = 14.5 (median = 5.5)
Measures of Central Tendency
Median
… is not sensitive to extreme
scores
… use it when you are unable
to use the mean because
of extreme scores
Measures of Central Tendency
Mode
… does not involve
any calculation or
ordering of data
… use it when you
have categories
Another tool - Histogram
• A Histogram is a variation of a bar chart in
which data values are grouped together
and put into different classes.
• This grouping allows you see how
frequently data in each class occurs in the
data set.
A Distribution Curve
Mean: 54
Median: 56
Mode: 63
The Normal Distribution Curve
"Bell" Curve
In everyday life many variables such as
height, weight, shoe size and exam
marks all tend to be normally
distributed, that is, they all tend to
look like the following curve.
The Normal Distribution Curve
Mean, Median, Mode
0.025
0.02
0.015
0.01
0.005
0
0
20
40
60
80
100
• It is bell-shaped and symmetrical about the mean
• The mean, median and mode are equal
• It is a function of the mean and the standard deviation
Variation or Spread of Distributions
Measures that indicate the spread of
scores:
• Range
• Standard Deviation
Variation or Spread of Distributions
Range
• Compares the minimum score with the
maximum score
• Max score – Min score = Range
• Gives a rough indication of the spread of the
scores because it does not tell us much
about the shape of the distribution and how
much the scores vary from the mean
Variation or Spread of Distributions
Standard Deviation
• Tells us what is happening between the
minimum and maximum scores
• Tells us how much the scores in the
data set vary around the mean
• Useful when we need to compare
groups using the same scale
Standard Deviations in a Normal Distribution
So how do you calculate
a Standard Deviation?
Short answer: Let Excel
do it for you
FYI Long answer: Calculating Standard
Deviation (from scratch, not using Excel)
1. Compute the mean, X, for the data
set.
2. Compute the deviation by
subtracting the mean, X, from
each value, Xi.
3. Square, (Xi – X)2, each individual
deviation.
4. Add up the squared deviations.
5. Divide by one less than the
sample size.
6. Take the square root.
Lets see how all this relates to a
real example
A flighty example
• Suppose a biologist wants to
determine the home range of a
particular species of bird.
– From all members of the population
of this species, the biologist collects
a sample of 100 birds and attaches
a radio transmitter to the leg of
each bird.
– After monitoring the position of the
birds for 1 month, the biologist finds
that the mean home range of the
sample is 23.1 km.
• Is this the exact value of the
home range of the population?
Point Estimation
• The question we can ask is, “Well how close is
23.1 km to the mean home range for the
population?”
– This value (23.1) is called the point estimate, it is the
value we will use to estimate the population value
– This is the “Best Guess” of the value of the population
parameter, given your sample information
• The way we will answer this question is to
generate a range of values for which we can be
reasonably confident that the population value
falls
– This process is called interval estimation
– The resulting range of values is called a
confidence interval.
Returning to our flighty example…
• The biologist has a point estimate of 23.1 km, and can
guess that the true population mean is probably
between 20 and 26.
• While the biologist may be confident that the true range
is between 20 and 26, she may be even more
confident that with a range of 15-30.
• Thus, the wider the interval estimation, the greater the
confidence that we have that the interval contains the
population mean.
Confidence Intervals
• Now, it is possible and preferable to be more quantitative
about how we go about determining the interval (rather
than just guessing).
• So, a confidence interval provides a range of numbers
along with the percentage confidence that the parameter
lies within
– A 95% confidence interval means that 95% of similarly
constructed intervals will contain the population parameter
– Note also that although there are many different CIs that we
could construct, in practice the 90%, 95%, and 99% CI are used
most often.
Computing Confidence
Intervals: Sampling Error
• Realize that whenever we use
a sample to estimate a
population characteristics, we
are going to have some
amount of sampling error.
– Sometimes, we will
overestimate the true value
– Sometimes, we will
underestimate the true value
True Value
Standard Error
• But, we also know (from the central limit theorem) that if our
sample size is sufficiently large (n > 30), the sampling
distributions of the sample means will be approximately
normal.
– Thus, 95% of the sample means will fall between + 2
standard deviations from the mean of the population
What does this mean?
• When we collect a sample of data, we can
be reasonably certain that the true
population value falls within 2 standard
deviations (plus or minus) of our sample
mean!
_
– Bottom line: If you have X and add and
subtract about 2 standard deviations from it,
this is 95% confidence interval
Elements of a Confidence
Interval
Confidence
interval
Confidence
limit (lower)
Sample statistic
(point estimate)
Confidence
limit (upper)
Common Levels of Confidence
• Commonly used confidence levels are
90%, 95%, and 99%
Confidence Depends on Interval (z)
Standard Error (SE)
_

xz=
x
X =  ± Zx
/2
-1.65x
-2.575x
-1.96x

σ
n
X
+1.65x
+2.575x
+1.96x
90% CI
95% CI
99% CI
Lower 95%
confidence interval =
Mean – 1.96 x SE
Upper 95%
confidence interval =
Mean + 1.96 x SE
Let’s return to our flighty example
• Our biologist found a point estimate of
23.1 km for the mean range size
• Suppose the biologist also knows that the
standard deviation for the range of this bird
population is 4.7 km
• Recall that the sample size was 100 birds
• To find the 95%CI,
– 23.1 + 1.96 (4.7/100)
– 23.1 + .9212
– 95% CI is 22.18 to 24.02 km
Now for a little on Correlations,
a way to show relationships
between experimental variables
Types of Correlation
Positive correlation
Negative correlation
No correlation
Simple linear regression describes the
linear relationship between a predictor
variable, plotted on the x-axis, and a
response variable, plotted on the y-axis
Regression line is
calculated to give a
minimum for the
differences between
the points and the line
Dependent Variable (Y)
Values measured in
response to changes
in the independent
variable
Independent Variable (X)
Values chosen by the
experimenter
How well does your regression equation
truly represent your set of data?
• One of the ways to determine the answer to
this question is to examine the correlation
coefficient, r
• Simple Linear Correlation (Pearson r)
– Shows proportionality of two variables
– -1 < r < 1 the closer to 1, better the correlation
Several sets of (x, y) points, with the correlation
coefficient, r, of x and y for each set.
Here, r, reflects the noisiness and direction of a linear relationship
Here, r, does NOT reflect the slope of the linear relationship
Here, r, does NOT reflect nonlinear relationships
Outliers
How well does your regression equation
truly represent your set of data?
• Another way is to examine the coefficient of
determination, r2 , i.e. “strength” of relationship
• Coefficient of Determination (r2)
– explains how much of the variability of one factor can be
caused or explained by its relationship to another factor
– 0 < r2 < 1
the higher the value, better the fit
– For example, an r2 of 0.70 implies that 70% of the
variation in y is accounted for by its relationship to x.
– Most statisticians consider an r2 of 0.7 or higher to show
a reasonable model for a relationship between variables.
Outliers
Usually exclude data that is more
than 2 SD from the trend line
Analysis without
the outliers
Outlier along the
trend line
Outlier
perpendicular to
the trend line