Topic 1- Statistical Analysis

Download Report

Transcript Topic 1- Statistical Analysis

Topic 1- Statistical
Analysis
Why?
O The scientific method involves making
observations and collecting measurable
data.
O When measuring data from a sample, the
sample must be representative of the entire
population of that sample
O Statistics allows us to sample small
populations and draw conclusions of the
larger populations
Why?
O It allows us to measure differences and
relationships between sets of data.
O All conclusions drawn from an experiment
have a certain level of confidence, but
nothing in science is 100% certain.
What is a representative
sample?
O A small group whose characteristics accurately
reflect those of the larger population from which
it is drawn.
O A representative sample is needed in order to
make more accurate generalizations of the
larger population
O Example: If approximately 15% of the United
States’ population is of Hispanic descent, a
sample of 100 Americans also ought to include
around 15 Hispanic people to be representative.
How do we get a representative
sample?
O Avoiding selection bias- when sampling is not
representative as a result of convenience sampling (using just
mpsj students) , undercoverage (not targeting a specific group
of a population), judgement sampling (targeting individuals you
pre-assume to fit a criteria) and non-response (people choose
not to complete the experiment)
O Larger sample sizes- ensures the sample is more similar to
the original population
O Random Sampling- selecting individuals from random
areas ,times or with different methods
O This results in better data collection quality and
experimenter bias or placebo effect
Reliable and valid data
O reliability is used to describe the overall consistency
of a measure. A measure is said to have a high
reliability if it produces similar results under
consistent conditions. For example, measurements of
people’s height and weight are often extremely
reliable.
O validity is the extent to which a concept, conclusion
or measurement is well-founded and corresponds
accurately to the real world. “You are measuring what
you’re supposed to measure”
Range
O Measures the spread of data
O The difference between the largest and
smallest observed values
O If one data point is unusually large/small, it
has a great effect on the range and is called
an outlier (Outliers can often indicate an
error in the experiment and are often
eliminated).
Averages
O Averages are the central tendencies of the
data. There are three types;
O Mean- sum of all the results divided by the
number of results
O Median- the middle value of a range of
results
O Mode: the value that appears the greatest
number of times
Example
O Find the mean, median and mode of the following data
set
O 1, 2, 2, 5, 6, 7, 11, 11, 11, 12
O Mean-
1+2+2+5+6+7+11+11+11+12
10
O Median-
= 6.8
6+7
= 6.5
O Mode-11
2
O When no numbers repeat then you do not have a mode
O If the mean, median and mode are all approximately
the same then we can assume a normal distribution
Averages
O Averages do not tell us everything about a
sample.
O May not be representative of the entire
population
O Two samples of a populations could be
different from one another. Bound to have
natural variation
Sample 1
Sample 2
5 round
1 oval
2 round
3 oval
Standard Deviation
O Samples can be very uniform- bunched
around the mean or spread out a long way
from the mean
O The statistic that measures this spread is
called the standard deviation
Standard deviation
O A measure of how the individual data points
are distributed around the mean
O Allows us to compare the means\spread of
data between two or more samples
O Tells us how tightly the data points are
clustered around the mean and therefore
how many outliers there are in the data.
O When the data points are clustered, the SD
is very small and when spread apart the SD
is large
Standard deviation and error
bars
O A graphical representation of
variability
O Can be used to show range of
data or SD
O In design labs, students often
use their SD to represent
their error bars on their
graphs
O A large SD indicates large
error or non-valid results
Example
O Calculate the SD of a sample- Four children
are aged 5; 6; 8 and 9.
O Step 1: find the mean  x=
1
𝑁
𝑁
𝑖=1 𝑥𝑖
O x1= 5, x2=6, x3=8, x4=9 and N(population =4)
O x=
1
(x1
4
O x=7
+ x2 + x3 + x4)
O Step 2 Find the SD σ:
O σ=
1
𝑁
𝑁
𝑖=1
𝑥𝑖 − 𝑥
2
O σ=
1
4
𝑁
𝑖=1
𝑥𝑖 − 7
2
O σ=
1
(5-7)2 + (6-7)2 + (8-7)2 + (9-7)2
4
O σ= 1.58
O Therefore the average age of the children is
7±1.58 𝑦𝑒𝑎𝑟𝑠.
Distribution
O Consider a population of bean plants with a mean
height of 7cm
O Normal Distribution- A spread of data that is equally
distributed before and after the mean
O A flat bell curve- data widely spread
O A tall and narrow curve- data is very close to the
mean
O Standard normal curve- 68% of all values lie within
+/- 1 SD from the mean and 95% of all values lie
within +/- 2 SD from the mean
O As the distribution of a bell curve changes the SD
value will change to account for the 68% and 95% of
the data set.
= 68% or +/- 1
The t-test
O To assess whether the means of two groups
are statistically different from each other
O Used when you want to compare the means
of two groups
O Ex. Is there a statistical difference in the
mean height between a group of boys and
girls at the age of 12?
The t-test
• Notice that all three examples below have the same
difference between means
• Yet they all tell different stories. They all have different
variability.
• The two groups with low variability from their mean are
visibly most different from each other and the groups with
high variability are most similar to each other
T-test
O We can judge the difference between means
relative to their spread or variability using
the t-test
O The formula is a ratio;
𝑡 − 𝑣𝑎𝑙𝑢𝑒 =
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝 𝑚𝑒𝑎𝑛𝑠
𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠
The formula
Example
O Problem: Sam Sleepresearcher hypothesizes that
people who are allowed to sleep for only four hours
will score significantly lower than people who are
allowed to sleep for eight hours on a cognitive skills
test. He brings 8 participants into his sleep lab and
randomly assigns them to one of two groups. In one
group he has participants sleep for eight hours and in
the other group he has them sleep for four. The
morning after he administers the SCAT (Sam's
Cognitive Ability Test) to the participants. (Scores on
the SCAT range from 1-9 with high scores
representing better performance).
SCAT scores
8 hours
sleep
group (X)
5 7 5 3 5 3 3 9
4 hours
sleep
group (Y)
8 1 4 6 6 4 1 2
Step 1- calculate degree of freedom
Df (paired t-test)= sample size-1
Df (unpaired) = n1+n2 - 2
Step 1: Find the means for both groups and subtract
Step 2: Calculate the variance (SD2)
Step 3: Divide each variance by the sample size
Step 4: Square root the denominator
Mx8hours= 5
My4hours= 4
SD8=4.571
SD4=6.571
N8hours=8 and n4=8
Step 3- use t-table
O Once the t-value is calculated you look it up in a
table of significance to see whether the ratio (tvalue) is large enough to say that the difference
between the groups is not likely due to chance
O Statisticians like to be 95% confident that their
conclusions are significant. So we use the risk
value or pvalue of p<0.05. Differences are due
to chance 5% of the time vs. p=0.1 where error
occurs 10% of the time
O If p>0.05, this indicates the means are not
statistically different
O according to the t sig/probability table with
df =n-1= 7, t must be at least 1.895 to be
significant
O since our t=0.847 and therefore p>0.05, (it
would fall at a lower confidence level
between .25 and .1) this difference is not
statistically significant
Correlation vs. Causation
O “correlation does not imply causation”- means
that correlation cannot be used to infer a causal
relationships, but rather that the causes
underlying the correlation may be indirect or
unknown
O Cause: a carefully designed experiment and its
evidence can determine that A causes B
O Correlation: observations, without a controlled
experiment, can only show that A and B are
related
Fallacy Examples
O Ice cream sales correlate with the number of
people who drown at sea. Therefore ice cream
causes people to drown.
O Children who sleep with a light on are more likely
to develop myopia (nearsightedness)
O Does light cause myopia?
O Atmospheric CO2 has been climbing in
conjunction with increased crime
O Does CO2 cause crime?
O A mathematical correlation test produces a
value r, which signifies the correlation
between two events
O r+1 positive correlation (as X increases so
does Y)
O r =0 no correlation
O r -1 negative correlation (as X increases Y
decreases)
Accuracy & Precision
O Accuracy: how close a measured value is to
the true value
O Precision: how close the measured values
are to each other
Errors and Uncertainties
O Examples:
O Human errors- can occur when tools or
instruments are used or read incorrectly. (E.g
a thermometer reading must be taken after
stirring and the bulb still in the liquid but not
touching the bottom)
O Systematic- experimenter does not know how to
use the equipment or something wrong with
equipment.
O Random – unknown or unpredictable changes
Systematic
O Note that systematic and random errors refer to
problems associated with making measurements.
Mistakes made in the calculations or in reading the
instrument are not considered in error analysis. It is
assumed that the experimenters are careful and
competent! (Not acceptable in your design lab)
O Can be reduced if equipment is regularly checked or
calibrated to ensure proper function
O Procedural systematic errors are acceptable. I.e.
identifying a problem with your procedure/controls.
Random
O Random errors are statistical fluctuations (in
either direction) in the measured data due to the
precision limitations of the measurement device.
Random errors usually result from the
experimenter's inability to take the same
measurement in exactly the same way to get
exact the same number.
O In biology this can be a result of changes in the
materials used, changes in conditions
O Controlled by carefully selecting material and
careful control variables and repeating trials
Uncertainties & Significant
Figures
O Uncertainties – used in biology since they
are the best choice for quantitative lab work
O Sig Figs- are useful when doing calculations
from a textbook and you do not know the
accuracy of the measuring device.
O They are mutually exclusive systems…you
use one of the other!
Things to Remember
O When adding or subtracting add uncertainties
O When dividing convert to percent uncertainty,
then add percent uncertainties
O If units are for ex. g/ml convert back to
uncertainty
O If units are percent change then convert back
then multiply by 100 to get back to % units
O When taking an average divide your uncertainty
by N
The act of measuring
O When a measurement is taken, this can
affect the environment of the experiment.
O Ex. When a cold thermometer is used to
measure warm water. The thermometer may
cool the water
O Ex. The presence of the experimenter
influences the behaviour of the animal being
observed
Replicates and Samples
O Biological systems because of their
complexity and variability require replicate
observations and multiple samples of
material.
O In IB you can choose to do a 5X5 or a 2X10
O 5 changes to the independent variable
measured 5 times
O 2 changes to the independent variable
measured 10 times
Degrees of precision
O If it is digital the use the value of the least
known digit (e.g the mass on the scale says
1.01g, then your uncertainty is +/- 0.01g)
O If it is analog like in the case of a
thermometer then use least known digit
divided by 2
O Always include you degrees of precision for
every measuring device in your lab
(especially in your tables)