summary - vscht.cz

Download Report

Transcript summary - vscht.cz

SUMMARY
π‘₯ βˆ’ πœ‡0
𝑑= 𝑠
𝑛
Two-sided t-test
difference between means, i.e. variability between samples
π‘₯1 βˆ’ π‘₯2
π‘‘β‰ˆπ‘ 
π‘₯ 1 βˆ’π‘₯ 2
𝑛
variability within samples
This is not an exact formula! It just demonstrates main ingrediences.
Two-sided t-test
π‘‘β‰ˆπ‘ 
π‘₯1 βˆ’ π‘₯2
π‘₯1 βˆ’π‘₯2
β€’ The numerator indicates how much the means differ.
β€’ This is an explained variation because it most likely results from
the differences due to the treatment or just dut to the differences in
the populations (recall beer prices, different brands are differently
exppensive).
β€’ The denominator is a measure of an error. It measures
individual differences of subjects.
β€’ This is considered an error variation because we don't know why
individual subjects in the same group are different.
𝑛
Explained variation
Error variation
3 samples
3 samples
ANOVA
β€’ Compare as many means as you want just with one test.
𝑠=
𝐹𝑑𝑓𝐡 ,π‘‘π‘“π‘Š
π‘₯𝑖 βˆ’ π‘₯
π‘›βˆ’1
2
⟹ 𝑠2 =
π‘₯𝑖 βˆ’ π‘₯
π‘›βˆ’1
2
𝑆𝑆
=
𝑑𝑓
π‘‘β‰ˆπ‘ 
π‘₯1 βˆ’ π‘₯2
π‘₯1 βˆ’π‘₯2
𝑛
𝑀𝑆𝐡
=
π‘€π‘†π‘Š
π‘†π‘†π‘Š
π‘€π‘†π‘Š =
=
π‘‘π‘“π‘Š
π‘˜
π‘₯𝑖 βˆ’ π‘₯π‘˜
π‘βˆ’π‘˜
2
𝑆𝑆𝐡
𝑀𝑆𝐡 =
=
𝑑𝑓𝐡
𝑛𝐾 π‘₯π‘˜ βˆ’ π‘₯𝐺
π‘˜βˆ’1
2
𝑑𝑓𝐡 = π‘˜ βˆ’ 1
π‘‘π‘“π‘Š = 𝑁 βˆ’ π‘˜
Total variability
β€’ What is the total number of degrees of freedom?
β€’ 𝑑𝑓𝐡 + π‘‘π‘“π‘Š = π‘˜ βˆ’ 1 + 𝑁 βˆ’ π‘˜ = 𝑁 βˆ’ 1 = π‘‘π‘“π‘‡π‘œπ‘‘π‘Žπ‘™
β€’ Likewise, we have a total variation
𝑆𝑆𝑇 = 𝑆𝑆𝐡 + π‘†π‘†π‘Š =
π‘₯𝑖 βˆ’ π‘₯𝐺
𝑀𝑆𝑇 = 𝑀𝑆𝐡 + π‘€π‘†π‘Š
2
Hypothesis
β€’ Let's compare three samples with ANOVA. Just try tu
guess what the hypothesis will be?
β€’ 𝐻0 : πœ‡1 = πœ‡2 = πœ‡3
𝐻1 ∢ πœ‡1 β‰  πœ‡2 β‰  πœ‡3
β€’ 𝐻0 ∢ πœ‡1 β‰  πœ‡2 β‰  πœ‡3
between βˆ’ group variability
within βˆ’ group variability
𝐻1 ∢ πœ‡1 = πœ‡2 β‰  πœ‡3
β€’ 𝐻0 : πœ‡1 = πœ‡2 = πœ‡3
𝐻1 ∢ at least one pair of samples is significantly different
β€’ Follow-up multiple comparison steps – see which means are
different from each other.
Multiple comparisons problem
β€’ And there is another (more serious problem) with many t-
tests. It is called a multiple comparisons problem.
http://www.graphpad.com/guides/prism/6/statistics/index.htm?beware_of_multiple_comparisons.htm
NEW STUFF
Post hoc tests
β€’ F-test in ANOVA is the so-called omnibus test. It tests the
means globally. It says nothing about which particular
means are different.
β€’ post hoc tests, multiple comparison tests.
β€’ Tukey Honestly Significant Differences
TukeyHSD(fit) # where fit comes from aov()
ANOVA assumptions
β€’ normality – all populations samples come from are normal
β€’ homogeneity of variance – variances are equal
β€’ independence of observations – the results found in one
sample won't affect others
β€’ Most influencial is the independence assumption.
Otherwise, ANOVA is relatively robust.
β€’ We can sometimes violate
β€’ normality – large sample size
β€’ variance homogeneity – equal sample sizes + the ratio of any two
variances does not exceed four
ANOVA kinds
β€’ one-way ANOVA (analýza rozptylu pΕ™i jednoduchém
tΕ™ídΔ›ní, jednofaktorová ANOVA)
aov(beer_brands$Price~beer_brands$Brand)
dependent variable
independent variable
β€’ two-way ANOVA (analýza rozptylu dvojného tΕ™ídΔ›ní,
dvoufaktorová ANOVA)
β€’ Example: engagement ratio, measure two educational methods
(with and without song) for men and women independently
β€’ aov(engagement~method+sex)
β€’ interactions between factors
CORRELATION
Introduction
β€’ Up to this point we've been working with only one
variable.
β€’ Now we are going to focus on two variables.
β€’ Two variables that are probably related. Can you think of
some examples?
β€’ weight and height
β€’ time spent studying and your grade
β€’ temperature outside and ankle injuries
Car data
Miles on a car
Value of the car
60 000
$12 000
80 000
$10 000
90 000
$9 000
100 000
$7 500
120 000
$6 000
β€’ x – predictor, explanatory, independent variable
β€’ How do you think y is called? Think about opposites to x
name.
β€’ outcome
β€’ determiner
β€’ response
β€’ stand-alone
β€’ dependent
Car data
Miles on a car
Value of the car
60 000
$12 000
80 000
$10 000
90 000
$9 000
100 000
$7 500
120 000
$6 000
β€’ How may we show these variables have a relationship?
β€’ Tell me some of yours ideas.
β€’ scatterplot
Scatterplot
Stronger relationship?
Correlation
β€’ Relation between two variables = correlation
β€’ strong relationship = strong correlation, high correlation
Match these
strong positive
strong negative
weak positive
weak negative
Correlation coefficient
β€’ r (Pearson's r) - a number that quantifies the relationship.
π‘Ÿ = π‘Ÿπ‘₯𝑦
π‘π‘œπ‘£(𝑋, π‘Œ)
=
𝑠𝑋 β‹… π‘ π‘Œ
β€’ π‘π‘œπ‘£(𝑋, π‘Œ) … covariance of X and Y. A statistic for how
much X and Y co-vary. In other words, how much do they
vary together.
β€’ 𝑠𝑋 , π‘ π‘Œ … standard deviations of X and Y. Describes, how
to variables vary apart from each other, rather than with
each other.
β€’ π‘Ÿ measures the strength of the relationship by looking at
how closely the data falls along a straight line.
Covariance
π‘π‘œπ‘£ 𝑋, π‘Œ =
π‘₯𝑖 βˆ’ π‘₯ 𝑦𝑖 βˆ’ 𝑦
π‘›βˆ’1
divide by n-1 for sample but by n for population
1
π‘Ÿ=
π‘›βˆ’1
π‘₯𝑖 βˆ’ π‘₯ 𝑦𝑖 βˆ’ 𝑦
𝑠𝑋 π‘ π‘Œ
β€’ Watch explanation video.
http://www.youtube.com/watch?v=35NWFr53cgA
Coefficient of determination
β€’ Coefficient of determination - π‘Ÿ 2 is the percentage of
variation in Y explained by the variation in X.
β€’ Percentage of variance in one variable that is accounted for by the
variance in the other variable.
r2 = 0
r2 = 0.25
r2 = 0.81
from http://www.sagepub.com/upm-data/11894_Chapter_5.pdf
+1
-1
+0.14
+0.93
-0.73
β€’ If X is age in years and Y age in months, what will the
correlation coefficient be?
β€’ +1.0
β€’ X is hours you're awake a day, Y is hours you're asleep a
day.
β€’ -1.0
Crickets
β€’ Find a cricket, count the number of its chirps in 15
seconds, add 37, you have just approximated the outside
temperature in degrees Fahrenheit.
β€’ National Service Weather Forecast Office:
http://www.srh.noaa.gov/epz/?n=wxcalc_cricketconvert
chirps in 15 sec temperature chirps in 15 sec temperature
18
57
27
68
20
60
30
71
21
64
34
74
23
65
39
77
Hypothesis testing
β€’ Even when two variables describing a sample of data
may seem they have an relationship, this could just be
due to the chance. The situation in population may be
different.
β€’ π‘Ÿ … sample corr. coeff., 𝜌 … population corr. coeff.
β€’ How hypotheses will look like?
𝐻0 : π‘Ÿ = 0
𝐻𝐴 : π‘Ÿ < 0
π‘Ÿ>0
π‘Ÿβ‰ 0
𝐻0 : 𝜌 = 0
𝐻𝐴 : 𝜌 < 0
𝜌>0
πœŒβ‰ 0
𝐻0 : π‘Ÿ < 0
π‘Ÿ>0
π‘Ÿβ‰ 0
𝐻𝐴 : π‘Ÿ = 0
𝐻0 : π‘Ÿ < 0
π‘Ÿ>0
π‘Ÿβ‰ 0
𝐻𝐴 : π‘Ÿ = 0
A
B
C
D
Hypothesis testing
𝑑=
π‘Ÿ π‘›βˆ’2
1
βˆ’ π‘Ÿ2
with 𝑑𝑓 = 𝑛 βˆ’ 2
β€’ test statistic has a t-distribution
β€’ Example: we are measuring relationship between two
variables, we have 25 participants, we get the t-statistic =
2.71. Is there a significant relationship between X and Y?
β€’ 𝛼 = 0.05, non-directonal test, π‘‘π‘π‘Ÿπ‘–π‘‘ = 2.069
Confidence intervals
95% CI = (-0.3995, 0.0914)
95% CI = 0.1369, 0.5733)
try to guess:
β€’ reject the null
β€’ reject the null
β€’ fail to reject the null
β€’ fail to reject the null
Statistics course from https://www.udacity.com
Hypothesis testing
β€’ A statiscally correct way how to decide about the
relationship between two variables is, of course,
hypothesis testing.
β€’ In these two particular cases:
β€’ 𝑑 = βˆ’1.29, 𝑑𝑓 = 59, 𝑝 = 0.2066
β€’ 𝑑 = 3.11, 𝑑𝑓 = 59, 𝑝 = 0.0028
Correlation vs. causation
β€’ causation – one variable causes another to happen
β€’ e.g. the facts it is raining cause people to take their umbrellas to
work
β€’ correlation – just means there is a relationship
β€’ e.g. do happy people have more friends? Are they just happy
because they have more friends? Or they act a certain way which
causes them to have more friends.
Correlation vs. causation
β€’ There is a strong relationship
between the ice cream
consumption and the crime rate.
β€’ How could this be true?
β€’ The two variables must have
something in common with one
another. It must be something
that relates to both level of ice
cream consumption and level of
crime rate. Can you guess what
that is?
β€’ Outside temperature.
from causeweb.org
Correlation vs. causation
β€’ If you stop selling ice cream, does the crime rate drop?
What do you think?
β€’ That’s because of the simple principle that correlations
express the association that exists between two or more
variables; they have nothing to do with causality.
β€’ In other words, just because level of ice cream
consumption and crime rate increase/descrease together
does not mean that a change in one necessarily results in
a change in the other.
β€’ You can’t interpret associations as being causal.
Correlation vs. causation
β€’ In an ice cream example, there exist a variable (outside
temperature) we did not realize to control.
β€’ Such variable is called third variable, confounding
variable, lurking variable.
β€’ The methodologies of scientific studies therefore need to
control for these factors to avoid a 'false positiveβ€˜
conclusion that the dependent variables are in a causal
relationship with the independent variable.
β€’ Let’s have a look at dependence of murder rate on
temperature.
from http://www-personal.umich.edu/~bbushman/BWA05a.pdf
Journal of Personality and Social Psychology, 2005, Vol. 89, No. 1, 62–66
high assault period
low assault period
from http://www-personal.umich.edu/~bbushman/BWA05a.pdf
Journal of Personality and Social Psychology, 2005, Vol. 89, No. 1, 62–66
http://xkcd.com/552/
Correlation and regression analysis
β€’ Correlation analysis investigates the relationships
between variables using graphs or correlation coefficients.
β€’ Regression analysis answers the questions like: which
relationship exists between variables X and Y (linear,
quadratic ,….), is it possible to predict Y using X, and with
what error?
Simple linear regression
β€’ also single linear regression
β€’ one y (dependent variable,
(jednoduchá lineární regrese)
závisle promΔ›nná),
one x
(independent variable, nezávisle promΔ›nná)
β€’ 𝑦 = π‘Ž + 𝑏π‘₯
β€’ π‘Ž – y-intercept (constant), 𝑏 – slope
β€’ 𝑦 is estimated value, so to distinguish it from the actual
value 𝑦 corresponding to the given π‘₯ statisticans use 𝑦
Data set
β€’ Students in
higher grades
carry more
textbooks.
β€’ Weight of the
textbooks
depends on the
weight of the
student.
strong positive correlation, r = 0.926
outlier
from Intermediate Statistics for Dummies
Build a model
β€’ Find a straight line y = a + bx
Interpretation
β€’ y-intercept (3.69 in our case)
β€’ it may or may not have a practical meaning
β€’ Does it fall within actual values in the data set? If yes, it is a clue it
may have a practical meaning.
β€’ Does it fall within negative territory where negative y-value are not
possible? (e.g. weights can’t be negative)
β€’ Does a value x = 0 have practical meaning (student weighting 0)?
β€’ However, even if it has no meaning, it may be necessary (i.e.
significantly different from zero)!
β€’ slope
β€’ change in y due to one-unit increase in x (i.e. if student’s
weight increases by 1 pound, its textbook’s weight increases
by 0.113 pounds)
β€’ now you can use regression line to estimate y value
for new x
Regression model conditions
β€’ After building a regression mode you need to check if the
required conditions are met.
β€’ What are these conditions?
β€’ The y’s have to have normal distribution for each value of x.
β€’ The y’s have to have constant spread (standard deviation) for each
value of x.
Normal y’s for every x
β€’ For any value of x, the population of possible y-values
must have a normal distribution.
from Intermediate Statistics for Dummies
Homoscedasticity condition
As you move from left to the right on the x-axis, the spread
around y-values remain the same.
source: wikipedia.org
Confidence and prediction limit
95% confidence limits – this interval includes the true regression line with 95% probability.
(pás spolehlivosti)
95% prediction limits – this interval represents the 95% probability for the values of the
dependent variable. i.e. 95% of data points lie within these lines.
(pás predikce)
Residuals
β€’ To check the normality of y-values you need to measure
how far off your predictions were from the actual data, and
to explore these errors.
β€’ residual (residuum, reziduální hodnota predikce)
𝑒 =π‘¦βˆ’π‘¦
actual value
residual
predicted value
from Intermediate Statistics for Dummies
Residuals
β€’ The residuals are data just like any other, so you can find
their mean (which is zero!!) and their standard deviation.
β€’ Residuals can be standardized, i.e. converted to the Zscore so you see where it falls on the standard normal
distribution.
β€’ Plotting residuals on the graph – residual plots.
Using r2 to measure model fit
β€’ r2 measures what percentage of the variability in y is
explained by the model.
β€’ The y-values of the data you collect have a great deal of
variability in and of themselves.
β€’ You look for another variable (x) that helps you explain
that variability in the y-values.
β€’ After you put that x variable into the model, and you find
it’s highly correlated with y, you want to find out how well
this model did at explaining why the values of y are
different.
Interpreting r2
β€’ high r2 (80-90% is extremely high, 70% is fairly high)
β€’ A high percentage of variability means that the line fits well because there is
not much left to explain about the value of y other than using x and its
relationship to y.
β€’ small r2 (0-30%)
β€’ The model containing x doesn’t help much in explaining the difference in the yvalues
β€’ The model would not fit well. You need another variable to explain y other than
the one you already tried.
β€’ middle r2 (30-70%)
β€’ x does help somewhat in explaining y, but it doesn’t do the job well enough on
its own.
β€’ Add one or more variables to the model to help explain y more fully as a group.
β€’ Textbook example: r = 0.93, r2 = 0.8649. Approximately 86% of
variability you find in textbook weights for is explained by the average
student weight. Fairly good model.