No Slide Title

Download Report

Transcript No Slide Title

Correlation and regression
http://sst.tees.ac.uk/external/U0000504
1
Introduction
Scientific rules and principles are often expressed
mathematically
 There are two main approaches to finding a
mathematical relationship between variables
 Analytical



Based on theory
Empirical

Based on observation and experience
2
The straight line (1)
 Most
graphs based on numerical data are
curves.
 The straight line is a special case
 Data is often manipulated to yield straight
line graphs as the straight line is relatively
easy to analyse
3
The Straight line (2)

Straight line equation
90.0

y = mx + c
80.0
70.0
60.0

slope = m
Dy
50.0
40.0

m = Dy/Dx
30.0
Dx
20.0
Intercept
10.0

Intercept = c
0.0
0
10
20
30
40
4
Correlation & Regression
 These
are statistical processes which;
Suggest the existence of a relationship
 Determine the best equation to fit the data

 Correlation
is a measure of the strength of a
relationship between two variables
 Regression is the process of determining
that relationship
5
Correlation and Regression
The next few slides illustrate
correlation and regression
6
No Correlation
7
Positive correlation
8
Negative correlation
9
Curvilinear correlation
10
Correlation coefficient

A statistical measure of the strength of a relationship
between two variables.



All these take a value in the range -1.0 to + 1.0




Pearson’ product-moment correlation coefficient, r
Spearman’s rank correlation coefficient, r
r or r = +1.0 represents a perfect positive correlation
r or r = -1.0 represents a perfect negative correlation
r or r = 0.0 represents a no correlation
values of r or r are associated with a probability of there
being a relationship.
11
Linear regression
 Is
the process of trying to fit the best
straight line to a set of data.
 The usual method is based on minimising
the squares of the errors between the data
and the predicted line
 For this reason, it is called “the method of
least squares”
12
Linear regression - assumptions
The error in the independent (x) variable is
negligible relative to the error in the dependant (y)
variable
 The errors are normally, independently and
identically distributed with mean 0 and constant
variance - NIID(0,s2)

13
Linear regression model

For a set of data, (x,y), there is an equation that best fits the
data of the form






Y = a + bx + e
x is the independent variable or the predictor
y is the measured dependant or predicted variable
Y is the calculated dependant or predicted variable
e is the error term and accounts for that part of Y not “explained”
by x.
For any individual data point, i, the difference between the
observed and predicted value of y is called the residual, ri


i.e. ri = yi – Yi = yi - (a + bxi)
The residuals provide a measure of the error term
14
Regression analysis (1)
Check the correlation coefficient
 Null Hypothesis






H0: There is no correlation between x & y
H1: There is a correlation between x & y
Decision rule
reject H0 if |r|  critical value at a = 0.05
If you cannot reject H0 then proceed no further,
otherwise carry out a full regression
15
Regression analysis (2)


Regression analysis can be carried out using either Excel
or Minitab. Excel will need the analysis ToolPak add-in
installed.
The output from both Minitab and Excel will give the
following information





The regression equation ( in the form y = a + bx)
Probabilities that a  0 and b  0
The coefficient of determination, R2
Analysis of variance
In addition you will need to produce at least one of



Residuals vs. fitted values
Residuals vs. x-values
Residuals vs. y values
16
Interpreting output


Regression equation:- this is the equation that best fits the
data and provides the predicted values of y
Analysis of variance:- Determines the proportion of the
variation in x & y that can be accounted for by the
regression equation and what proportion is accounted for
by the error term. The p-value arising out of this tells us
how well the regression equation fits the data.

The proportion of the variation in the data accounted for by the
regression equation is called the coefficient of determination, R2
and is equal to the square of the correlation coefficient
17
Output plots
The output plots are used to check the assumptions
about the errors
 The normal probability plot should show the
residuals lying on a straight line.
 The residual plots should have no obvious pattern
and should not show the residuals increasing or
decreasing with increase in the fitted or measured
values.

18
Non linear relationships
 Many
functions can be manipulated
mathematically to yield a straight line
equation.
 Some examples are given in the next few
slides
19
Linearisation (2)
Function
Plot
Slope
Intercept
b
ya
x
y vs. 1/x
a
b
a
b
y  ax n  b
y vs. x
n
20
Linearisation (3)
Function
kx
y
ax
Linear
form
1 1 a
 
y k kx
Plot
1
1
vs .
y
x
Slope
/Intercept
a
slope 
k
1
int . 
k
21
Functions involving logs (1)
 Some
functions can be linearised by taking
logs
 These
are

y = A xn

and y = A ekx
22
Functions involving logs (2)
 For
y = Axn, taking logs gives
 log
y = log a + n log x
 A graph
of log y vs. log x gives a straight
line, slope n and intercept log A.
 To
find A you must take antilogs (= 10x)
23
Functions involving logs (3)
 For
 ln
y = Aekx, we must use natural logs
y = ln A + kx
 This
gives a straight line slope k and
intercept ln A
 To
find A we must take antilogs (= ex)
24
Polynomials
 These

are functions of general formula
y = a + bx + cx2 + dx3 + …
 They
cannot be linearised
 Techniques

for fitting polynomials exist
Both Excel and Minitab provide for fitting
polynomials to data
25