Fitting Equations to Data

Download Report

Transcript Fitting Equations to Data

Fitting Equations to Data

• Suppose that we have a single dependent variable Y (continuous numerical) • • and one or several independent variables, X 1 , X 2 , X 3 , ... (also continuous numerical, although there are techniques that allow you to handle categorical independent variables).

The objective will be to “fit” an equation to the data collected on these measurements that explains the dependence of Y on X 1 , X 2 , X 3 , ...

Example

Data collected on

n

= 110 countries Some of the variables

Y =

infant mortality

X

1 = pop’n size

X

2 = pop’n density

X

3 = % urban

X

4 = GDP

Etc

Our intersest is in determining how to

X

1 ,

X

2

, X

3

, X

4

,etc Y

is related

What is the value of these equations?

Equations give very precise and concise descriptions (models) of data and how dependent variables are related to independent variables.

Examples

1.

Linear models Y

= Blood Pressure,

X

Y = a X + b + e = age 250 200 150 100 50 0 0 20 40 60 80 100

2.

Exponential growth or decay models

Y = Average of 5 best times for the 100m during an Olympic year, X = the Olympic year.

Y

 a

e

kX

10 8 0 20 40 60 80 100

2.

Logistic Growth models Y

 1  a b

e

kX

 e 10 8 6 4 2 0 0 20 40 60 80 100

2.

Gompertz Growth models Y

 a

e

 b

e

kX

 e 10 8 6 4 2 0 0 20 40 60 80 100

Note: the presence of the random error term (random noise). This is a important term in any statistical model. Without this term the model is deterministic and doesn’t require the statistical analysis

What is the value of these equations?

1.Equations give very precise and concise descriptions (models) of data and how dependent variables are related to independent variables.

2.The parameters of the equations usually have very useful interpretations relative to the phenomena that is being studied.

3.The equations can be used to calculate and estimate very useful quantities related to phenomena. Relative extrema, future or out-of-range values of the phenomena 4.Equations can provide the framework for comparison.

The Multiple Linear Regression Model

Again we assume that we have a single dependent variable Y and p (say) independent variables X 1 , X 2 , X 3 , ... , X p . The equation (model) that generally describes the relationship between Y and the Independent variables is of the form: Y = f(X 1 , X 2 ,... ,X p | q 1 , q 2 , ... , q q ) + e where q 1 , q 2 , ... , q q and e are unknown parameters of the function f is a random disturbance (usually assumed to have a normal distribution with mean 0 and standard deviation s) .

In Multiple Linear Regression we assume the following model Y = b 0 + b 1 X 1 + b 2 X 2 + ... + b p X p + e This model is called the

Multiple Linear Regression Model.

Again are unknown parameters of the model and where e b 0 , b 1 , b 2 , ... , b p are unknown parameters and is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation s .

The importance of the Linear model

1.

It is the simplest form of a model in which each dependent variable has some effect on the independent variable Y. When fitting models to data one tries to find the simplest form of a model that still adequately describes the relationship between the dependent variable and the independent variables. The linear model is sometimes the first model to be fitted and only abandoned if it turns out to be inadequate.

2.

In many instance a linear model is the most appropriate model to describe the dependence relationship between the dependent variable and the independent variables. This will be true if the dependent variable increases at a constant rate as any or the independent variables is increased while holding the other independent variables constant.

3.

Many non-Linear models can be put into the form of a Linear model by appropriately transformation the dependent variables and/or any or all of the independent variables. This important fact ensures the wide utility of the Linear model. (i.e. the fact the many non-linear models are linearizable.)

An Example

The following data comes from an experiment that was interested in investigating the source from which corn plants in various soils obtain their phosphorous. The concentration of inorganic phosphorous (X 1 ) and the concentration of organic phosphorous (X 2 ) was measured in the soil of n = 18 test plots. In addition the phosphorous content (Y) of corn grown in the soil was also measured. The data is displayed below:

Inorganic Phosphorous X 1 Organic Phosphorous X 2 0.4

0.4

3.1

0.6

4.7

1.7

9.4

10.1

11.6

53 23 19 34 24 65 44 31 29 Plant Available Phosphorous Y 64 60 71 61 54 77 81 93 93 Inorganic Phosphorous X 1 Organic Phosphorous X 2 12.6

10.9

23.1

23.1

21.6

23.1

1.9

26.8

29.9

58 37 46 50 44 56 36 58 51 Plant Available Phosphorous Y 51 76 96 77 93 95 54 168 99

Intercept X 1 X 2

Coefficients

56.2510241 ( b 0 ) 1.78977412 ( b 1 ) 0.08664925 ( b 2 ) Equation: Y = 56.2510241 + 1.78977412 X 1 + 0.08664925 X 2

Least Squares for Multiple Regression

Assume we have taken

n

observations on

Y

:

y

1 ,

y

2 , … ,

y n

For

n

sets of values of

X

1 ,

X

2 , … ,

X p

(

x

11 ,

x

12 , … ,

x

1

p

) (

x

21 ,

x

22 , … ,

x

2

p

) … (

x n

1 ,

x n

2 , … ,

x np

) For any choice of the parameters b 0 , b 1 , b 2 , … , b

p

the residual sum of squares is defined to be:

R

 b b 0 , 1 , , b

p

)  

i n

  1 

i n

  1 

y i y i

 

i

) 2 b 0  b 1

x i

1   b

p x ip

) 2

The Least Squares estimators of b 0 , b 1 , b 2 , … , b

p

are chosen to minimize the residual sum of squares :

R

 b b 0 , 1 , , b

p

) 

i n

  1 

y i

 b 0  b 1

x i

1   b

p x ip

2 ) To achieve this we solve the following system of equations: 

R

R

 b b 0 , 1 ,  b 0  b b 0 , 1 ,  b 1 , b

p

, b

p

) )  0  0 

R

 b b 0 , 1 ,  b

p

, b

p

)  0

Now: 

R

 b b 0 , 1 ,  b 0 , b

p

) 

i n

  1 2 

y i

 b 0  b 1

x i

1   b

x p ip

)  ) 0 or

n

b 0  b 1

i n

  1

x i

1   b

p i n

  1

x ip

i n

  1

y i

Also 

R

 b b 0 , 1 ,  b 1 , b

p

) 

i n

  1 2 

y i

 b 0  b 1

x i

1   b

x p ip

)  

x i

1 ) or b 0

i n

  1

x i

1  b 1

i n

  1

x

2

i

1 

R

 b b 0 , 1 ,  b

k

, b

p

)   b

p i n

  1

x x i

1

ip

 0 yields the equation 

i n

  1

x y i

1

i

b 0

i n

  1

x ik

 b 1

i n

  1

x x ik i

1   b

k i n

  1

x ik

2   b

p i n

  1

x x ik ip

i n

  1

x y ik i

 0

The system of equations for b b 0 , 1 , , b

p n

b 0 b 0

i n

  1 

x i

1 b 1 

n

i

b  1 1

x i

1

i n

  1  

x

2

i

1 b  

p n

i

b  1

p x ip i n

  1 

i n

  1

x x i

1

ip y i

i n

  1

x y i

1

i

b 0

i n

  1

x ik

 b 1

i n

  1

x x ik i

1   b

k i n

  1

x ik

2   b

p i n

  1

x x ik ip

i n

  1

x y ik i

b 0

i n

  1

x ip

 b 1

i n

  1

x x ip i

1   b

p i n

  1

x

2

ip

i n

  1

x y ip i

(

n

+ 1) linear equations in (

n

+ 1) unknowns These equations are called the

Normal equations

.

b b 0 1 b

least squares

estimates

The Example

The following data comes from an experiment that was interested in investigating the source from which corn plants in various soils obtain their phosphorous. The concentration of inorganic phosphorous (X 1 ) and the concentration of organic phosphorous (X 2 ) was measured in the soil of n = 18 test plots. In addition the phosphorous content (Y) of corn grown in the soil was also measured. The data is displayed below:

Inorganic Phosphorous X 1 Organic Phosphorous X 2 0.4

0.4

3.1

0.6

4.7

1.7

9.4

10.1

11.6

53 23 19 34 24 65 44 31 29 Plant Available Phosphorous Y 64 60 71 61 54 77 81 93 93 Inorganic Phosphorous X 1 Organic Phosphorous X 2 12.6

10.9

23.1

23.1

21.6

23.1

1.9

26.8

29.9

58 37 46 50 44 56 36 58 51 Plant Available Phosphorous Y 51 76 96 77 93 95 54 168 99

The Normal equations

.

n

b ˆ      

X X

2 1  )  b ˆ 

X

 1 ) b ˆ  b ˆ 1     

X

1 2 )  b ˆ 1

X

 2

X X

1 2 ) b ˆ 1 ) b ˆ 2      

X X

1 

X

2 2 2

Y

) ) b ˆ 2 b ˆ 2    

X Y

1

X Y

2 where   

X

1

X

1 2    215, 

X

2  4321.02, 20706.2,   158,

X

2 2 2   

Y

 1463 10139.5, 35076,   63825

The Normal equations

.

18 b ˆ   215 b ˆ 1  158 b ˆ 2  143 215 b ˆ  158 b ˆ   4321.02

b ˆ 1  10139.5

b ˆ 1  10139.5

b ˆ 2  35076 b ˆ 2   20706.2

63825 have solution: b ˆ  b ˆ 1 b ˆ 2  56.2510241

 1.78977412

 0.08664925

Intercept X 1 X 2

Coefficients

56.2510241 ( b 0 ) 1.78977412 ( b 1 ) 0.08664925 ( b 2 ) Equation: Y = 56.2510241 + 1.78977412 X 1 + 0.08664925 X 2

Summary of the Statistics used in Multiple Regression

The Least Squares Estimates:

b b b

0 , 1 , 2 , ,

b

p

, - the values that minimize

RSS

 

i n

  1

i n

  1  

y i y i

 

i

) 2 b 0  b

x

1 1

i

 b 2

x

2

i

  b

p x pi

2 )

The Analysis of Variance Table Entries

a) Adjusted Total Sum of Squares (SS n Total ) SS Total    y i  _ y ) 2 .  d.f.  n  1 ) i  1 b) Residual Sum of Squares (SS n Error ) RSS  SS Error    y i  yˆ i ) 2 .  d.f.  n  p  1 ) i  1 c) Regression Sum of Squares (SS n Reg ) SS Reg  SS b 1 , b 2 , ... , b p )  

Note:

i  1  yˆ i  _ y ) 2 .  d.f.  p ) n  i  1  y i  _ y n ) 2   i  1  yˆ i  _ y ) 2  n  i  1  y i  yˆ i ) 2 .

i.e. SS Total = SS Reg +SS Error

The Analysis of Variance Table

Source Regression Error Total Sum of Squares d.f.

Mean Square SS SS Reg Error SS Total p SS Reg /p = MS Reg n-p-1 SS Error /(n-p-1) =MS Error n-1 = s 2 F MS Reg /s 2

Uses:

1. To estimate s 2 (the error variance).

- Use s 2 = MS Error to estimate 2. To test the Hypothesis s 2 .

H 0 : b 1 = b 2 = ... = b p Use the test statistic = 0.

F

MS

Reg 

SS

Reg

MS Error p

MS

Reg

SS Error

n s

2 1 ) - Reject

H

0 if

F

>

F

a (

p

,

n-p-1

).

3. To compute other statistics that are useful in describing the relationship between Y (the dependent variable) and X 1 , X 2 , ... ,X p (the independent variables).

a)R 2 = the coefficient of determination = SS Reg /SS Total = n  i  1 n  i  1  ˆ i  y i  y ) 2  y ) 2 = the proportion of variance in Y explained by 1 - R 2 X 1 , X 2 , ... ,X p = the proportion of variance in Y that is left unexplained by X 1 , X 2 , ... , X p = SS Error /SS Total .

b) R a 2 = "R 2 adjusted" for degrees of freedom.

= 1 -[the proportion of variance in Y that is left unexplained by X 1 , X 2 ,... , X p adjusted for d.f.]

MS Error MS Total SS Error SS Total

n

n

 1 ) 1 ) 

n

n

n

n

 1 )

SS

1 )

Error SS Total

 1 ) 1 )  1 

R

2

c)

Y R

= 

R

2 = the Multiple correlation coefficient of with

X

1 ,

X

2 , ... ,

X p

= SS Re g SS Total = the maximum correlation between

Y

linear combination of

X

1 ,

X

2 , ... ,

X

p and a

Comment:

The statistics F, R 2 , R a 2 equivalent statistics. and R are

Using Statistical Packages

To perform Multiple Regression

Using SPSS

Note:

The use of another statistical package such as Minitab is similar to using SPSS

After starting the SSPS program the following dialogue box appears:

If you select

Opening an existing file

and press

OK

the following dialogue box appears

The following dialogue box appears:

If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear

One that will contain the output:

The other containing the data:

To perform any statistical Analysis select the

Analyze

menu:

Then select

Regression

and

Linear.

The following

Regression

dialogue box appears

Select the Dependent variable

Y

.

Select the Independent variables

X

1 ,

X

2 , etc.

If you select the

Method

-

Enter

.

All variables will be put into the equation.

There are also several other methods that can be used : 1. Forward selection 2. Backward Elimination 3. Stepwise Regression

Forward selection

1. This method starts with

no

variables in the equation 2. Carries out statistical tests on variables not in the equation to see which have a

significant

effect on the dependent variable.

3. Adds

the

most

significant.

4. Continues until all variables not in the equation have no significant effect on the dependent variable.

Backward Elimination

1. This method starts with

all

variables in the equation 2. Carries out statistical tests on variables in the equation to see which have

no

significant effect on the dependent variable.

3. Deletes

the

least

significant.

4. Continues until all variables in the equation have a significant effect on the dependent variable.

Stepwise Regression

(uses both forward and backward techniques) 1. This method starts with

no

variables in the equation 2. Carries out statistical tests on variables not in the equation to see which have a

significant

effect on the dependent variable.

3. It then

adds

the

most

significant.

4. After a variable is added it checks to see if any variables added earlier can now be deleted.

5. Continues until all variables not in the equation have no significant effect on the dependent variable.

All of these methods are procedures for attempting to find the

best

equation The

best

equation is the equation that is the

simplest

(not containing variables that are not important) yet

adequate

(containing variables that are important)

Once the

dependent

variable, the

independent

variables and the

Method

have been selected if you press

OK

, the

Analysis

will be performed.

The output will contain the following table

Model Summary

Model 1 R .822

a R Square .676

Adjus ted R Square .673

Std. Error of the Estimate 4.46

a. Predictors : (Cons tant), WEIGHT, HORSE, ENGINE R 2 and R 2 adjusted measures the proportion of variance in

Y

that is explained by

X

1 ,

X

2 ,

X

3 , etc (67.6% and 67.3%) R is the Multiple correlation coefficient (the maximum correlation between

Y

and a linear combination of

X

1 ,

X

2 ,

X

3 , etc)

The next table is the Analysis of Variance Table

ANOVA b

Model 1 Regress ion Res idual Total Sum of Squares 16098.158

7720.836

23818.993

df 3 388 391 Mean Square 5366.053

19.899

a. Predictors : (Cons tant), WEIGHT, HORSE, ENGINE b. Dependent Variable: MPG F 269.664

Sig.

.000

a The

F

test is testing if the regression coefficients of the predictor variables are all zero. Namely none of the independent variables

X

1 ,

X

2 ,

X

3 , etc have any effect on

Y

The final table in the output

Coefficients a

Model 1 (Cons tant) ENGINE HORSE WEIGHT Uns tandardized Coefficients B 44.015

-5.53E-03 Std. Error 1.272

.007

-5.56E-02 -4.62E-03 .013

.001

a. Dependent Variable: MPG Standardi zed Coefficien ts Beta -.074

-.273

-.504

t 34.597

-.786

-4.153

-6.186

Sig.

.000

.432

.000

.000

Gives the estimates of the regression coefficients, there standard error and the

t

test for testing if they are zero

Note: Engine size

has no significant effect on

Mileage

The estimated equation from the table below:

Coefficients a

Model 1 (Cons tant) ENGINE HORSE WEIGHT Uns tandardized Coefficients B 44.015

-5.53E-03 -5.56E-02 -4.62E-03 Std. Error 1.272

.007

.013

.001

a. Dependent Variable: MPG Standardi zed Coefficien ts Beta -.074

-.273

-.504

t 34.597

-.786

-4.153

-6.186

Sig.

.000

.432

.000

.000

Is:

Mileage

 44.0

 5.53

1000

Engine

 5.56

100

Horse

 4.62

1000

Weight

Error

Note the equation is:

Mileage

 44.0

 5.53

1000

Engine

 5.56

100

Horse

 4.62

1000

Weight

Error

Mileage decreases with: 1.

With increases in Engine Size (not significant,

p

= 0.432) With increases in Horsepower (significant,

p

= 0.000) With increases in Weight (significant,

p

= 0.000)