Document 7315644

Download Report

Transcript Document 7315644

Chapter 12 and 13 Simple Linear and Multiple Regression



Simple Linear Regression Analysis



Multiple Regression Analysis and Model Building

BUS304 – Chapter 12 Simple Regression Analysis

Simple Regression Analysis

 Also called “Bivariate Regression”   It analyzes the relationship between two variables It is regarded as a higher lever of analysis than correlation analysis  It specifies one dependent variable (the response) and one independent variable (the predictor, the cause).

 It assumes a linear relationship between the dependent and independent variable.

 The output of the analysis is a linear regression model, which is generally used to predict the dependent variable. 35.3

30.3

25.3

20.3

15.3

10.3

5.3

0.3

30 32 34 36 38 40 42 44 46 BUS304 – Chapter 12 Simple Regression Analysis

The regression Model y

i =  0 +  1 *

x

i +  i    The model assumes a linear relationship Two variables:    x – independent variable (the reason) y – dependent variable (the result) For example, • x can represent the number of customers dinning in a restaurant • y can represent the amount of tips collected by the waiter Parameters:    0 : the intercept – represents the expected value of y when x=0.

 1 : the slope (also called the coefficient of x ) – represents the expected increment of y when x  increases by 1  : the error term – the uncontrolled part BUS304 – Chapter 12 Simple Regression Analysis

Graphical explanation of the parameters



Assume this is a scatter plot of the population

35.3

30.3

25.3

20.3

 1 15.3

10.3

5.3

0.3

30 32  34 36 38 40 42 44 46 BUS304 – Chapter 12 Simple Regression Analysis

Building the model

   The regression model is used to   predict the value of y explain the impact of x on y Scenarios,    x is easily observable, but y is not; or x is easily controllable, but y is not; or x will affect y, but y cannot affect x.

The causality should be carefully justified before building up the model  When assigning x and y, make sure which is the reason and which is the result. – otherwise, the model is wrong !

 Example: Information System research: • “Ease of use” vs. “The Usefulness”  There may always be a second thought on the causality.

BUS304 – Chapter 12 Simple Regression Analysis

Example

 Build up the regression models 1.

At State University, a study was done to establish whether a relationship existed between a student’s GPA when graduating and SAT score when entering the university. 2.

The Skeleton Manufacturing Company recently did a study of its customers. A random sample of 50 customer accounts was pulled from the computer records. Two variables were observed: a) The total dollar volume of business this year b) Miles away the customer is from corporate headquarters BUS304 – Chapter 12 Simple Regression Analysis

Estimate the coefficient

   Regression Model

y

i =  0 +  1 *

x

i +  i Given  0 =2 and  1 =3, If knowing x=4, we can expect y.

  

How to know



0 =2 and



1 =3?

To know  0 and  1 , we need to have the population data for all x and y.

Normally, we only have a sample. The trend line determined by a sample is an estimation of the population trend line.

 The Fitted Model 

0  The hat indicates a predicted value b 0 and b 1 are estimations of  0 and  1, they are

sample statistics

BUS304 – Chapter 12 Simple Regression Analysis

Estimate the coefficients

      Based on the sample collected Run “simple regression analysis” to find the “

best fitted line

”.

35.3

30.3

25.3

20.3

15.3

10.3

5.3

0.3

30 32 34 36 38 40 42 44 46 The intercept of the line: b 0 The slope of the line: b 1 They are estimates of  0 and  1 We can use b 0 and b 1 to predict y when we know x 

0  The prediction model BUS304 – Chapter 12 Simple Regression Analysis

How to determine the trend line?

  The trend line is also called the “best fitted line” How to define the “

best fitted line

”?

 There could be a lot of criteria.

 The most commonly used one: • The “Ordinary Least Squares” Regression (OLS) • To find the line with the least aggregate squared residual • Residual: for each sample data point i, the y value (y i ) is not likely to be exactly the ˆ

e i

 

y y

ˆ BUS304 – Chapter 12 Simple Regression Analysis

Solution for OLS regression

  The objective function: min , 1 

e i



min , 1

1 2

2 

e n

2 Find the best b 0 and b 1 , which minimize the sum of squared residuals Solution:



 







x x





x y

 2





b x

1  Use Excel:   Add a trend line Run a regression analysis (Data Analysis too kit) BUS304 – Chapter 12 Simple Regression Analysis

Exercise

 Open “Midwest.xls”  Create a scatter plot  Add a trend line.

 Provide your estimation of y when  x = 10  x = 0  x = 4  Residue: e i , for each sample data point.

 In regression analysis, we assume that the residues are normally distributed, with mean 0  The smaller the variance of residue, the stronger the linear relationship.

BUS304 – Chapter 12 Simple Regression Analysis

Add a trend line

 Step 1: Use your scatter plot, right click one data point, choose the option to “add trend line”  Step 2: choose “option tag”, check “Display equation on chart”  “OK” y= 175.8 + 49.91*x BUS304 – Chapter 12 Simple Regression Analysis

The “Fitness”

 Sometimes, it is just not a good idea to use a line to represent the relationship: Y Y X

Not good !

kinda good better

Just see how well the sample data form a line -- how well the model predicts BUS304 – Chapter 12 Simple Regression Analysis

The measurement for the fitness

  The Sum of Squared Errors (SSE)

SSE

 

e i

2   

  2    The smaller the SSE, the better the fit.

In the extreme case, if every point lies on the line, there is no residual at all, SSE=0 (Every prediction is accurate) SSE also increase when the sample size gets larger (more terms to sum up) - however, this doesn’t indicate a worse fitness.

Other associated terms:   SST – total sum of squares: • Total variation of y   

 

 2   SSR – sum of squares Regression 

 2 • Total variation of y explained by the model  It can be computed that SST, SSR, and SSE has the following relationship:

SST

 BUS304 – Chapter 12 Simple Regression Analysis

R

2  A standardized measure of fitness:

2   Interpretation:

SSE SST

 The proportion of the total variation in the dependent variable (y) that is explained by the regression model  In other words, the proportion that is not explained by the residuals.

 The larger the R 2 , the better the fitness  In the Simple Linear Regression Model, R 2 =r 2 .

 Compute the correlation and verify.

BUS304 – Chapter 12 Simple Regression Analysis

Read the regression report

 Step 1: check the fitness  whether the model is correct  Multiple R

Regression Statistics

0.832534056

R Square Adjusted R Square Standard Error 0.693112955

0.662424251

92.10553441

Better greater than 0.3, The greater the better.

Observations 12 Step 2: what are the coefficients, whether the slope of x is too small?

Intercept Years with Midwest

Coefficients

175.8288191

49.91007584

Standard Error

54.98988674

10.50208428

t Stat

3.197476

4.752397

P-value

0.009532

0.000777

Lower 95%

53.30372

26.50997

Upper 95%

298.3539

73.31018

   y= 175.8 + 49.91*x Interval Estimation of  0 p-value of p-value of  1 =0 and  1 : (conf level: 95%)  0 =0  0 : 53.3~298.2529

 1 : 26.5~73.31

BUS304 – Chapter 12 Simple Regression Analysis

Confidence Interval Estimation

Input the required confidence level Intercept X Variable 1

Coefficients

32.64209

Standard Error

2.6092

4 -0.64049

0.1265

t Stat

12.51

019 5.061

P-value

1.56E

-06 0.000

975

Lower 95%

26.625

17 -0.9323

Upper 95%

38.659

0.3486

Lower 90.0%

27.7900

37.494

-0.8758

Upper 90.0%

-0.4051

BUS304 – Chapter 12 Simple Regression Analysis

Hypothesis Test

 People are normally interested in whether  1 0 or not. is   In other words, whether x has an impact on y.

Based on the report from excel, it is very convenient to conduct such a test.

 Simply compare whether the p value of the coefficient is smaller than  or not.

   Hypothesis: H 0 :  1 =0 H A :  1  0 Decision rules:   If p <  , reject the null hypothesis, If p   , do not reject the null hypothesis.

Compare p and  , make the decision.

BUS304 – Chapter 12 Simple Regression Analysis

When you don’t have a good fit

  If the fitness is not good, that is, the correlation between x and y is not strong enough.

It is always a good idea to check the scatter plot first.

 Cases • Case A. Maybe there are outliers (explain the outlier) 24 23 22 21 20 19 18 17 16 15 19 20 21 22 23 24 23 22 21 20 19 18 17 16 15 19 21 23 25 27 29 31 BUS304 – Chapter 12 Simple Regression Analysis

Not a good fit?

  Case 2:   Check the variation of x. In order to have a good prediction model, the independent variable should cover a certain range.

 Collect more data while guarantee the variations of x.

Case 3:  Inherently non-linear relationship Y Y   X X Non-linear regression (not required) Segment regression • Separate your data into groups and run regression separately.

BUS304 – Chapter 12 Simple Regression Analysis

Exercise

  

Problem 12.14 (Page 498) Problem 12.15 Problem 12.19

BUS304 – Chapter 12 Simple Regression Analysis

Multiple regression analysis

 An extension of the simple regression model:

  0     2

2   

k x k

     

: number of total factors being studied x 1 , x 2 , …, x k : the value of independent variables y: the dependent variable  0 ,  1 ,  2 , …,  k : the coefficients Example Annual Income   0   Years in school   2  Years of working   Independent variable (y): annual income  Dependent variables (k=2) • x1: years in school • x2: years of working BUS304 – Chapter 12 Simple Regression Analysis

Why Multiple Regression?

 Limitations of Bivariate Regression:  Often too simplistic  Biased estimates if relevant predictors are omitted  Lack of fit does not show that X is unrelated to Y if the true model is multivariate  Use Bivariate Regression only when  There is a compelling need for a simplified model  Other predictors have only modest effects and a single logical predictor “stands out”.

BUS304 – Chapter 12 Simple Regression Analysis

Data Format



Regression Model

  0     2

2   

k x k

  

Prediction Model

  0

b x

1 1 

b x

2 2  

b x k k



Data Format

Dependent variable y y 1 y 2 … y n x 1 x 11 x 12 … x 1n x Independent Variables x 2 ...

22 … x 23 … … … x 2n … x k x k1 x k2 … x kn BUS304 – Chapter 12 Simple Regression Analysis

Building up the model

 Decide the question you want to ask  the dependent variable (y)  List the potential independent variables (x 1 , …,x k )  All the factors you think might affect the dependent variable.

• Different people may come up with different set of independent variables • Reflects your knowledge on the subject  In practice, you might not get data for all the factors, • try to get as many as you can. • Pre-select the most relevant factors to limit the number of your survey questions.

BUS304 – Chapter 12 Simple Regression Analysis

Pre-selection

  Use correlation matrix to pre-study  Example “First-city.xls” It is always good to run a correlation analysis to check whether all the factors are relevant.

 “Data Analysis”  “Correlation” Price Sq. Feet Age Bedrooms Bathrooms Garage #

Price

1 0.747712

-0.48522

0.540088

0.665504

0.693538

Sq. Feet Age Bedrooms Bathroom s

1 -0.07288

0.70586

0.62929

0.416261

1 -0.2024

-0.3871

-0.43738

1 0.59964

0.312034

1 0.464602

Garage #

1 1.

Use hypothesis test to see whether all the factors are correlated with the dependent variable. (H 0 :  =0) See the correlation between pair of factors, try to explain why  to predetermine whether the data make sense Delete the ones which have too small correlation.

BUS304 – Chapter 12 Simple Regression Analysis

Run regression

  “Data Analysis”  “regression” Read the report: 1.

Fitness: determine whether the model is good or not.

Multiple R

Regression Statistics

0.903371816

R Square Adjusted R Square 0.816080638

0.813142629

Standard Error Observations 27350.25168

319

Good Fit

Regression Residual Total Significance of the model

5 313 318

1.04E+12 2.34E+11 1.27E+12

2.08E+11 7.48E+08

277.7666

Significance F

9E-113 BUS304 – Chapter 12 Simple Regression Analysis

Run Regression (cont’d)

Intercept Sq. Feet Age Bedrooms Bathrooms Garage # 3.

Coefficients:

Coefficients

31127.60228

63.0656426

-1144.436731

-8410.378875

3521.954016

28203.54189

Standard Error

9539.669

4.017033

112.78

3002.511

1580.997

2858.692

t Stat

3.262965

15.69956

-10.1475

-2.80111

2.227679

9.865889

P-value

0.001224

2.26E-41 4.19E-21 0.00541

0.026612

3.62E-20

Lower 95%

12357.6

55.16183

-1366.34

-14318 411.2265

22578.85

Upper 95%

49897.6

70.96945

-922.534

-2502.72

6632.682

33828.23

Lower 95.0%

12357.6

55.16183

-1366.34

-14318 411.2265

22578.85

Upper 95.0%

49897.6

70.96945

-922.534

-2502.72

6632.682

33828.23

• The prediction model

1  1144.44

2  8410.38

3  3521.95

4  28203.54

5 • • All the factors are not likely to have a coefficient 0. The meaning of each coefficient: – The increment of y when each factor changes while keeping the other factors the same.

BUS304 – Chapter 12 Simple Regression Analysis

If you didn’t get a good fit

Check the correlation matrix:  Whether there is some factor (independent variable) which do not have a strong correlation with the dependent variable?

 Whether there is some too strong correlation between two independent variables?  If so, adjust the model and run regression again, see whether the “fitness” improves.

Still not, then check the p-value for each coefficient of the independent variable (  s)  delete those independent variable and run regression again.

Still cannot get a good fit? Maybe because of non-linearity. Check the scatter plots.

BUS304 – Chapter 12 Simple Regression Analysis

How many factors?

 There is always a question about how many independent variables should be incorporated into the model.

 As many as possible?

 As few as possible?

 Principle of Occam’s Razor  When two explanations are otherwise equivalent, we prefer the simpler, more parsimonious one.

 Intuition Rules:  Whether the model makes sense!

 Example: Use the data Newhomes.xls

BUS304 – Chapter 12 Simple Regression Analysis

Notes on regression coefficients



Regression coefficient shows the increment of dependent variable when the independent variable increases by one unit, and the other independent variables keep the same value.

Regression coefficients (



s) are not the correlation!

Coefficients (



s) vary when regression model changes!

BUS304 – Chapter 12 Simple Regression Analysis

Exercise



Problem 13.4 (Page 538)

BUS304 – Chapter 12 Simple Regression Analysis

Binary Predictors

 What is a binary predictor?

 A binary predictors has two values, denoting the presence or absence of a condition. (usually coded 0 and 1)  Examples:   Male/Female? Male=1/Female=0 Employed?  Employed =1/Unemployed =0 West Coast?  Regression Model with Binary Predictors MPG = 39.5 – 0.00463 Weight + 1.51 Manual   Weight = vehicle curb weight as tested (pounds) Manual = 1 if manual transmission, 0 if automatic BUS304 – Chapter 12 Simple Regression Analysis

Binary Predictor Model

 Actually Two models  If Manual=0 (auto)  MPG = 39.5 – 0.00463 Weight If Manual =1 MPG = 39.5 – 0.00463 Weight + 1.51

= 41.01 – 0.00463 Weight Manual =1 Manual =0  Meaning of the coefficient for the binary variable  The expected change of dependent variable when the binary variable is true or false.

BUS304 – Chapter 12 Simple Regression Analysis

Underlying Assumption

assumes that the slope is the same for both cases  You should test it by running regression on both groups and see whether there is a significant difference  Another way to test the possible different slopes is to use a refined model

MPG =



0 –



1 Weight +



2 Manual +



3 (Manual*Weight)

 If  3 is not significant (large p-value, likely to be 0), then you are fine with the original model.

You should do the test if you have binary variables in your model BUS304 – Chapter 12 Simple Regression Analysis

More than one binary variables

  Example: Where are you from?

   Midwest Northeast Southeast  West How many independent variables?

   Midwest + Northeast + Southeast + West =1 That is, if Midwest = Northeast = Southeast = 0, West =1 At least one is not “independent”

Bush%

56.5

58.6

51.0

51.3

41.7

50.8

38.4

41.9

Age65%

13.0

5.7

13.0

14.0

10.6

9.7

13.8

13.0

Urban%

69.9

41.5

88.2

49.9

96.7

83.9

95.6

80.0

ColGrad %

20.4

28.1

24.6

18.4

27.5

34.6

31.6

24.0

Union %

9.6

21.9

6.4

5.8

16.0

9.0

16.3

13.3

MidWe st

0 0 0 0 0 0 0 0

Neast

0 0 0 0 0 0 1 1

West

0 1 1 0 1 1 0 0

Seast

1 0 0 1 0 0 0 0 Not included BUS304 – Chapter 12 Simple Regression Analysis

Review of Multivariate Regression

    Start with a question  interest of study (to predict, to dissect)  Dependent Variable Specify independent variables  Try to incorporate as much important independent variables as possible  Try to find those independent variables which are not too much correlated Data Collection Data Procession  Correlation analysis first to have a general idea, re-select factors to be incorporated in the model  Run regression, try different combinations to find the best model  The best model should both have a good fit and make good intuitive sense.

 Pay attention to binary variables.

BUS304 – Chapter 12 Simple Regression Analysis

Document 7315644

Transcript Document 7315644

Chapter 12 and 13 Simple Linear and Multiple Regression

Simple Linear Regression Analysis

Multiple Regression Analysis and Model Building

Simple Regression Analysis

The regression Model y

x

Graphical explanation of the parameters

Assume this is a scatter plot of the population

Building the model

Example

Estimate the coefficient

y

x

Estimate the coefficients

How to determine the trend line?

best fitted line

Solution for OLS regression











Exercise

Add a trend line

The “Fitness”

The measurement for the fitness

R

Read the regression report

Confidence Interval Estimation

Hypothesis Test

When you don’t have a good fit

Not a good fit?

Exercise

Problem 12.14 (Page 498) Problem 12.15 Problem 12.19

Multiple regression analysis

Why Multiple Regression?

Data Format

Regression Model

Prediction Model

Data Format

Building up the model

Pre-selection

Run regression

Run Regression (cont’d)

If you didn’t get a good fit

How many factors?

Notes on regression coefficients

Regression coefficient shows the increment of dependent variable when the independent variable increases by one unit, and the other independent variables keep the same value.

Regression coefficients (

s) are not the correlation!

Coefficients (

s) vary when regression model changes!

Exercise

Problem 13.4 (Page 538)

Binary Predictors

Binary Predictor Model

Underlying Assumption

More than one binary variables

Review of Multivariate Regression

Directory