III. Model Building Model building: writing a model that will provide a good fit to a set of data & that will.

Download Report

Transcript III. Model Building Model building: writing a model that will provide a good fit to a set of data & that will.

III. Model Building

Model building:

writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

y

and good predictions of

y

for given values of the explanatory variables.

 Why is model building important, both in statistical analysis & in analysis in general?

 Theory & empirical research

 “A social science theory is a reasoned and precise speculation about the answer to a research question, including a statement about why the proposed answer is correct.”  “Theories usually imply several more specific descriptive or causal hypotheses” (King et al., page 19).

 A model is “a simplification of, and approximation to, some aspect of the world.”  “Models are never literally ‘true’ or ‘false,’ although good models abstract only the ‘right’ features of the reality they represent” (King et al., page 49).

 Remember: social construction of reality (including alleged causal relations); skepticism; rival hypotheses; & contradictory evidence.

 What kinds of evidence (or perspectives) would force me to revise or jettison the model?

Three approaches to model building  Begin with a linear model for its simplicity & as a rough approximation of the y/x relationships.

 Begin with a curvilinear model to capture the complexities of the y/x relationships.

 Begin with a model that incorporates linearity &/or curvilinearity in y/x relationships according to theory & observation.

 The predominate approach used to be to start with a simple model & to test it against progressively more complex models.

 This approach suffers, however, from the problem associated with omitted variables in the simpler models.

Increasingly common, then, is the approach of starting with more complex models & testing

against simpler models (Greene, Econometric Analysis , pages 151-52).

 The point of departure for model-building is trying to grasp how the outcome variable

varies as the levels of an explanatory

variable change.

y

 We have to know how to write a

mathematical equation to model this

relationship.

 In what follows, let’s pretend that we’ve already done careful univariate & bivariate exploratory data analysis via graphs & numerical summaries (although, on the other hand, the exercise requires here & there that we didn’t do such careful groundwork…).

 Suppose we want to model a person’s performance on an exam, y single explanatory variable, amount of study time.

, as a function of a x , the person’s  It may be that the person’s score, y , increases in a straight line as the amount of study time increases from 1 to 6 hours.

 If this were the entire range of appropriate: x values used to fit the equation, a linear model would be

y

  0   1

x

e

 What, though, if the range of sample hours increased to 10 hours or more: would a straight line model continue to be satisfactory?

 Quite possibly, the increase in exam score for a unit increase in study time would decrease, causing some amount of curvature in the y/x relationship.

 What kind of model would be appropriate? A second-order polynomial, called a quadratic:

y

  0   1

x

  2

x

2 

e

y

  0   1

x

  2

x

2 

e

 0 : value of y when x ’s equal 0; shifts the quadratic parabola up or down the y -intercept  1 : the slop of y on x when x =0 (which we don’t really care about & won’t interpret)  2 : negative, degree of downward parabola; positive, degree of upward parabola

 Recall that the model is valid only for the

range of model.

x values used to estimate the

 What does this imply about predictions for values that exceed this range?

Testing a second-order equation

Ho Ha : :

 2  2   0 0 If  2 tests significant, do not interpret  1 (which represents y ’s slope on x1 x1 =0).

when .

.

 Let’s continue with this model-building strategy but change the substantive topic.

 We’ll focus on the relationship of average hourly wage to a series of explanatory variables (e.g., education & job tenure with the same employer).

 Let’s explore the relationship.

Example: use WAGE1, clear 0 5 10 years of education average hourly earnings 15 Fitted values  scatter wage educ || qfit wage educ 20

. sparl wage educ 30 wage = -0.905 + 0.541 educ r² = 0.165 RMSE = 3.378 n = 526 20 10 0 0 5 10 years of education 15 20

. sparl wage educ, quad 30 wage = 5.408 - 0.607 educ + 0.049 educ^2 r² = 0.201 RMSE = 3.307 n = 526 20 10 0 0 5 10 years of education 15 20

. sparl wage educ, logx 30 wage = -7.460 + 5.330 log educ r² = 0.128 RMSE = 3.455 n = 524 20 10 0 -10 0 5 10 years of education 15 20

. gen educ2=educ^2 . su educ educ2

. reg wage educ educ2

Source SS df MS Model Residual 1439.40602 2 719.70301

5721.00827 523 10.9388303

Total wage educ educ2 _cons 7160.41429 525 13.6388844

Coef.

-.6074999

.0490724

5.407688

Std. Err. t .2414904 -2.52

.0100718 4.87

1.458863 3.71

Number of obs F( 2, 523) Prob > F R-squared Adj R-squared Root MSE = 526 = 65.79

= 0.0000

= 0.2010

= 0.1980

= 3.3074

P>t [95% Conf. Interval] 0.012

0.000

0.000

-1.08191

.0292862

2.541737

-.1330897

.0688587

8.273639

 If the second-order term tests significant we don’t we interpret the first-order term.

 Why not?

 Let’s figure out what the second order term means in this model.

 What do the following graphs say about the relationship of wage to years of education?

. twoway qfitci wage educ 0 5 10 years of education 95% CI 15 Fitted values 20

. scatter wage educ || mband wage educ, ba(8) 0 5 10 years of education average hourly earnings 15 Median bands 20

. lowess wage educ, bwidth(.2) Lowess smoother 0 bandwidth = .2

5 10 years of education 15 20

Median Band Regression & Lowess Smoothing  Median band regression (scatter mband y x) & lowess smoothing (lowess y x) are two very helpful tools for detecting (1) how a model fits or doesn’t fit to particular segments of the x –values (e.g., poorer to richer persons) & (2) thus non-linearity.

 Hence they’re really useful at all stages of exploratory data analysis.

 Another option: locpoly y x1

 What did the graphs say about the relationship of wage to years of education?

 Let’s answer this question more precisely by predicting the direction & magnitude of the wage/education relationship at specific levels of education, identified via ‘su x, detail’ and/or our knowledge of the issue:

. su educ, d

. lincom educ*9 + educ2*81

wage Coef. Std. Err.

(1) -1.492631 1.388045

t -1.08

. lincom educ*12 + educ2*144

wage Coef. Std. Err.

(1) -.2235668 1.514447

t -0.15

. lincom educ*16 + educ2*256

wage Coef. Std. Err.

t (1) 2.842547 1.456761

1.95

P>t 0.283

P>t 0.883

P>t 0.052

[95% Conf. Interval] -4.219459

1.234197

[95% Conf. Interval] -3.198713

2.75158

[95% Conf. Interval] -.0192752

5.70437

How do the predictions relate to the graph?

0 5 10 years of education 95% CI 15 Fitted values 20

 Don’t get hung up with every segment of the curve.

 The curve is only an approximation . Thus it may not fit the data well within any particular range (especially where there are few observations).

 Remember, moreover, that Adj R 2 .198 for this model.

was just  Obviously there are other relevant explanatory variables. Not only do we need to identify them, but we also need to ask: are they independent & linear? independent & curvilinear? or are they interactional?

Interaction Effects  Interaction: the effect of a 1-unit change in one explanatory variable depends on the level of another explanatory variable.

 With interaction, both the

y

-intercept & the regression slope change; i.e. the regression lines are not parallel.

 E.g., how do education & job tenure interact with regard to predicted wage?

. gen educXtenure=educ*tenure

y

  0   1

x

1   2

x

2   3

x

1

x

2 

e

y

  0   1

x

1   2

x

2   3

x

1

x

2 

e

. reg wage educ tenure educXtenure

Source Model Total SS df MS 2296.41715 3 765.472384

Residual 4863.99714 522 9.31800218

7160.41429 525 13.6388844 Number of obs = 526 F( 3, 522) = 82.15

Prob > F R-squared = 0.0000

= 0.3207

Adj R-squared = 0.3168

Root MSE = 3.0525

wage educ _cons Coef. Std. Err. t P>t [95% Conf. Interval] .4265947 .0610327 6.99 0.000

.3066948 . 5464946 tenure -.0822392 .0737709 -1.11 0.265

educXtenure .0225057 .0059134 3.81 0.000

-.4612881 .7832198 -0.59 0.556

-.2271635 .0626851

.0108887 .0341228

-1.999938 1.077362

Hypothesis test

H

0

H a : :

 3

x

1

x

2  3

x

1

x

2   0 0

 Let’s interpret the model.

 If the interaction term tests significant, we don’t interpret its base variables?

 Why not? Each base variable represents its

y

/

x

slope when the other don’t care about this.

x

=0. We

 To interpret the interaction term, we use ‘su x1, d’ & our knowledge of the subject to identify key levels of educ & tenure (or use one SD above mean, mean, & one SD below mean):  Then we predict the slope-effect of educXtenure on wage at the specified levels, as follows:

 How the interaction of mean education with varying levels of tenure relates to average hourly wage:

. lincom educ + (educXtenure*2)

wage Coef.

Std. Err.

t P>t [95% Conf. Interval] (1) -.0372278 .062391 -0.60 0.551 -.1597961 .0853406

. lincom educ + (educXtenure*10)

wage Coef.

(1) .142818

Std. Err.

t .0221835 6.44

P>t 0.000

. lincom educ + (educXtenure*18)

wage Coef.

(1) .3678751

Std. Err.

t .0503567 7.31

P>t 0.000

[95% Conf. Interval ] .0992382 .1863978

[95% Conf. Interval] .2689484 .4668019

 How the interaction of mean tenure with varying levels of education relates to average hourly wage:

. lincom tenure + (8*educXtenure)

wage Coef.

Std. Err. t P>t (1) .0978065 0303746 3.22 0.001

[95% Conf. Interval] .0381351 .157478

. lincom tenure + (12*educXtenure)

wage Coef. Std. Err. t P>t (1) .1878294 .0184755 10.17 0.000

. lincom tenure + (20*educXtenure)

wage Coef. Std. Err. t P>t (1) .3678751 .0503567 7.31 0.000

[95% Conf. Interval] .1515338 .224125

[95% Conf. Interval] .2689484 .4668019

 With significant interaction, to repeat, both the regression coefficient & the intercept change as the levels of the second interacting variable change.

y

 That is, the regression slopes are unequal. What does this mean in the model for average hourly wage?

Our interaction model yielded an Adj R

2

of .317.

Given the non-linearity we’ve uncovered, could we increase the explanatory power by combining quadratic & interaction terms?

y

  0   1

x

1   2

x

2   3

x

1

x

2   4

x

1 2   5

x

2 2 

e

. reg wage educ tenure educXtenure educ2 tenure2

Source Model Residual Total SS df MS 2619.24058 5 523.848116

4541.17371 520 8.73302637

7160.41429 525 13.6388844

Number of obs = 526 F( 5, 520) = 59.98

Prob > F R-squared = 0.0000

= 0.3658

Adj R-squared = 0.3597

Root MSE = 2.9552

wage educ Coef. Std. Err. t P>t -.7069382 .2269283 -3.12 0.002

tenure -.0072781 .0848573 -0.09 0.932

educXtenure .0263957 .005787 4.56 0.000

educ2 tenure2 _cons .0470478 .0091265 5.16 0.000

-.0050847 .0016688 -3.05 0.002

5.763382 1.464726 3.93 0.000

[95% Conf. Interval] -1.152747

-.2611293

-.1739834

.0150269

.1594272

.0377645

.0291184

-.0083632

2.885874

.0649771

-.0018062

8.640889

 Let’s assess the model’s fit.

 Let’s conduct a test of nested models, comparing this new, ‘full’ model to each of the previous, ‘reduced’ models.

 Did adding educXtenure, educ2 & tenure2 boost the model’s variance-explaining power by a statistically significant margin?

. test educXtenure educ2 tenure2 ( 1) educXtenure = 0 ( 2) educ2 = 0 ( 3) tenure2 = 0 F( 3, 520) = 17.47

Prob > F = 0.0000

 Did adding educ2 & tenure2 boost the model’s variance-explaining power by a statistically significant margin over the interaction model?

. test educ2 tenure2

( 1) educ2 = 0 ( 2) tenure2 = 0 F( 2, 520) = 18.48

Prob > F = 0.0000

Valid testing of nested models To conduct a valid test of nested models:  the number of observations for both the complete & reduced models must be equal;  the functional form of y must be the same (e.g., we can’t compare outcome variable ‘wage’ to outcome variable ‘log-wage’).

Comparing non-nested models  How do we compare non-nested models (i.e. models with the same number of explanatory variables),

or nested models that don’t meet the criteria for

comparative testing?

 Use either the AIC or BIC test statistics: the smaller the score, the better the model fits.

 Download the ‘fitstat’ command (see Long/Freese, Regression Models for Categorical Dependent Variables ).

. reg science read write math female . fitstat, saving(model1) bic . reg science read write . fitstat, using(model1) bic  The output tells whether or not the ‘current’ model is supported &, if it is supported, to what degree.

 And we can display ‘estimates table.’

AIC

&/or

BIC

in . reg science read write math . estimates store model1 . estimates table model1, stats(N df_m adj_r2 aic bic)  ‘ereturn list’ provides the codes.

 For BIC or AIC, “The upshot is that ex post, neither model is discarded; we have merely revised or assessment of the comparative likelihood of the two in the face of the sample data” (Greene, Econometric Analysis , page 153).

 That is, the Bayesian approach compares “the two hypotheses rather than testing for the validity of one over the other” (Greene, page 153).

Econometric Analysis ,

Graphing the model  Graphing multiple variables from regression models requires 3-D graphing capabilities (see, e.g., Systat, SAS).

 Here’s

Stata

’s crude version:

1 . gr3 wage educ educ2 educ2 educ -0 -0 wage 1

1 . gr3 wage tenure tenure2 tenure2 tenure -0 -0 wage 1

. twoway qfitci wage educ, bc(yellow) 0 5 10 years of education 95% CI 15 Fitted values 20

. twoway qfitci wage tenure, bc(red) 0 10 20 30 years w ith current employer 95% CI Fitted values 40

 What do the graphs tell us about the relationship of wage to years of education & to years of job tenure?

 Using lincom to predict the slope for wage at specific values of the interaction variables is important, too.

 Adj R 2 was .360. Can we improve the model by adding dummy variables?

 Let’s explore the possibility for females versus males; nonwhites versus whites; & urban (smsa) versus rural.

 But first, a quick detour. We find evidence that 52% of the population of interest is female & 16% is nonwhite: are the sample percentages significantly different than these population benchmarks?

 How do we statistically assess these possibilities?

 We’ll first check the confidence intervals.

. ci female nonwhite, binomial

Variable Obs Mean -- Binomial Exact - Std. Err. [95% Conf. Interval] female 526 nonwhite 526 .4790875 .021782 .4356663 .5227448

.1026616 .0132339 .0780672 .1318238

 Next we’ll try prtest to test the proportions.

. prtest female=.52

One-sample test of proportion female: Number of obs = 526 Variable Mean Std. Err.

female .4790875 .021782

Ho: proportion(female) = .52

[95% Conf. Interval] .4363956 .5217793

Ha: female < .52 z = -1.878 P < z = 0.0302 Ha: female != .52 z = -1.878

P > z = 0.0604

Ha: female > .52

z = -1.878

P > z = 0.9698

. prtest nonwhite=.16

One-sample test of proportion nonwhite: Number of obs = 526 Variable Mean Std. Err.

nonwhite .1026616 .0132339

[95% Conf. Interval] .0767235 .1285996

Ho: proportion(nonwhite) = .16

Ha: nonwhite < .16 z = -3.587 P < z = 0.0002 Ha: nonwhite != .16

z = -3.587

P > z = 0.0003

Ha: nonwhite > .16

z = -3.587

P > z = 0.9998

 Let’s get back to our regression model, first by exploring the proposed new variables.

. grmeanby female, su(wage) Means of wage, average hourly earnings 0 1 f emale

. grmeanby nonwhite, su(wage) Means of wage, average hourly earnings 0 1 nonw hite

. grmeanby smsa, su(wage) Means of wage, average hourly earnings 1 0 smsa

. tab female, su(wage) =1 if | Summary of average hourly earnings female | Mean Std. Dev. Freq.

------------+----------------------------------- 0 | 7.1 4.2 274 1 | 4.6 2.5 252 ------------+----------------------------------- Total | 5.9 3.7 526

 The sample’s average wage disparities are pronounced for females versus males, & notable but less pronounced for nonwhites versus whites & for urban (smsa) versus rural.

 We should examine the wage distribution for each of these categorical, binary variables. Let’s illustrate this for females versus males:

bys female: su wage -> female = 0 Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------------- wage | 274 7.099489 4.160858 1.5 24.98

_____________________________________________________ -> female = 1 Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------------- wage | 252 4.587659 2.529363 .53 21.63

. gr box wage, over(female, total) marker(1, mlabel(id)) 0 1 Total Graphs by =1 if f emale

0 1 . table female, contents(mean wage med wage sd wage min wage max wage) =1 if female mean(wage) med(wage) sd(wage) min(wage) max(wage) 7.1 6 4.6 3.8

4.160858

2.529363

1.5 25 .53 22

. ttest wage, by(female) unequal Two-sample t test with unequal variances Obs Mean Std. Err. Std. Dev.

[95% Conf. Interval] 0 274 7.099489 .2513666 4.160858

1 252 4.587659 .1593349 2.529363

6.604626 7.594352

4.273855 4.901462

combined 526 5.896103 .1610262 3.693086 5.579768 6.212437

diff 2.51183 .2976118

1.926971 3.09669

Satterthwaite's degrees of freedom: 456.327

Ho: mean(0) - mean(1) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 t = 8.4400 t = 8.4400

P < t = 1.0000 P > t = 0.0000

t = 8.4400

P > t = 0.0000

. lowess wage educ, bwidth(.2) ml(female) 1 1 0 0 0 bandwidth = .2

5 1 0 1 Lowess smoother 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 1 10 years of education 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 15 20

. scatter wage educ || qfit wage educ, by(female) 0 1 0 5 10 15 20 0 years of education average hourly earnings Graphs by =1 if f emale 5 10 Fitted values 15 20

. scatter wage educ || mband wage educ, by(female) 0 1 0 5 10 15 20 0 years of education average hourly earnings Graphs by =1 if f emale 5 10 Median bands 15 20

. reg wage educ tenure educXtenure educ2 tenure2 female nonwhite smsa

Source Model Residual Total SS 3044.645

df MS Number of obs = 526 F( 8, 517) = 47.81

8 380.580624 Prob > F = 0.0000

4115.7693

517 7.96086904 R-squared = 0.4252

Adj R-squared = 0.4163

7160.41429 525 13.6388844 Root MSE = 2.8215

wage Coef.

Std. Err. t P>t [95% Conf.

Interval] educ tenure -.6354087

-.0437508

educXtenure .0274386

educ2 .0410706

tenure2 female nonwhite smsa _cons -.0050909

-1.727369

-.1551834

.834631

6.215331

.2180251 -2.91 0.004 -1.063733

.0815164 -0.54 0.592 -.2038949

.0055418 4.95 0.000 .0165515

.0087769 4.68 0.000 .0238277

.0015958 -3.19 0.002 -.0082259

.254498 -6.79 0.000 -2.227347

.4090206 -0.38 0.705 -.9587302

.2822579 2.96 0.003 .2801175

1.405937 4.42 0.000 3.45328

-.2070846

.1163932

.0383258

.0583134

-.0019559

-1.227392

.6483634

1.389145

8.977382

Let’s re-estimate the model, without nonwhite.

. reg wage educ tenure educXtenure educ2 tenure2 female smsa Source Model Residual Total SS df MS Number of obs = 526 F( 7, 518) = 54.71

3043.49906 7 434.78558 Prob > F = 0.0000

4116.91523 518 7.9477128 R-squared Adj R-squared 7160.41429 525 13.6388844 Root MSE = 0.4250

= 0.4173

= 2.8192

wage Coef.

Std. Err. t P>t [95% Conf.

Interval] educ tenure -.6313552

-.0463398

educXtenure .0275878

educ2 .0409219

tenure2 female smsa _cons -.0050603

-1.726572

.8338924

6.17466

.2175832 -2.90 0.004 -1.058809

.0811631 -0.57 0.568 -.2057891

.0055232 4.99 0.000 .0167372

.0087609 4.67 0.000 .0237105

-.2039013

.1131095

.0384385

.0581332

.0015924 -3.18 0.002 -.0081887

-.0019319

.254279 -6.79 0.000 -2.226116 -1.227027

.2820179 2.96 0.003 .279853

1.387932

1.400685 4.41 0.000 3.422939

8.926381

 There are no notable changes in the coefficients. Let’s conduct a nested model test:

 Let’s say we find evidence that the regression slope for wage on female versus male may not be the same in urban versus rural areas.

 That is, there may be a statistically significant femaleXsmsa interaction.

 Let’s find out:

 Normally, we create the interaction variable femaleXsmsa & write the model as follows: . reg wage educ tenure educXtenure educ2 tenure2 female smsa femaleXsmsa  Stata , however, let’s us do the work on the fly: . xi:reg wage educ tenure educXtenure educ2 tenure2 i.female*i.smsa

. xi:reg wage educ tenure educXtenure educ2 tenure2 i.female*i.smsa

i.female _Ifemale_0-1 (naturally coded; _Ifemale_0 omitted) i.smsa _Ismsa_0-1 (naturally coded; _Ismsa_0 omitted) i.fem~e*i.smsa _IfemXsms_#_# (coded as above) Source SS df MS Model 3058.50881 8 382.313602 Residual 4101.90548 517 7.93405315 Total 7160.41429 525 13.6388844 Number of obs = 526 F( 8, 517) = 48.19

Prob > F = 0.0000

R-squared = 0.4271

Adj R-squared = 0.4183

Root MSE = 2.8167

wage Coef.

Std. Err.

t P>t [95% Conf. Interval] educ -.6125454

tenure -.049629

educXtenure .0278069

educ2 tenure2 .0399927

-.0050664

_Ifemale_1 -1.18429

.2178258

.0811286

.0055208

.0087794

.0015911

.4690306

_Ismsa_1 1.193317

.3842972

_IfemXsms_~1-.7594022 .5521189

_cons 5.841761

1.420256

-2.81

-0.61

5.04

4.56

-3.18

-2.52

3.11

-1.38

4.11

0.005

0.541

0.000

0.000

0.002

0.012

0.002

0.170

0.000

-1.040478

-.2090112

.016961

.022745

-.0081922

-2.10573

.4383412

-1.844075

3.051579

-.1846129

.1097533

.0386528

.0572405

-.0019407

-.26285

1.948293

.3252702

8.631943

 We fail to reject the null hypothesis for femaleXsmsa.

 Next we hypothesize that average hourly wage varies by economic sector.

 So let’s add to the model a series of dummy variables for the economic sectors, the comparison sector being manufacturing.

 We hypothesize that the regression slope is the same for each sector but the y -intercept varies:

. reg wage educ tenure educXtenure educ2 tenure2 female smsa construc ndurman trcommpu trade services profservices

Source Model Residual SS df 3284.61293

3875.80136

13 512 Total 7160.41429

525 MS 252.662533

7.56992453

13.6388844

Number of obs = 526 F( 13, 512) = 33.38

Prob > F = 0.0000

R-squared = 0.4587

Adj R-squared = 0.4450

Root MSE = 2.7513

 Economic sectors are compared to manufacturing.

wage educ tenure Coef.

-.6367266

-.0637539

educXtenure .0266911

educ2 .0413504

tenure2 female -.0044319

-1.682962

smsa construc ndurman trcommpu trade services profserv _cons .84172

-.8289172

-1.13003

-1.60342

-1.910428

-1.928182

-.7975599

7.426713

Std. Err.

.2137267

.0796214

.0054182

.0086545

.0015651

.2608887

.2768325

.643652

.4770815

.6639094

.3891094

.503255

.4156892

1.399124

t -2.98

-0.80

4.93

4.78

-2.83

-6.45

3.04

-1.29

-2.37

-2.42

-4.91

-3.83

-1.92

5.31

P>t 0.003

0.424

0.000

0.000

0.005

0.000

0.002

0.198

0.018

0.016

0.000

0.000

0.056

0.000

[95% Conf. Interval] -1.056616

-.2168374

-.2201786

.0926709

.0160463

.0243477

-.0075067

-2.195505

.0373358

.0583532

-.001357

-1.170418

.2978526

-2.093441

-2.067308

-2.907742

-2.674875

-2.916881

-1.614226

4.677982

1.385587

.4356068

-.1927519

-.2990986

-1.14598

-.9394828

.0191064

10.17544

 Testing the nested model.

 We test not the individual significance but the joint significance of the dummy variable series:

testparm construc-profserv

( 1) construc = 0 ( 2) ndurman = 0 ( 3) trcommpu = 0 ( 4) trade = 0 ( 5) services = 0 ( 6) profserv = 0 F( 6, 512) = 5.31

Prob > F = 0.0000

 ‘testparm’ (test parameters) allows us to enter the first dummy variable in the series, a dash, & the last dummy variable in the series.

 ‘test’ requires that each dummy variable in the series be entered.

 The model, then, has greatly improved: Adj R 2 has reached .455, & the other, more important fit indicators look fine.

 But is the slope coefficient for wage really the same for females & males?

 Let’s test the assumption of equal slopes.

 We have to estimate a new model, this time interacting the dummy variable female with each of the other explanatory variables.

 Why choose ‘female’ as the variable to interact with the others?

 We could create an interaction variable corresponding to female’s interaction with each explanatory variable: femaleXeduc, femaleXtenure, femaleXprofserv, etc.

 Better: create an interaction variable only for those main effect variables that for good reason you think should be expected to vary by gender.

 Either way, Stata again allows us to do the work on the fly—but be sure you know how formally to write such a model.

For the sake of it, we’ll do a ‘full interaction’ model.

xi:reg wage i.female*educ i.female*tenure i.female*educXtenure i.female*educ2 i.female*tenure2 i.female*smsa i.female*construc i.female*ndurman i.female*trcommpu i.female*trade i.female*services i.female*profserv

Note: we could have used this approach to create educXtenure, but for pedagogical reasons we created this variable the formal way.

Let’s conduct a nested model test: from ‘female’ to the last ‘femX…’ interaction.

. testparm _Ifemale_1- _IfemXprofs_1 ( 1) _Ifemale_1 = 0 ( 2) _IfemXeduc_1 = 0 ( 3) _IfemXtenur_1 = 0 ( 4) _IfemXeducX_1 = 0 ( 5) _IfemXeduc2_1 = 0 ( 6) _IfemXtenura1 = 0 ( 7) _IfemXsmsa_1 = 0 ( 8) _IfemXconst_1 = 0 ( 9) _IfemXndurm_1 = 0 (10) _IfemXtrcom_1 = 0 (11) _IfemXtrade_1 = 0 (12) _IfemXservi_1 = 0 (13) _IfemXprofs_1 = 0 Constraint 1 dropped F( 12, 500) = 2.22

Prob > F = 0.0099

 Our conclusion?  We reject the null hypothesis: there indeed is statistically significant evidence of unequal wage slopes for females vs. males with regard to average hourly wage.

 Substantive meaning?

Don’t take linearity & uniform slopes for granted  Key lesson: we should test the

baseline notions of linearity & uniform slopes, & when necessary

revise the model accordingly.

 But we don’t have to do a ‘full interaction’ model.

Note: The econometric literature discusses the detection of significantly different slopes in terms of the Chow Test. procedure is equivalent to the Chow Test (type ‘findit Chow Test’, which will lead you to Stata FAQ ’s on the subject).

Stata ’s joint ‘test’  See Wooldridge, Introductory Econometrics pp. 237-240; & Stata’s online FAQ’s .

,

 A colleague of yours inspects your statistical work & says “Nice try, but you goofed with regard to the outcome variable.”  Where did we go wrong? Let’s take a look & see.

. histogram wage, norm plotr(c(navy)) 0 5 10 15 average hourly earnings 20 25

 Average hourly wage is highly right skewed.

 How should we address this problem?

 Let’s begin by using some helpful qladder & ladder.

Stata tools—

 For qladder & ladder, the null hypothesis is that each displayed, transformed distribution is normal.

 The alternative hypothesis that it isn’t normal.

 So, in ladder & qladder, we want to fail to reject the null hypothesis to obtain an effective normalizing transformation.

. qladder wage cubic square identity -4000 -2000 0 2000 4000 6000 sqrt -200 -100 0 log 100 200 300 -5 0 5 1/sqrt 10 15 0 1 2 inverse 3 4 0 1 2 1/square 3 -.8

-.6

-.4

1/cubic -.2

-8.33e-17 -.6

-.4

-.2

-5.55e-17 .2

-.6

-.4

-.2-5.55e-17 .2

.4

average hourly earnings Quantile-Normal plots by transf ormation -1 -.5

0 .5

1

. ladder wage Transformation formula chi2(2) P(chi2) cubic square raw square-root log reciprocal root reciprocal reciprocal square reciprocal cubic wage^3 wage^2 wage sqrt(wage) log(wage) 1/sqrt(wage) 1/wage 1/(wage^2) 1/(wage^3) .

.

.

.

.

.

.

.

13.99

0.000

0.000

0.000

0.000

0.001

0.000

0.000

0.000

0.000

 qladder’s ‘log’ looked helpful, but nothing looks helpful (i.e. insignificant) in ladder. Let’s explore:

. sparl wage educ 30 wage = -0.905 + 0.541 educ r² = 0.165 RMSE = 3.378 n = 526 20 10 0 0 5 10 years of education 15 20

. sparl wage educ, logy 30 log wage = 0.584 + 0.083 educ r² = 0.186 RMSE = 0.480 n = 526 20 10 0 0 5 10 years of education 15 20

 We’ll opt for a log transformation, which is the most basic way of linearizing a highly right skewed distribution: . gen lwage=ln(wage) . su wage lwage  Note: log(wage) & ln(wage) are equivalent.

. histogram wage, norm plotr(c(navy)) 0 5 10 15 average hourly earnings 20 25

. histogram lwage, norm plotr(c(navy)) -1 0 1 log(w age) 2 3

 Recall that log transformations require quantitative, ratio variables with positive values— plus not ‘too many’ zero values (see Wooldridge) & ideally a ratio between lowest & highest values of at least 10.

 Are the results of the log transformation satisfactory? Why, or why not?

 How do the models fit (again recalling our discussion of comparing non-nested models)? Interpretation?

 in Quantitative explanatory variables: every per unit change x multiplies average hourly wage by …, on average, holding the other variables constant.

 Categorical explanatory variables: e.g., having a job in services multiplies average hourly wage by …, on average, holding the other variables constant.

 Now we need to use lincom to predict wage, or the direction & magnitude of its slope, at specific levels of key explanatory variables (or one SD above mean, mean, & one SD below mean).

 We’ll leave that for you to do.

Some summary points Remember that: (1) log transformations require quantitative, ratio variables with positive values—plus not ‘too’ many zero values & ideally a ratio of at least 10 between the lowest & highest values; (2) quadratic (& similar) transformations require quantitative, interval or ratio variables, & not necessarily positive values; &

(3) in any instance, matters of theory, interpretability, & common sense may lead us not to transform a variable, even though doing so may make sense on purely statistical grounds.

 Furthermore, the assumptions of linearity & uniform slopes must be tested.  Compare nested models, using AIC or BIC the two models’ number of observations are unequal or when the number of explanatory variables is equal.

when  Don’t make predictions beyond the range of the model’s x -values.

 Don’t overfit a model to a data sample: most samples have their quirks, & overfitting a model to such quirks comes at the expense of the model’s generality;  transforming variables & with trying to boost R 2 Thus, don’t go overboard with .

 And don’t forget median band regression—scatter mband & lowess smoothing—lowess analysis.

y x,

bands(#);

y x

, bandwidth(.#) (see also locpoly): these are helpful tools at all stages of y/x data

 We’ll be doing more tranformations as part of regression diagnostics (i.e. assessing & correcting violations of regression’s statistical assumptions & dealing with outliers that distort the results).

 One final question: can we validly compare the magnitude of slope coefficients within a regression model?

 Usually not, because their metrics are typically different (e.g., years of education, score on a mental health scale, & quantitative versus categorical variables in regard to average hourly wage).

Standardized regression coefficients  We can, however, validly compare the magnitude of the slope coefficients if we standardize them, which, of course, expresses them as standard deviations on the standard normal distribution.

 This is easy to do in Stata : . reg y x1 x2 x3, beta

.Source

SS df MS Model 2194.1116 3 731.370532 Residual 4966.30269 522 9.51398984 Total 7160.41429 525 13.6388844 wage educ exper tenure _cons Coef. Std. Err. t P>t .5989651 .0512835 11.68 0.000

.0223395 .0120568 1.85 0.064

.1692687 .0216446 7.82 0.000

-2.872735 .7289643 -3.94 0.000

.

.

Number of obs = 526 F( 3, 522) = 76.87

Prob > F R-squared = 0.0000

= 0.3064

Adj R-squared = .30

Beta .4490953

.0820981 .3311255

 For every standard deviation increase in education, wage increases by .45 standard deviations on average, holding the other variables constant.

 For every standard deviation increase in experience, wage increases by .08 standard deviations on average, holding the other variables constant.

 For every standard deviation increase in tenure, wage increases by .33 standard deviations on average, holding the other variables constant.

  Standardizing regression coefficients can be quite useful & is commonly done, but it does have serious limitations: The standardized values depend on the particular sample: comparisons can’t be made across samples.

The standardized values depend on which other variables are included in the equation: change one or more of the variables & the standardized values change.

 Comparisons of standardized coefficients, then, can’t be made across regression equations.

 Standardization makes no sense for interpreting categorical explanatory variables: there’s no standard deviation change in, e.g., gender, ethnicity, or religion, so don’t bother trying to interpret categorical variables when they’re included in a standardized regression model (but, rather, use standardization to gauge the relative effect of categorical as well as quantitative explanatory variables on the outcome variable).

 And the interpretation of standardized interaction terms can be deceptive.

 Here’s a convenient (downloadable) command to obtain the standardized coefficients.

 After estimating the model: . listcoef, std  See Long/Freese for details.

Combining graph curves in

Stata

 Sometimes it may be helpful to combine straight & curved graphs of twoway scatterplots.

 Here are a couple of examples.

. scatter write math || lfit science math || fpfit science math 30 40 50 math score 60 writing score predicted science 70 Fitted values 80

. scatter write math || lfit science math || mband science math 30 40 50 math score 60 writing score Median bands 70 Fitted values 80

Summary  What’s a theory? What does it involve?

 What’s a model?

 Interplay of theory & empirical research?

 Approaches to model building? Fundamental principles of model building?