Managers’ perceptions of product market competition and

Download Report

Transcript Managers’ perceptions of product market competition and

2. Linear dependent variables











2.1 The basic idea underlying linear regression
2.2 Single variable OLS
2.3 Correctly interpreting the coefficients
2.4 Examining the residuals
2.5 Multiple regression
2.6 Heteroskedasticity
2.7 Correlated errors
2.8 Multicollinearity
2.9 Outlying observations
2.10 Median regression
2.11 “Looping”
1
2.1 The basic idea underlying
linear regression

A simple linear regression aims to characterize
the relation between a dependent variable and
one independent variable using a straight line
 You have already seen how to fit a line between
two variables using the scatter command
 Linear regression does the same thing but it can
be extended to include multiple independent
variables
2
2.1 The basic idea

For example, you predict that larger companies
usually pay higher fees
 You can formalize the effect of company size on
predicted fees using a simple equation:

The parameter a0 represents what fees are
expected to be in the case that Size = 0.
 The parameter a1 captures the impact of an
increase in Size on expected fees.
3
2.1 The basic idea

The parameters a0 and a1 are assumed to be
the same for all observations and they are called
“regression coefficients”
 You may argue that company size is not the only
variable that affects audit fees. For example, the
complexity of the audit engagement, or the size
of the audit firm may also matter.
 If you do not know all the factors that influence
fees, the predicted fee that you calculate from
the above equation
will differ from the actual
fee.
4
2.1 The basic idea

The deviation between the predicted fee and the
actual fee is called the “residual”. In general, you
might represent the relation between actual fees
and predicted fees in the following way:

where represents the residual term (i.e., the
difference between actual and predicted fees)
5
2.1 The basic idea
 Putting
the two together we can express
actual fees using the following equation:
 The
goal of regression analysis is to
estimate the parameters a0 and a1
6
2.1 The basic idea
 One
of the simplest techniques to estimate
the coefficients is known as ordinary least
squares (OLS).
 The objective of OLS is to make the
difference between the predicted and
actual values as small as possible
 In other words, the goal is to minimize the
magnitude of the residuals
7
2.1 The basic idea

Go to MySite


Download “ols.dta” to your hard drive and open in STATA
(use "J:\phd\ols.dta", clear)
examine the graphical relation between the two variables,
twoway (scatter y x) (lfit y x)
8
2.1 The basic idea

This line is fitted by minimizing the sum of the squared
differences between the observed and predicted values
of y (known as the residual sum of square, RSS)
The main assumptions required to obtain these
coefficients are that:

The
relation between y and x is linear
The x variable is uncorrelated with the residuals (i.e., x is
exogenous)
The residuals have a mean value of zero
9
2.2 Single variable OLS (regress)
 Instead
of using the lfit command with the
graph, we can instead use the regress
command

regress y x
 The
first variable (y) is the dependent
variable while the second (x) is the
independent variable
10
2.2 Single variable OLS (regress)
 This
gives the following output:
11
2.2 Single variable OLS (regress)
• The coefficient estimates are 3.000 for the a0
parameter and 0.500 for the a1 parameter
• We can use these to predict the values of Y for any
given value of X. For example, when X = 5 we predict
that Y will be:
• display 3.000091+0.5000909*5
12
2.2 Single variable OLS (_b[])
 Alternatively,
we do not need to type the
coefficient estimates because STATA will
remember them for us. They are stored by
STATA using the name _b[varname]
where varname is replaced with the name
of the independent variable or the constant
(_cons)

display _b[_cons]+_b[x]*5
13
2.2 Single variable OLS
 Note
that the predicted value of y when x
equals 5 differs from the actual value

list y if x==5
 The
actual value is 5.68 compared to the
predicted value of 5.50. The difference for
this observation is the “residual” error that
arises because x is not a perfect predictor
of y.
14
2.2 Single variable OLS
 If
we want to compute the predicted value
of y for each value of x in our dataset, we
can use the saved coefficients

gen y_hat=_b[_cons]+_b[x]*x
 The
estimated residuals are the difference
between the observed y values and the
predicted y values


gen y_res=y-y_hat
list x y_hat y y_res
15
2.2 Single variable OLS (predict)

A quicker way to do this would be to use the
predict command after regress



Checking that this gives the same answer:


predict yhat
predict yres, resid
list yhat y_hat yres y_res
You should also note that the values of x, yhat
and yres correspond with those found on the
scatter graph


sort x
list x y y_hat y_res
16
2.2 Single variable OLS
17
2.2 Single variable OLS

Note that by construction, there is zero correlation
between the x variable and the residuals
 twoway (scatter y_res x) (lfit y_res x)
18
2.2 Single variable OLS
 Standard



errors
Typically our data comprises a sample that is
taken from a larger population
The coefficients are only estimates of the true
a0 and a1 values that describe the entire
population
If we obtained a second random sample from
the same population, we would obtain
different coefficient estimates for a0 and a1
19
2.2 Single variable OLS

We therefore need a way to describe the
variability that would obtain if we were to apply
OLS to many different samples
 Equivalently, we want a measure of how
“precisely” our coefficients are estimated
 The solution is to calculate “standard errors”,
which are simply the sample standard deviations
associated with the estimated coefficients
 Standard errors (SEs) allow us to perform
statistical tests, e.g., is our estimate of a1
significantly greater than zero?
20
2.2 Single variable OLS
 The
techniques for estimating standard
errors are based on additional OLS
assumptions



Homoscedasticity (i.e., the residuals have a
constant variance)
Non-correlation (i.e., the residuals are not
correlated with each other)
Normality (i.e., the residuals are normally
distributed)
21
2.2 Single variable OLS

The t-statistic is obtained by dividing the coefficient estimate by the
standard error
22
2.2 Single variable OLS

The p-values are from the t-distribution and they tell you how
likely it is that you would have observed the estimated
coefficient under the assumption that the “true” coefficient in
the population is zero.
 The p-value of 0.002 tells you that it is very unlikely (prob =
0.2%) that the true coefficient on x is zero.
 The confidence intervals mean you can be 95% confident that
the true coefficient of x lies between 0.233 and 0.767.
23
2.2 Single variable OLS




To explain this we need some notation
captures the variation in y around its mean
captures the variation that is not explained by x
captures the variation that is explained by x
24
2.2 Single variable OLS

The total sum of squares (TSS) = 41.27
 The explained sum of squares (ESS) = 27.51
 The residual sum of squares (RSS) = 13.76
 Note that TSS = ESS + RSS.
25
2.2 Single variable OLS





The column labeled df contains the number of
“degrees of freedom”
For the ESS, df = k-1 where k = number of
regression coefficients (df = 2 – 1)
For the RSS, df = n – k where n = number of
observations (= 11 - 2)
For the TSS, df = n-1 ( = 11 – 1)
The last column (MS) reports the ESS, RSS and
TSS divided by their respective degrees of
freedom
26
2.2 Single variable OLS


The first number simply tells us how many observations
are used to estimate the model
The other statistics here tell you how “well” the model
explains the variation in Y
27
2.2 Single variable OLS

The R-squared = ESS / TSS (= 27.51 / 41.27 = 0.666)



So x explains 66% of the variation in y.
Unfortunately, many researchers in accounting (and other fields)
evaluate the quality of a model by looking only at the R-squared.
This is not only invalid it is also very dangerous (I will explain
why later)
28
2.2 Single variable OLS

One problem with the R-squared is that it will always
increase even when an independent variable is
added that has very little explanatory power.


Adding another variable is not always a good idea as you lose
one degree of freedom for each additional coefficient that needs
to be estimated. Adding insignificant variables can be especially
inefficient if you are working with a small sample size.
The adjusted R-squared corrects for this by
accounting for the number of model parameters, k,
that need to be estimated:

Adj R-squared = 1-(1-R2)(n-1)/(n-k) = 1-(1-.666)(10)/9 = 0.629
• In fact the adjusted R-squared can even take on negative
values. For example, suppose that y and x are uncorrelated
in which case the unadjusted R-squared is zero:
• Adj R-squared = 1-(n-1)/(n-2) = (n-2-n+1)/(n-2)= -1/(n-2)
29
2.2 Single variable OLS



You might think that another way to measure the fit of the model is
to add up the residuals. However, by definition the residuals will sum
to zero.
An alternative is to square the residuals, add them up (giving the
RSS) and then take the square root.
Root MSE = square root of RSS/n-k



= [13.76 / (11-2)]0.5 = 1.236
One way to interpret the root MSE is that it shows how far away on
average the model is from explaining y
The F-statistic = (ESS/k-1)/(RSS/n-k)




= (27.51 / 1)/(13.76/9) = 17.99
the F statistic is used to test whether the R-squared is significantly
greater than zero (i.e., are the independent variables jointly significant?)
Prob > F gives the probability that the R-squared we calculated will be
observed if the true R-squared in the population is actually equal to zero
This F test is used to test the overall statistical significance of the
regression model
30
Class exercise 2a
 Open
your Fees.dta file and run the
following two regressions:


audit fees on total assets
the log of audit fees on the log of total assets
 What
does the output of your regression
mean?
 Which model appears to have the better
“fit”
31
2.3 Correctly interpreting the
coefficients
 So
far we have considered the case where
the independent variable is continuous.
 Interpretation of results is even more
straightforward when the independent
variable is a dummy.

reg auditfees big6
 ttest auditfees, by(big6)
32
2.3 Correctly interpreting the
coefficients
 Suppose
we wish to test whether the Big 6
fee premium is significantly different
between listed and non-listed companies
33
2.3 Correctly interpreting the
coefficients








gen listed=0
replace listed=1 if companytype==2 | companytype==3 |
companytype==5
reg auditfees big6 if listed==0
ttest auditfees if listed==0, by(big6)
reg auditfees big6 if listed==1
ttest auditfees if listed==1, by(big6)
gen listed_big6=listed*big6
reg auditfees big6 listed listed_big6
34
2.3 Correctly interpreting the
coefficients
Some studies report the “economic” significance
of the estimated coefficients as well as the
statistical significance
 Economic significance refers to the magnitude of
the impact of x on y
 There is no single way to evaluate “economic
significance” but many studies describe the
change in the predicted value of y as x increases
from the 25th percentile to the 75th (or as x
changes by one standard deviation around its
mean)

35
2.3 Correctly interpreting the
coefficients

For example, we can calculate the expected
change in audit fees as company size increases
from the 25th to 75th percentiles





reg auditfees totalassets
sum totalassets if auditfees<., detail
gen fees_low=_b[_cons]+_b[totalassets]*r(p25)
gen fees_high=_b[_cons]+_b[totalassets]*r(p75)
sum fees_low fees_high
36
Class exercise 2b

Estimate the audit fee model in logs rather than
in absolute values
 Calculate the expected change in audit fees as
company size increases from the 25th to 75th
percentiles
 Compare your results for economic significance
to those we obtained when the fee model was
estimated using the absolute values of fees and
assets.
 Hint: you will need to take the exponential of the
predicted log of fees in order to make this
comparison.
37
2.3 Correctly interpreting the
coefficients

When evaluating the economic significance of a
dummy variable coefficient, we usually do so
using the values zero and one rather than
percentiles
 For example




reg lnaf big6
gen fees_nb6=exp(_b[_cons])
gen fees_b6=exp(_b[_cons]+_b[big6])
sum fees_nb6 fees_b6
38
2.3 Correctly interpreting the
coefficients


Suppose we believe that the impact of a Big 6 audit on
fees depends upon the size of the company
Usually, we would quantify this impact using a range of
values for lnta (e.g., as lnta increases from the 25th to the
75th percentile)
39
2.3 Correctly interpreting the
coefficients
 For







example:
gen big6_lnta= big6*lnta
reg lnaf big6 lnta big6_lnta
sum lnta if lnaf<. & big6<., detail
gen big6_low=_b[big6]+_b[big6_lnta]*r(p25)
gen big6_high=_b[big6]+_b[big6_lnta]*r(p75)
gen big6_mean=_b[big6]+_b[big6_lnta]*r(mean)
sum big6_low big6_high big6_mean
40

It is amazing how many studies give a misleading
interpretation of the coefficients when using interaction
terms. For example, Blackwell et al.
41
42

Class questions:


Theoretically, how should auditing affect the interest
rate that the company has to pay?
Empirically, how do we measure the impact of
auditing on the interest rate using eq. (1)?
43
44

Class question: At what values of total assets
($000) is the effect of the Audit Dummy on the
interest rate:

negative, zero, positive?
45
46

Class questions:


What is the mean value of total assets within their
sample?
How does auditing affect the interest rate for the
average company in their sample?
47
48
 Verify
that the above claim is “true”.
 Suppose Blackwell et al. had reported the
impact for a firm with $11m in assets and
another firm with $15m in assets.


How would this have changed the conclusions
drawn?
Do you think the paper would have been
published if the authors had made this
comparison?
49
50
2.4 Examining the residuals


Go to MySite
Download “anscombe.dta” to your hard drive

use "J:\phd\anscombe.dta", clear

Run the following regressions
•
•
•
•

reg y1 x1
reg y2 x2
reg y3 x3
reg y4 x4
Note that the output from these regressions is virtually
identical



intercept = 3.0 (t-stat=2.67)
x coefficient = 0.5 (t-stat=4.24)
R-squared = 66%
51
Class exercise 2c






If you did not know about regression assumptions or
regression diagnostics you would probably stop your
analysis at this point, concluding that you have a good fit
for all four models.
In fact, only one of these four models is well specified.
Draw scatter graphs for each of these four associations
(e.g., twoway (scatter y1 x1) (lfit y1 x1)).
Of the four models, which do you think is the well
specified one?
Draw scatter graphs for the residuals against the x
variable for each of the four regressions – is there a
pattern?
Which of the OLS assumptions are violated in these four
regressions?
52
2.4 Examining the residuals
 Unfortunately,
accounting researchers
often judge whether a model is “wellspecified” solely in terms of its explanatory
power (i.e., the R-squared).
 Many researchers fail to report other types
of diagnostic tests



is there significant heteroscedasticity?
is there any pattern to the residuals?
are there any problems of outliers?
53
2.4 Examining the residuals
 For
example, many audit fee studies claim
that their models are well-specified
because they have high R2
 Carson et al. (2003):
54
2.4 Examining the residuals

Gu (2007) points out that:


econometricians consider R2 values to be relatively unimportant
(accounting researchers put far too much emphasis on the
magnitude of the R2)
regression R2s should not be compared across different samples
• in contrast there is a large accounting literature that uses R2s
to determine whether the value relevance of accounting
information has changed over time
55

It is easy to show that the same “economic” model can yield very
different R2 depending on how the variables are transformed:

Using either eq. (1) or (2), we will obtain exactly the same coefficient
estimates because the economic model is the same

If eq. (1) is well-specified, so also is eq. (2)

If eq. (1) is mis-specified, so also is eq. (2)
However, the R2 of eq. (1) will be very different from the R2 of eq. (2)

56

Example:
















use "J:\phd\Fees.dta", clear
gen lnaf=ln(auditfees)
gen lnta=ln(totalassets)
sort companyid yearend
by companyid: gen lnaf_lag=lnaf[_n-1]
egen miss=rmiss(lnaf lnta lnaf_lag)
gen chlnaf=lnaf-lnaf_lag
reg lnaf lnta lnaf_lag if miss==0
reg chlnaf lnta lnaf_lag if miss==0
The lnta coefficients are exactly the same in the two models.
The lnaf_lag coefficient in eq. (2) equals the lnaf_lag coefficient in
eq. (1) minus one.
The R2 is much higher in eq. (1) than eq. (2).
The high R2 in eq. (1) does not imply that the model is wellspecified.
The low R2 in eq. (2) does not imply that the model is mis-specified.
Either both equations are well-specified or they are both misspecified.
The R2 tells us nothing about whether our hypothesis about the
determinants of Y is correct.
57
2.4 Examining the residuals

Instead of relying only on the R2, an examination of the
residuals can help us identify whether the model is well
specified. For example compare the audit fee model
which is not logged:




With the logged audit fee model




reg auditfees totalassets
predict res1, resid
twoway (scatter res1 totalassets, msize(tiny)) (lfit res1
totalassets)
reg lnaf lnta
predict res2, resid
twoway (scatter res2 lnta, msize(tiny)) (lfit res2 lnta)
Notice that the residuals are more “spherical” displaying
less of an obvious pattern in the logged model.
58
2.4 Examining the residuals


In order to obtain unbiased standard errors we have to
assume that the residuals are normally distributed
We can test this using a histogram of the residuals








hist res1
this does not give us what we need because there are severe
outliers
sum res1, detail
hist res1 if res1>-22 & res1<208, normal xlabel(-25(25)210)
hist res2
sum res2, detail
hist res2 if res2>-2 & res2<1.8, normal xlabel(-2(0.5)2)
The residuals are much closer to the assumed normal
distribution when the variables are measured in logs
59
60
Class exercise 2d
 Following
Pong and Whittington (1994)
estimate the raw value of audit fees as a
function of raw assets and assets squared
 Examine the residuals
 Do you think this model is better specified
than the one in logs?
61
2.5 Multiple regression
 Researchers
use “multiple regression”
when they believe that Y is affected by
multiple independent variables:

Y = a0 + a1 X1 + a2 X2 + e
 Why
is it important to control for multiple
factors that influence Y?
62
2.5 Multiple regression

Suppose the “true” model is:



Y = a0 + a1 X1 + a2 X2 + e
where X1 and X2 is uncorrelated with the error, e
Suppose the OLS model that we estimate is:


Y = a0 + a1 X1 + u
where u = a2 X2 + e

OLS imposes the assumption that X1 is
uncorrelated with the residual term, u.
 Since X1 is uncorrelated with e, the assumption
that X1 is uncorrelated with u is equivalent to
assuming that X1 is uncorrelated X2.
63
2.5 Multiple regression

If X1 is correlated with X2 the OLS estimate of a1
is biased.
 The magnitude of the bias depends upon the
strength of the correlation between X1 and X2.
 Of course, we often do not know whether the
model we estimate is the “true” model
 In other words, we are unsure whether there is
an omitted variable (X2) that affects Y and that is
correlated with our variable of interest (X1)
64
2.5 Multiple regression
 We
can judge whether or not there is likely
to be a correlated omitted variable
problem using:


theory
prior empirical studies
65
2.5 Multiple regression

Previously, when we were using simple regression with
one independent variable, we checked whether there
was a pattern between the residuals and the
independent variable




lnaf = a0 + a1 lnta + res1
twoway (scatter res1 lnta) (lfit res1 lnta)
When we are using multiple regression, we want to test
whether there is a pattern between the residuals and the
right hand side of the equation as a whole
The right hand side of the equation “as a whole” is the
same thing as the predicted value of the dependent
variable
66
2.5 Multiple regression


So we should examine whether there is a pattern
between the residuals and the predicted values of the
dependent variable
For example, let’s estimate a model where audit fees
depend on company size, audit firm size, and whether
the company is listed on a stock market






gen listed=0
replace listed=1 if companytype==2 | companytype==3 |
companytype==5
reg lnaf lnta big6 listed
predict lnaf_hat
predict lnaf_res, resid
twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res lnaf_hat)
67
2.5 Multiple regression (rvfplot)
 In
fact, those nice guys at STATA have
given us a command which enables us to
short-cut having to use the predict
command for calculating the residuals and
the fitted values



reg lnaf lnta big6 listed
rvfplot
rvf stands for residuals versus fitted
68
2.6 Heteroscedasticity (hettest)

The OLS techniques for estimating standard errors are
based on an assumption that the variance of the errors is
the same for all values of the independent variables
(homoscedasticity)



In many cases, the homoscedasticity assumption is clearly
violated. For example:
• reg auditfees nonauditfees totalassets big6 listed
• rvfplot
the homoscedasticity assumption can be tested using the hettest
command after we do the regression
• reg auditfees nonauditfees totalassets big6 listed
• hettest
Heteroscedasticity does not bias the coefficient
estimates but it does bias the standard errors of the
coefficients
69
2.6 Heteroscedasticity (robust)

Heteroscedasticity is often caused by using a dependent
variable that is not symmetric
 for example the auditfees variable is highly skewed due to
the fact that it has a lower bound of zero
 much of the heterosedasticity can often be removed by
transforming the dependent variable (e.g., use the log of
audit fees instead of the raw values)
 When you find that there is heteroscedasticity, you need to
adjust the standard errors using the Huber/White/sandwich
estimator
 In STATA it is easy to do this adjustment using the robust
option
 reg auditfees nonauditfees totalassets big6 listed, robust
 Compare the adjusted and unadjusted results
 reg auditfees nonauditfees totalassets big6 listed
 What is different? What is the same?
70
Class exercise 2e

Esimate the audit fee model in logs rather than
absolute values
 Using rvfplot, assess whether the residuals
appear to be non-constant
 Using hettest, provide a formal test for
heteroscedasticity
 Compare the coefficients and t-statistics when
you estimate the standard errors with and
without adjusting for heteroscedasticity.
71
2.7 Correlated errors

The OLS techniques for estimating standard
errors are based on an assumption that the
errors are not correlated
 This assumption is typically violated when we
use repeated annual observations on the same
companies
 The residuals of a given firm are correlated
across years (“time series dependence”)
72
Time-series dependence



Time-series dependence is nearly
always a problem when
researchers use “panel data”
Panel data = data that are pooled
for the same companies across
time
In panel data, there are likely to be
unobserved company-specific
characteristics that are relatively
constant over time
Company Year
A
1996
A
1997
B
1996
B
1997
73

Let’s start with a simple regression model where the
errors are assumed to be uncorrelated

We now relax the assumption of independent errors by
assuming that the error term has an unobserved
company-specific component that does not vary over
time and an idiosyncratic component that is unique to
each company-year observation:

Similarly, we can assume that the X variable has a
company-specific component that does not vary over
time and an idiosyncratic component:
74
Time-series dependence





In this case, the OLS standard errors tend to be biased
downwards and the magnitude of this bias is increasing in the
number of years within the panel.
To understand the intuition, consider the extreme case where
the residuals and independent variables are perfectly
correlated across time.
In this case, each additional year provides no additional
information and will have no effect on the true standard error
However, under the incorrect assumption of time-series
independence, it is assumed that each additional year
provides additional observations and the estimated standard
errors will shrink accordingly and incorrectly
This problem can be avoided by adjusting the standard errors
for the clustering of yearly observations across a given
company
75
Time-series dependence

To understand all this, it is helpful to review the
following example

First, I estimate the model using just one observation
for each company (in the year 1998)
•
•
•
•
•
•
gen fye=date(yearend, "MDY")
gen year=year(fye)
drop if year!=1998
sort companyid
drop if companyid==companyid[_n-1]
reg lnaf lnta big6 listed, robust
76
Time-series dependence



Now I create a dataset in which each observation is
duplicated
Each duplicated observation provides no additional
information and will have no effect on the true standard
error but it will reduce the estimated standard error (i.e.,
the estimated standard error will be biased downwards)
 save "J:\phd\Fees98.dta", replace
 append using "J:\phd\Fees98.dta"
 reg lnaf lnta big6 listed, robust
What’s happened to the coefficients and t-statistics?
77
Time-series dependence
(robust cluster())

We can obtain correct standard errors in the
duplicate dataset using the robust cluster()
option which adjusts the standard errors for
clustering of observations (here they are
duplicated) for each company


reg lnaf lnta big6 listed, robust cluster (companyid)
What’s happened to the coefficients and tstatistics?
78
Time-series dependence

In reality the observations of a given company
are not exactly the same from one year to the
next (i.e., they are not exact duplicates).
 However, the observations of a given company
often do not change much from one year to the
next.
 For example, a company’s size and the fees that
it pays may not change much over time (i.e.,
there is a strong unobserved company-specific
component to the variables).
 Failing to account for this in panel data tends to
overstate the magnitude of the t-statistics.
79
Time-series dependence

It is easy to demonstrate that the residuals of a given
company tend to be very highly correlated over time
 First, start again with the original data







use "J:\phd\Fees.dta", clear
gen fye=date(yearend, "MDY")
gen year=year(fye)
gen lnaf=ln(auditfees)
gen lnta=ln(totalassets)
save "J:\phd\Fees1.dta", replace
Estimate the fee model and obtain the residuals for each
company-year observation


reg lnaf lnta
predict res, resid
80
Time-series dependence

Reshape the data so that we have each
company as a row and there are separate
variables for each yearly set of residuals






keep companyid year res
sort companyid year
drop if companyid==companyid[_n-1] &
year==year[_n-1]
reshape wide res, i( companyid) j(year)
browse
Examine the correlations between the residuals
of a given company

pwcorr res1998- res2002
81
Time-series dependence

We can easily control for this problem of time-series
dependence using the robust cluster() option



Note that if we do not control for time-series
dependence, the t-statistic is biased upwards even
though we have controlled for the heteroscedasticity


reg lnaf lnta, robust
If we do not control for heteroscedasticity, the upward
bias would be even worse


use "J:\phd\Fees1.dta", clear
reg lnaf lnta, robust cluster(companyid)
reg lnaf lnta
TOP TIP: Whenever you use panel data you should get
into the habit of using the robust cluster() option,
otherwise your “significant” results from pooled
regressions may be spurious.
82
2.8 Multicollinearity

Perfect collinearity occurs if there is a perfect
linear relation between multiple variables of the
regression model.
 For example, our dataset covers a sample
period of five years (1998-2002). Suppose we
create a dummy for each year and include all
five year dummies in the fee regression.



tabulate year, gen(year_)
reg lnaf year_1 year_2 year_3 year_4 year_5
STATA excludes one of the year dummies when
estimating the model – why is that?
83
2.8 Multicollinearity

The reason is that a linear combination of the
year dummies equals the constant in the model



The model can only be estimated if one of the
year dummies or the constant is excluded


year_1 + year_2 + year_3 + year_4 + year_5 = 1
where 1 is a constant
reg lnaf year_1 year_2 year_3 year_4 year_5, nocons
STATA automatically throws away one of the
year dummies so that the model can be
estimated
84
Class exercise 2f


Go to MySite
Download “international.dta” to your hard drive and
open in STATA





You are interested in testing whether legal enforcement
affects the importance of equity markets to the economy
Create dummy variables for each country in your dataset
Run a regression where importanceofequitymarket is the
dependent variable and legalenforcement is the
independent variable
How many country dummies can be included in your
regression? Explain.
Are your results for the legalenforcement coefficient
sensitive to your choice for which country dummies to
exclude? Explain.
85
Golden rule:
You must not include dummies as control
variables if their inclusion would “dummy
out” all of the variation in your treatment
variable.
86
2.8 Multicollinearity

We have seen that when there is perfect
collinearity between independent variables,
STATA will have to exclude one of them.




For example, a linear combination of all year dummies
equals the constant in the model
year_1 + year_2 + year_3 + year_4 + year_5 = constant
so we cannot estimate a model that includes all the year
dummies and the constant term
Even if the independent variables are not
perfectly collinear, there can still be a problem if
they are highly correlated
87
2.8 Multicollinearity

Multicollinearity can cause:



Example:






the standard errors of the coefficients to be large (i.e.,
the coefficients are not estimated precisely)
the coefficient estimates can be highly unstable
use "J:\phd\Fees.dta", clear
gen lnaf=ln(auditfees)
gen lnta=ln(totalassets)
gen lnta1=lnta
reg lnaf lnta lnta1
Obviously, you must exclude one of these
variables because lnta and lnta1 are perfectly
correlated
88
2.8 Multicollinearity

Let’s see what happens if we change the value of lnta1
for just one observation







list lnta if _n==1
replace lnta1=8 if _n==1
reg lnaf lnta
reg lnaf lnta1
reg lnaf lnta lnta1
Notice that the lnta and lnta1 coefficients are highly
significant when included separately but they are
insignificant when included together
The reason of course is that, by construction, lnta and
lnta1 are very highly correlated

pwcorr lnta lnta1, sig
89
2.8 Multicollinearity
 As
another example, we can see that the
coefficients can “flip” signs as a result of
high collinearity






sort lnaf lnta
replace lnta1=10 if _n<=100
reg lnaf lnta
reg lnaf lnta1
reg lnaf lnta lnta1
pwcorr lnta lnta1, sig
90
2.8 Multicollinearity (vif)



Variance-inflation factors (VIF) can be used to assess
whether multicollinearity is a problem for a particular
independent variable
The VIF takes account of the variable’s correlations with
all other independent variables on the right hand side
The VIF shows the increase in the variance of the
coefficient estimate that is attributable to the variable’s
correlations with other independent variables in the
model





reg lnaf lnta big6 lnta1
vif
reg lnaf lnta big6
vif
Multicollinearity is generally regarded as high (very high)
if the VIF is greater than 10 (20)
91
2.9 Outlying observations

We have already seen that outlying observations
heavily influence the results of OLS models
92
2.9 Outlying observations

In simple regression (with just one independent
variable), it is easy to spot outliers from a scatterplot of Y
on X



For example, a company is an outlier if it is very small in terms of
size and it pays an audit fee that is very high
In multiple regression (where there are multiple X
variables), some observations may be “outliers” even
though they do not show up on the scatterplot
Moreover, observations that show up as outliers on the
scatterplot might actually be normal once we control for
other factors in the multiple regression

For example the small company may pay a high audit fee
because other characteristics of that company make it a complex
audit.
93
2.9 Outlying observations (cooksd)


We can calculate the influence of each observation on
the estimated coefficients using Cook’s D
Values of Cook’s D that are higher than 4/N are
considered large, where N is the number of observations
used in the regression






reg lnaf lnta big6
predict cook, cooksd
sum cook, detail
gen max=4/e(N)
e(N) is the number of observations in the most recent regression
model (the estimation sample size is stored by STATA as an
internal result)
count if cook>max & cook<.
94
2.9 Outlying observations (cooksd)


We can discard the observations that have values larger
than Cook’s D and re-estimate the model
 reg lnaf lnta big6 if cook<=max
For example, Ke and Petroni (2004, p.906) explain that
they use Cook’s D to exclude outliers and the standard
errors are adjusted for heteroscedasticity and time-series
dependence (they are using a panel dataset):
95
2.9 Outlying observations (winsor)



Rather than drop outlying observations, some
researchers choose to “winsorize” the data
Winsorizing replaces the extreme values of a variable
with the values at certain percentiles (e.g., the top and
bottom 1%)
You can winsorize variables in STATA using the winsor
command





winsor lnaf, gen(wlnaf) p(0.01)
winsor lnta, gen(wlnta) p(0.01)
sum lnaf wlnaf lnta wlnta, detail
reg wlnaf wlnta big6
A disadvantage with “winsorizing” is that the researcher
is assuming that outliers lie only at the extremes of the
variable’s distribution.
96
2.10 Median regression
 Median
regression is quite similar to OLS
but it can be more reliable especially when
we have problems of outlying observations
 Recall that in OLS, the coefficient
estimates are chosen to minimize the sum
of the squared residuals
97
2.10 Median regression

In median regression, the coefficient estimates are
chosen to minimize the sum of the absolute residuals

Squaring the residuals in OLS means that large
residuals are more heavily weighted than small
residuals.
Because the residuals are not squared in median
regression, the coefficient estimates are less sensitive to
outliers

98
2.10 Median regression

Median regression takes its name from its
predicted values, which are estimates of the
conditional median of the dependent variable.
 In OLS, the predicted values are estimates of
the conditional mean of the dependent variable.
 The predicted values of both regression
techniques therefore measure the central
tendency (i.e., mean or median) of the
dependent variable.
99
2.10 Median regression
 STATA
treats median regression as a
special case of quantile regression.
 In quantile regression, the coefficients are
estimated so that the sum of the weighted
absolute residuals is minimized
 where
the weights are wi
100
2.10 Median regression (qreg)

Weights can be different for positive and
negative residuals. If positive and negative
residuals are weighted equally, you get a
median regression. If positive residuals are
weighted by the factor 1.5 and negative
residuals are weighted by the factor 0.5, you get
a “3rd quartile regression”, etc.
 In STATA you perform quantile regressions
using the qreg command


qreg lnaf lnta big6
reg lnaf lnta big6
101
Class exercise 2g
 Open
the “anscombe.dta” file
 Do a scatterplot of y3 and x3
 Do an OLS regression of y3 on x3 for the
full sample.
 Calculate Cook’s D to test for the
presence of outliers.
 Do an OLS regression of y3 on x3 after
dropping any outliers.
 Do a median regression of y3 on x3 for the
full sample.
102
2.10 Median regression
 Basu
and Markov (2004) compare the
results of OLS and median regressions to
determine whether analysts who issue
earnings forecasts attempt to minimize:


the sum of squared forecast errors (OLS), or
the sum of absolute forecast errors (median)
103
104

The LAD estimator is simply the median regression
command that we saw earlier (qreg)
105

Basu and Markov (2004) conclude that analysts’ forecasts
may accurately reflect their rational expectations

Their study is a good example of how we can make an
important contribution to the literature if we use an estimation
technique that is not widely used by accounting researchers
106
2.11 “Looping”

Looping can be very useful when we want to carry out
the same operations many times
 Looping significantly reduces the length of our do files
because it means we do not have to state the same
commands many times
 When software designers use the word “programming”
they mean they are creating a new command
 Usually we do not need new commands because what
we need has already been written for us in STATA
 However, programming is necessary if we want to use
“looping”
107
2.11 “Looping” (program, forvalues)
 Example:
program ten
forvalues i = 1(1)10 {
display `i'
}
end
 To run this program we simply type ten
108
2.11 “Looping” (program, forvalues)

What’s happening?






program ten : we are telling STATA that the name of our
program is “ten” and that we are starting to write a program
end : we are telling STATA that we have finished writing the
program
{ } : everything inside these brackets is part of a loop
forvalues i = : the program will perform the commands inside the
brackets for each value of i (i is called a “local macro”)
1(1)10 : i goes from one to ten, increasing by the value one
every time
display `i': this is the command inside the brackets and STATA
will execute this command for each value of i from one to ten.
Note that ` is at the top left of your keyboard whereas ' is next to
the Enter key
109
2.11 “Looping” (program, forvalues)





The macro i has single quotes around it. These quotes
tell Stata to replace the macro with the value that it holds
before executing the command. So the first time through
the loop, i holds the value of 1. Stata first replaces `i' with
1, and then it executes the command
display 1
The next time through, i holds the value of 2. Stata first
replaces `i' with the value 2, and then it executes the
command
display 2
This process continues through the values 3, 4,...,10.
110
2.11 “Looping” (capture)

Suppose we make a mistake in the program or
we want to modify the program in some way
 We first need to drop this program from STATA’s
memory



program drop ten
we can then go on to write a new program called ten
It is good practice to drop any program that
might exist with the same name before writing a
new program

capture program drop ten
111
2.11 “Looping”
 Our
program is now:
capture program drop ten
program ten
forvalues i = 1(1)10 {
display `i'
}
end
 To run this program we simply type ten
112
Another example

Earnings management studies often estimate
“abnormal accruals” using the Jones model:

ACCRUALSit = α0k (1/ASSETit-1) + α1k (SALESit /
ASSETit-1) + α2k (PPEit /ASSETit-1) + uit
ACCRUALSit = change in non-cash current assets
minus change in non-debt current liabilities, scaled
by lagged assets.


The k sub-scripts indicate that the model is
estimated separately for each industry.
 Industries are identified using Standard
Industrial Classification (SIC) codes
113
Another example

The number of industries =




10 using one digit codes,
100 using two digit codes,
1,000 using three digit codes, etc
Your do file could be very long if you had
separate lines for each industry:





Estimate abnormal accruals for SIC = 1
Estimate abnormal accruals for SIC = 2
Estimate abnormal accruals for SIC = 3
…..
Estimate abnormal accruals for SIC = 10, etc.
114
Another example

Your do file will be much shorter if you use the
looping technique
capture program drop ab_acc
program ab_acc
forvalues i = 1(1)10 {
insert commands that you want to execute on
each industry SIC code
}
end
115
Another example

Go to My Site. Open “accruals.dta” and generate the variables

the regressions will be estimated at the one-digit level











use "J:\phd\accruals.dta", clear
gen one_sic=int(sic/1000)
gen ncca= current_assets- cash
gen ndcl= current_liabilities- debt_in_current_liabilities
sort cik year
gen ch_ncca=ncca-ncca[_n-1] if cik==cik[_n-1]
gen ch_ndcl=ndcl-ndcl[_n-1] if cik==cik[_n-1]
gen accruals=(ch_ncca-ch_ndcl)/assets[_n-1] if cik==cik[_n-1]
gen lag_assets=assets[_n-1] if cik==cik[_n-1]
gen ppe_scaled=ppe/assets[_n-1] if cik==cik[_n-1]
gen chsales_scaled=(sales-sales[_n-1])/assets[_n-1] if cik==cik[_n-1]
116
Another example











gen ab_acc=.
capture program drop ab_acc
program ab_acc
forvalues i = 0(1)9 {
capture reg accruals lag_assets ppe_scaled
chsales_scaled if one_sic==`i'
capture predict ab_acc`i' if one_sic==`i', resid
replace ab_acc= ab_acc`i' if one_sic==`i'
capture drop ab_acc`i'
}
end
ab_acc
117
Explaining this program

forvalues i = 0(1)9 {


capture reg accruals lag_assets ppe_scaled chsales_scaled if one_sic==`i'


the regressions are run at the one digit level because some industries have
insufficient observations at the two-digit level
capture predict ab_acc`i' if one_sic==`i', resid



the one_sic variable takes values from 0 to 9
For each industry, I create a separate abnormal accrual variable (ab_acc1 if
industry #1 ab_acc2 if industry #2, etc.).
If this line had been capture predict ab_acc if one_sic==`i', resid we would not
have been able to go beyond industry #1 as the ab_acc was already defined
replace ab_acc= ab_acc`i' if one_sic==`i'

The overall abnormal accrual variable (ab_acc) equals ab_acc1 if industry #1,
equals ab_acc2 if industry #2, etc.
before starting the program I had to gen ab_acc=. in order for this
replace command to work
 capture drop ab_acc`i'


I drop ab_acc1, ab_acc2, etc. because I only need the ab_acc variable.
118
Conclusion
 You
should now have a good
understanding of





how OLS models work,
how to interpret the results of OLS models,
how to find out whether the assumptions of
OLS are violated,
how to correct the standard errors for
heteroscedasticity and time-series
dependence
how to handle problems of outliers
119
Conclusion
 So
far, we have been discussing the case
where our dependent variable is
continuous (e.g., lnaf)
 When the dependent variable is not
continuous, we cannot use OLS (or
quantile) regression.
 The next topic considers how to estimate
models where our dependent variable is
not continuous.
120