Dummy Variables

Transcript Dummy Variables

Dummy Variables
Outline



Objective
Why forming dummy variables to use nominal
variables as independent variables in
regressions are important.
How to use and interpret dummy variables.



Rules of use.
Recommended best practices.
Interpretation example
Objective


Learn how to use nominal variables as
independent variables in regression models.
These include variables like:




Continent/region (Africa, Western Europe, etc).
U.S. Party Vote (Democrats, Republicans, Other).
Marital status (Married, Single, Widowed, etc).
Religion (Catholic, Protestant, Muslim, etc).
Independent Variables in Regressions


When you run an ordinary least squares (OLS)
regression analysis, each B coefficient can be
interpreted as the predicted change in Y (the
dependent variable) as a result of increasing the
independent variable (X) by one unit.
For example: To explain differences in countries’
life expectancy rates, a regression was run using
literacy rates as an independent variable.

Both variables are interval.
Interpreting Coefficients


The B (unstandardized) coefficient for literacy
rates was 0.28.
We interpret this coefficient as:

When literacy increases by one point, the model
predicts that life expectancy will increase by 0.28
points when controlling for all other variables.
What if?


What if another reader suggested that the
relationship between literacy and life
expectancy was different in Africa than
everywhere else in the world?
Fortunately, there is a variable in the dataset
for continent/region: North America, South
America, Western Europe, Eastern Europe,
Africa, the Middle East, Central & South Asia,
East Asia and Oceania.
A problem with nominal variables


The continent/region variable is nominal.
This poses a problem when used in
regression analyses, because without an
order to the values, we cannot interpret the
coefficient.

It would be silly to say that “for every one point
increase in region…” or “for every one point
increase from North America…”
Solution for nominal variables

Transform nominal variable into many
dichotomous variables, called “dummies.”



Dichotomous variables have only two value categories
or options, like “yes” and “no”.
So, recode the region variable so that all African
countries are coded as 1 and all other countries are 0.
With only two options, coefficients can be
interpreted as the difference from one value
category to the other value category.

The coefficient for a dichotomous variable for African
countries would be interpreted as the difference
between African and non-African countries.
Dummy interpretation
IV = Africa= 1, All others = 0
DV = Life expectancy
Unstandardized B Coefficient = ##
 The model predicts that compared to all other
countries, countries in Africa have ##
lower/higher life expectancy when controlling
for all other variables.
Rules for Dummies

All dummy variables must be dichotomous with
only two options or categories.

Continents:



Africa=1, all other regions = 0 AND/OR a separate
variable that is:
Western Europe =1, all other regions = 0.
Party voted for when there is a Green Party, Tea Party
or other third party candidate:


Democrat=1, all other parties = 0 AND/OR a separate
variable that is:
Republican = 1, all other parties= 0.
More rules for dummies

You can use more than one dummy variable
as independent variables in a regression
equation.

Region/continents example:




Africa=1, all other regions = 0
Western Europe =1, all other regions= 0.
East Asia=1, all other regions= 0.
When you add new dummies, the
observations covered by the omitted category
(zero) decreases.
Note on adding additional dummies




Each time you create a new dummy variable out of a
nominal variable, that category is no longer included in
the omitted category (zero).
For example, if you have only one dummy variable,
Africa=1, then all other regions = 0.
If you add a dummy for Western Europe =1, then all
other regions is really “all other regions except Africa and
Western Europe.”
If you a dummy for East Asia=1 too, then for each of the
three dummies, 0= “all other regions except Africa,
Western Europe and East Asia.”
Maximum number of dummies


The number of dummy variables used must be NO
MORE than one less than the total number of value
categories in the original nominal variable.
For example, the original continent/region variable had
NINE value categories:



North America, South America, Western Europe, Eastern Europe,
Africa, the Middle East, Central & South Asia, East Asia and
Oceania.
Therefore, one can use up to EIGHT different dummy
variables.
There must always be at least one region as a baseline,
remaining as zero.
What dummy do you exclude?



It does not matter to your overall model which
category you exclude if you include the maximum
number of variables.
However, there are best practices that one ought to
follow when choosing the excluded category. The
excluded category is like a baseline, so certain
categories make results easier to understand and
interpret.
It is best if the excluded category is:


The mode or most common category
The observations in that category are relatively similar or homogenous.
Recommended: exclude the mode

Exclude the mode, the most common or best
known category.

For example, if your original variable was U.S.
vote choice, with three categories, Democrats,
Republicans or “Other”, exclude the well-known
Democrats or Republicans. It will be easier for you
and your readers to interpret the coefficient for the
dummy variable in regards to a well-known group
of voters.
Recommended: homogenous baseline

Since the excluded category provides a
baseline, interpretations are easier when the
excluded category is relatively homogenous.

In the region example, it may make sense to
exclude Western Europe since almost all of the
countries in those regions share certain attributes
like high levels of literacy relative to countries in
other regions.

This category may often appear “extreme”.
Dummy interpretation: all others?
IVs = Canadian Party Vote.
Canada has five parties represented in Parliament: the
Conservatives, the Liberals, the NDP, the Bloc Quebecois,
and the Greens. Other parties also run.
Conservatives = 1, All others = 0
NDP = 1, All others = 0
Bloc Quebecois = 1, All others = 0
Greens = 1, All others = 0
Other small parties = 1, All others = 0
What party or parties are included in “all others” at
this point?
Example when maximum number of
dummies are used.
Canadian vote example from previous slide
Independent Variables = Vote (All others = last remaining
party = Liberals)
Conservatives = 1, Liberals = 0
NDP = 1, Liberals = 0
Bloc Quebecois = 1, Liberals = 0
Greens = 1, Liberals = 0
Other small parties = 1, Liberals = 0
Dependent Variable = Feeling towards Prime Minister
Harper (Conservative)
Interpretation of example



Independent variable = party vote, dependent variable is
feelings towards Prime Minister Harper, the
unstandardized B coefficient is ##.
The model predicts that compared to Liberals, NDP
voters’ opinions are ## lower [or higher] when controlling
for all other variables.
The model predicts that compared to Liberals,
Conservative voters’ opinions are ## higher [or lower]
when controlling for all other variables.

Dummy Variables

Transcript Dummy Variables

Directory