#### Transcript Dummy Variables

Dummy Variables Dummy variables refers to the technique of using a dichotomous variable (coded 0 or 1) to represent the separate categories of a nominal level measure. The term “dummy” appears to refer to the fact that the presence of the trait indicated by the code of 1 represents a factor or collection of factors that are not measurable by any better means within the context of the analysis. Coding of dummy Variables Take for instance the race of the respondent in a study of voter preferences Race coded white(0) or black(1) There are a whole set of factors that are possibly different, or even likely to be different, between voters of different races Income, socialization, experience of racial discrimination, attitudes toward a variety of social issues, feelings of political efficacy, etc. Since we cannot measure all of those differences within the confines of the study we are doing, we use a dummy variable to capture these effects. Multiple categories Now picture race coded white(0), black(1), Hispanic(2), Asian(3) and Native American(4) If we put the variable race into a regression equation, the results will be nonsense since the coding implicitly required in regression assumes at least ordinal level data – with approximately equal differences between ordinal categories. Regression using a 3 (or more) category nominal variable yields un-interpretable and meaningless results. Creating Dummy variables The simple case of race is already coded correctly Black: coded 0 for white and 1 for black Note the coding can be reversed and leads only to changes in sign and direction of interpretation. The complex nominal version turns into 5 variables: White; coded 1 for whites and 0 for non-whites Black; coded 1 for blacks and 0 for non-blacks Hispanic; coded 1 for Hispanics and 0 for non- Hispanics Asian; coded 1 for Asians and 0 for non- Asians AmInd; coded 1 for native Americans and 0 for non-native Americans Regression with Dummy Variables The dummy variable is then added the regression model Yi =a + B1 * Xi + B2 * Blacki + ei Interpretation of the dummy variable is usually quite straightforward. The intercept term represents the intercept for the omitted category The slope coefficient for the dummy variable represents the change in the intercept for the category coded 1 (blacks) Regression with only a dummy When we regress a variable on only the dummy variable, we obtain the estimates for the means of the depended variable. Yi =a + B1 * Blacki + ei a is the mean of Y for Whites and a+B1 is the mean of Y for Blacks. Omitting a category When we have a single dummy variable, we have information for both categories in the model Also note that White = 1 – Black Thus having both a dummy for White and one for Blacks is redundant. As a result of this, we always omit one category, whose intercept is the model’s intercept. This omitted category is called the reference category In the dichotomous case, the reference category is simply the category coded 0 When we have a series of dummies, you can see that the reference category is also the omitted variable. Suggestions for selecting the reference category Make it a well defined group – ‘other’ or an obscure one (low n) is usually a poor choice. If there is some underlying ordinality in the categories, select the highest or lowest category as the reference. (e.g. blue-collar, white-collar, professional) It should have ample number of cases. The modal category is also often a good choice. Multiple dummy Variables The model for the full dummy variable scheme for race is: Yi a B1 * X i B2 * Blacki B3 * Hispanici B4 * Asiani B5 * AmIndi ei Note that the dummy for White has been omitted, and the intercept a is the intercept for Whites. Tests of Significance With dummy variables, the t tests test whether the coefficient is different from the reference category, not whether it is different from 0. Thus if a = 50, and B1 = -45, the coefficient for Blacks might not be significantly different from 0, while Whites are significantly different from 0 Interaction terms When the research hypotheses state that different categories may have differing responses to other independent variables, we need to use interaction terms. For example, race and income interact with each other so that the relationship between income and ideology is different (stronger or weaker) for Whites than Blacks. Creating Interaction terms To create an interaction term is easy Multiply the category * the independent variable The full model is thus: Yi a B1Racei B2 Income B3 ( Race* Income) ei a is the intercept for Whites; (a + B1) is the intercept for Blacks; B2 is the slope for Whites; and (B2 + B3) is the slope for Blacks t-tests for B1 and B3 are whether they are different than a and B2 Separating Effects The literature is unclear on how to fully interpret interaction effects There is multicolinearity between a dummy and its interaction terms, and also the regular independent variable It is suggested that you do not use a model with Interactions terms and no intercept!