SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression Basic Relationships Multinomial Logistic Regression Describing Relationships Classification Accuracy Sample Problems.
Download ReportTranscript SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression Basic Relationships Multinomial Logistic Regression Describing Relationships Classification Accuracy Sample Problems.
SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression Basic Relationships Multinomial Logistic Regression Describing Relationships Classification Accuracy Sample Problems SW388R7 Data Analysis & Computers II Multinomial logistic regression Slide 2 Multinomial logistic regression is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables. Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions. The group comparisons are equivalent to the comparisons for a dummy-coded dependent variable, with the group with the highest numeric score used as the reference group. For example, if we wanted to study differences in BSW, MSW, and PhD students using multinomial logistic regression, the analysis would compare BSW students to PhD students and MSW students to PhD students. For each independent variable, there would be two comparisons. SW388R7 Data Analysis & Computers II What multinomial logistic regression predicts Slide 3 Multinomial logistic regression provides a set of coefficients for each of the two comparisons. The coefficients for the reference group are all zeros, similar to the coefficients for the reference group for a dummy-coded variable. Thus, there are three equations, one for each of the groups defined by the dependent variable. The three equations can be used to compute the probability that a subject is a member of each of the three groups. A case is predicted to belong to the group associated with the highest probability. Predicted group membership can be compared to actual group membership to obtain a measure of classification accuracy. SW388R7 Data Analysis & Computers II Level of measurement requirements Slide 4 Multinomial logistic regression analysis requires that the dependent variable be non-metric. Dichotomous, nominal, and ordinal variables satisfy the level of measurement requirement. Multinomial logistic regression analysis requires that the independent variables be metric or dichotomous. Since SPSS will automatically dummy-code nominal level variables, they can be included since they will be dichotomized in the analysis. In SPSS, non-metric independent variables are included as “factors.” SPSS will dummy-code non-metric IVs. In SPSS, metric independent variables are included as “covariates.” If an independent variable is ordinal, we will attach the usual caution. SW388R7 Data Analysis & Computers II Assumptions and outliers Slide 5 Multinomial logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions. SPSS does not compute any diagnostic statistics for outliers. To evaluate outliers, the advice is to run multiple binary logistic regressions and use those results to test the exclusion of outliers or influential cases. SW388R7 Data Analysis & Computers II Sample size requirements Slide 6 The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression. For preferred case-to-variable ratios, we will use 20 to 1. SW388R7 Data Analysis & Computers II Methods for including variables Slide 7 The only method for selecting independent variables in SPSS is simultaneous or direct entry. SW388R7 Data Analysis & Computers II Overall test of relationship - 1 Slide 8 The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square. The significance test for the final model chi-square (after the independent variables have been added) is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables. SW388R7 Data Analysis & Computers II Overall test of relationship - 2 Slide 9 Model Fitting Information Model Intercept Only Final -2 Log Likelihood 284.429 265.972 Chi-Square 18.457 df Sig . 6 .005 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (18.457) was 0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. Strength of multinomial logistic regression relationship SW388R7 Data Analysis & Computers II Slide 10 While multinomial logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model. A more useful measure to assess the utility of a multinomial logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable. SW388R7 Data Analysis & Computers II Evaluating usefulness for logistic models Slide 11 The benchmark that we will use to characterize a multinomial logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. The only difference between by chance accuracy for binary logistic models and by chance accuracy for multinomial logistic models is the number of groups defined by the dependent variable. SW388R7 Data Analysis & Computers II Computing by chance accuracy Slide 12 The percentage of cases in each group defined by the dependent variable is found in the ‘Case Processing Summary’ table. Case Processing Summary N HIGHWAYS AND BRIDGES Valid Missing Total Subpopulation 1 2 3 62 93 12 167 103 270 153a Marginal Percentage 37.1% 55.7% 7.2% 100.0% a. The dependent variable has only one value observed in 146 (95.4%) subpopulations. The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.371² + 0.557² + 0.072² = 0.453). The proportional by chance accuracy criteria is 56.6% (1.25 x 45.3% = 56.6%). SW388R7 Data Analysis & Computers II Comparing accuracy rates Slide 13 To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for multinomial logistic regression .) Classification Predicted Observed 1 2 3 Overall Percentage 1 2 15 7 5 16.2% 47 86 7 83.8% 3 0 0 0 .0% The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%). The criteria for classification accuracy is satisfied in this example. Percent Correct 24.2% 92.5% .0% 60.5% SW388R7 Data Analysis & Computers II Numerical problems Slide 14 The maximum likelihood method used to calculate multinomial logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer. Sometimes, the method will break down and not be able to converge or find an answer. Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables. The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0. Relationship of individual independent variables and the dependent variable SW388R7 Data Analysis & Computers II Slide 15 There are two types of tests for individual independent variables: The likelihood ratio test evaluates the overall relationship between an independent variable and the dependent variable The Wald test evaluates whether or not the independent variable is statistically significant in differentiating between the two groups in each of the embedded binary logistic comparisons. If an independent variable has an overall relationship to the dependent variable, it might or might not be statistically significant in differentiating between pairs of groups defined by the dependent variable. Relationship of individual independent variables and the dependent variable SW388R7 Data Analysis & Computers II Slide 16 The interpretation for an independent variable focuses on its ability to distinguish between pairs of groups and the contribution which it makes to changing the odds of being in one dependent variable group rather than the other. We should not interpret the significance of an independent variable’s role in distinguishing between pairs of groups unless the independent variable also has an overall relationship to the dependent variable in the likelihood ratio test. The interpretation of an independent variable’s role in differentiating dependent variable groups is the same as we used in binary logistic regression. The difference in multinomial logistic regression is that we can have multiple interpretations for an independent variable in relation to different pairs of groups. Relationship of individual independent variables and the dependent variable SW388R7 Data Analysis & Computers II Slide 17 Parameter Estimates HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 95% Confidence Interva Exp(B) SPSS identifies the comparisons it makes for Wald df Sig . Exp(B) Lower Bound Upper B groups defined by1the dependent variable in 1.709 .191 the table of ‘Parameter Estimates,’ using either .906 codes or1 the value .341labels,1.019 the value depending .980 .427 1 1.073 on the options settings for.514 pivot table labeling. .868 4.913 1 .027 .253 .075 The reference category is identified in the 2.195 1 .138 footnote to the table. .017 1 .897 1.003 .963 In this analysis, two will be 2.463 1 comparisons .117 1.188 .958 made: 7.298 1 .007 .191 .057 a. The reference category is: 3. HIGHWAYS a AND BRIDGES TOO LITTLE ABOUT RIGHT Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS •the TOO LITTLE group (coded 1, shaded blue) will be compared to the TOO MUCH Parameter Estimatespurple) group (coded 3, shaded •the ABOUT RIGHT group (coded 2 , shaded orange)) will be compared to the TOO MUCH group (coded 3, shaded Std.purple). Error Wald df Sig . Exp(B) B 3.240 2.478 1.709 1 .191 The reference category plays the same role in .019 .020 .906 1 .341 multinomial logistic regression that it plays in .071 .108 .427 1 variable: .514 it is the dummy-coding of a nominal the category that would be coded with zeros -1.373 .620 4.913 1 .027 for all of the dummy-coded variables that 3.639 2.456 2.195 1 .138 all other categories are interpreted against. .003 .020 .017 1 .897 .172 .110 2.463 1 .117 -1.657 .613 7.298 1 .007 a. The reference category is: TOO MUCH. 1.019 1.073 .253 1.003 1.188 .191 95% C Lower B Relationship of individual independent variables and the dependent variable SW388R7 Data Analysis & Computers II Slide 18 Likelihood Ratio Tests Effect Intercept AGE EDUC CONLEGIS -2 Log Likelihood of Reduced Model 268.323 268.625 270.395 275.194 Chi-Square 2.350 2.652 4.423 9.221 df 2 2 2 2 Sig . .309 .265 .110 .010 In this example, there is a statistically significant relationship between the independent variable CONLEGIS and the dependent variable. (0.010 < 0.05) The chi-square statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is Parameter Estimates formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. HIGHWAYS a AND BRIDGES 1 2 B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 As well, the independent variable CONLEGIS is significant in distinguishing both category 1 of the 95% Confidence Interval fo dependent variable from category 3 of the dependent Exp(B) Sig . Exp(B) < 0.05) Lower Bound Upper Bou variable. (0.027 .191 .341 .514 .027 .138 .897 .117 .007 And the independent variable CONLEGIS is significant in distinguishing category 2 of the dependent variable from category 3 of the dependent variable. (0.007 < 0.05) 1.019 1.073 .253 .980 .868 .075 1. 1. . 1.003 1.188 .191 .963 .958 .057 1. 1. . SW388R7 Data Analysis & Computers II Slide 19 Interpreting relationship of individual independent variables to the dependent variable Likelihood Ratio Tests Effect Intercept AGE EDUC CONLEGIS -2 Log Survey Likelihood of respondents who had less confidence in congress (higher values correspond to lower confidence) were less likely to be in the Reduced group of survey respondents who we spend too little money Model Chi-Square df Sigthought . on highways and bridges (DV category 1), rather than the group of 268.323 2.350 2 .309spend too much money on survey respondents who thought we highways and2.652 bridges (DV 2category.265 3). 268.625 270.395 4.423 2 .110 For each unit increase in confidence in Congress, the odds of being 275.194 9.221 2 .010 in the group of survey respondents who thought we spend too little The chi-square statistic is the difference in -2 log-likelihoods money on highways and bridges decreased by 74.7%. (0.253 – 1.0 between the final model a reduced model. The reduced model is = and -0.747) Parameter Estimates formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 Sig . .191 .341 .514 .027 .138 .897 .117 .007 Exp(B) 95% Confidence Interval fo Exp(B) Lower Bound Upper Bou 1.019 1.073 .253 .980 .868 .075 1. 1. . 1.003 1.188 .191 .963 .958 .057 1. 1. . SW388R7 Data Analysis & Computers II Slide 20 Interpreting relationship of individual independent variables to the dependent variable Likelihood Ratio Tests Effect Intercept AGE EDUC CONLEGIS -2 Log Likelihood of Reduced Model 268.323 268.625 270.395 275.194 Chi-Square 2.350 2.652 4.423 9.221 df 2 2 2 2 Sig . .309 .265 .110 .010 Survey respondents who had less confidence in congress (higher correspond to lower confidence) were less likely to be in the The chi-square statistic is values the difference in -2 log-likelihoods group of survey respondents who thought we spend about the right between the final model and a reduced model. The reduced model is Parameter Estimates amount of money on highways (DV category 2), rather formed by omitting an effect from the final model. The null hypothesis and bridges than the group of survey respondents who thought we spend too is that all parameters of that effect are 0. much money on highways and bridges (DV Category 3). HIGHWAYS a AND BRIDGES 1 2 For each unit in confidence in Congress, the ofExp(B) being B increase Std. Error Wald df Sig odds . in the group of survey respondents who thought we spend about the Intercept 3.240 2.478 1.709 1 .191 right amount of money on highways and bridges decreased by AGE 80.9%. (0.191 .019 – 1.0 = .020 1 .341 1.019 0.809) .906 EDUC .071 .108 .427 1 .514 1.073 CONLEGIS -1.373 .620 4.913 1 .027 .253 Intercept 3.639 2.456 2.195 1 .138 AGE .003 .020 .017 1 .897 1.003 EDUC .172 .110 2.463 1 .117 1.188 CONLEGIS -1.657 .613 7.298 1 .007 .191 a. The reference category is: 3. 95% Confidence Interval fo Exp(B) Lower Bound Upper Bou .980 .868 .075 1. 1. . .963 .958 .057 1. 1. . Relationship of individual independent variables and the dependent variable SW388R7 Data Analysis & Computers II Slide 21 Likelihood Ratio Tests Effect Intercept AGE EDUC POLVIEWS SEX -2 Log Likelihood of Reduced Model 327.463a 333.440 329.606 334.636 338.985 Chi-Square .000 5.976 2.143 7.173 11.521 df Sig . 0 2 2 2 2 . .050 .343 .028 .003 The chi-sq uare statistic is the difference in -2 log-likelihoods Parameter Estimates between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. a. a NATCHLD B Std. Error Wald df This reducedIntercept model is equivalent to the final model because TOO LITTLE 8.434 2.233 14.261 1 omitting the effect does not increase the degrees of freedom. AGE -.023 .017 1.756 1 EDUC -.066 .102 .414 1 POLVIEWS -.575 .251 5.234 1 [SEX=1] -2.167 .805 7.242 1 b [SEX=2] 0 . . 0 ABOUT RIGHT Intercept 4.485 2.255 3.955 1 AGE -.001 .018 .003 1 EDUC .011 .104 .011 1 POLVIEWS -.397 .257 2.375 1 [SEX=1] -1.606 .824 3.800 1 b [SEX=2] 0 . . 0 a. The reference category is: TOO MUCH. In this example, there is a statistically significant relationship between SEX and the dependent variable, spending on childcare assistance. As well, SEX plays a statistically significant role in differentiating the TOO 95% Confidence Interval f LITTLE group from the TOO Exp(B) (reference) group. SigMUCH . Exp(B) Lower Bound Upper Bou (0.007 < 0.5) .000 .185 .977 .944 .520 .936 .766 .022 .563 .344 .007 .115 .024 . . . However, SEX does not .047differentiate the ABOUT group from the .955RIGHT .999 .965 TOO MUCH (reference) .916 1.011 .824 group.(0.51 > 0.5) .123 .673 .406 .051 .201 .040 . . . 1 1 . . 1 1 1 1 SW388R7 Data Analysis & Computers II Slide 22 Interpreting relationship of individual independent variables and the dependent variable Likelihood Ratio Tests Effect Intercept AGE EDUC POLVIEWS SEX -2 Log Likelihood of Reduced Model Chi-Square df Sig . 327.463a .000 0 . Survey respondents 333.440 5.976who were2male (code .050 1 for sex) were less likely to be in the group of survey respondents who thought we spend too 329.606 2.143 2 .343 little money on childcare assistance (DV category 1), rather than the 334.636 7.173 2 .028 we spend too much group of survey respondents who thought money on childcare 3). 338.985 11.521assistance2 (DV category .003 The chi-sq uare statistic is the difference in -2 log-likelihoods were male were 88.5% less likely (0.115 – Parameter Estimates between the final Survey model andrespondents a reduced model.who The reduced model 1.0 = -0.885) to be in the group of survey respondents who thought is formed by omitting effect from finalmoney model. The we an spend too the little onnull childcare assistance. hypothesis is that all parameters of that effect are 0. a. a NATCHLD B Std. Error Wald df Sig . Exp(B) This reducedIntercept model is equivalent to the final model because TOO LITTLE 8.434 2.233 14.261 1 .000 omitting the effect does not increase the degrees of freedom. AGE -.023 .017 1.756 1 .185 .977 EDUC -.066 .102 .414 1 .520 .936 POLVIEWS -.575 .251 5.234 1 .022 .563 [SEX=1] -2.167 .805 7.242 1 .007 .115 b [SEX=2] 0 . . 0 . . ABOUT RIGHT Intercept 4.485 2.255 3.955 1 .047 AGE -.001 .018 .003 1 .955 .999 EDUC .011 .104 .011 1 .916 1.011 POLVIEWS -.397 .257 2.375 1 .123 .673 [SEX=1] -1.606 .824 3.800 1 .051 .201 b [SEX=2] 0 . . 0 . . a. The reference category is: TOO MUCH. 95% Confidence Interval f Exp(B) Lower Bound Upper Bou .944 .766 .344 .024 . 1 1 . . .965 .824 .406 .040 . 1 1 1 1 Interpreting relationships for independent variable in problems SW388R7 Data Analysis & Computers II Slide 23 In the multinomial logistic regression problems, the problem statement will ask about only one of the independent variables. The answer will be true or false based on only the relationship between the specified independent variable and the dependent variable. The individual relationships between other independent variables are the dependent variable are not used in determining whether or not the answer is true or false. SW388R7 Data Analysis & Computers II Problem 1 Slide 24 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 1 Slide 25 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who For thought spend too thesewe problems, we little will money on highways and bridges from survey respondents who assume thoughtthat we spend too much money on highways and there is no problem bridges and survey respondents who thought we spend theorright amount of money on with missing data,about outliers, highways and bridges from survey respondents who thought we spend too much money on influential cases, and that the highways and bridges. validation analysis will confirm the in generalizability the Among this set of predictors, confidence Congress wasofhelpful in distinguishing among the results groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey In little this problem, wehighways are told to respondents who thought we spend too money on and bridges, rather than the use 0.05 as alpha the money on highways and bridges. group of survey respondents who thought we spend too for much For each unit increase in confidence in Congress, logistic the odds of being in the group of survey multinomial regression. respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 2 Slide 26 The variables listed first in the problem statement are the independent variables (IVs): "age" [age], "highest year of school 11. Incompleted" the dataset GSS2000, is the following statement true, false, or an incorrect application [educ] and "confidence in of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, Congress" [conlegis]. and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways andThe bridges from survey respondents who thought we spend too much money on variable used to define highways andgroups bridges. is the dependent variable (DV): "opinion about Among this set of predictors, confidence in Congress was helpful in distinguishing among the spending on highways and groups defined by responses to opinion about spending on highways and bridges. Survey bridges" respondents who had[natroad]. less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the oddsonly of being in the group of survey supports direct or respondents who thought we spend too little moneySPSS on highways and bridges decreased by entryless of independent 74.7%. Survey respondents who had less confidence simultaneous in congress were likely to be in the variables in multinomial group of survey respondents who thought we spend about the right amount logistic of money on regression, so we have no choice of highways and bridges, rather than the group of survey respondents who thought we spend too method for entering variables. much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SW388R7 Data Analysis & Computers II Dissecting problem 1 - 3 Slide 27 SPSS multinomial logistic regression models the relationship by comparing each of the groups defined by the dependent variable to the group with the highest code value. 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application TheAssume responses opinion about spending highways and bridgesor were: of a statistic? thattothere is no problem withon missing data, outliers, influential cases, and that the validation willright, confirm of the results. Use a level of 1= Too little, analysis 2 = About andthe 3 =generalizability Too much. significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents whoThe thought we will spend too in little on highways and bridges, rather than the analysis result two money comparisons: group of survey respondents who thought we spend too much money highways and bridges. • survey respondents who thought we spend too littleon money For each unit increase in confidence in Congress, the odds of being in the group of survey versus survey respondents who thought we spend too much respondents who thought too and littlebridges money on highways and bridges decreased by moneywe on spend highways 74.7%. Survey respondents who had less confidence werethe less likely to be in the • survey respondents who thought in wecongress spend about right group of survey respondents who thought we spend about the right amount of money on amount of money versus survey respondents who thought we highways and bridges, rather than the group of survey respondents who thought we spend too spend and too much money on highways and bridges. much money on highways bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SW388R7 Data Analysis & Computers II Dissecting problem 1 - 4 Slide 28 Each problem includes a statement about the relationship between one independent variable and the dependent variable. The answer to the problem is based on the stated relationship, ignoring the independent variables and the"confidence in The variablesrelationships "age" [age], between "highest the yearother of school completed" [educ] and Congress" [conlegis] were useful predictors for distinguishing between groups based on dependent variable. responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate This survey respondents who thought we too little money on highways and problem identifies a difference forspend both of the comparisons bridges from among survey groups respondents who thought we spend too much money on highways and modeled by the multinomial logistic regression. bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SW388R7 Data Analysis & Computers II Dissecting problem 1 - 5 Slide 29 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. order for the the multinomial logistic For each unit increase in confidence inInCongress, odds of being in theregression group of survey question to be true, the overall must respondents who thought we spend too little money on highways andrelationship bridges decreased by beconfidence statistically in significant, mustlikely be noto be in the 74.7%. Survey respondents who had less congress there were less evidence of numerical problems, the classification group of survey respondents who thought we spend about the right amount of money on must be substantially better than highways and bridges, rather than the accuracy group of rate survey respondents who thought we spend too much money on highways and bridges.could For each unit increase in confidence Congress, the be obtained by chance alone, andinthe odds of being in the group of survey respondents who thought we spend about the right amount stated individual relationship must be statistically of money on highways and bridges decreased by 80.9%. significant and interpreted correctly. SW388R7 Data Analysis & Computers II Request multinomial logistic regression Slide 30 Select the Regression | Multinomial Logistic… command from the Analyze menu. SW388R7 Data Analysis & Computers II Selecting the dependent variable Slide 31 First, highlight the dependent variable natroad in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. SW388R7 Data Analysis & Computers II Selecting metric independent variables Slide 32 Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal. Move the metric independent variables, age, educ and conlegis to the Covariate(s) list box. In this analysis, there are no nonmetric independent variables. Nonmetric independent variables would be moved to the Factor(s) list box. SW388R7 Data Analysis & Computers II Specifying statistics to include in the output Slide 33 While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table. Click on the Statistics… button to make a request. SW388R7 Data Analysis & Computers II Requesting the classification table Slide 34 First, keep the SPSS defaults for Summary statistics, Likelihood ratio test, and Parameter estimates. Second, mark the checkbox for the Classification table. Third, click on the Continue button to complete the request. SW388R7 Data Analysis & Computers II Slide 35 Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options. SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 1 Slide 36 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence congressrequires were less likely Multinomial logistic in regression that the to be in the group of survey respondents who thought we spend toobelittle money and on highways and bridges, rather than the dependent variable non-metric the group of survey respondents who thought we spend too much money on highways and bridges. independent variables be metric or dichotomous. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought weabout spend too little on highways and bridges decreased by "Opinion spending onmoney highways and bridges" who [natroad] is ordinal, satisfying the non74.7%. Survey respondents had less confidence in congress were less likely to be in the metric level of measurement requirement forright the amount of money on group of survey respondents who thought we spend about the dependent variable. highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the contains three categories: survey respondents odds of being in theItgroup of survey respondents who thought we spend about the right amount who thought we spend too little money, about of money on highways the and rightbridges amountdecreased of money, by and80.9%. too much money on highways and bridges. 1. True 2. True with caution SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 2 Slide 37 "Age" [age] and "highest year of school completed" [educ] are interval, 11. satisfying In the dataset GSS2000, is the following statement true, false, or an incorrect application the metric or dichotomous of alevel statistic? Assume thatrequirement there is nofor problem with missing data, outliers, or influential cases, of measurement independent variables. and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents thought spend too much money on "Confidence in Congress"who [conlegis] is we ordinal, highways and bridges. satisfying the metric or dichotomous level of measurement requirement for independent variables. If we follow the convention of treating Among this set of predictors, in Congress helpfulthe in distinguishing among the ordinalconfidence level variables as metricwas variables, level groups defined by responses to opinion about spending and bridges. Survey of measurement requirement for on thehighways analysis is respondents who had less confidence congress were lessdolikely to be in the group of survey satisfied. Sincein some data analysts not agree with convention, note ofon caution should respondents who thought wethis spend too littlea money highways andbebridges, rather than the included in our interpretation. group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SW388R7 Data Analysis & Computers II Sample size – ratio of cases to variables Slide 38 Case Processing Summary N HIGHWAYS AND BRIDGES Valid Missing Total Subpopulation 1 2 3 62 93 12 167 103 270 153a Marginal Percentage 37.1% 55.7% 7.2% 100.0% a. The dependent variable has only one value observed Multinomial logistic regression requires that the minimum ratio in 146 (95.4%) subpopulations. of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (167) to number of independent variables (3) was 55.7 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. The preferred ratio of valid cases to independent variables is 20 to 1. The ratio of 55.7 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. SW388R7 Data Analysis & Computers II Slide 39 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Model Fitting Information Model Intercept Only Final -2 Log Likelihood 284.429 265.972 Chi-Square 18.457 df Sig . 6 .005 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (18.457) was 0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II NUMERICAL PROBLEMS Slide 40 Parameter Estimates HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 95% Confidence Interv Exp(B) Multicollinearity in the multinomial df Sig . Exp(B) is Lower Bound Upper logistic regression solution 1 by examining .191 detected the standard errors 1for the b coefficients. .341 1.019 A .980 standard error larger than 2.0 1 .514 1.073 indicates numerical problems, such .868 1 .027 .253 .075 as multicollinearity among the independent variables, zero cells for 1 .138 a dummy-coded independent 1 .897 1.003 .963 variable because all of the subjects 1 same.117 .958 have the value for1.188 the variable, separation' 1 and 'complete .007 .191 .057 whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted. None of the independent variables in this analysis had a standard error larger than 2.0. (We are not interested in the standard errors associated with the intercept.) SW388R7 Data Analysis & Computers II Slide 41 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 Likelihood Ratio Tests Effect Intercept AGE EDUC CONLEGIS -2 Log Likelihood of Reduced Model 268.323 268.625 270.395 275.194 Chi-Square 2.350 2.652 4.423 9.221 df 2 2 2 2 Sig . .309 .265 .110 .010 The chi-square statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". For this relationship, the probability of the chi-square statistic (9.221) was 0.010, less than or equal to the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with confidence in Congress were equal to zero was rejected. The existence of a relationship between confidence in Congress and opinion about spending on highways and bridges was supported. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 SW388R7 Data Analysis & Computers II Slide 42 Parameter Estimates HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 Sig . .191 .341 .514 .027 .138 .897 .117 .007 a. The reference category is: 3. In the comparison of survey respondents who thought we spend too little money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (4.913) for the variable confidence in Congress [conlegis] was 0.027. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. Exp(B) 95% Confiden Exp Lower Bound 1.019 1.073 .253 .980 .868 .075 1.003 1.188 .191 .963 .958 .057 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 SW388R7 Data Analysis & Computers II Slide 43 Parameter Estimates HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 Sig . .191 .341 .514 .027 .138 .897 .117 .007 a. The reference Thecategory value is: of 3.Exp(B) was 0.253 which implies that for each unit increase in confidence in Congress the odds decreased by 74.7% (0.253 - 1.0 = -0.747). The relationship stated in the problem is supported. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Exp(B) 95% Confiden Exp Lower Bound 1.019 1.073 .253 .980 .868 .075 1.003 1.188 .191 .963 .958 .057 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 4 SW388R7 Data Analysis & Computers II Slide 44 Parameter Estimates HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 Sig . .191 .341 .514 .027 .138 .897 .117 .007 a. The reference category is: 3. In the comparison of survey respondents who thought we spend about the right amount of money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (7.298) for the variable confidence in Congress [conlegis] was 0.007. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. Exp(B) 95% Confiden Exp Lower Bound 1.019 1.073 .253 .980 .868 .075 1.003 1.188 .191 .963 .958 .057 SW388R7 Data Analysis & Computers II Slide 45 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 5 Parameter Estimates 95% Con HIGHWAYS a AND BRIDGES 1 2 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS B 3.240 .019 .071 -1.373 3.639 .003 .172 -1.657 Std. Error 2.478 .020 .108 .620 2.456 .020 .110 .613 Wald 1.709 .906 .427 4.913 2.195 .017 2.463 7.298 df 1 1 1 1 1 1 1 1 Sig . .191 .341 .514 .027 .138 .897 .117 .007 a. The reference category is: 3. The value of Exp(B) was 0.191 which implies that for each unit increase in confidence in Congress the odds decreased by 80.9% (0.191-1.0=-0.809). The relationship stated in the problem is supported. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. Exp(B) Lower Bou 1.019 1.073 .253 .9 .8 .0 1.003 1.188 .191 .9 .9 .0 SW388R7 Data Analysis & Computers II Slide 46 CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: BY CHANCE ACCURACY RATE The independent variables could be characterized as useful predictors distinguishing survey respondents who thought we spend too little money on highways and bridges, survey respondents who thought we spend about the right amount of money on highways and bridges and survey respondents who thought we spend too much money on highways and bridges if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. Case Processing Summary N HIGHWAYS AND BRIDGES 1 2 3 Marginal Percentage 37.1% 55.7% 7.2% 100.0% 62 93 12 Valid 167 Missing 103 Total The proportional by chance accuracy rate270 was computed by calculating the proportion of cases for each Subpopulation 153agroup based on the number of cases in each group in the 'Case Processing a. variable has onlysumming one value observed Summary',The anddependent then squaring and the proportion of in 146 (95.4%) subpopulations. cases in each group (0.371² + 0.557² + 0.072² = 0.453). SW388R7 Data Analysis & Computers II Slide 47 CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: CLASSIFICATION ACCURACY Classification Predicted Observed 1 2 3 Overall Percentage 1 2 15 7 5 16.2% 47 86 7 83.8% 3 0 0 0 .0% The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%). The criteria for classification accuracy is satisfied. Percent Correct 24.2% 92.5% .0% 60.5% SW388R7 Data Analysis & Computers II Answering the question in problem 1 - 1 Slide 48 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey We in found a statistically significant respondents who had less confidence congress were less likely to overall be in the group of survey between combination of respondents who thought we spendrelationship too little money onthe highways and bridges, rather than the independent variables and the dependent group of survey respondents who thought we spend too much money on highways and bridges. variable. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less was confidence in congress were less likelyinto be in the There no evidence of numerical problems group of survey respondents who thought we spend about the right amount of money on the solution. highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increaseaccuracy in confidence in Congress, the Moreover, the classification surpassed odds of being in the group of survey respondents who thought we spend about the proportional by chance accuracy criteria, the right amount of money on highways and bridgessupporting decreased the by 80.9%. utility of the model. 1. True 2. True with caution 3. False SW388R7 Data Analysis & Computers II Answering the question in problem 1 - 2 Slide 49 We verified thatyear eachof statement about the relationship The variables "age" [age], "highest school completed" [educ] and "confidence in an independent variable and the dependent Congress" [conlegis]between were useful predictors for distinguishing between groups based on variable correct both direction of the relationship responses to "opinion about was spending oninhighways and bridges" [natroad]. These predictors the change in likelihood associated with a one-unit differentiate surveyand respondents who thought we spend too little money on highways and bridges from surveychange respondents who thought we spend too much money on highways and of the independent variable, for both of the bridges and survey respondents thought we stated spend about the right amount of money on comparisons who between groups in the problem. highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables. SW388R7 Data Analysis & Computers II Problem 2 Slide 50 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 1 Slide 51 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" For these [natspac]. problems, These we willpredictors differentiate survey respondents who thought we spend too little money on space exploration from survey assume that there is no problem respondents who thought we spend too much money on space exploration and survey with missing data, outliers, or respondents who thought we spend about the right amount of money on space exploration from influential cases, and that the survey respondents who thought we spend too much money on space exploration. validation analysis will confirm the generalizability of the Among this set of predictors, total family income was helpful in distinguishing among the groups results defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were In more to be thetold group this likely problem, weinare to of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group use 0.05 as alpha for the of survey respondents who thought wemultinomial spend too logistic much money on space exploration. For each regression. unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 2 Slide 52 The variables listed first in the problem statement are the independent variables 1. In (IVs): the dataset GSS2000, is the following statement true, false, or an incorrect application of "highest year of school completed" a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and [educ], "sex" [sex] and "total family that the validation analysis will confirm the generalizability of the results. Use a level of income" [income98]. significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space The variable used to define exploration. groups is the dependent variable (DV): "opinion about Among this on setspace of predictors, total family income was helpful in distinguishing among the groups spending defined by responses to opinion about spending on space exploration. Survey respondents who exploration" [natspac]. had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each SPSS supports direct or unit increase in total family income, the odds of being in only the group of survey respondents who simultaneous entry of independent thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic variables in multinomial logistic regression, so we have no choice of method for entering variables. SW388R7 Data Analysis & Computers II Dissecting problem 2 - 3 Slide 53 SPSS multinomial logistic regression models the relationship by comparing each of the groups defined by the dependent variable to the group with the highest code value. 1. In the dataset GSS2000,toisopinion the following statement false, or an incorrect application of The responses about spending ontrue, the space a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and program were: that the validation analysis confirm generalizability 1= Too little, 2 =will About right,the and 3 = Too much. of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who The analysis willwere resultmore in two comparisons: had higher total family incomes likely to be in the group of survey respondents who • survey respondents who thought spend exploration, too little money thought we spend about the right amount of money we on space rather than the group versus survey respondents who thought we too exploration. much of survey respondents who thought we spend too much money spend on space For each money space exploration unit increase in total familyonincome, the odds of being in the group of survey respondents who survey respondents whoofthought spend exploration about the right thought we spend• about the right amount money we on space increased by 6.0%. amount of money versus survey respondents who thought we spend too much money on space exploration. 1. True 2. True with caution 3. False SW388R7 Data Analysis & Computers II Dissecting problem 2 - 4 Slide 54 Each problem includes a statement about the "sex" [sex] and "total family income" The variables "highest year of school completed" [educ], onefor independent variable and groups based on responses to [income98]relationship were usefulbetween predictors distinguishing between the dependent variable. The answer to the problem "opinion about spending on space exploration" [natspac]. These predictors differentiate survey is who based on the we stated relationship, ignoringon the respondents thought spend too little money space exploration from survey respondents who thought we spend too much money on relationships between the other independent space exploration and survey respondents who thought wedependent spend about the right amount of money on space exploration from variables and the variable. survey respondents who thought we spend too much money on space exploration. Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. True True with caution This problem identifies a difference for only one of the two comparisons based on the three values False the dependent variable. Inappropriate application of a of statistic Other problems will specify both of the possible comparisons. SW388R7 Data Analysis & Computers II Dissecting problem 2 - 5 Slide 55 The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. True In order for the multinomial logistic regression True with caution question to be true, the overall relationship must False be statistically significant, there must be no Inappropriate application of a statistic evidence of numerical problems, the classification accuracy rate must be substantially better than could be obtained by chance alone, and the stated individual relationship must be statistically significant and interpreted correctly. SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 1 Slide 56 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. Among this set of predictors, total family income was helpful in distinguishing among the groups Multinomial requires the defined by responses to opinionlogistic aboutregression spending on space that exploration. Survey respondents who dependent variable be non-metric and the had higher total family incomes were more likely to be in the group of survey respondents who independent variables dichotomous. thought we spend about the right amountbeofmetric moneyoron space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each "Opinion about spending on space exploration" unit increase in total family income, the odds of the being in the group of survey respondents who [natspac] is ordinal, satisfying non-metric thought we spend about the right amount of money on level of measurement requirement forspace the exploration increased by 6.0%. dependent variable. 1. 2. 3. 4. True It contains three categories: survey respondents True with caution who thought we spend too little money, about the right amount of money, and too much money False space exploration. Inappropriateon application of a statistic SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 2 Slide 57 "Highest year of school "Sex" [sex] is dichotomous, completed" [educ] is interval, satisfying the metric or satisfying the metric or dichotomous of incorrect measurement 1. In the dataset true, false,level or an application of dichotomous level ofGSS2000, is the following statement requirement for independent a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and measurement requirement for variables. independent variables.analysis will confirm the generalizability that the validation of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents thought is weordinal, spend too much money on space "Total family income"who [income98] exploration. satisfying the metric or dichotomous level of measurement requirement for independent variables. If we follow the convention of treating Among this set of ordinal predictors, family was helpful distinguishing among the groups level total variables as income metric variables, the in level defined by responses to opinion about spendingforonthe space exploration. Survey respondents who of measurement requirement analysis is had higher total family incomes likely to do benot in the group of survey respondents who satisfied. Since were some more data analysts agree this convention, a note of caution be thought we spendwith about the right amount of money on should space exploration, rather than the group included in our interpretation. of survey respondents who thought we spend about the right amount of money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. True 2. True with caution SW388R7 Data Analysis & Computers II Request multinomial logistic regression Slide 58 Select the Regression | Multinomial Logistic… command from the Analyze menu. SW388R7 Data Analysis & Computers II Selecting the dependent variable Slide 59 First, highlight the dependent variable natspac in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. SW388R7 Data Analysis & Computers II Selecting non-metric independent variables Slide 60 Non-metric independent variables are specified as factors in multinomial logistic regression. Non-metric variables can be either dichotomous, nominal, or ordinal. These variables will be dummy coded as needed and each value will be listed separately in the output. Select the dichotomous variable sex. Move the non-metric independent variables listed in the problem to the Factor(s) list box. SW388R7 Data Analysis & Computers II Selecting metric independent variables Slide 61 Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal. Move the metric independent variables, educ and income98, to the Covariate(s) list box. SW388R7 Data Analysis & Computers II Specifying statistics to include in the output Slide 62 While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table. Click on the Statistics… button to make a request. SW388R7 Data Analysis & Computers II Requesting the classification table Slide 63 First, keep the SPSS defaults for Summary statistics, Likelihood ratio test, and Parameter estimates. Second, mark the checkbox for the Classification table. Third, click on the Continue button to complete the request. SW388R7 Data Analysis & Computers II Slide 64 Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options. SW388R7 Data Analysis & Computers II Sample size – ratio of cases to variables Slide 65 Case Processing Summary N SPACE EXPLORATION PROGRAM RESPONDENTS SEX Valid Missing Total Subpopulation 1 2 3 1 2 33 90 85 94 114 208 62 270 138a Marginal Percentage 15.9% 43.3% 40.9% 45.2% 54.8% 100.0% a. The dependent variable has only one value observed in 112 Multinomial logistic regression requires that the minimum ratio (81.2%) subpopulations. of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (208) to number of independent variables( 3) was 69.3 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. The preferred ratio of valid cases to independent variables is 20 to 1. The ratio of 69.3 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. SW388R7 Data Analysis & Computers II Slide 66 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Model Fitting Information Model Intercept Only Final -2 Log Likelihood 354.268 334.967 Chi-Square 19.301 df Sig . 6 .004 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (19.301) was 0.004, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II NUMERICAL PROBLEMS Slide 67 Parameter Estimates SPACE EXPLORATION a PROGRAM 1 2 Intercept EDUC INCOME98 [SEX=1] [SEX=2] Intercept EDUC INCOME98 [SEX=1] [SEX=2] B -4.136 .101 .097 .672 0b -2.487 .108 .058 .501 0b Std. Error 1.157 .089 .050 .426 . .840 .068 .034 .317 . a. The reference category is: 3. b. This parameter is set to zero because it is redundant. Wald 12.779 1.276 3.701 2.488 . 8.774 2.521 2.932 2.492 . df 95% Confidence Exp(B) Lower Bound U Sig . Exp(B) 1 .000 Multicollinearity in the multinomial logistic regression is 1 .259 solution 1.106 detected by examining the standard1 errors .054 for the b 1.102 1 .115 1.959 coefficients. A standard error larger than 2.0 indicates numerical 0 . . problems, such as multicollinearity 1 .003 among the independent variables, 1 for a dummy-coded .112 1.114 zero cells independent variable 1 .087 because 1.060 all of the subjects have the same value 1 .114 1.650 for the variable, and 'complete 0 . the two . separation' whereby groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted. None of the independent variables in this analysis had a standard error larger than 2.0. .929 .998 .850 . .975 .992 .886 . SW388R7 Data Analysis & Computers II Slide 68 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 Likelihood Ratio Tests Effect Intercept EDUC INCOME98 SEX -2 Log Likelihood of Reduced Model 334.967a 337.788 340.154 338.511 Chi-Square .000 2.821 5.187 3.544 df Sig . 0 2 2 2 The chi-sq uare statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. a. The statistical significance of the relationship between Thisopinion reduced model equivalent toon the space final model because total family income and aboutis spending exploration is based on the statistical significance ofdegrees the of freedom. omitting the effect does not increase the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". For this relationship, the probability of the chi-square statistic (5.187) was 0.075, greater than the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with total family income were equal to zero was not rejected. The existence of a relationship between total family income and opinion about spending on space exploration was not supported. . .244 .075 .170 SW388R7 Data Analysis & Computers II Answering the question in problem 2 Slide 69 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. We found a statistically significant overall relationship between the combination of Among this set of predictors, totalindependent family income was helpful in dependent distinguishing among the groups variables and the defined by responses to opinion about spending on space exploration. Survey respondents who variable. had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of no money on space exploration, rather There was evidence of numerical problems in than the group of survey respondents who thoughtthe wesolution. spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%. 1. 2. 3. 4. However, the individual relationship between total family income and spending on space was not statistically significant. True True with caution The answer to the question is false. False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Slide 70 Steps in multinomial logistic regression: level of measurement and initial sample size The following is a guide to the decision process for answering problems about the basic relationships in multinomial logistic regression: Dependent non-metric? Independent variables metric or dichotomous? No Inappropriate application of a statistic Yes Ratio of cases to independent variables at least 10 to 1? Yes Run multinomial logistic regression No Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Slide 71 Steps in multinomial logistic regression: overall relationship and numerical problems Overall relationship statistically significant? (model chi-square test) No False Yes Standard errors of coefficients indicate no numerical problems (s.e. <= 2.0)? Yes No False SW388R7 Data Analysis & Computers II Slide 72 Steps in multinomial logistic regression: relationships between IV's and DV Overall relationship between specific IV and DV is statistically significant? (likelihood ratio test) No False Yes Role of specific IV and DV groups statistically significant and interpreted correctly? (Wald test and Exp(B)) Yes No False SW388R7 Data Analysis & Computers II Slide 73 Steps in multinomial logistic regression: classification accuracy and adding cautions Overall accuracy rate is 25% > than proportional by chance accuracy rate? No False Yes Satisfies preferred ratio of cases to IV's of 20 to 1 No True with caution Yes One or more IV's are ordinal level treated as metric? No True Yes True with caution