SW388R7 Data Analysis & Computers II Logistic Regression – Basic Relationships Slide 1 Logistic Regression Describing Relationships Classification Accuracy Sample Problems.
Download ReportTranscript SW388R7 Data Analysis & Computers II Logistic Regression – Basic Relationships Slide 1 Logistic Regression Describing Relationships Classification Accuracy Sample Problems.
SW388R7 Data Analysis & Computers II Logistic Regression – Basic Relationships Slide 1 Logistic Regression Describing Relationships Classification Accuracy Sample Problems SW388R7 Data Analysis & Computers II Logistic regression Slide 2 Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent variables. (SPSS now supports Multinomial Logistic Regression that can be used with more than two groups, but our focus here is on binary logistic regression for two groups.) Logistic regression combines the independent variables to estimate the probability that a particular event will occur, i.e. a subject will be a member of one of the groups defined by the dichotomous dependent variable. In SPSS, the model is always constructed to predict the group with higher numeric code. If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event. This will create some awkward wording in our problems. Our only option for changing this is to recode the variable. SW388R7 Data Analysis & Computers II What logistic regression predicts Slide 3 The variate or value produced by logistic regression is a probability value between 0.0 and 1.0. If the probability for group membership in the modeled category is above some cut point (the default is 0.50), the subject is predicted to be a member of the modeled group. If the probability is below the cut point, the subject is predicted to be a member of the other group. For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category. SW388R7 Data Analysis & Computers II Level of measurement requirements Slide 4 Logistic regression analysis requires that the dependent variable be dichotomous. Logistic regression analysis requires that the independent variables be metric or dichotomous. If an independent variable is nominal level and not dichotomous, the logistic regression procedure in SPSS has a option to dummy code the variable for you. If an independent variable is ordinal, we will attach the usual caution. SW388R7 Data Analysis & Computers II Assumptions Slide 5 Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions. SW388R7 Data Analysis & Computers II Sample size requirements Slide 6 The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression. For preferred case-to-variable ratios, we will use 20 to 1 for simultaneous and hierarchical logistic regression and 50 to 1 for stepwise logistic regression. SW388R7 Data Analysis & Computers II Methods for including variables Slide 7 There are three methods available for including variables in the regression equation: the simultaneous method in which all independents are included at the same time The hierarchical method in which control variables are entered in the analysis before the predictors whose effects we are primarily concerned with. The stepwise method (forward conditional in SPSS) in which variables are selected in the order in which they maximize the statistically significant contribution to the model. For all methods, the contribution to the model is measures by model chi-square is a statistical measure of the fit between the dependent and independent variables, like R². SW388R7 Data Analysis & Computers II Computational method Slide 8 Multiple regression uses the least-squares method to find the coefficients for the independent variables in the regression equation, i.e. it computed coefficients that minimized the residuals for all cases. Logistic regression uses maximum-likelihood estimation to compute the coefficients for the logistic regression equation. This method finds attempts to find coefficients that match the breakdown of cases on the dependent variable. The overall measure of how will the model fits is given by the likelihood value, which is similar to the residual or error sum of squares value for multiple regression. A model that fits the data well will have a small likelihood value. A perfect model would have a likelihood value of zero. Maximum-likelihood estimation is an interative procedure that successively tries works to get closer and closer to the correct answer. When SPSS reports the "iterations," it is telling us how may cycles it took to get the answer. SW388R7 Data Analysis & Computers II Overall test of relationship Slide 9 The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square. The significance test for the model chi-square is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables. SW388R7 Data Analysis & Computers II Beginning logistic regression model Slide 10 The SPSS output for logistic regression begins with output for a model that contains no independent variables. It labels this output "Block 0: Beginning Block" and (if we request the optional iteration history) reports the initial -2 Log Likelihood, which we can think of as a measure of the error associated trying to predict the dependent variable without using any information from the independent variables. The initial -2 log likelihood is 213.891. We will not routinely request the iteration history because it does not usually yield us additional useful information. SW388R7 Data Analysis & Computers II Ending logistic regression model Slide 11 After the independent variables are entered in Block 1, the -2 log likelihood is again measured (180.267 in this problem). The difference between ending and beginning -2 log likelihood is the model chisquare that is used in the test of overall statistical significance. In this problem, the model chi-square is 33.625 (213.891 – 180.267), which is statistically significant at p<0.001. Model chi-square is 33.625, significant at p < 0.001. Relationship of Individual Independent Variables and Dependent Variable SW388R7 Data Analysis & Computers II Slide 12 There is a test of significance for the relationship between an individual independent variable and the dependent variable, a significance test of the Wald statistic . The individual coefficients represent change in the probability of being a member of the modeled category. Individual coefficients are expressed in log units and are not directly interpretable. However, if the b coefficient is used as the power to which the base of the natural logarithm (2.71828) is raised, the result represents the change in the odds of the modeled event associated with a one-unit change in the independent variable. If a coefficient is positive, its transformed log value will be greater than one, meaning that the modeled event is more likely to occur. If a coefficient is negative, its transformed log value will be less than one, and the odds of the event occurring decrease. A coefficient of zero (0) has a transformed log value of 1.0, meaning that this coefficient does not change the odds of the event one way or the other. SW388R7 Data Analysis & Computers II Numerical problems Slide 13 The maximum likelihood method used to calculate logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer. Sometimes, the method will break down and not be able to converge or find an answer. Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables. The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0. SW388R7 Data Analysis & Computers II Strength of logistic regression relationship Slide 14 While logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model. A more useful measure to assess the utility of a logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable. SW388R7 Data Analysis & Computers II Evaluating usefulness for logistic models Slide 15 The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. SW388R7 Data Analysis & Computers II Comparing accuracy rates Slide 16 To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for logistic regression.) Classification Tablea Step 1 Observed EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO 20 34 10 72 Overall Percentage a. The cut value is .500 SPSS reports the overall accuracy rate in the footnotes to the table "Classification Table." The overall accuracy rate computed by SPSS was 67.6%. Percentage Correct 37.0 87.8 67.6 SW388R7 Data Analysis & Computers II Computing by chance accuracy Slide 17 The number of cases in each group is found in the Classification Table at Step 0 (before any independent variables are included). The proportion of cases in the largest group is equal to the overall percentage (60.3%). Classification Tablea,b Step 0 Observed EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO 0 54 0 82 Overall Percentage Percentage Correct .0 100.0 60.3 a. Constant is included in the model. b. The cut value is .500 The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² = 0.521). The proportional by chance accuracy criteria is 65.2% (1.25 x 52.1% = 65.2%). SW388R7 Data Analysis & Computers II Problem 1 Slide 18 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in age increased the odds that survey respondents have not seen an x-rated movie by 3.9%. Survey respondents who were female were approximately six and three quarters times more likely to have not seen an x-rated movie. Survey respondents who were more conservative were more likely to have not seen an x-rated movie. A one unit increase in liberal or conservative political views increased the odds that survey respondents have not seen an xrated movie by approximately one and a quarter times. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 1 Slide 19 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] For these problems, we will were useful predictors for distinguishing between groups based on responses to "seen x-rated assume that there is no problem movie in last year" [xmovie]. These predictors differentiate survey with missing data, outliers, or respondents who have not seen an x-rated movie from survey respondents who have an x-rated movie. influential cases, and seen that the validation analysis will confirm the generalizability of the Survey respondents who were older were more likely to have not seen an x-rated movie. A one results unit increase in age increased the odds that survey respondents have not seen an x-rated movie by 3.9%. Survey respondents who wereInfemale were approximately this problem, we are told to six and three quarters times use 0.05 as alpha for the more likely to have not seen an x-rated movie. Survey respondents who were more conservative logistic regression. were more likely to have not seen an x-rated movie. A one unit increase in liberal or conservative political views increased the odds that survey respondents have not seen an xrated movie by approximately one and a quarter times. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 2 Slide 20 The variables listed first in In the dataset GSS2000.sav, is the problem following statement true, false, or an incorrect application of statement are the independent a statistic? Assume that there is no variables problem with missing data, outliers, or influential cases, and (IVs): "age" [age], "sex" [sex], and "liberal that the validation analysis will confirm the generalizability of the results. Use a level of or conservative political views" [polviews]. significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. The variable used to define Survey respondents who were older were more likely to have not seen an x-rated movie. A one groups is the dependent unit increase in age increased the odds that survey respondents have not seen an x-rated movie (DV): "seen x-rated byvariable 3.9%. Survey respondents who were female were approximately six and three quarters times movie in last year" [xmovie]. more likely to have not seen an x-rated movie. Survey respondents who were more conservative When a problem states that a list of were more likely to have not seen an x-rated movie.independent A one unit increase liberal or variablesin can conservative political views increased the odds that distinguish survey respondents have and not does seen an xamong groups rated movie by approximately one and a quarter times. not identify control variable or an order of importance for the variables, we do a logistic regression entering all of the variables simultaneously. SW388R7 Data Analysis & Computers II Dissecting problem 1 - 3 Slide 21 SPSS logistic regression models the relationship by computing the changes in the likelihood of falling in the category of the dependent variable which had the highest numerical code. responses to an x-rated movie were In the datasetThe GSS2000.sav, is seeing the following statement true,coded: false, or an incorrect application of 1= Yes and 2 = No. a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of The SPSS output will model the changes in the likelihood of significance of 0.05 for evaluating the statistical relationship. not seeing an x-rated movie because the code for No is 2. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and three quarters times more likely to have not seen an x-rated movie. Survey respondents The statements of the specific relationships who were more conservative were independent more likely variables to have not between and seen the an x-rated movie. A one unit increase in liberal or conservative political increased the odds that survey dependent variable are views all phrased in terms respondents have not seenofan x-rated by approximately one and a quarter times. impact on movie not seeing an x-rated movie. SW388R7 Data Analysis & Computers II Dissecting problem 1 - 4 Slide 22 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability results. Use a level of The specific relationships for of thethe independent significance of 0.05 for evaluating the statistical variables listed inrelationship. the problem indicate the direction of the relationship, increasing or decreasing the likelihood of fallingorinconservative the modeled political group, and the [polviews] The variables "age" [age], "sex" [sex], and "liberal views" amount between of changegroups in the odds with to a "seen x-rated were useful predictors for distinguishing basedassociated on responses one-unit change in the independent variable. movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and three quarters times more likely to have not seen an x-rated movie. Survey respondents who were more conservative were more likely to have not seen an x-rated movie. A one unit increase in liberal or conservative political views increased the odds that survey respondents have not seen an x-rated movie by approximately one and a quarter times. 1. 2. 3. 4. True In order for the logistic regression question to be true, the overall relationship must be statistically True with caution significant, there must be no evidence of a flawed False numerical analysis, the classification accuracy Inappropriate application of a statistic rate must be substantially better than could be obtained by chance alone, and each significant relationship must be interpreted correctly. SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 1 Slide 23 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in ageLogistic increased the odds that survey respondents regression requires that the dependenthave not seen an x-rated movie by 3.9%. Survey respondents were female were approximately six and three quarters times variable bewho non-metric and the independent or movie. dichotomous. xmore likely to havevariables not seenbe anmetric x-rated Survey"seen respondents who were more conservative rated movie in last year" [xmovie] is an were more likely to have not seen an x-rated movie. A one unit increase in liberal or dichotomous variable, which satisfies the level of conservative political views increased the odds that survey respondents have not seen an xmeasurement requirement. rated movie by approximately one and a quarter times. It contains two categories: survey respondents who had seen an x-rated movie in the last year True and survey respondents who had not seen an xrated movie in the last year. True with caution 1. 2. 3. False 4. Inappropriate application of a statistic SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 2 Slide 24 "Age" [age] is an interval level "Sex" [sex] is a dichotomous variable, which satisfies the level or dummy-coded In the dataset GSS2000.sav, is the following statement true, false, or an nominal incorrect application of of ameasurement requirements foris no problem with missing variable which may statistic? Assume that there data, outliers, or be influential cases, and logistic regression analysis. included in logistic regression. that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in age increased the odds that survey respondents havepolitical not seen an x-rated movie "Liberal or conservative views" by 3.9%. Survey respondents who were female[polviews] were approximately andvariable. three quarters times is an ordinalsix level If more likely to have not seen an x-rated movie. who more conservative weSurvey follow respondents the convention of were treating as metric were more likely to have not seen an x-rated ordinal movie. level A onevariables unit increase in liberal or variables, the level of measurement conservative political views increased the odds that survey respondents have not seen an xrequirement rated movie by approximately one and a quarter times. for logistic regression 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation. SW388R7 Data Analysis & Computers II Request simultaneous logistic regression Slide 25 Select the Regression | Binary Logistic… command from the Analyze menu. SW388R7 Data Analysis & Computers II Selecting the dependent variable Slide 26 First, highlight the dependent variable xmovie in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. SW388R7 Data Analysis & Computers II Selecting the independent variables Slide 27 Move the independent variables listed in the problem to the Covariates list box. SW388R7 Data Analysis & Computers II Specifying the method for including variables Slide 28 SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included. SPSS also supports the specification of "Blocks" of variables for testing hierarchical models. Since the problem states that there is a relationship without requesting the best predictors, we specify Enter as the method for including variables. SW388R7 Data Analysis & Computers II Completing the logistic regression request Slide 29 Click on the OK button to request the output for the logistic regression. The logistic procedure supports the selection of subsets of cases, automatic recoding of nominal variables, saving diagnostic statistics like standardized residuals and Cook's distance, and options for additional statistics. However, none of these are needed for this analysis. SW388R7 Data Analysis & Computers II Sample size – ratio of cases to variables Slide 30 Case Processing Summary Unweig hted Cases Selected Cases Unselected Cases Total a N Included in Analysis Missing Cases Total 177 93 270 0 270 a. If weight is in effect, see classification table for the total number of cases. The minimum ratio of valid cases to independent variables for logistic regression is 10 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 177 valid cases and 3 independent variables. The ratio of cases to independent variables is 59.0 to 1, which satisfies the minimum requirement. In addition, the ratio of 59.0 to 1 satisfies the preferred ratio of 20 to 1. Percent 65.6 34.4 100.0 .0 100.0 SW388R7 Data Analysis & Computers II Slide 31 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 39.668 39.668 39.668 df 3 3 3 Sig . .000 .000 .000 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chisquare at step 1 after the independent variables have been added to the analysis. In this analysis, the probability of the model chi-square (39.668) was <0.001, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II NUMERICAL PROBLEMS Slide 32 Variables in the Equation B Step a 1 AGE SEX POLVIEWS Constant .038 1.901 .306 -4.590 S.E. .014 .410 .135 1.045 Wald 7.629 21.452 5.110 19.302 df 1 1 1 1 Sig . .006 .000 .024 .000 Exp(B) 1.039 6.689 1.358 .010 a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS. Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted. None of the independent variables in this analysis had a standard error larger than 2.0. (The check for standard errors larger than 2.0 does not include the standard error for the Constant.) SW388R7 Data Analysis & Computers II Slide 33 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 The probability of the Wald statistic for the variable age was 0.006, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for age was equal to zero was rejected. This supports the relationship that "survey respondents who were older were more likely to have not seen an x-rated movie." Variables in the Equation B Step a 1 AGE SEX POLVIEWS Constant .038 1.901 .306 -4.590 S.E. .014 .410 .135 1.045 Wald 7.629 21.452 5.110 19.302 df 1 1 1 1 Sig . .006 .000 .024 .000 Exp(B) 1.039 6.689 1.358 .010 a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS. The value of Exp(B) was 1.039 which implies that a one unit increase in age increased the odds that survey respondents have not seen an x-rated movie by 3.9%. This confirms the statement of the amount of change in the likelihood of belonging to the modeled group of the dependent variable associated with a one unit change in the independent variable, age. SW388R7 Data Analysis & Computers II Slide 34 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 The probability of the Wald statistic for the variable sex was <0.001, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for sex was equal to zero was rejected. This supports the relationship that "survey respondents who were female were approximately six and three quarters times more likely to have not seen an x-rated movie." Variables in the Equation B Step a 1 AGE SEX POLVIEWS Constant .038 1.901 .306 -4.590 S.E. .014 .410 .135 1.045 Wald 7.629 21.452 5.110 19.302 df 1 1 1 1 Sig . .006 .000 .024 .000 a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS. The value of Exp(B) was 6.689 which implies that a one unit increase in sex increased the odds by approximately six and three quarters times that survey respondents have not seen an x-rated movie. Exp(B) 1.039 6.689 1.358 .010 SW388R7 Data Analysis & Computers II Slide 35 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 The probability of the Wald statistic for the variable liberal or conservative political views was 0.024, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for liberal or conservative political views was equal to zero was rejected. This supports the relationship that "survey respondents who were more conservative were more likely to have not seen an x-rated movie." Liberal or conservative political views is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were more conservative. Variables in the Equation B Step a 1 AGE SEX POLVIEWS Constant .038 1.901 .306 -4.590 S.E. .014 .410 .135 1.045 Wald 7.629 21.452 5.110 19.302 df 1 1 1 1 Sig . .006 .000 .024 .000 a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS. The value of Exp(B) was 1.358 which implies that a one unit increase in liberal or conservative political views increased the odds that survey respondents have not seen an x-rated movie by approximately one and a quarter times. Exp(B) 1.039 6.689 1.358 .010 CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: by chance accuracy rate SW388R7 Data Analysis & Computers II Slide 36 The independent variables could be characterized as useful predictors distinguishing survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. Classification Tablea,b Step 0 Observed SEEN X-RATED MOVIE IN LAST YEAR YES NO Overall Percentage Predicted SEEN X-RATED MOVIE IN LAST YEAR YES NO 0 45 0 132 Percentage Correct .0 100.0 74.6 a. Constant is included in the model. Thecut proportional b. The value is .500 by chance accuracy rate was computed by first calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0. The proportion in the "YES" group is 45/177 = 0.254. The proportion in the "No" group is 132/177 = 0.746. Then, we square and sum the proportion of cases in each group (0.254² + 0.746² = 0.621). 0.621 is the proportional by chance accuracy rate. CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: criteria for classification accuracy SW388R7 Data Analysis & Computers II Slide 37 Classification Tablea Step 1 Observed SEEN X-RATED MOVIE IN LAST YEAR YES NO Predicted SEEN X-RATED MOVIE IN LAST YEAR YES NO 19 26 9 123 Overall Percentage a. The cut value is .500 The accuracy rate computed by SPSS was 80.2% which was greater than or equal to the proportional by chance accuracy criteria of 77.6% (1.25 x 62.1% = 77.6%). The criteria for classification accuracy is satisfied. Percentage Correct 42.2 93.2 80.2 SW388R7 Data Analysis & Computers II Answering the question in problem 1 - 1 Slide 38 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews] were useful predictors for distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who have not seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one We found a statistically significant overall between the combination of seen an x-rated movie unit increase in age increased the relationship odds that survey respondents have not independent variables and the dependent by 3.9%. Survey respondents who were female were approximately six and three quarters times variable. more likely to have not seen an x-rated movie. Survey respondents who were more conservative were more likely to have not seen an x-rated movie. A one unit increase in liberal or There was no evidence of numerical problems in conservative political views increased the odds that survey respondents have not seen an xthe solution. rated movie by approximately one and a quarter times. Moreover, the classification accuracy surpassed the proportional by chance accuracy criteria, supporting the utility of the model. SW388R7 Data Analysis & Computers II Answering the question in problem 1 - 2 Slide 39 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical We verified that eachrelationship. statement about the relationship between an independent variable and the[sex], dependent variableorwas correct in both The variables "age" [age], "sex" and "liberal conservative political views" [polviews] direction of the relationship and the change in were useful predictors for distinguishing between groups based on responses to "seen x-rated likelihood associated with a one-unit change of the movie in last year" [xmovie]. These predictors differentiate survey respondents who have not independent variable. seen an x-rated movie from survey respondents who have seen an x-rated movie. Survey respondents who were older were more likely to have not seen an x-rated movie. A one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and three quarters times more likely to have not seen an x-rated movie. Survey respondents who were more conservative were more likely to have not seen an x-rated movie. A one unit increase in liberal or conservative political views increased the odds that survey respondents have not seen an x-rated movie by approximately one and a quarter times. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables. SW388R7 Data Analysis & Computers II Problem 2 Slide 40 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the executive branch of the federal government were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in confidence in the executive branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 42.8%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 1 Slide 41 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" and "confidence For [happy] these problems, we will in the executive branch of the federal government" [confed] were useful predictors for distinguishing assume that there is no problem between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey with missing data, outliers, or respondents who have been less supportive that the use of marijuana should be made legal influential cases, and that the from survey respondents who have been more supportive that the use of marijuana should be validation analysis will confirm made legal. the generalizability of the results Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A onewe unit In this problem, areincrease told to in general happiness decreased the odds that survey respondents have been less supportive that the use of use 0.05 as alpha for the marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the logistic regression. executive branch of the federal government were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in confidence in the executive branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 42.8%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 2 Slide 42 The variables listed first in the problem statement are the independent variables (IVs): "sex" [sex] , "general happiness" [happy], and "confidence in the executive branch of the federal government" [confed]. Sex is a control variable and general happiness and In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of confidence in the executive branchy are predictors. a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. The variable used to define groups is the dependent variable (DV): Survey respondents were less happy overall were less likely to have been less supportive "should marijuana bewho made legal" that the use of marijuana should be made legal. A one unit increase in general happiness [grass]. decreased the odds that survey respondents have been less supportive that the use of When a problem identifies control marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the variables, we do a hierarchical executive branch of the federal government were less likely to have been less supportive that logistic regression entering in the the use of marijuana should be made legal. A one unit increase in confidence the executive variables in SPSS blocks. branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 42.8%. SW388R7 Data Analysis & Computers II Dissecting problem 2 - 3 Slide 43 SPSS logistic regression models the relationship by computing the changes in the likelihood of falling in the category of the dependent variable which had the highest numerical code. The responses to seeing an x-rated movie were coded: In the GSS2000.sav, is the following statement true, false, or an incorrect application of 1=dataset Legal and 2 = Not Legal. a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that The the SPSS validation analysis will the confirm the in generalizability of the results. Use a level of output will model changes the likelihood of significance of 0.05 for evaluating the statistical relationship. being less supportive of legalizing marijuana because 2 corresponds to not legalizing marijuana. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the Thefederal statements of the specific relationships between executive branch of the government were less likely to have been less supportive that the use of marijuana independent should be made legal.and A one increasevariable in confidence variables theunit dependent are all in the executive branch of the federal phrased government decreased the odds that survey respondents have been less in terms of impact on being less supportive of supportive that the use of marijuana should be made legal by 42.8%. legalizing marijuana. SW388R7 Data Analysis & Computers II Dissecting problem 2 - 4 Slide 44 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. The specific relationships for the independent listed"sex" in the problem indicate the direction After controlling for the effect ofvariables the variable [sex] on "should marijuana be made legal" of the relationship, increasing or decreasing the [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the likelihood of falling infor thedistinguishing modeled group, and thegroups based on federal government" [confed] were useful predictors between responses to "should marijuana be made of legal" [grass]. These survey amount change in the oddspredictors associated differentiate with a respondents who have been less supportive that the useindependent of marijuana should be made legal one-unit change in the variable. from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the executive branch of the federal government were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in confidence in the executive branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made In order for the logistic regression question to be true, the legal by 42.8%. 1. 2. 3. 4. relationship between the predictors and the dependent variable must be statistically significant after entering the control True variables in a previous stage, there must be no evidence of a True with caution flawed numerical analysis, the classification accuracy rate must False be substantially better than could be obtained by chance alone, Inappropriate application of a statistic and each significant relationship must be interpreted correctly. SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 1 Slide 45 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness Logistic regression analysis requires that the decreased the odds that survey respondents have been lessdependent supportive that the use of variable be dichotomous and the independent variables marijuana should made by 66.9%. Surveymarijuana respondents who had less confidence in the be be metric or legal dichotomous. "Should be made executive branch of the federal government were less likely to have been less supportive that legal" [grass] is a dichotomous variable, which satisfies the use of marijuana should be made legal. A one unit increase in confidence in the executive the level of measurement requirement for the dependent branch of the federal government decreased the odds that survey respondents have been less variable. supportive that the use of marijuana should be made legal by 42.8%. It contains two categories: •survey respondents who have been less supportive that True the use of marijuana should be made legal •survey respondents who have been more supportive True with caution that the use of marijuana should be made legal False 1. 2. 3. 4. Inappropriate application of a statistic SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 2 Slide 46 "Sex" [sex] is a dichotomous or dummy-coded nominal variable which In the dataset GSS2000.sav, following statement true, false, or an incorrect application of may be includedisinthe logistic regression a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and analysis. that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. "General happiness" [happy] and "confidence in the Survey respondents who were less happy branch overall of were likely to have been less supportive executive the less federal government" that the use of marijuana should be made legal. A one unit increase in general happiness [confed] are ordinal level variables. If we follow the decreased the odds that surveyconvention respondents have been less supportive that the use of of treating ordinal level variables as marijuana should be made legalmetric by 66.9%. Survey respondents who had less confidence in the variables, the level of measurement logistic analysis executive branch of the federalrequirement governmentfor were less regression likely to have beenisless supportive that Since some data analysts not agreein with the use of marijuana should be satisfied. made legal. A one unit increase in do confidence the executive this convention, a note of caution should be included branch of the federal government decreased the odds that survey respondents have been less in our should interpretation. supportive that the use of marijuana be made legal by 42.8%. SW388R7 Data Analysis & Computers II Request hierarchical logistic regression Slide 47 Select the Regression | Binary Logistic… command from the Analyze menu. SW388R7 Data Analysis & Computers II Selecting the dependent variable Slide 48 First, highlight the dependent variable grass in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. SW388R7 Data Analysis & Computers II Selecting the control independent variables Slide 49 First, move the control independent variable, sex, listed in the problem to the Covariates list box. Second, click on the Next button to add the new block that will contain the predictors. SW388R7 Data Analysis & Computers II Adding the predictor independent variables Slide 50 First, move the predictors to the Covariates list box. SW388R7 Data Analysis & Computers II Specifying the method for including variables Slide 51 In our hierarchical regression, we will specify that all of the variables in each block be entered simultaneously when the block is entered. SW388R7 Data Analysis & Computers II Completing the logistic regression request Slide 52 Click on the OK button to request the output for the logistic regression. The logistic procedure supports the selection of subsets of cases, automatic recoding of nominal variables, saving diagnostic statistics like standardized residuals and Cook's distance, and options for additional statistics. However, none of these are needed for this analysis. SW388R7 Data Analysis & Computers II Sample size – ratio of cases to variables Slide 53 Case Processing Summary Unweig hted Cases Selected Cases Unselected Cases Total a N Included in Analysis Missing Cases Total 163 107 270 0 270 a. If weight is in effect, see classification table for the total number of cases. The minimum ratio of valid cases to independent variables for logistic regression is 10 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 163 valid cases and 3 independent variables. The ratio of cases to independent variables is 54.33 to 1, which satisfies the minimum requirement. In addition, the ratio of 54.33 to 1 satisfies the preferred ratio of 20 to 1. Percent 60.4 39.6 100.0 .0 100.0 SW388R7 Data Analysis & Computers II Slide 54 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been included is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included. In this analysis, the probability of the block chi-square (17.467) was <0.001, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the control variables versus the model with the predictor independent variables was rejected. The contribution of the relationship between the predictor independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II NUMERICAL PROBLEMS Slide 55 Variables in the Equation B Step a 1 SEX HAPPY CONFED Constant .154 -1.104 -.559 3.721 S.E. .351 .354 .270 1.066 Wald .194 9.739 4.290 12.195 df 1 1 1 1 Sig . .660 .002 .038 .000 Exp(B) 1.167 .331 .572 41.308 a. Variable(s) entered on step 1: HAPPY, CONFED. Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted. None of the independent variables in this analysis had a standard error larger than 2.0. (The check for standard errors larger than 2.0 does not include the standard error for the Constant.) SW388R7 Data Analysis & Computers II Slide 56 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 The probability of the Wald statistic for the variable general happiness was 0.002, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for general happiness was equal to zero was rejected. This supports the relationship that "survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal." General happiness is an ordinal variable that is coded so that lower numeric values are associated with survey respondents who were happier overall. Variables in the Equation B Step a 1 SEX HAPPY CONFED Constant .154 -1.104 -.559 3.721 S.E. .351 .354 .270 1.066 Wald .194 9.739 4.290 12.195 df 1 1 1 1 Sig . .660 .002 .038 .000 a. Variable(s) entered on step 1: HAPPY, CONFED. The value of Exp(B) was 0.331 which implies that a one unit increase in general happiness decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 66.9%. Exp(B) 1.167 .331 .572 41.308 SW388R7 Data Analysis & Computers II Slide 57 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 The probability of the Wald statistic for the variable confidence in the executive branch of the federal government was 0.038, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for confidence in the executive branch of the federal government was equal to zero was rejected. This supports the relationship that "survey respondents who had less confidence in the executive branch of the federal government were less likely to have been less supportive that the use of marijuana should be made legal." Confidence in the executive branch of the federal government is an ordinal variable that is coded so that lower numeric values are associated with survey respondents who had more confidence in the executive branch of the federal government. Variables in the Equation B Step a 1 SEX HAPPY CONFED Constant .154 -1.104 -.559 3.721 S.E. .351 .354 .270 1.066 Wald .194 9.739 4.290 12.195 df 1 1 1 1 Sig . .660 .002 .038 .000 a. Variable(s) entered on step 1: HAPPY, CONFED. The value of Exp(B) was 0.572 which implies that a one unit increase in confidence in the executive branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 42.8%. Exp(B) 1.167 .331 .572 41.308 CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: by chance accuracy rate SW388R7 Data Analysis & Computers II Slide 58 The independent variables could be characterized as useful predictors distinguishing survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. Classification Tablea,b Step 0 Observed SHOULD MARIJUANA BE MADE LEGAL LEGAL NOT LEGAL Predicted SHOULD MARIJUANA BE MADE LEGAL LEGAL NOT LEGAL 0 57 0 106 Overall Percentage a. Constant is included in the model. b. The cut value is .500 The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.350² + 0.650² = 0.545). Percentage Correct .0 100.0 65.0 CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: criteria for classification accuracy SW388R7 Data Analysis & Computers II Slide 59 Classification Tablea Step 1 Observed SHOULD MARIJUANA BE MADE LEGAL LEGAL NOT LEGAL Predicted SHOULD MARIJUANA BE MADE LEGAL LEGAL NOT LEGAL 18 39 13 93 Overall Percentage a. The cut value is .500 The accuracy rate computed by SPSS was 68.1% which was greater than or equal to the proportional by chance accuracy criteria of 68.1% (1.25 x 54.5% = 68.1%). The criteria for classification accuracy is satisfied. Percentage Correct 31.6 87.7 68.1 SW388R7 Data Analysis & Computers II Answering the question in problem 2 - 1 Slide 60 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal" [grass], the variable "general happiness" [happy] and "confidence in the executive branch of the federal government" [confed] were useful predictors for distinguishing between groups based on responses to "should marijuana be made legal" [grass]. These predictors differentiate survey respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness We found ahave statistically significant overall decreased the odds that survey respondents been less supportive that the use of relationship between the predictor independent marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the variables and theless dependent executive branch of the federal government were likely tovariable. have been less supportive that the use of marijuana should be made legal. A one unit increase in confidence in the executive was no of survey numerical problems in branch of the federal government There decreased theevidence odds that respondents have been less supportive that the use of marijuana be made legal by 42.8%. the should solution. 1. 2. 3. 4. True True with caution False Inappropriate application of Moreover, the classification accuracy surpassed the proportional by chance accuracy criteria, supporting the utility of the model. a statistic SW388R7 Data Analysis & Computers II Answering the question in problem 2 - 2 Slide 61 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. We verified that each statement about the relationship between an independent variable and After controlling for the effect the variable "sex" was [sex]correct on "should marijuana be made legal" theofdependent variable in both [grass], the variable "generaldirection happiness" [happy] and "confidence in the executive branch of the of the relationship and the change in federal government" [confed] were useful predictors for distinguishing between groups based on likelihood associated with a These one-unit change of the responses to "should marijuana be made legal" [grass]. predictors differentiate survey independent variable. respondents who have been less supportive that the use of marijuana should be made legal from survey respondents who have been more supportive that the use of marijuana should be made legal. Survey respondents who were less happy overall were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in general happiness decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the executive branch of the federal government were less likely to have been less supportive that the use of marijuana should be made legal. A one unit increase in confidence in the executive branch of the federal government decreased the odds that survey respondents have been less supportive that the use of marijuana should be made legal by 42.8%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables. SW388R7 Data Analysis & Computers II Problem 3 Slide 62 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting Problem 3 - 1 Slide 63 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most For these problems, weuseful will predictor for distinguishing between groups based on responses toassume "expectthat u.s.there in world war in 10 years" [uswary] was is no problem "total family income" [income98]. These predictors differentiate survey respondents who have with missing data, outliers, or been less positive that the United States would fight in another world war within the next ten influential cases, and that the years from survey respondents who have been more positive that the United States would fight validation in another world war within the next ten years.analysis will confirm the generalizability of the results The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war the In this problem, we within are told to next ten years was total family income. use 0.05 as alpha for the logistic regression. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting Problem 3 - 2 Slide 64 The variables listed first in the The variable used to problem statement are the define groups is the independent variables (IVs): "highest dependent variable (DV): academic degree" [degree], "total "expect u.s. in world war family income" [income98], and in 10oryears" [uswary]. In the dataset GSS2000.sav, is the following statement true, false, an incorrect application of "satisfaction with financial situation" a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and [satfin]. that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. Since the problem identifies themore most likely usefulto of have been less Survey respondents who had higher total family incomes were predictor, positive that the United States would fight in another worldimportant war within the nextwe tendoyears. A a stepwise logistic one unit increase in total family income increased the odds that survey respondents have been regression. less positive that the United States would fight in another world war within the next ten years by 10.0%. SW388R7 Data Analysis & Computers II Dissecting Problem 3 - 3 Slide 65 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. SPSS logistic regression models the relationship by computing the changes thehigher likelihood fallingincomes in the category of the Survey respondents who in had totaloffamily were more likely to have been less variable which had highest numerical code.the next ten years. A positive that thedependent United States would fight inthe another world war within one unit increase in total family income increased the odds that survey respondents have been less positive that theresponses United States wouldu.s. fight in another war within The to “expect in world war inworld 10 years” were the next ten years by 10.0%. coded: 1= Yes and 2 = No. The SPSS output will model the changes in the likelihood of being less positive that the United States would fight in another world war within the next ten years. SW388R7 Data Analysis & Computers II Dissecting Problem 3 - 4 Slide 66 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of The statements of the specific significance of 0.05 for evaluating the statistical relationship. relationships between independent variables and the dependent variable are From the list of variables "highest academic degree" [degree], "total family income" [income98], all phrased in terms of impact on being and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing less positive that the between groups based on responses to "expect u.s. in world warUnited in 10 States years" would [uswary] was fight in another world war within the next who have "total family income" [income98]. These predictors differentiate survey respondents ten years. been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. SW388R7 Data Analysis & Computers II Dissecting Problem 3 - 5 Slide 67 From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have specific relationships for the world independent been less positive that the UnitedThe States would fight in another war within the next ten variables listed in the problem indicate the direction years from survey respondents who have been more positive that the United States would fight of theten relationship, increasing or decreasing the in another world war within the next years. likelihood of falling in the modeled group, and the of change withwho a one-unit The most important predictor foramount identifying surveyassociated respondents have been less positive change in theworld independent variable. that the United States would fight in another war within the next ten years was total family income. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. 1. 2. 3. 4. True True with caution In order for the logistic regression question to be true, the False relationshipofbetween the predictors selected for inclusion and the Inappropriate application a statistic dependent variable must be statistically significant, there must be no evidence of a flawed numerical analysis, the classification accuracy rate must be substantially better than could be obtained by chance alone, and the order of entry and each significant relationship must be interpreted correctly. SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 1 Slide 68 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income.Logistic regression analysis requires that the dependent variable be dichotomous and the independent variables be metric or dichotomous. "Expect u.s. in world war in 10 years" [uswary] is a Survey respondents who hadvariable, higher total incomes were more likely to have been less dichotomous whichfamily satisfies the level of measurement positive that the United States would fight in another world war within the next ten years. A requirement for the dependent variable. one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. It contains two categories: survey respondents States would fight in True survey respondents True with caution States would fight in who have been less positive that the United another world war within the next ten years who have been more positive that the United another world war within the next ten years. 1. 2. 3. False 4. Inappropriate application of a statistic SW388R7 Data Analysis & Computers II LEVEL OF MEASUREMENT - 2 Slide 69 "Highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin] are ordinal level variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for logistic regression In the dataset GSS2000.sav, is the following statement true, false, an incorrect analysis is satisfied. Since some dataor analysts do notapplication of a statistic? Assume that there isagree no problem with missing data, outliers, or influential cases, and with this convention, a note of caution should that the validation analysis willbe confirm theingeneralizability of the results. Use a level of included our interpretation. significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. SW388R7 Data Analysis & Computers II Request stepwise logistic regression Slide 70 Select the Regression | Binary Logistic… command from the Analyze menu. SW388R7 Data Analysis & Computers II Selecting the dependent variable Slide 71 First, highlight the dependent variable uswary in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. SW388R7 Data Analysis & Computers II Adding the independent variables Slide 72 First, move the predictors to the Covariates list box. SW388R7 Data Analysis & Computers II Specifying the method for including variables Slide 73 In our stepwise logistic regression, we specify the Forward Conditional method for adding variables. SW388R7 Data Analysis & Computers II Adding options to the output Slide 74 To add a summary of steps at the end of the analysis and specifications for stepwise method, click on the Options… button. SW388R7 Data Analysis & Computers II Including a summary of steps Slide 75 To obtain a summary of the steps on which variables were added or removed from the analysis, mark the option button At last step in the Display panel. SW388R7 Data Analysis & Computers II Specifications for stepwise method Slide 76 Click on the Continue button to close the dialog box. We can change the criteria for adding and removing variables from the analysis by changing the probability for entry and removal. We will use the default level of significance of 0.05 for entry and 0.10 for removal. SW388R7 Data Analysis & Computers II Completing the logistic regression request Slide 77 Click on the OK button to request the output for the logistic regression. SW388R7 Data Analysis & Computers II Sample size – ratio of cases to variables Slide 78 Case Processing Summary Unweig hted Cases Selected Cases Unselected Cases Total a N Included in Analysis Missing Cases Total 136 134 270 0 270 a. If weight is in effect, see classification table for the total number of cases. The minimum ratio of valid cases to independent variables for stepwise logistic regression is 10 to 1, with a preferred ratio of 50 to 1. In this analysis, there are 136 valid cases and 3 independent variables. The ratio of cases to independent variables is 45.33 to 1, which satisfies the minimum requirement. However, the ratio of 45.33 to 1 does not satisfy the preferred ratio of 50 to 1. A caution should be added to the interpretation of the analysis and a split sample validation should be conducted. Percent 50.4 49.6 100.0 .0 100.0 SW388R7 Data Analysis & Computers II Slide 79 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi-square. In this analysis, the probability of the model chi-square (9.001) was 0.003, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II NUMERICAL PROBLEMS Slide 80 Variables in the Equation B Step a 1 INCOME98 Constant .095 -1.033 S.E. .033 .527 Wald 8.436 3.847 df 1 1 Sig . .004 .050 Exp(B) 1.100 .356 a. Variable(s) entered on step 1: INCOME98. Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted. None of the independent variables in this analysis had a standard error larger than 2.0. (The check for standard errors larger than 2.0 does not include the standard error for the Constant.) SW388R7 Data Analysis & Computers II Slide 81 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE The probability of the Wald statistic for the variable total family income was 0.004, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for total family income was equal to zero was rejected. This supports the relationship that "survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years." Total family income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. Variables in the Equation B Step a 1 INCOME98 Constant .095 -1.033 S.E. .033 .527 Wald 8.436 3.847 df 1 1 Sig . .004 .050 a. Variable(s) entered on step 1: INCOME98. The value of Exp(B) was 1.100 which implies that a one unit increase in total family income increased the odds that survey respondents have been less positive that the United States would fight in another world war within the next ten years by 10.0%. Exp(B) 1.100 .356 IMPORTANCE OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE SW388R7 Data Analysis & Computers II Slide 82 The order of importance is based on the entry order of the variables included in the stepwise logistic regression. The entry order is summarized in the Step Summary table, in which we see which variable was added or removed at each step. Step Summarya,b Step 1 Improvement Chi-square df 9.001 1 Sig . .003 Chi-square Model df 9.001 Sig . 1 Correct Class % .003 67.6% a. No more variables can be deleted from or added to the current model. b. End block: 1 The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income [INCOME98]. The importance of the predictors stated in the problem is correct. Variable IN: INCOME9 8 CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: by chance accuracy rate SW388R7 Data Analysis & Computers II Slide 83 The independent variables could be characterized as useful predictors distinguishing survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. Classification Tablea,b Step 0 Observed EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO 0 54 0 82 Overall Percentage a. Constant is included in the model. b. The The by chance accuracy rate was computed by cut proportional value is .500 calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² = 0.521). Percentage Correct .0 100.0 60.3 CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL: criteria for classification accuracy SW388R7 Data Analysis & Computers II Slide 84 Classification Tablea Step 1 Observed EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS YES NO 20 34 10 72 Overall Percentage a. The cut value is .500 The accuracy rate computed by SPSS was 67.6% which was greater than or equal to the proportional by chance accuracy criteria of 65.2% (1.25 x 52.1% = 65.2%). The criteria for classification accuracy is satisfied. Percentage Correct 37.0 87.8 67.6 SW388R7 Data Analysis & Computers II Answering the question in problem 3 - 1 Slide 85 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "highest academic degree" [degree], "total family income" [income98], and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was "total family income" [income98]. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been more positive that the United States would fight in another world war within the next ten years. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. We found a statistically significant overall relationship between the predictor independent and the dependent Survey respondents who had highervariables total family incomes were variable. more likely to have been less positive that the United States would fight in another world war within the next ten years. A There was no evidence of numerical problems in one unit increase in total family income increased the odds that survey respondents have been less positive that the United Statesthe would solution. fight in another world war within the next ten years by 10.0%. 1. 2. 3. 4. Moreover, the classification accuracy surpassed the proportional by chance accuracy criteria, supporting the utility of the model. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Answering the question in problem 3 - 2 Slide 86 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for theeach statistical relationship. Weevaluating verified that statement about the relationship between an independent variable and the dependent variable was correct in both "total family income" [income98], From the list of variables "highest academic degree" [degree], and "satisfaction withdirection financialofsituation" [satfin], the most useful the relationship and the change in predictor for distinguishing between groups based on responses to "expect u.s. in world war in the 10 years" [uswary] was likelihood associated with a one-unit change of "total family income" independent [income98].variable. These predictors differentiate survey respondents who have been less positive that the United States would fight in another world war within the next ten years from survey respondents who have been of more positive for that the United States would fight We also verified the order importance the in another world war independent within the next ten years. variables included in the stepwise analysis. The most important predictor for identifying survey respondents who have been less positive that the United States would fight in another world war within the next ten years was total family income. Survey respondents who had higher total family incomes were more likely to have been less positive that the United States would fight in another world war within the next ten years. A one unit increase in total family income increased the that respondents Theodds answer tosurvey the question is true have been less positive that the United States would fight in another world war within the next ten years with caution. by 10.0%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic A caution is added to the findings because of the inclusion of ordinal level independent variables. A caution is added to the findings because of the preferred sample size is not met. SW388R7 Data Analysis & Computers II Slide 87 Steps in binary logistic regression: level of measurement and initial sample size The following is a guide to the decision process for answering problems about the basic relationships in logistic regression: Dependent dichotomous? Independent variables metric or dichotomous? No Inappropriate application of a statistic Yes Ratio of cases to independent variables at least 10 to 1? Yes Run logistic regression, using method for including variables identified in the research question. No Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Slide 88 Steps in logistic regression: overall relationship and numerical problems No No False Hierarchical method of entry used to include independent variables? Presence of relationship confirmed by test of model chisquare? Yes Presence of relationship confirmed by test of block chisquare? No False Yes Yes Standard errors of coefficients indicate presence of numerical problems (s.e. > 2.0)? No Yes False SW388R7 Data Analysis & Computers II Slide 89 Steps in logistic regression: relationships between IV's and DV Stepwise method of entry used to include independent variables? Yes No Entry order of variables interpreted correctly? No Yes Relationships between individual IVs and DV groups interpreted correctly? Yes No False False SW388R7 Data Analysis & Computers II Slide 90 Steps in logistic regression: classification accuracy and adding cautions Overall accuracy rate is 25% > than proportional by chance accuracy rate? No False Yes Satisfies preferred ratio of cases to IV's of 20 to 1 (50 to 1 for stepwise) No True with caution Yes One or more IV's are ordinal level variables? No True Yes True with caution