Transcript V10_CAT
Categorical Data Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 1 Lesson Objective Understand basic rules of probability. Calculate marginal and conditional probabilities. Determine if two categorical variables are independent. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 2 C Recall Rule of Thumb: Quantitative variables: averages or differences have meaning. Ex: weight, height, income, age Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 3 C Recall Rule of Thumb: Categorical variables: classify people or things. Ex: gender, race, occupation, political affiliation, country of origin Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 4 Note: Sometimes quantitative variables are expressed as categorical. Income (Family Economic Income): Class Definition 1. Less than $30,000 2. $30,000 but less than $100,000 3. $100,000 or more. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 5 Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 6 Relationship between two quantitative variables? Is relationship linear (scatterplot)? J L Use Correlation & Least Squares Regression. Data transformations. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 7 Recall: Boxplots Best graphical tool for examining the relationship between a quantitative variable and a categorical variable, (i.e., comparing distributions). Example: Weight vs. Country of Origin Department of ISM, University of Alabama, 1992-2003 weight Weight Boxplot can be used to answer: “Do the distributions of weights vary for different countries of origin?” 4000 3000 2000 US 1 Far East Europe 2 3 origin M28- Categorical Analysis 8 Relationship between two categorical variables? Use two-way frequency tables: Look at marginal probabilities and conditional probabilities. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 9 STATISTICS is the science of transforming data into information to make decisions in the face of uncertainty. Department of ISM, University of Alabama, 1995-2003 M28- Categorical Data 10 How do we measure "uncertainty"? Probability A numerical measure of the likelihood that an outcome or an event occurs. P(A) = probability of event A Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 11 Three Methods for Assessing Probability Classical Relative Frequency Subjective Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 12 Probability requirements for discrete variables: 1. 0 _ < P(A) _ <1 P(A) = 0 impossible event P(A) = 1 certain event 2. Sum of the probabilities of all possible outcomes must equal 1. (Binomial, Poisson) Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 13 Conditional probability: The chance one event happens, given that another event will occur. P(A and B) P(A | B) = P(B) = All outcomes belonging to BOTH A AND B Those outcomes in the restricted group, B Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 14 Problem: Credit Card Manager New credit test to determine credit worthiness. Credit test checked against 500 previous customers. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 15 Credit History Credit Test A Passed Failed (F) (P) Good (G) 350 50 400 Default (D) 20 80 100 370 130 500 Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 16 What is the probability of a customer defaulting? P(Defaults) = P (D ) = P F G 350 50 400 20 80 100 D 370 130 500 What is the probability of a customer defaulting given that he fails test A? P(Defaults given failed test A) = P (D|F ) = Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 17 General Rules: P(A and B) = P(A) P(B|A) = P(B) P(A|B) P(A or B) = P(A) + P(B) - P(A and B) Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 18 P P(Fails AND Defaults) = P(F) P(D|F) Department of ISM, University of Alabama, 1992-2003 F G 350 50 400 20 80 100 D 370 130 500 M28- Categorical Analysis 19 P P(Fails OR Defaults) = P(F) + P(D) - F G 350 50 400 20 80 100 D 370 130 500 P(D AND F) Note: The “overlap” group would be counted twice if no subtraction. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 20 Does knowledge of “test A result” help you make a better decision? = P (D|F ) = P (D ) Do you want to know the test A results before you give the loan? “Credit test A results” and “defaulting” are ____________ on each other. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 21 A different sample of 500 credit records Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 22 Credit History Credit Test B Passed Failed (F) (P) Good (G) 340 60 400 Default (D) 85 15 100 425 75 500 Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 23 What is the probability of a customer defaulting? P(Defaults) = P (D ) = P F G 340 60 400 85 15 100 425 75 500 D What is the probability of a customer defaulting given that he fails test B? P(Defaults given failed test B) = P (D|F ) = Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 24 Does knowledge of “test B result” help you make a better decision? = P (D|F ) = P (D ) Test B tells me . “Credit test B results” and “defaulting” are of each other. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 25 Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 26 Two events are independent if the occurrence, or non-occurrence, of one does not affect the chances of the other occurring, or not occurring. Otherwise, we say the events are dependent. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 27 If A and B independent, then P(A and B) = P(A) P(B) P(A or B) = P(A) + P(B) - P(A) P(B) P(A|B) = P(A) P(B|A) = P(B) Department of ISM, University of Alabama, 1992-2003 Note: The condition does NOT change the probability. M28- Categorical Analysis 28 Survey of randomly selected people voters in Jan. 2001: Q1: Did you vote in the 2000 election? Q2: Do you favor an amendment to require a balanced budget? Q3: To which political party do you belong ? Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 29 Political Party: Do you favor amendment for a balanced budget? Yes No Total Republican 90 82 172 Democrat 44 104 148 Other 48 32 80 182 218 400 Total Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 30 Party: Favor amendment Yes No Total Republican 90 82 172 Democrat 44 104 148 Other 48 32 80 Total 182 218 400 Marginal totals for opinion. Marginal totals for Party. Sample size Favor amend. Party Yes No Total Repub 90 82 172 Demo 44 104 148 Other 48 32 80 Total 182 218 400 What proportion favor the amend. and are Other? What proportion favor the amend.? What proportion claim to be Rep? Favor amend. Party Yes No Total Repub 90 82 172 Demo 44 104 148 Other 48 32 80 Total 182 218 400 Of those that claim to be Democrat, what proportion favor the amend. What proportion favor the amend, given those that claim to be Rep? Considering only those opposed, what proportion are not Republican? Conditional Distribution: Restrict subjects to only those that meet a condition. Within this restricted group, what is the distribution of some other var.? Distribution of “opinion” given those that claim to be Republican: P( Yes | Rep. ) = .523 P( No | Rep. ) = .477 90 172 82 172 “given that” Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 34 Is there a relationship between the party and the opinion on the amendment? What would you expect to happen if no relationship existed? Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 35 Three Conditional Distributions: P( Yes | Rep.) = .523, P( No | Rep.) = P( Yes | Demo) = .297, P( No | Demo) = P( Yes | Other) = .600, P( No | Other) = Marginal Distribution: P( Yes ) = .455, P( No ) = .545 Is there a relationship? Why? or Why not? Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 36 If there is NO relationship (i.e., independence) between the party and the opinion, then “the three conditional probabilities close to each other and close to the should be the marginal probability.” Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 37 Three Conditional Probabilities: P( Yes | Rep.) = .523 Are these close to each other? P( Yes | Demo) = .297 P( Yes | Other) = .600 Marginal Probability: P( Yes ) = .455 AND close to the “marginal”? Not close; therefore, “party” and the “opinion” are ____________. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 38 Create with “Pivot Tables” in Excel. Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 39 Barchart- Clustered Rep. Yes Demo. Other Frequency Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 40 Barchart- Stacked Rep. Yes Demo. Other Frequency Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 41 Barchart- Percents Rep. Yes Demo. Other Percent Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 42 Summary For two categorical variables: Must use conditional probabilities to determine if a relationship exists. Cannot use correlation. Visual display: Stacked percentage bar charts Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 43 Associations between TWO Variables Variables Quant. vs. Quant Quant. vs. Cat. Cat. vs. Cat. numerical LS regression line, r, r-sq, std error X-bar and s for each category Two-way table, conditional & marginal distributions Department of ISM, University of Alabama, 1992-2003 graphical Scatterplot, residual plots Side-by-side box plots Bar chart : stacked, percent. M28- Categorical Analysis 44 The End Department of ISM, University of Alabama, 1992-2003 M28- Categorical Analysis 45