Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data Regression Plot Y = 2.31464 + 1.28722X r = 0.559 Weight10 Height Chapter Goals • To be able.
Download ReportTranscript Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data Regression Plot Y = 2.31464 + 1.28722X r = 0.559 Weight10 Height Chapter Goals • To be able.
Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data Regression Plot Y = 2.31464 + 1.28722X r = 0.559 60 50 40 Weight 30 20 10 10 20 30 40 50 Height 1 Chapter Goals • To be able to present bivariate data in tabular and graphic form • To become familiar with the ideas of descriptive presentation • To gain an understanding of the distinction between the basic purposes of correlation analysis and regression analysis 2 3.1 ~ Bivariate Data Bivariate Data: Consists of the values of two different response variables that are obtained from the same population of interest Three combinations of variable types: 1. Both variables are qualitative (attribute) 2. One variable is qualitative (attribute) and the other is quantitative (numerical) 3. Both variables are quantitative (both numerical) 3 Two Qualitative Variables • When bivariate data results from two qualitative (attribute or categorical) variables, the data is often arranged on a crosstabulation or contingency table Example: A survey was conducted to investigate the relationship between preferences for television, radio, or newspaper for national news, and gender. The results are given in the table below: Male Female TV 280 115 Radio 175 275 NP 305 170 4 Marginal Totals • This table may be extended to display the marginal totals (or marginals). The total of the marginal totals is the grand total: Male Female Col. Totals TV 280 115 395 Radio 175 275 450 NP Row Totals 305 760 170 560 475 1320 Note: Contingency tables often show percentages (relative frequencies). These percentages are based on the entire sample or on the subsample (row or column) classifications. 5 Percentages Based on the Grand Total (Entire Sample) • The previous contingency table may be converted to percentages of the grand total by dividing each frequency by the grand total and multiplying by 100 – For example, 175 becomes 13.3% 175 = 100 13.3 1320 Male Female Col. Totals TV 21.2 8.7 29.9 Radio 13.3 20.8 34.1 NP Row Totals 23.1 57.6 12.9 42.4 36.0 100.0 6 Illustration • These same statistics (numerical values describing sample results) can be shown in a (side-by-side) bar graph: Percentages Based on Grand Total 25 Male 20 Female 15 Percent 10 5 0 TV Radio NP Media 7 Percentages Based on Row (Column) Totals • The entries in a contingency table may also be expressed as percentages of the row (column) totals by dividing each row (column) entry by that row’s (column’s) total and multiplying by 100. The entries in the contingency table below are expressed as percentages of the column totals: Male Male Female Female Col. Col.Totals Totals TV TV Radio Radio NP NP Row RowTotals Totals 70.9 70.9 38.9 38.9 64.2 64.2 57.6 57.6 29.1 29.1 61.1 61.1 35.8 35.8 42.4 42.4 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Note: These statistics may also be displayed in a side-by-side bar graph 8 One Qualitative & One Quantitative Variable 1. When bivariate data results from one qualitative and one quantitative variable, the quantitative values are viewed as separate samples 2. Each set is identified by levels of the qualitative variable 3. Each sample is described using summary statistics, and the results are displayed for side-by-side comparison 4. Statistics for comparison: measures of central tendency, measures of variation, 5-number summary 5. Graphs for comparison: dotplot, boxplot 9 Example Example: A random sample of households from three different parts of the country was obtained and their electric bill for June was recorded. The data is given in the table below: Northeast Northeast 23.75 23.75 40.50 40.50 33.65 33.65 31.25 31.25 42.55 42.55 50.60 50.60 37.70 37.70 31.55 31.55 38.85 38.85 21.25 21.25 Midwest Midwest 34.38 34.38 34.35 34.35 39.15 39.15 37.12 37.12 36.71 36.71 34.39 34.39 35.12 35.12 35.80 35.80 37.24 37.24 40.01 40.01 West West 54.54 54.54 65.60 65.60 59.78 59.78 45.12 45.12 60.35 60.35 61.53 61.53 52.79 52.79 47.37 47.37 59.64 59.64 37.40 37.40 • The part of the country is a qualitative variable with three levels of response. The electric bill is a quantitative variable. The electric bills may be compared with numerical and graphical techniques. 10 Comparison Using Dotplots . . : . . . . . . ---+---------+---------+---------+--------+---------+--- Northeast . :..:. .. ---+---------+---------+---------+--------+---------+--- Midwest . . . . . . : . . • The electric bills in the Northeast tend to be more spread out than ---+---------+---------+---------+----those in the Midwest. TheWest bills in the West tend to be higher than ----+---------+--both those in the24.0 Northeast and32.0 Midwest. 40.0 48.0 56.0 64.0 11 Comparison Using Box-and-Whisker Plots The Monthly Electric Bill 70 60 50 Electric Bill 40 30 20 Northeas t Midwes t Wes t 12 Two Quantitative Variables 1. Expressed as ordered pairs: (x, y) 2. x: input variable, independent variable y: output variable, dependent variable Scatter Diagram: A plot of all the ordered pairs of bivariate data on a coordinate axis system. The input variable x is plotted on the horizontal axis, and the output variable y is plotted on the vertical axis. Note: Use scales so that the range of the y-values is equal to or slightly less than the range of the x-values. This creates a window that is approximately square. 13 Example Example: In a study involving children’s fear related to being hospitalized, the age and the score each child made on the Child Medical Fear Scale (CMFS) are given in the table below: Age (x ) CMFS (y ) 8 9 9 10 11 9 8 9 8 11 31 25 40 27 35 29 25 34 44 19 Age (x ) CMFS (y ) 7 6 6 8 9 12 15 13 10 10 28 47 42 37 35 16 12 23 26 36 Construct a scatter diagram for this data 14 Solution • age = input variable, CMFS = output variable Child Medical Fear Scale 50 40 CMFS 30 20 10 6 7 8 9 10 11 12 13 14 15 Age 15 3.2 ~ Linear Correlation • Measures the strength of a linear relationship between two variables – As x increases, no definite shift in y: no correlation – As x increases, a definite shift in y: correlation – Positive correlation: x increases, y increases – Negative correlation: x increases, y decreases – If the ordered pairs follow a straight-line path: linear correlation 16 Example: No Correlation • As x increases, there is no definite shift in y: 55 Output 45 35 10 20 30 Input 17 Example: Positive Correlation • As x increases, y also increases: 60 50 Output 40 30 20 10 15 20 25 30 35 40 45 50 55 Input 18 Example: Negative Correlation • As x increases, y decreases: 95 85 Output 75 65 55 10 15 20 25 30 35 40 45 50 55 Input 19 Please Note Perfect positive correlation: all the points lie along a line with positive slope Perfect negative correlation: all the points lie along a line with negative slope If the points lie along a horizontal or vertical line: no correlation If the points exhibit some other nonlinear pattern: no linear relationship, no correlation Need some way to measure correlation 20 3.1 ~ Bivariate Data Coefficient of Linear Correlation: r, measures the strength of the linear relationship between two variables Pearson’s Product Moment Formula: ( x x )( y y ) r= (n 1) sx s y Notes: 1 r 1 r = +1: perfect positive correlation r = -1 : perfect negative correlation 21 Alternate Formula for r SS( xy) r= SS( x)SS( y) SS( x ) = “sum of squ ares for x”= x 2 SS( y ) = “sum of squ ares for y”= y 2 ( x) 2 n ( y) 2 n SS( xy ) = “sum of squ ares for xy”= xy x y n 22 Example Example: The table below presents the weight (in thousands of pounds) x and the gasoline mileage (miles per gallon) y for ten different automobiles. Find the linear correlation coefficient: 2 2 y y xy x x Sum Sum Sum 2.5 2.5 2.5 3.0 3.0 3.0 4.0 4.0 4.0 3.5 3.5 3.5 2.7 2.7 2.7 4.5 4.5 4.5 3.8 3.8 3.8 2.9 2.9 2.9 5.0 5.0 5.0 2.2 2.2 2.2 34.1 34.1 34.1 x 40 40 40 6.25 6.25 6.25 43 43 43 9.00 9.00 9.00 30 30 30 16.00 16.00 16.00 35 35 35 12.25 12.25 12.25 42 42 42 7.29 7.29 7.29 19 19 19 20.25 20.25 20.25 32 32 32 14.44 14.44 14.44 39 39 39 8.41 8.41 8.41 15 15 15 25.00 25.00 25.00 14 14 14 4.84 4.84 4.84 309 309 309 123.73 123.73 123.73 y x2 1600 1600 1600 100.0 100.0 100.0 1849 1849 1849 129.0 129.0 129.0 900 900 900 120.0 120.0 120.0 1225 1225 1225 122.5 122.5 122.5 1764 1764 1764 113.4 113.4 113.4 361 361 361 85.5 85.5 85.5 1024 1024 1024 121.6 121.6 121.6 1521 1521 1521 113.1 113.1 113.1 225 225 225 75.0 75.0 75.0 196 196 196 30.8 30.8 30.8 10665 10665 10665 1010.9 1010.9 1010.9 y2 xy 23 Completing the Calculation for r SS( x) = x SS( y ) = y SS( xy) = r= 2 x) ( 2 n 2 y) ( n 2 (34.1) 2 = 123.73 = 7.449 10 (309) 2 = 10665 = 1116.9 10 x y (34.1)(309) xy = 1010.9 = 42.79 SS ( xy ) = SS ( x )SS ( y ) n 10 42.79 ( 7.449 )(1116 .9 ) = 0.47 24 Please Note r is usually rounded to the nearest hundredth r close to 0: little or no linear correlation As the magnitude of r increases, towards -1 or +1, there is an increasingly stronger linear correlation between the two variables Method of estimating r based on the scatter diagram. Window should be approximately square. Useful for checking calculations. 25 3.3 ~ Linear Regression • Regression analysis finds the equation of the line that best describes the relationship between two variables • One use of this equation: to make predictions 26 Models or Prediction Equations • Some examples of various possible relationships: Linear: ^y = b0 b1 x 2 Quadratic: y^ = a bx cx x Exponential: y^ = a (b ) Logarithmic: ^y = a log b x Note: What would a scatter diagram look like to suggest each relationship? 27 Method of Least Squares • Equation of the best-fitting line: ^y = b0 b1 x • Predicted value: ^y • Least squares criterion: – Find the constants b0 and b1 such that the sum 2 2 ^ = ( y ) ( y ( b b x )) y 0 1 is as small as possible 28 Illustration • Observed and predicted values of y: y ^y = b0 b1 x ( x, y) y ^y ( x , ^y ) ^y y x 29 The Line of Best Fit Equation • The equation is determined by: b0: y-intercept b1: slope • Values that satisfy the least squares criterion: ( x x )( y y ) SS( xy ) b1 = = 2 SS( x ) ( x x) y (b1 x ) b0 = = y (b1 x) n 30 Example Example: A recent article measured the job satisfaction of subjects with a 14-question survey. The data below represents the job satisfaction scores, y, and the salaries, x, for a sample of similar individuals: x y 31 17 33 20 22 13 24 15 35 18 29 17 23 12 37 21 1) Draw a scatter diagram for this data 2) Find the equation of the line of best fit 31 Finding b1 & b0 • Preliminary calculations needed to find b1 and b0: x 23 23 23 23 31 31 31 31 33 33 33 33 22 22 22 22 24 24 24 24 35 35 35 35 29 29 29 29 37 37 37 37 234 234 234 234 x y xy x2 12 12 12 12 529 529 529 529 276 276 276 276 17 17 17 17 961 961 961 961 527 527 527 527 20 20 20 20 1089 1089 1089 1089 660 660 660 660 13 13 13 13 484 484 484 484 286 286 286 286 15 15 15 15 576 576 576 576 360 360 360 360 18 18 18 18 1225 1225 1225 1225 630 630 630 630 17 17 17 17 841 841 841 841 493 493 493 493 21 21 21 21 1369 1369 1369 1369 777 777 777 777 133 133 133 133 7074 7074 7074 7074 4009 4009 4009 4009 y x2 xy 32 Line of Best Fit SS( x ) = x) ( 2 x n SS( xy ) = b1 = b0 2 234 2 = 7074 = 229.5 8 x y (234)(133) xy = 4009 = 118.75 n 8 SS( xy ) = 118.75 = 0.5174 SS( x ) 229.5 y (b1 x ) 133 (0. 5174)(234) = = = 14902 . n 8 . 0. 517 x Solution 1) Equation of the line of best fit: ^y = 149 33 Scatter Diagram Solution 2) Job Satisfaction Survey 22 21 20 19 18 Job Satisfaction 17 16 15 14 13 12 21 23 25 27 29 31 33 35 37 Salary 34 Please Note Keep at least three extra decimal places while doing the calculations to ensure an accurate answer When rounding off the calculated values of b0 and b1, always keep at least two significant digits in the final answer The slope b1 represents the predicted change in y per unit increase in x The y-intercept is the value of y where the line of best fit intersects the y-axis The line of best fit will always pass through the point ( x , y ) 35 Making Predictions 1. One of the main purposes for obtaining a regression equation is for making predictions 2. For a given value of x, we can predict a value of ^y 3. The regression equation should be used to make predictions only about the population from which the sample was drawn 4. The regression equation should be used only to cover the sample domain on the input variable. You can estimate values outside the domain interval, but use caution and use values close to the domain interval. 5. Use current data. A sample taken in 1987 should not be used to make predictions in 1999. 36