Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data Regression Plot Y = 2.31464 + 1.28722X r = 0.559 Weight10 Height Chapter Goals • To be able.

Download Report

Transcript Chapter 3 ~ Descriptive Analysis & Presentation of Bivariate Data Regression Plot Y = 2.31464 + 1.28722X r = 0.559 Weight10 Height Chapter Goals • To be able.

Chapter 3 ~ Descriptive Analysis &
Presentation of Bivariate Data
Regression Plot
Y = 2.31464 + 1.28722X
r = 0.559
60
50
40
Weight
30
20
10
10
20
30
40
50
Height
1
Chapter Goals
• To be able to present bivariate data in tabular and
graphic form
• To become familiar with the ideas of descriptive
presentation
• To gain an understanding of the distinction
between the basic purposes of correlation
analysis and regression analysis
2
3.1 ~ Bivariate Data
Bivariate Data: Consists of the values of two different
response variables that are obtained from the same
population of interest
Three combinations of variable types:
1. Both variables are qualitative (attribute)
2. One variable is qualitative (attribute) and the other
is quantitative (numerical)
3. Both variables are quantitative (both numerical)
3
Two Qualitative Variables
• When bivariate data results from two qualitative (attribute or
categorical) variables, the data is often arranged on a crosstabulation or contingency table

Example: A survey was conducted to investigate the
relationship between preferences for television, radio, or
newspaper for national news, and gender. The results are
given in the table below:
Male
Female
TV
280
115
Radio
175
275
NP
305
170
4
Marginal Totals
• This table may be extended to display the marginal totals (or
marginals). The total of the marginal totals is the grand total:
Male
Female
Col. Totals
TV
280
115
395
Radio
175
275
450
NP Row Totals
305
760
170
560
475
1320
Note: Contingency tables often show percentages (relative
frequencies). These percentages are based on the entire
sample or on the subsample (row or column) classifications.
5
Percentages Based on the Grand Total
(Entire Sample)
• The previous contingency table may be converted to
percentages of the grand total by dividing each frequency by
the grand total and multiplying by 100
– For example, 175 becomes 13.3%
 175 

=

100 13.3
 1320

Male
Female
Col. Totals
TV
21.2
8.7
29.9
Radio
13.3
20.8
34.1
NP Row Totals
23.1
57.6
12.9
42.4
36.0
100.0
6
Illustration
• These same statistics (numerical values describing sample
results) can be shown in a (side-by-side) bar graph:
Percentages Based on Grand Total
25
Male
20
Female
15
Percent
10
5
0
TV
Radio
NP
Media
7
Percentages Based on Row (Column) Totals
• The entries in a contingency table may also be expressed as
percentages of the row (column) totals by dividing each row
(column) entry by that row’s (column’s) total and multiplying by
100. The entries in the contingency table below are expressed as
percentages of the column totals:
Male
Male
Female
Female
Col.
Col.Totals
Totals
TV
TV Radio
Radio
NP
NP Row
RowTotals
Totals
70.9
70.9 38.9
38.9 64.2
64.2
57.6
57.6
29.1
29.1 61.1
61.1 35.8
35.8
42.4
42.4
100.0
100.0 100.0
100.0 100.0
100.0
100.0
100.0
Note: These statistics may also be displayed in a side-by-side bar
graph
8
One Qualitative & One Quantitative Variable
1. When bivariate data results from one qualitative and one
quantitative variable, the quantitative values are viewed as
separate samples
2. Each set is identified by levels of the qualitative variable
3. Each sample is described using summary statistics, and the
results are displayed for side-by-side comparison
4. Statistics for comparison: measures of central tendency,
measures of variation, 5-number summary
5. Graphs for comparison: dotplot, boxplot
9
Example
 Example: A random sample of households from three different
parts of the country was obtained and their electric
bill for June was recorded. The data is given in the
table below:
Northeast
Northeast
23.75
23.75
40.50
40.50
33.65
33.65
31.25
31.25
42.55
42.55
50.60
50.60
37.70
37.70
31.55
31.55
38.85
38.85
21.25
21.25
Midwest
Midwest
34.38
34.38
34.35
34.35
39.15
39.15
37.12
37.12
36.71
36.71
34.39
34.39
35.12
35.12
35.80
35.80
37.24
37.24
40.01
40.01
West
West
54.54
54.54
65.60
65.60
59.78
59.78
45.12
45.12
60.35
60.35
61.53
61.53
52.79
52.79
47.37
47.37
59.64
59.64
37.40
37.40
• The part of the country is a qualitative variable with three levels of
response. The electric bill is a quantitative variable. The electric
bills may be compared with numerical and graphical techniques.
10
Comparison Using Dotplots
.
.
:
.
. .
. .
.
---+---------+---------+---------+--------+---------+--- Northeast
.
:..:. ..
---+---------+---------+---------+--------+---------+--- Midwest
.
.
. .
. .
: .
.
• The electric bills in the Northeast tend to be more spread out than
---+---------+---------+---------+----those
in the Midwest. TheWest
bills in the West tend to be higher than
----+---------+--both those in the24.0
Northeast and32.0
Midwest. 40.0
48.0
56.0
64.0
11
Comparison Using Box-and-Whisker Plots
The Monthly Electric Bill
70
60
50
Electric
Bill
40
30
20
Northeas t
Midwes t
Wes t
12
Two Quantitative Variables
1. Expressed as ordered pairs: (x, y)
2. x: input variable, independent variable
y: output variable, dependent variable
Scatter Diagram: A plot of all the ordered pairs of bivariate
data on a coordinate axis system. The input variable x is
plotted on the horizontal axis, and the output variable y is
plotted on the vertical axis.
Note: Use scales so that the range of the y-values is equal to or
slightly less than the range of the x-values. This creates a
window that is approximately square.
13
Example
 Example: In a study involving children’s fear related to being
hospitalized, the age and the score each child made on
the Child Medical Fear Scale (CMFS) are given in the
table below:
Age (x )
CMFS (y )
8
9
9 10 11
9
8
9
8 11
31 25 40 27 35 29 25 34 44 19
Age (x )
CMFS (y )
7
6
6
8
9 12 15 13 10 10
28 47 42 37 35 16 12 23 26 36
Construct a scatter diagram for this data
14
Solution
• age = input variable, CMFS = output variable
Child Medical Fear Scale
50
40
CMFS
30
20
10
6
7
8
9
10
11
12
13
14
15
Age
15
3.2 ~ Linear Correlation
• Measures the strength of a linear relationship
between two variables
– As x increases, no definite shift in y: no correlation
– As x increases, a definite shift in y: correlation
– Positive correlation: x increases, y increases
– Negative correlation: x increases, y decreases
– If the ordered pairs follow a straight-line path: linear
correlation
16
Example: No Correlation
• As x increases, there is no definite shift in y:
55
Output
45
35
10
20
30
Input
17
Example: Positive Correlation
• As x increases, y also increases:
60
50
Output
40
30
20
10
15 20
25
30
35
40
45 50
55
Input
18
Example: Negative Correlation
• As x increases, y decreases:
95
85
Output
75
65
55
10
15 20
25
30
35
40
45 50
55
Input
19
Please Note

Perfect positive correlation: all the points lie along a line
with positive slope

Perfect negative correlation: all the points lie along a line
with negative slope

If the points lie along a horizontal or vertical line: no
correlation

If the points exhibit some other nonlinear pattern: no linear
relationship, no correlation

Need some way to measure correlation
20
3.1 ~ Bivariate Data
Coefficient of Linear Correlation: r, measures the strength of
the linear relationship between two variables
Pearson’s Product Moment Formula:
( x  x )( y  y )

r=
(n  1) sx s y
Notes:



 1  r  1
r = +1: perfect positive correlation
r = -1 : perfect negative correlation
21
Alternate Formula for r
SS( xy)
r=
SS( x)SS( y)
SS( x ) = “sum of squ ares for x”=  x 2 
SS( y ) = “sum of squ ares for y”=  y 2 
( x)
2
n
( y)
2
n
SS( xy ) = “sum of squ ares for xy”=  xy  
x y
n
22
Example
 Example: The table below presents the weight (in thousands of
pounds) x and the gasoline mileage (miles per gallon)
y for ten different automobiles. Find the linear
correlation coefficient:
2
2
y
y
xy
x
x
Sum
Sum
Sum
2.5
2.5
2.5
3.0
3.0
3.0
4.0
4.0
4.0
3.5
3.5
3.5
2.7
2.7
2.7
4.5
4.5
4.5
3.8
3.8
3.8
2.9
2.9
2.9
5.0
5.0
5.0
2.2
2.2
2.2
34.1
34.1
34.1
x
40
40
40
6.25
6.25
6.25
43
43
43
9.00
9.00
9.00
30
30
30
16.00
16.00
16.00
35
35
35
12.25
12.25
12.25
42
42
42
7.29
7.29
7.29
19
19
19
20.25
20.25
20.25
32
32
32
14.44
14.44
14.44
39
39
39
8.41
8.41
8.41
15
15
15
25.00
25.00
25.00
14
14
14
4.84
4.84
4.84
309
309
309 123.73
123.73
123.73
y
 x2
1600
1600
1600
100.0
100.0
100.0
1849
1849
1849
129.0
129.0
129.0
900
900
900
120.0
120.0
120.0
1225
1225
1225
122.5
122.5
122.5
1764
1764
1764
113.4
113.4
113.4
361
361
361
85.5
85.5
85.5
1024
1024
1024
121.6
121.6
121.6
1521
1521
1521
113.1
113.1
113.1
225
225
225
75.0
75.0
75.0
196
196
196
30.8
30.8
30.8
10665
10665
10665 1010.9
1010.9
1010.9
 y2
 xy
23
Completing the Calculation for r
SS( x) =  x
SS( y ) =  y
SS( xy) = 
r=
2
x)
(


2
n
2
y)
(


n
2
(34.1) 2
= 123.73 
= 7.449
10
(309) 2
= 10665 
= 1116.9
10
x y
(34.1)(309)

xy 
= 1010.9 
= 42.79
SS ( xy )
=
SS ( x )SS ( y )
n
10
 42.79
( 7.449 )(1116 .9 )
= 0.47
24
Please Note

r is usually rounded to the nearest hundredth

r close to 0: little or no linear correlation

As the magnitude of r increases, towards -1 or +1, there is
an increasingly stronger linear correlation between the
two variables

Method of estimating r based on the scatter diagram.
Window should be approximately square. Useful for
checking calculations.
25
3.3 ~ Linear Regression
• Regression analysis finds the equation of the line
that best describes the relationship between two
variables
• One use of this equation: to make predictions
26
Models or Prediction Equations
• Some examples of various possible relationships:
Linear: ^y = b0  b1 x
2
Quadratic: y^ = a  bx  cx
x
Exponential: y^ = a (b )
Logarithmic: ^y = a log b x
Note: What would a scatter diagram look like to suggest each
relationship?
27
Method of Least Squares
• Equation of the best-fitting line: ^y = b0  b1 x
• Predicted value: ^y
• Least squares criterion:
– Find the constants b0 and b1 such that the sum
2
2
^

=


(
y
)
(
y
(
b
b
x
))
y


0
1
is as small as possible
28
Illustration
• Observed and predicted values of y:
y
^y = b0  b1 x
 ( x, y)
y  ^y
 ( x , ^y )
^y
y
x
29
The Line of Best Fit Equation
• The equation is determined by:
b0: y-intercept
b1: slope
• Values that satisfy the least squares criterion:
( x  x )( y  y ) SS( xy )

b1 =
=
2
SS( x )
 ( x  x)
y  (b1   x )

b0 =
= y  (b1  x)
n
30
Example
 Example: A recent article measured the job satisfaction of
subjects with a 14-question survey. The data below
represents the job satisfaction scores, y, and the
salaries, x, for a sample of similar individuals:
x
y
31
17
33
20
22
13
24
15
35
18
29
17
23
12
37
21
1) Draw a scatter diagram for this data
2) Find the equation of the line of best fit
31
Finding b1 & b0
• Preliminary calculations needed to find b1 and b0:
x
23
23
23
23
31
31
31
31
33
33
33
33
22
22
22
22
24
24
24
24
35
35
35
35
29
29
29
29
37
37
37
37
234
234
234
234
x
y
xy
x2
12
12
12
12 529
529
529
529 276
276
276
276
17
17
17
17 961
961
961
961 527
527
527
527
20
20
20
20 1089
1089
1089
1089 660
660
660
660
13
13
13
13 484
484
484
484 286
286
286
286
15
15
15
15 576
576
576
576 360
360
360
360
18
18
18
18 1225
1225
1225
1225 630
630
630
630
17
17
17
17 841
841
841
841 493
493
493
493
21
21
21
21 1369
1369
1369
1369 777
777
777
777
133
133
133
133 7074
7074
7074
7074 4009
4009
4009
4009
y
 x2
 xy
32
Line of Best Fit
SS( x ) = 
x)
(

2
x 
n
SS( xy ) = 
b1 =
b0
2
 234 2 
= 7074  
= 229.5

 8 
x y
(234)(133) 


xy 
= 4009 
= 118.75

n
8

SS( xy ) = 118.75 =
0.5174
SS( x )
229.5
y  (b1   x ) 133  (0. 5174)(234)

=
=
= 14902
.
n
8
. 0. 517 x
Solution 1) Equation of the line of best fit: ^y = 149
33
Scatter Diagram
Solution 2)
Job Satisfaction Survey
22
21
20
19
18
Job
Satisfaction
17
16
15
14
13
12
21
23
25
27
29
31
33
35
37
Salary
34
Please Note

Keep at least three extra decimal places while doing the
calculations to ensure an accurate answer

When rounding off the calculated values of b0 and b1,
always keep at least two significant digits in the final
answer

The slope b1 represents the predicted change in y per unit
increase in x

The y-intercept is the value of y where the line of best fit
intersects the y-axis

The line of best fit will always pass through the point ( x , y )
35
Making Predictions
1. One of the main purposes for obtaining a regression equation
is for making predictions
2. For a given value of x, we can predict a value of ^y
3. The regression equation should be used to make predictions
only about the population from which the sample was drawn
4. The regression equation should be used only to cover the
sample domain on the input variable. You can estimate
values outside the domain interval, but use caution and
use values close to the domain interval.
5. Use current data. A sample taken in 1987 should not be
used to make predictions in 1999.
36