Numerical Analysis EE, NCKU Tien-Hao Chang (Darby Chang) Correlation coefficient What we need is a single summary number that answers the following questions: – does.
Download
Report
Transcript Numerical Analysis EE, NCKU Tien-Hao Chang (Darby Chang) Correlation coefficient What we need is a single summary number that answers the following questions: – does.
Numerical Analysis
EE, NCKU
Tien-Hao Chang (Darby Chang)
1
Correlation coefficient
What we need is a single summary number
that answers the following questions:
– does a relationship exist?
– if so, is it a positive or a negative relationship?
and
– is it a strong or a weak relationship?
Correlation coefficient: A single summary
number that gives you a good idea about
how closely one variable is related to
another variable
2
Correlation coefficient
Two-way scatter plot
Suppose that we are interested in a pair of continuous
random variables
– For example, relationship between the percentage of children
who have been immunized against the infectious DPT and
mortality rate
Data for a random sample of 20 countries are shown in the
next slide
– X: the percentage of children immunized by age on year
– Y: the under-five mortality rate
Before we do any analysis, we should create a two-way
scatter plot of the data
– relationship exists between x and y?
The mortality rate tends to decrease as the percentage of
children immunized increase
3
4
Pearson’s CC
In the underlying population form which the
sample of points (xi,yi) is selected, the
population correlation between the variables X
and Y
The quantifies the strength of the linear
relationship between the outcomes x and y
The estimator of ρ or r is known as Pearson’s
coefficient of correlation or correlation
coefficient
( x x )( y y )
xy
xy
x
rxy
y
(x
i
i
x)
i
2
( yi y )
2
5
The correlation coefficient is dimensionless number; it
has no units of measurement.
– -1 ≤ r ≤ 1
– the value r=1 and r=-1 occur when there is an exact
linear relationship between x and y
– if y tends to increase in magnitude as x increases, r is
greater than 0; x any y are said to be positively
correlated
– if y decreases as x increases, r is less than 0 and the
two variables are negatively correlated
– if r=0, there is no linear relationship between x and y
and the variables are uncorrelated
http://cclearn.npue.edu.tw/tuition/ccchen-web/教育統
計學/7.pdf
6
http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png
7
CC is not a percent
In addition to telling you
– whether two variables are related to one another,
– whether the relationship is positive or negative and
– how large the relationship is,
The correlation coefficient tells you one more important bit
of information—it tells you exactly how much variation in
one variable is related to changes in the other variable
A correlation coefficient is a “ratio” not a percent
– many students tend to think when r = .90 it means that 90%
of the changes in one variable are accounted for or related to
the other variable
– even worse, some think that this means that any predictions
you make will be 90% accurate
– both are not correct!
8
Correlation Coefficient
Coefficient of determination
However it is very easy to translate the correlation
coefficient into a percentage
All you have to do is “square the correlation
coefficient” which means that you multiply it by itself
So, if the symbol for a correlation coefficient is “r”,
then the symbol for this new statistic is simply “r2”
which can be called “r squared”
r2, also called the “Coefficient of Determination”, tells
you how much variation in one variable is directly
related to (or accounted for) by the variation in the
other variable
9
The correlation coefficient is r = 0.80. By squaring r
to get r2, you fully 64% of the variation in scores on
Variable B is directly related to how they scored on
Variable A.
10
Statistical test
11
Correlation coefficient
Statistical inference
To test a significant correlation between two
variables
– H0:r = 0
– H1:r ≠ 0
The statistic (under H0):
t
n2
1 r
2
– with n-2 degrees of freedom
http://zoro.ee.ncku.edu.tw/mlb2009/res/14ch5.pdf (pp. 9-14)
12
Test the significance of the correlation coefficient for the
age and blood pressure data
– suppose that n=6, r=0.897 and α=0.05
Step 1: State the hypotheses
– H0:r = 0
H1:r ≠ 0
Step 2: Find the critical values
– since α=0.05 and there are 6–2=4 degrees of freedom, the
critical values are t = +2.776 and t = –2.776.
Step 3: Compute the test value
– t = 4.059
Step 4: Make the decision
– reject the null hypothesis, since the test value falls in the
critical region (4.059 > 2.776)
Step 5: Summarize the results
– there is a significant relationship between the variables of age
and blood pressure
13
Correlation coefficient
Limitations
It quantifies only the strength of the linear
relationship between two variables
Care must be taken when the data contain
any outliers, or pairs of observations that lie
considerably outside the range of the other
data points
A high correlation between two variables
does not imply a cause-and-effect
relationship
14
http://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/2000px-Anscombe%27s_quartet_3.svg.png
Four sets of data with the same correlation of 0.816
15
Spearman’s Rank CC
Pearson’s correlation coefficient is very sensitive to
outlying values
We may be interested in calculating a measure of
association that is more robust
One approach is to rank the two sets of outcomes x
and y separately and known as Spearman’s rank
correlation coefficient
rxy
( x x )( y y )
(x x ) ( y y
ri
r
ri
r
2
ri
r
ri
)
r
2
– where xri and yri are the rank associated the ith subject
rather than the actual observations
16
Any Questions?
About Correlation Coefficient
17
Statistical inference
Basic tests
–
–
–
–
–
tests about proportions
tests about one mean
tests of the equality of two means
tests for variances
references
•
•
•
•
http://zoro.ee.ncku.edu.tw/mlb2009/res/14-ch5.pdf (pp. 27-33)
http://www.math.isu.edu.tw/finance/course/sta/ch8.ppt
http://www.tnb.org.tw/Image/ttest.ppt
http://www.mis.ncyu.edu.tw/course/download/cftai/Chapter%206.%20Continuou
s%20Probability%20Distribution.PPT
More advanced tests
– ANOVA (analysis of variance)
– goodness of fit (Wilcoxon test, Kolmogorov-Smirnov test, …)
18
Multivariate analysis
Statistics
–
–
ANOVA
Multiple linear regression
•
•
–
–
–
http://www.sjsu.edu/faculty/gerstman/biostat-text/Gerstman_PP15.ppt
http://www.stat.nuk.edu.tw/Ray-Bing/regression/regression/Chapter3.ppt
PCA (principle component analysis)
ICA (independent component analysis)
LDA (linear discriminant analysis)
So far, all techniques belong to statistics. You could find them in most
statistical software, such as MATLAB, R (http://www.r-project.org/), SPSS…
Machine learning
–
–
–
Naïve Bayes (http://zoro.ee.ncku.edu.tw/mlb2009/res/11-ch4.pdf pp. 13-27)
LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
RVKDE (http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde)
19