Data analysis - National Cheng Kung University

Download Report

Transcript Data analysis - National Cheng Kung University

Numerical Analysis
EE, NCKU
Tien-Hao Chang (Darby Chang)
1
Correlation coefficient
Two continuous variables
2
Correlation coefficient (CC)

What we need is a single summary number that
answers the following questions:
– does a relationship exist?
– if so, is it a positive or a negative relationship? and
– is it a strong or a weak relationship?

Correlation coefficient, a single summary number
that gives you a good idea about how closely one
variable is related to another variable
3
Correlation coefficient
Two-way scatter plot

Before we do any analysis, we should create a
two-way scatter plot of the data

For example, relationship between the percentage
of children who have been immunized against the
infectious DPT and mortality rate
– 𝑥: the percentage of children immunized by age on
year
– 𝑦: the under-five mortality rate
4
The mortality rate tends to decrease as the percentage of children immunized
increase
5
Correlation Coefficient
Pearson’s correlation coefficient



In the underlying population form which the sample of
points 𝑥𝑖 , 𝑦𝑖 is selected, the population correlation
between the variables 𝑋 and 𝑌
The quantifies the strength of the linear relationship
between the outcomes 𝑥𝑖 and 𝑦𝑖
The estimator of 𝜌 or 𝑟 is known as Pearson’s coefficient of
correlation or correlation coefficient
– 𝜌𝑥𝑦 =
– 𝑟𝑥𝑦 =
𝜎𝑥𝑦
𝜎𝑥 𝜎𝑦
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑥𝑖 −𝑥 2 𝑦𝑖 −𝑦 2
6

The correlation coefficient is dimensionless number; it has
no units of measurement.
– −1 ≤ 𝑟 ≤ 1
– the value 𝑟 = 1 and 𝑟 = −1 occur when there is an exact linear
relationship between 𝑥𝑖 and 𝑦𝑖
– if 𝑦𝑖 tends to increase in magnitude as 𝑥𝑖 increases, 𝑟 is greater
than 0; 𝑥𝑖 any 𝑦𝑖 are said to be positively correlated
– if 𝑦𝑖 decreases as 𝑥𝑖 increases, 𝑟 is less than 0 and the two
variables are negatively correlated
– if 𝑟 = 0, there is no linear relationship between 𝑥𝑖 and 𝑦𝑖 and the
variables are uncorrelated

http://cclearn.npue.edu.tw/tuition/ccchen-web/教育統計學
/7.pdf
7
8
http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png
Correlation Coefficient
Correlation coefficient is not a percent

In addition to telling you
– whether two variables are related to one another,
– whether the relationship is positive or negative and
– how large the relationship is,


The correlation coefficient tells you one more important bit of
information—it tells you exactly how much variation in one variable is
related to changes in the other variable
A correlation coefficient is a “ratio” rather than a percentage
– many students tend to think when 𝑟 = .90 it means that 90% of the changes in
one variable are accounted for or related to the other variable
– even worse, some think that this means that any predictions you make will be
90% accurate
– both are not correct!
9
Correlation Coefficient
Coefficient of determination




However it is very easy to translate the correlation
coefficient into a percentage
All you have to do is “square the correlation coefficient”
which means that you multiply it by itself
So, if the symbol for a correlation coefficient is 𝑟, then the
symbol for this new statistic is simply 𝑟 2 which can be
called “r squared”
𝑟 2 , also called the “Coefficient of Determination”, tells you
how much variation in one variable is directly related to (or
accounted for) by the variation in the other variable
10
The correlation coefficient is 𝑟 = .80. By squaring 𝑟 to get 𝑟 2 ,
you fully 64% of the variation in scores on Variable B is directly
related to how they scored on Variable A.
11
Statistical test
12
Correlation coefficient
Statistical inference

To test a significant correlation between two variables
– 𝐻0 : 𝑟 = 0
– 𝐻1 : 𝑟 ≠ 0

The statistic (under 𝐻0 ):
– 𝑡=

𝑛−2
1−𝑟 2
with 𝑛 − 2 degrees of freedom
http://zoro.ee.ncku.edu.tw/mlb2009/res/14-ch5.pdf (pp.
9-14)
13

Test the significance of the correlation coefficient for the age and blood
pressure data
– suppose that 𝑛 = 6, 𝑟 = 0.879 and 𝛼 = 0.05

Step 1: State the hypotheses
– 𝐻0 : 𝑟 = 0 𝐻1 : 𝑟 ≠ 0

Step 2: Find the critical values
– since 𝛼 = 0.05 and there are 6 − 2 = 4 degrees of freedom, the critical values
are 𝑡 = +2.776 and 𝑡 = −2.776.

Step 3: Compute the test value
– 𝑡 = 4.059

Step 4: Make the decision
– reject the null hypothesis, since the test value falls in the critical region
(4.059 > 2.776)

Step 5: Summarize the results
– there is a significant relationship between the variables of age and blood
pressure
14
Correlation coefficient
Limitations

It quantifies only the strength of the linear
relationship between two variables

Care must be taken when the data contain any
outliers, or pairs of observations that lie
considerably outside the range of the other
data points

A high correlation between two variables does
not imply a cause-and-effect relationship
15
http://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/2000px-Anscombe%27s_quartet_3.svg.png
Four sets of data with the same correlation of 0.816
16
Correlation coefficient
Spearman’s rank CC


Pearson’s correlation coefficient is very sensitive
to outlying values, and thus less robust
One approach is to rank the two sets of outcomes
𝑥𝑖 and 𝑦𝑖 separately, known as Spearman’s rank
correlation coefficient
𝑥𝑟𝑖 − 𝑥𝑟 𝑦𝑟𝑖 − 𝑦𝑟
𝑟𝑥𝑦 =
𝑥𝑟𝑖 − 𝑥𝑟 2 𝑦𝑖 − 𝑦𝑟 2
– where 𝑥𝑟𝑖 and 𝑦𝑟𝑖 are the rank associated the 𝑖-th
subject rather than the actual observations
17
Any Questions?
About correlation coefficient
18
Statistical inference

Basic tests
–
–
–
–
–
tests about proportions
tests about one mean
tests of the equality of two means
tests for variances
references
•
•
•
•

http://zoro.ee.ncku.edu.tw/mlb2009/res/14-ch5.pdf (pp. 27-33)
http://www.math.isu.edu.tw/finance/course/sta/ch8.ppt
http://www.tnb.org.tw/Image/ttest.ppt
http://www.mis.ncyu.edu.tw/course/download/cftai/Chapter%206.%20Continuous%20Probability%20
Distribution.PPT
More advanced tests
–
–
ANOVA (analysis of variance)
goodness of fit (Wilcoxon test, Kolmogorov-Smirnov test, …)
19
Multivariate analysis

Statistics
–
–
ANOVA
Multiple linear regression
•
•
–
–
–
http://www.sjsu.edu/faculty/gerstman/biostat-text/Gerstman_PP15.ppt
http://www.stat.nuk.edu.tw/Ray-Bing/regression/regression/Chapter3.ppt
PCA (principle component analysis)
ICA (independent component analysis)
LDA (linear discriminant analysis)

So far, all techniques belong to statistics. You could find them in most statistical
software, such as MATLAB, R (http://www.r-project.org/), SPSS…

Machine learning
–
–
–
Naïve Bayes (http://zoro.ee.ncku.edu.tw/mlb2009/res/11-ch4.pdf pp. 13-27)
LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
RVKDE (http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde)
20
Let’s see an Excel tutorial
21
Let’s see the data
22
Points to a good final project
23
Points to a good final project

Raise some interesting issues
– from observations
– you have at least two trap issues (next slide)

Design good analyses
– make sure that your analyses fit your issues
– do the results concur with your speculations?
– design further analyses
24
Predict masked disease codes
acode

class:name
There are some ‘masked’ diseases codes
– for example, disease #14 has no acode, class and name


First, predict the masked disease names
Second, some masked diseases whose names are not in the file
(namely, novel diseases). Try to identify them, and, if possible, to
figure out what disease they are
25
The final project includes

Presentation
– slides (.ppt) and how you present them
– convincing for me and your classmates
– reasonably evaluate other works (voting others’ works if we have time)

Project
– scripts (executable)
– results (.txt, .xls, …)
– a step-by-step README of how you get the results from cd.dat (.txt, .doc, …)

Report
– a more detailed document of your slide (.doc)
– the duty of each group member
– anything worthy extra credit
26
Final grade





Email all the materials to [email protected]
before 2011/6/20 23:59
The raw grade will be available as soon as the final
project of your group is received
Ask me ([email protected]) about your grade
with your NCKU email account
The final (adjusted) grades must wait all groups
(2011/6/21, I hope)
You have about one week to double-check the grade,
and the final grades will be submitted around 2011/6/27
27