Descriptive Statistics(1)

Download Report

Transcript Descriptive Statistics(1)

Probability and Statistics for
Computer Engineer
What is model?
Type of Models
Purpose of the Class
Course Overview
Model
• Model
– Virtual system to explain phenomena or behavior
– Example
• Stock price and weather forecasting rule, Ohm’s law
• Types of Models
– Deterministic v.s. Statistic(Stochastic)
– Chaotic v.s. Non-chaotic
- Deterministic Model
Differential Equations, Functions, Transform
Model
– Statistical Model
• Not data but statistics(mean, variance, probability density
function)
• Uncertainty
– Ambiguity due to lack of evidence
• Relative Frequency
– Vagueness inherent in language
• Probability
– Mathematical model of relative frequency
– Relative Frequency
Why we need to study?
• Purpose of Study
– Tool for analyzing & understanding statistical
models
• Related Courses in Computer Engineering
Statistical Pattern Recognition and Machine Learning
Data Mining
Data Communication
Artificial Intelligence
Simulation Engineering
Statistical Communication Theory
Digital Signal Processing
Image Processing
Lecture Plan
• Text
공학인증을 위한 확률과 통계
이재원외 카오스북
• Topics to be covered
–
–
–
–
–
Descriptive Statistics
Probability and Random Variables
Sample Distrribution
Statistical Estimation
Hypothesis Test
Lecture Plan
• Grading Policy
– Exam I 25%, Exam II 25%, Exam III 25%
– Home Work with Programming 15%
– Presence 10%
Descriptive Statistics
•
•
•
•
•
•
•
Graph for Data Analysis
Sample mean, Variance and Standard Deviation
Histogram and Cumulative Histogram
Measures of Central Tendency
Bivariate Data and Scatter Diagram (Plot)
Covariance and Correlation Coefficient
Uniform Random Number for Simulation
Graph for Data Analysis
• Data Table, Graph
– Data:
• Summarized for some purpose and
– Graph
• Histogram of frequency
• Dispersion plot
• Cumulative histogram of frequency
Graph for Data Analysis
• Example: Sample weights of male student
• {65,67,64,66,63,….62}
• Ascending OS = {53, 58, 60, 61, … 72}
• Frequency Distribution
Class
Range
Class
Center (X)
Frequency
(FR)
50.5-53.5
52
1
53.5-56.5
55
2
56.5-59.5
58
6
59.5-62.5
61
11
62.5-65.5
64
16
65.5-68.5
67
9
68.5-71.5
70
4
71.5-74.5
73
1
Graph for Data Analysis
– How to make frequency table (도수분포표)
• Number of classes (계급수): 6-20
• Class interval (계급범위)= [Range (Max. Data–Min
Data)/Number of class +1]
– Type of Graphs for Univariate
•
•
•
•
Histogram of frequency
Relative frequency = frequency/total number of data
Frequency polygon
Cumulative relative frequency polygon
Graph for Data Analysis
How to calculate the sample mean?
What does the sample mean stand for?
Anything else for more precise description of the data ?
Sample Mean, Variance and
Standard Deviation
• Example:
– Height data of all the students in this class (Not
sample, but population)
– Weights of sampled male students in CBNU
(Sample)
• (Sample) Mean of the data
For xi , i  1, 2, 3, ..., n
1 n
x   xi
n i 1
Residual
d i  xi  x
 d   ( x  x )   x  nx  0
i
i
i
– A representatives of the data
– Simple but not enough description
Sample Mean, Variance and
Standard Deviation
• Note :
• Optimal in the sense of sum of squared residuals
E   ( xi  C ) 2
E
1
 2 ( xi  C )  0 or nC   xi or C   xi
C
n
• Sometimes it is poor: Outlier (외톨이) Data
Example: 98 96 97 68 97
Mean = 91.2 Is it reasonable?
• Kinds of Representatives
Median of the data, Trimmed mean of the data
Needs of the other representatives than mean
Sample Mean, Variance and
Standard Deviation
• Sample Variance and Standard Deviation
–
–
–
–
Unit of standard deviation = Unit of data
A measure of dispersion of data
Variance with mean is still not enough to describe data.
Then how can the data be described completely?
n xi2   xi 
1
2
2
sx 
(Deriveit!)
 ( xi x )  n(n  1)
n 1
2
sx 
s x2 
1
( xi x ) 2 

n 1
1
( xi x ) 2 

n
n xi2   xi 
2
n(n  1)
n xi2   xi 
2
n2
n xi2   xi 
1
2
sx 
 ( xi x ) 
n
n2
2
when n is verylarge.
Histogram and Cumulative Histogram
• Frequency/Cumulative Frequency
Score
# of students
(Frequency)
Cum no.
(Cum. Freq.)
Relative Freq.
Cum. Relative
Freq.
0-9
2
2
0.02
0.02
10-19
3
5
0.03
0.05
20-29
5
10
0.05
0.10
30-39
7
17
0.07
0.17
40-49
8
25
0.08
0.25
50-59
16
41
0.16
0.41
60-69
25
66
0.25
0.66
70-79
17
83
0.17
0.83
80-89
12
95
0.12
0.95
90-99
5
100
0.05
1.00
Total
100
1.00
Histogram and Cumulative Histogram
Histogram
Cumulative Histogram
The area of the histogram = 100
The area of the relative frequency = 1.00
Non-decreasing property of cumulative histogram
Probability is a mathematical model of relative frequency.
The most precise description of data : Density or Distribution
Population and Sample
• Population (모집단)
– 관심의 대상이 되는 모든 가능한 관측치나 측정값의
집단
• 유한모집단(선거인), 무한모집단(자연수 공간)
• Sample (표본)
– 일정기준에 의해 추출한 모집합의 부분집합
• 예: 스마트 폰 공장의 불량검사
– Population: 생산된 모든 스마트 폰
– Sample: 임의로 추출된 일정 대수의 스마트 폰
Population and Sample
• Parameter(파라메터)
– 모집단으로부터 얻어진 자료의 특성치 또는
요약치
– 예: 모평균( ), 모분산( 2 ), 모표준편차(  )
• Statistics(통계치 또는 통계량)
– 표본의 특성이나 성격을 나타내는 수치
2
X
s
– 예: 표본평균( ), 표본분산( ), 표본표준편차
(s), 최빈수(mode)
Population and Sample
• Summary
(populti모집단
on)
표본
(sample)
비고
크기(size)
N
n
평균(mean)

X
E(X )  
분산variance)
2
s2
E(s 2 )   2
표준편차S.D.)

s
Measures of Central Tendency
• Arithmetic mean (산술평균)
– Geometric mean (기하평균)
– Harmonic mean (조화평균)
•
•
•
•
Median (중위수)
Mode (최빈수)
Weighted average (가중평균)
Winsored mean
Arithmetic Mean (산술평균)
• Mean in frequency distribution
–
–
–
–
–
Freq. in population
Sample freq.
Class center of population
Class center of population
Number of classes
L l
L
L
i 1
i 1
 w   f i xi /  f i
f1 , f 2 ,..., f L
f1 , f 2 ,..., f l
x1 , x2 ,..., xl
x1 , x2 ,...,xL
l
l
i 1
i 1
xw   f i xi /  f i
Remember these equation for understanding
the expected value.
Arithmetic Mean
• Example: Number of responsible family
members of a worker
number
class center
Freq.
0-2
1
3
3-5
4
26
6-8
7
23
9-11
10
1
w 
3 1  26  4  23  7  110
 5.25
3  26  23  1
Arithmetic Mean
• Features of arithmetic mean
– The simplest representative
– Good estimate of central tendency
– Optimal with respect to mean squared error
min
C
1 N
1 N
2
2
  ( xi  C )    ( xi   )
 N i 1
 N i 1
– Center of the range in symmetric distribution
– Sensitive to outlier
Median (중위수)
• Median,
Me
Center value after sorting the magnitude
P  {X 1,X 2 ,...,XN-1,X N }
If N is odd, M e  X ( N 1) / 2
If N is even, M e  ( X N / 2  X ( N / 2)1) / 2
Example
Med {3, 4, 10, 9} = (4+9)/2 = 6.5
P = {50,75,60,55,70,200,55,55}
Arithmetic mean = 77.5 Median = (55+60)/2 = 57.5
Which one is better for central tendency? Outlier = 200
Mode (최빈수)
• Mode, M o
The value that has the maximum freq.
Position of concentration in freq.
In symmetric distribution M  M e  M o
In single-mode asymmeric distribution
M  M o  3(M  M e )
Example:
Mode(2,3,2,1,4) = 2,
Mode(5,6,7,8) = None
Mode(9,5,4,8,9,8) = 8 or 9
Mode
• Example
M  70, M e  72, M o  75
Weighted Mean (가중평균)
• Data and weight
• Weighted Mean
{( X1,W1 ), ( X 2 ,W2 ),...(X n ,Wn )}
WeightedMean 
• Example:
n
W X
i 1
n
i
i
W
i 1
i
영어(4학점,C(2점)), 통계학(3학점,A(4점)), 체육(1학점,A(4점))
Weighted Mean = (4x2 + 3x4 + 1x4)/(4+3+1) = 3(B)
Winsored Mean
• Winsored Mean
– Sort the data in order, subtutute the data
less than ¼-th order into ¼-th data, and the
data greater than ¾-th order into ¾-th data,
and take the average
– Example: S = {5,6,7,8,9,11,13}
Winsored data = {6,6,7,8,9,11,11}
Winsored Mean=
Sum of Winsored data/n=58/7
Bivatiate Data and Scatter Diagram
• Scatter Diagram(Plot) for Multivariate Data
– Something to be considered
• Density: No. of data in an unit volume
• Relation between variables:
– Regression Analysis
– Correlations between variables
Covariance and Correlation Coefficient
• Covariance and correlation Coefficient
n xi yi   xi  yi
1
(
x

x
)(
y

y
)

 i
i
n 1
n(n  1)
c
rxy  xy
: Normalizedby standarddeviations
sx s y
cxy 
– Properties
0  rxy  1
y
rxy  0 : positivecorrelation
rxy  0 : negat ivecorrelation
rxy  0 : uncorelated
x
• Factor Analysis
Covariance and Correlation Coefficient
• Just thinking about
– 2-D or more dimensional (accumulated) histogram
• Linear Regression
Find thelinear equation
y  x  
that minimize  
1
( yi  xi   ) 2 .

n-1
Solution :

-2

( yi  xi   ) xi  0 gives   xi2    xi   xi yi

 n  1

2

 ( yi  xi   )  0 gives   xi  n   yi .
 n  1

n xi yi   xi  yi
n xi2  ( xi ) 2

n xi yi   xi  yi
n(n  1) s x2
  y  x 
1
1
yi    xi

n
n
Uniform Random Number
• Examples
– Histogram of fair die or coin
12000
10000
8000
6000
계열1
4000
2000
0
1
2
3
4
5
6
– Note:
• Cumulated histogram of the fair die
• Law of Large Number
• Random number with any distribution can be generated from
uniform random number.
Uniform Random Number
Uniform Random Number
Uniform Random Number
Homework #1
• Matlab Installation
• Calculation of
– Sample Mean, Variance and Standard Deviation
– Linear Regression
– Covariance and Correlation Coefficients
• Program
–
–
–
–
Generate uniform random number
Making a fair die
Experiment and count the frequency
Draw the histogram and cumulative histogram