Basic Statistics PowerPoint, Dae Joong Kim

Download Report

Transcript Basic Statistics PowerPoint, Dae Joong Kim

2011 Boot Camp
Basic Statistics:
Conceptual understanding and application
Dae Joong Kim
John Glenn School of Public Affairs
Ohio State University
[email protected]
Introduction
Quantitative Analysis Courses
Boot Camp: Basic statistics
Online: http://glennschool.osu.edu/bootcamp/index.html
820: Data Analysis for Public Policy and Management
(Autumn quarter)
822: Multivariate Data Analysis for Public Policy and
Management
(Spring quarter)
2
6
Statistics = probabilistic, random, or stochastic analysis → errors in equations or models; a useful tool e.g., Y=b0+b1X1+b2X2+e
Mathematics = deterministic analysis → no errors in equations or models; mathematical language e.g., Y=b0+b1X1+b2X2
Statistical significance: whether relevant estimates are included in statistical confidence interval (C.I) (99%, 95% or 91%)
Substantial significance: whether relevant estimates have expected sign (+ or -)and magnitude
2011 Boot Camp
weight height
145
170
170
190
155
172
122
180
167
187
160
174
143
174
142
166
139
164
165
182
student1
student2
student3
student4
student5
student6
student7
student8
studnet9
Student10
 Analysis:
Mean of weight:
145 + 170 + 155 + ⋯ + 165
= 150.8
10
(145+155)
Median of weight:
2 = 150
122 139 142 143 145 155 160 165 167 170
Correlation between weight and
height:
170
w/o outlier
160
 Descriptive statistics analysis
(Data distribution analysis)
 Mean (average; expected utility)
 Variance
KEY CONDITIONS
 Standard Deviation
 Normal distribution
 Frequency and percentage, etc.
 Central limit theorem
 Inferential statistics analysis
Correlation (r): just relationship between variables (no direction;
(Relationship analysis between variables:
symmetric relationship)
Explorative analysis or hypothesis testing )
corr(weight, height),
-1 ≤ r ≤ 1
 Correlation (positive, negative, or nothing)
Regression: independent relationship of more than one variables with a
variable (direction: asymmetric relationship)
 Mean difference analysis
Weight = height + error, 0 ≤ r2 ≤ 1
 t-test for two groups
Causation: a correlation or regression is not same as causation if it does
 Analysis of variance (ANOVA) for multi-groups
not satisfy 1) time order between variables (cause-effect), 2) no other
 Regression (independent variable and dependent variable) variables between them, and 3) direction change at the same time.
 Single regression model
 Linear regression model (OLS)
BASIC ASSUMPTIONS
 Multiple regression model  Non-linear regression model
 Linearity; Normality:
Homoscedasticity;
Interpretation of Data Output
Independency;
 Interpret data outputs based on your background knowledge and
Statistical Hypothesis Testing
experience
Null hypothesis (H0):
(Different researchers can interpret the same data in different ways)
hypothesis that researchers try to disprove
 Support your argument, or your theoretical model based on your
Alternative or research) hypothesis (Ha) :
hypothesis that researchers expect to support their models
interpretation
variables
150
Mean: arithmetic average of a set of number
Median: the middle obs in a group of data when
the data are ranked in order of magnitude
Mode: the most common value in any distribution
 Sampling:
Population: JGS master students
Sample: 10 students
observations
Analysis of Data
 Sample
Estimate; infer
 Population
 Research question:
Relation between weight and height
weight
 Surveys
- questionnaire
- interview
 Experiments
Sampling
Example
w/ outlier
140
Collection of Data
Observations (individuals or cases)
Discrete variables
(Nominal (e.g., sex (male or female),
Variables = observations’
Ordinal scale(e.g., economic status
attributes
(low, Middle, high))
Continuous variables
(Interval scale (e.g., income),
Ratio scale (e.g., height; weight))
KEY CONDITIONS
 Randomness
Level of measurement
 Representativeness
(scales of measure)
130
 The study of the collection, analysis,
and interpretation of data related to
your research questions or models
Data
120
BASIC STATISTICS
165
170
175
180
185
190
height
Regression
outlier
Statistics?
Definition
The study of the collection,
analysis, and
interpretation of DATA related to
your research (questions or models)
Research question:
e.g., Is there any relationship of weight with height?
questions or models
4
6
Statistics?
Definition
Statistics = probabilistic, random, or stochastic analysis
→ errors in equations or models; a useful tool
e.g., Y=b0+b1X1+b2X2 + e
Mathematics = deterministic analysis → no errors in
equations or models; mathematical language
e.g., Y=b0+b1X1+b2X2
5
6
DATA?
Definition
Data refers to qualitative (e.g., female/male) or
quantitative attributes of a variable or set of
variables.
the results of measurements and can be the
basis of graphs, images, or observations of a
set of variables.
Raw Data(=unprocessed data) refers to a collection of
numbers and characters.
6
6
DATA?
Example: Raw Data
7
6
DATA?
Purpose
To get necessary information and knowledge
Data
interpretation
Information
Discussion; agreement
Knowledge
“Data” is not “information” unless it is interpreted
8
6
DATA?
Structure
Observations (=individuals or cases)
Data
Variables = observations’ attributes
e.g., Raw Data
Discrete variables
1. Nominal
e.g., sex (male or female)
2. Ordinal scale
e.g., economic status (low, Middle, high)
Continuous variables
3. Interval scale
e.g., income ($100,000)
4. Ratio scale
e.g., height; weight
Level of measurement
(scales of measure)
9
6
Collection of Data
Definition
The selection of a SAMPLE (a subset of individuals)
from within a POPULATION to yield some
Information/knowledge about the whole population,
especially for the purposes of making predictions based
on statistical inference
*Population: all people or items with the characteristic
that one wishes to understand
10
6
Collection of Data
Structure
 Randomness
 Representativeness
 Surveys (Observation)
- Questionnaire
Paper
Web, etc
- Interview
Face-to-face
Phone,etc etc
Sampling
Population
Sample
 Experiments
Control group vs.
experimental group
Inference
(estimation; prediction)
11
6
Collection of Data
Survey or Interview Questionnaires
1. Nominal (=categorical or dummy) question
e.g., your gender? Male___
Female ______
2. Ordinal-scale question
e.g., How much are you satisfied with your annual salary?
a. very high b. high c. neutral d. low
e. very low
3. Interval-scale question
e.g., How is your annual salary?
a. below 20,000
b. 20,000 – 50,000 c. 50,000 – 70,000 d. above 70,000
4. Ratio-scale question
e.g., What is your height? ________
12
6
Collection of Data
Randomness and Representativeness
The most important conditions to secure reliable sampling,
or to eliminate bias
Randomness:
equal chance of selection
(e.g., National lottery)
Representativeness:
the selection of individuals which are representative of
a larger population
13
6
Collection of Data
Randomness and Representativeness
Low bias and high precision
Quality of Data
14
6
Collection of Data
Normal Distribution
Symmetric distribution of values around the mean of a variable
(Bell-shape distribution)
s.d (s or σ) = 40
s.d (s or σ) = 24
s.d (s or σ) = 19
Mean (𝑋 or μ)=30
Mean (𝑋 or μ)=70)
Mean (𝑋 or μ)=10
15
Collection of Data
Normal Distribution (why important?)
1. Distributions of most variables tend to be normal, or
they are usually quite close to normal distribution
2. It is easy for mathematical statisticians to work with.
This means that many kinds of statistical tests can be
derived for normal distributions.
3. If the mean and standard deviation of a normal
distribution are known, it is easy to convert back and forth
from raw scores to percentiles.
16
Collection of Data
Standard Normal Distribution
N ~ (0, σ2)
Standard normal
distribution is called “Z
distribution”
<probabilistic distribution>
Z-distribution (n≥30)
cf. t-distribution (n<30)
17
Collection of Data
Z distribution table
s.d
t distribution table
t=
s.e: how likely the mean yo
estimating is true mean
e.g., Z =1.13=(1.1 + 0.03)
e.g., t =1.26 (df=9)
87%
87% take more than 111.3 min when mean time on a review is 100 mins, and s.d is 10 mins.
18
Collection of Data
Central Limit Theorem
A foundational concept in statistical inference which states that
if a sampling distribution is made up of samples containing
more than 30 cases (each), the sample means will be
normally distributed
19
6
Collection of Data
Normal distribution: Mean, Median, Mode
Mean: arithmetic average of a set of number
Median: the middle observation in a group of data when the data are ranked in order of
magnitude
Mode: the most common value in any distribution
20
6
Collection of Data
Skewedness
Left-tail is longer
Right-tail is longer
Means are distorted by extreme values, or outliers
1.
Using median instead of mean
2.
If necessary, transform to normality, especially in regression analysis
21
6
Analysis of Data
Purpose
A step to find “a pattern of data” to get necessary
information and knowledge
22
6
Analysis of Data
Type
Descriptive (statistical) analysis
Numerical information (such as mean, median,
standard deviation) that summarize and interpret
some of the properties of a set of data (sample) but
do not infer the properties of the population from which
the sample was drawn.
Inferential (statistical) analysis
Deducing (inferring) the properties of a population
from the analysis of the properties of a data sample
drawn from it
23
6
Analysis of Data
Descriptive Analysis
Data distribution analysis:
It tells us what values the variable takes and how often each value occur
 Mean (𝑿 (sample); μ (population))
- Arithmetic average or expected value of a variable (χ)
𝑛
𝑖=1 𝑥𝑖
𝑛
(n = number of observation)
 Variance (s2 (sample); σ2 (population))
- The average of the squared differences from the mean
𝑛
2
𝑖=1(𝑥𝑖 −𝑥)
𝑛−1
𝑛
2
𝑖=1(𝑥𝑖 −𝑥)
𝑛−1
 Standard Deviation (s (sample); σ (population))
- A measure of dispersion, or variation, the square root of variance
24
6
Analysis of Data
Descriptive Analysis
 Range: difference between maximum value and minimum value
 Min: the lowest, or minimum value in variable
 Max: the highest, or maximum value in variable
 Q1: the first (or 25th) quartile
 Q2: the third (or 75th) quartile
Min
1
2
3
4
50th
Mean or Mode
25th
5
6
7
8
9
Max
10 11 12 13
25
6
Analysis of Data
Descriptive Analysis
 Frequency distribution
- A table that shows a body of your data grouped according
to numerical values
Example:
26
6
Analysis of Data
Descriptive Analysis
Mean: arithmetic average of a set of number
Median: the middle observation in a group of data when the
data are ranked in order of magnitude
Mode: the most common value in any distribution
 Height
Mean:
170+190+172+180+187+174+174+166+164+182
10
Median:
174+174
2
= 𝟏𝟕𝟓.9
=174
164 166 170 172 174 174 180182 187 190
Mode: 174
Variance:
(170−175.9)2 +(190−175.9)2 + ∙ ∙ ∙ +(164−175.9)2 +(182−175.9)2
(10−1)
=74.77
Standard deviation: 74.77 = 8.65
27
Analysis of Data
Descriptive Analysis: Using “Stata”
28
Analysis of Data
Inferential Analysis
Relationship analysis between variables:
(Explorative analysis or hypothesis testing )
Main analysis: Mean difference analysis (t-test; ANOVA) and
Relationship analysis (correlation; regression), etc.
 Mean difference analysis
 t-test for two groups
 Analysis of variance (ANOVA) for multi-groups
29
6
Analysis of Data
Inferential Analysis: Mean Diff
Example. t-test
height difference between male and female
male = 0
female =1
30
6
Analysis of Data
Inferential Analysis
 Relationship analysis
 Correlation
- Correlation means linear association between two variables
- Three types of correlation
X2
X2
X2
X1
positive
X1
zero
X1
negative
31
6
Analysis of Data
Inferential Analysis
 Regression (independent and dependent relationships among variables)
1. Number of independent variable
 Single regression model:
Association of one independent variable with one dependent variable
e.g., Y = β0+β1X1+e
where Y is dependent var, X is independent var, e is error, β0 is intercept, and β1 is slope of X1.
 Multiple regression model:
Association of more than two independent variables with one dependent variable
e.g., Y = β0+β1X1+β2X2+e
2. Shape of regression line
 Linear regression model (OLS)
 Non-linear regression model (MLE or GLS)
Y
Y
X
6
X
32
Analysis of Data
Correlation (r): just relationship between variables (no direction;
symmetric relationship)
corr(weight, height),
-1 ≤ r ≤ 1
Regression: independent relationship of more than one variables with
a variable (direction: asymmetric relationship)
Weight(Y) = β0+β1*height(X1) + error, 0 ≤ r2 ≤ 1
r2 is the fraction of the sample variance of weight (Y) explained by (or
predicted by) height (X1).
Causation: a correlation or regression is not same as causation if it
does not satisfy 1) time order between variables (causeeffect), 2) no other variables between them, and 3) direction
change at the same time.
33
6
Analysis of Data
150
140
without outlier
130
weight
160
170
Correlation ( r) btwn weight and height:
with outlier
120
(outlier)
Regression (r2) btwn height and weight:
165
170
175
180
185
190
height
r2 tells us that 22.78% of variance in weight is
explained by height
D.V
I.V
Slope (β1)
6
Intercept (β0)
34
Analysis of Data
Hypothesis Testing
Null hypothesis (H0):
hypothesis that researchers try to disprove
Alternative or research hypothesis (Ha) :
hypothesis that researchers expect to support their models
35
6
Analysis of Data
Hypothesis Testing: Example
H0: Male is not taller than female
Ha: Male is taller than female
Pr(T>t)=0.0076
<
0.05
We can accept the hypothesis that male’s height is less than female’s height because the
difference of height between female and male is statistically significant at 5% signficnace
level.
6
36
Interpretation of Data Outputs
 Interpretation of data outputs based on your background
knowledge and experience is the last step of statistics in social
science.
- Different researchers can interpret the same data in different ways
 Support your argument, or your theoretical model based on
your interpretation
37
6
Interpretation of Data Outputs
 Interpretation of data outputs based on your background
knowledge and experience is the last step of statistics in social
science.
- Different researchers can interpret the same data in different ways
 Support your argument, or your theoretical model based on
your interpretation
38
6
Interpretation of Data Outputs
Statistical significance:
whether relevant estimates are included in a statistical
confidence interval (C.I) (99%, 95% or 90%), or at a significant
level (α=0.01(t value=2.58), 0.05 (t value=1.96) or 0.1
(t value=1.64)
Substantial significance:
whether relevant estimates have expected
sign (+ or -)and magnitude
39
6
Interpretation of Data Outputs
• Not statistically significant at 5% significance level, but
significant at 1% level; and its sign is positive
• If we assume this one is statistically significant, we can interpret
that for one center meter increase of height, weight increases by
.98 pounds
40
6
Practice
41
6
Practice: Questions
Q1. n (number of observations)
Q2.
𝑛
𝑖=1 𝑋𝑖
(sum of Xs)
Q3. 𝑋 (mean)
Q4. Median
Mode
Q5. Five number summery:
Min (lowest value)
Q1 (25th quartile value)
M (median)
Q3 (75th quartile value)
Max (highest value)
Q6. s2 (variance)
Q7. s (standard deviation)
42
6
Practice: Answers
43
6
Practice: Answers
44
6
Practice: Extra Qs
45
6