統計預測方法 - 國立臺灣大學 數學系

Download Report

Transcript 統計預測方法 - 國立臺灣大學 數學系

多變量分析
陳 宏
台灣大學數學系
週四9:10至12:00 A211室
[email protected]
課程內容
•基礎機率,統計語言及其工具(授課時數約2週)
– 重要的機率分配
– 模擬隨機變數
– 點估計、信賴區間、假設檢定
•線性模型(授課時數約7週)
– 線性迴歸、羅吉斯迴歸
– 變異數分析
– 列聯表分析
•多變量分析(授課時數約8週)
– 主成分分析(Principal Component Analysis)
– 因素分析(Factor Analysis)
– 判別分析法(Discriminant Analysis)
– 集群分析法(Cluster Analysis)
– 典型相關分析(Canonical Correlation Analysis)
• 參考書:
– 待定
• 程式語言:
– R(可由網路取得)
– R has a home page at http://www.r-project.org/
– Download
•成績評量方式:
– 期中考(30%)、projects(70%)
講
綱
•概論
– Exploratory Data Analysis: Decision Making
– Data Mining
– Data Collection: 抽樣與問卷
•統計軟體
– R Software
•基礎機率,統計語言及其工具
– Probability and Random Variables
– Variance
•線性模型
– Association
– IntroRegression
– MultipleRegression
– DAonREgression
講
綱
•多變量分析
– 主成分分析(Principal Component Analysis)
– 因素分析(Factor Analysis)
– 判別分析法(Discriminant Analysis)
– 集群分析法(Cluster Analysis)
– 典型相關分析(Canonical Correlation Analysis)
Statistics for Decision Making
•Describing Sets of Data
– Objective: Introduce numerical methods and graphical displays to
summarize data sets.
– Graphical and numerical tools
• for examining the distribution of a single variable,
• for comparing several distributions, and
• for investigating changes over time.
•Sampling and Statistical Inference
– Objective: Provide methods to infer about a population based on a
sample of observations drawn from that population
•Forecasting with Distinguishable Data
– Objective: Introduce the basic concepts of forecasting to motivate a
regression model.
– Method for studying relationships among several variables.
•Regression Coefficients and Forecasts
– Objective: Understand regression coefficients and how to use them for
forecasting
Statistics for Decision Making
•Measures of Goodness of Fit and Residual Analysis
– Objective: Introduce a few statistics that measure how well a
regression model fits the data and show how to use residual analysis
to detect inadequacies of a regression model
•Developing a Regression Model
– Objective: Demonstrate how to develop a useful regression model
through
•Selection of the Dependent Variable
•Selection of the Independent Variables
•Determining the Nature of Relationships
Sampling and Statistical Inference
•Objective: Provide methods to infer about a population based
on a sample of observations drawn from that population.
•Inference from a Sample
•Statistical Estimation
•From Margin of Error to Confidence Interval
•Test of Significance
Inference from a Sample
•The sample provides useful information, but the information
is imperfect.
– Samples are taken when it is impossible, impractical or too expensive
to obtain complete data on relevant population.
•EX. Suppose you are asked 100 potential customers how
much they will spend on a proposed new product next year?
– From the 100 responses you obtained a sample average of $250. You
could make the following inference:
• My best estimate of average sales per potential customer is $250.
• Average sales per potential customer will be between $210 and $290 with
95% confidence.
• Average sales per potential customer will be greater than the break-even
amount of $210 at a 2.5% level of significance.
•Law of Large Numbers:
– Independent observations at random from any population with finite
mean 
– As the number of observations drawn increases, the mean of the
observed values eventually approaches the mean  of the population
as closely as you specified and then stays that close.
Sampling variability
•Parameter: p=the proportion of the adult population in the
US (~190 million) that find clothes shopping frustrating.
•Statistic: 66% or 1650 out of 2500 adults.
•Sampling variability: The value of a statistic varies in repeated
random sampling.
•Answer to “What would happen if we took many samples?”
– Take a large number of samples from the same
population.
– Calculate the sample proportion p^ for each sample.
– Make a histogram of the values of p^.
– Examine the distribution displayed in the histogram.
•We can imitate chance behavior of many samples by using
random digits or computer (simulation).
Sampling variability
•The sampling distribution of a statistic is the distribution of
values taken by the statistic in all possible samples of the
same size from the same population.
•Can be either
– approximated by simulation or
– obtained exactly by probability theory in statistics.
1000 SRSs of size 100 when p=0.6.
1000 SRSs of size 100 and 2500 when p=0.6
Bias and variance
•A statistic is unbiased in the mean of its sampling distribution
is equal to the true value of the parameter being estimated. no favoritism.
•The variability of a statistic is described by the spread of its
sampling distribution.
– 95% of the sample proportions will like in the range 0.6±0.1 (n=100) or
0.6 ± 0.02 (n=2500)
– Larger samples have smaller spreads.
•As long as the population is much larger than the sample, the
spread of the sampling distribution for a sample of fixed size
n is approximately the same for any population size.
– An SRS of size 2500 from 270 million US residents gives
results as precise as an SRS of size 2500 from 740,000
inhabitants of SFO!
Why randomize?
• The act of randomizing guarantees that the results of analyzing our
data are subject to the laws of probability.
– Randomization removes bias.
– Replication (bigger sample) reduces variance.
– Better answer “What would happen if the sample or the experiment
were repeated many times?”
•Caution: the sampling distribution does not reflect bias due
to under-coverage, non-response, lack of realism, etc.
Presidential Election and Poll
背景:1936年美國總統選舉
•法蘭克羅斯福總統爭取連任、肯薩斯州州長蘭登為共和黨總統
候選人
•美國經濟正由大蕭條中逐漸恢復
–九百萬人失業,於1929年至1933年間實際所得降低三分之一。
– 蘭登州長選戰主軸為「小政府」。口號為The spender must go。
– 羅斯福總統選戰主軸為「擴大內需」 (deficit financing)。口號為Balance
the budget of the American people first。
•宣稱一:大部分的觀察家認為羅斯福總統將大勝
•宣稱二:Literary Digest雜誌認為蘭登將以57%對43%贏此選戰。
– 此數字乃根據於二百四十萬人之民意調查結果。
– 該機構至1916年起,皆能依照其預測辦法作正確的預測。
•選舉結果:羅斯福以62%對38%贏此選戰。為什麼?
•新興競爭者-蓋洛普-的工作:
– 依據Literary Digest雜誌所取的二百四十萬人樣本中,蓋洛普抽樣三千人,
而預測蘭登將以56%對44%贏此選戰。
–依據自己所取的五萬人樣本中,蓋洛普預測羅斯福將以56%對44%贏此選
戰。
Digest雜誌錯在那裡?
取樣辦法:郵寄一千萬份的問卷,回收二百四十萬份,但
問卷對象係從電話簿及俱樂部會員中選取。
–在當時僅有一千一百萬具住宅用電話,但九百萬人失業。
可能問題的所在:
•取樣偏差:Digest雜誌的取樣中包含過多的富人,而該年
貧富間選舉傾向相距極大。
•拒回答偏差:低回收率。
–以芝加哥一地為例,問卷寄給三分之一的登記選民,回
收約20%的問卷,其中超過一半宣稱將選蘭登,但選舉
結果卻是羅斯福拿到三分之二的選票。
為何簡單隨機抽樣是個合理的抽樣方法?
•試想抽取16所醫院來預測393所醫院的平均出院病人數的例子,
– 共有約1033種的不同樣本。
– 依據中央極限定理,所得到的平均出院病人數分佈像個鐘形曲線,其
中心位於所有醫院的平均出院病人數,且大多數的16所醫院平均出院
病人數都離中心(大數法則)不遠。
較有保障的抽樣辦法,被選取的樣本應使用隨機的原理取
得。
Digest雜誌錯在那裡?
取樣辦法:郵寄一千萬份的問卷,回收二百四十萬份,但問
卷對象係從電話簿及俱樂部會員中選取。
•(在當時僅有一千一百萬具住宅用電話及九百萬人失業)。
•可能問題的所在:
•取樣偏差:Digest雜誌的取樣中包含過多的富人,而該年貧富
間選舉傾向相距極大。
•拒回答偏差:低回收率。
•以芝加哥一地為例,問卷寄給三分之一的登記選民,回收約
20%的問卷,其中超過一半宣稱將選蘭登,但選舉結果卻是羅
斯福拿到三分之二的選票。
Statistical Estimation
•A parameter is a number that described the population.
– Its value is fixed but unknown.
•A statistic is a number that describes a sample.
–Its value is known for a sample, but it can change from sample to sample.
–We use a statistic to estimate an unknown parameter.
•Error of estimation is the difference between an estimate and the
estimated parameter.
–In case of estimating the population mean using the sample mean,
Error of Estimation = sample mean – population mean
•The distribution of Error of Estimation: Central Limit Theorem
–If the sample size is large, the error of estimation is approximately
normally distributed with mean zero and a standard deviation which can
be estimated by
Standard Error = sample standard deviation/(sample size)1/2
•The Normal Distribution
–If X has N(,2) distribution, then Z=(X- )/ has N(0,1) distribution.
The normal density
• The height of the normal density curve for the normal distribution
with mean  and SD  is given by:
1
 ( x,  ,  ) 
e
 2
1  x 
 

2  
2
•Why is the normal distributions important?
• Good description for some distributions of real data. (e.g. test scores,
repeated measurements, characteristics of biological populations, etc.)
• Good approximations to the results of many kinds of chance outcomes.
(e.g. coin tossing).
• Many statistical inference procedures based on normal distributions
work well for other roughly symmetric distributions.
From Margin of Error to Confidence Interval
•What is the probability that the error of estimation exceeds
two standard errors?
– If we add two standard errors to our estimate as the margin of error,
what can we say about the resulting interval estimate?
•Confidence and Probability
– When reporting that a confidence interval for a population mean
extends from $210 to $290, it is tempting to slip into the language of
probability, and say there is only 5% chance that the true mean of the
population is outside this interval.
– Such probabilistic interpretation is much more natural and appealing
than the rather convoluted interpretation above. But is it legitimate?
– Example:
• Suppose from a sample of 100 potential customers one market researcher
obtained a 95% confidence interval of ($190,$210) for the average amount
a potential customer will spend on a product next year.
• Another market researcher from a different sample of size 400 obtained a
95% confidence interval of ($215,$225).
• How do you reconcile these two results?
Test of Significance
•Example 1: A market researcher asked a sample of 100
potential customers how much they plan to spend on a
product next year.
– The mean of the sample turned out to be $25 and the standard deviation is
$200.
– Is it likely that average sales per capita exceeds a break-even level of $208?
• Example 2: Suppose a manager is trying to decide which of the two
new products, A or B, to introduce. Break-even sales per capita are
$208 for both A and B.
– Sample results are given in the following.
– Product A: sample size = 10,000, sample mean=211, sample SD= 100
– Product B: sample size = 100, sample mean=250, sample SD= 300
• Example 3: In a Business Week/Harris executive poll, senior
executives were asked: “Compared with the last 12 months, do you
think the rate of growth of the gross domestic product will go up,
go down, or stay the same for the next 12 months?”
Test for Independence
•Application on Business outlook
•Results of this poll are summarized below (Business Week,
1/09/95).
Date of Survey
12/94 6/94 12/93 Total
Go Up
152 177
101
430
Go Down
104
72
36
212
Outlook Stay the Same 144 152
261
557
Not Sure
0
0
4
4
Total
400 401
402
1203
•Have the executives changed their outlook over time?
Relations in categorical data
•Relationship between two or more categorical variables.
•Use counts (frequencies) or percent (relative frequencies) of
individuals that fall into various categories.
Two-way table
•A two-way table describes two categorical variables.
•Each horizontal row in the table describes individuals with one
level of the row variable.
•Each vertical column describes individuals with one level of
the column variable.
•EX: Years of school completed, by age (thousands of persons)
Education
did not complete high school
completed high school
college 1 to 3 years
college, 4 or more years
Total
25 to 34
5,325
14,061
11,659
10,342
41,387
Age Group
35 to 54 55 and over
Total
9,152
16,035
30,512
24,070
18,320
56,451
19,926
9,662
41,247
19,878
8,005
38,225
73,026
52,022 166,435
Marginal distributions
•Look at the distribution of each variable separately.
•“Total” columns list the totals for each of the rows or row
totals. Similarly for column totals.
•Row and column totals specify the marginal distributions of
each of the two categorical variables.
The distribution of years of schooling completed among people age 25
years and over
Describing relationships
•What percent of people aged 25 to 34 have completed 4 years
of college?
•What percent of people aged 35 to 54 have completed 4 years
of college?
•What percent of people aged 55 and over have completed 4
years of college?
•Conclusion?
Conditional distribution of age group on
the education level
Three way table
• The table of outcome by hospital by patient
condition is a three-way table that reports
the frequencies of each combination of
levels of three categorical variables.
• We can aggregate a three-way table into a
two-way table.
• A variable being aggregated can become a
lurking variable.
NSF study on the salary of new
women engineer
• The median salary of newly graduated
female engineers and scientists was 73% of
that for males.
• Field is a lurking variable. (life and social
sciences against physical and engineering)
Establishing causation
• The best (and only?) method of establishing
causation is to conduct a carefully designed
experiment in which the effects of possible
lurking variables are controlled.
• What other criteria when we can’t do an
experiment?
“Smoking causes lung cancer”
• The association is strong.
• The association is consistent.
• Higher doses are associated with stronger
responses.
• The alleged cause precedes the effect in
time.
• The alleged cause is plausible.
Forecasting with Distinguishable Data
• Objective: Introduce the basic concepts of forecasting to motivate a
regression model.
• Forecasting with Indistinguishable Data:
– If the future value of the variable you would like to forecast is
indistinguishable from the sample values you collected, then you forecast
with indistinguishable data.
– Example 1: To help forecasting the selling price of your house, you obtained
a sample ($109,360, $137,980, $131,230, $130,230, $125,410, $124,370,
$139,030, $140,160, $144,220, $154,190.
• Forecasting when the Data are Distinguishable:
– When your sample contains additional information so that the sample values
are no longer indistinguishable from the future value you would like to
forecast, you forecast with distinguishable data.
– Example 2: Our sample also contain the information on the square footage of
the ten houses. ($109,360,1404), ($137,980,1477), ($131,230,1503)$,
($130,230,1552), ($125,410,1608), ($124,370,1633), ($139,030,1717),
($140,160,1775), ($144,220,1838), ($154,190,1934).
Forecasting with Distinguishable Data
• Assume that your house has 1682 square feet of living area.
– Analysis 1: sample average of all ten houses = $133,618 (SD = $12,406)
• Analysis 2: Stratify the sample according to lot size.
Size Range
Sample Average
SD
Number of Observations
1400-1599
$127,200
$12,381
4
1600-1799
$132,243
$8,513
4
1800-1999
$149,205
$7,050
2
Then use $132,243 (instead of $133,618) to forecast the selling value.
– Does the cell standard deviation properly measure the forecast uncertainty?
– Is it possible to have a measure of overall efficacy of our partitioning the
sample into cells?
• Use the data more efficiently: The stratification method that we
used is unsatisfactory for two reasons. First, we have ignored data
on house that are “less like,” but not “most like” yours. Secondly,
we have stratified the data somewhat arbitrarily.
The question of causation
•Mother’s adult height vs daughter’s adult height.
•Amount of saccharin in a rat’s diet vs count of tumors in the
rat’s bladder.
•A student’s SAT score and the student’s first year GPA.
•Monthly flow of money into stock mutual funds vs monthly
rate of return for the stock market.
•The anesthetic used in surgery vs whether the patient
survives the surgery.
•The number of years of education a worker has vs the
worker’s income.
Explaining association
•Causation.
•Common response. (a lurking variable).
•Confounding: two variables are confounded when their
effects on a response variable are mixed together.
Data on the survival of patients after
surgery in hospital A and B
Died
Survived
Total
Hospital A Hospital B
63
16
2037
784
2100
800
•Hospital A loses 3% of patients while Hospital B
loses 2%.
Lurking variable...
Died
Survived
Total
Good condition
Hospital A Hospital B
6
8
594
592
600
600
Died
Survived
Total
Bad condition
Hospital A Hospital B
57
8
1443
192
1500
200
• 1% vs 1.3%
for patients
with good
condition
• 3.8% vs 4%
for patients
with bad
condition
Simpson’s paradox
• How can A do better in each group, yet do
worse overall??
• An association or comparison that holds for
all of several groups can reverse direction
when the data are combined to form a single
group.
Regression Model
•Try to create a model that specifies the relationship between
selling price (dependent variable) and other variables
(independent or explanatory variable) that help you forecast
the selling price.
–It is reasonable to assume that as size go up, selling price will go up on
average.
Regression Coefficients and Forecasts
• Objective: Understand regression coefficients and how to use
them for forecasting.
Measures of Goodness of Fit and Residual
Analysis
• Objective: Introduce a few statistics that measure how well a
regression model fits the data and show how to use residual
analysis to detect inadequacies of a regression model
Developing a Regression Model
•Objective: Demonstrate how to develop a useful regression
model through
– Selection of the Dependent Variable
– Selection of the Independent Variables
– Determining the Nature of Relationships