Multivariate Statistics with Data Analysis for Academic

Download Report

Transcript Multivariate Statistics with Data Analysis for Academic

Multivariate Statistical Data Analysis with Its Applications

Hua-Kai Chiou

Ph.D., Assistant Professor Department of Statistics, NDMC [email protected]

September, 2005 1

Agenda

1.

2.

3.

4.

5.

6.

7.

8.

Introduction Examining Your Data Sampling & Estimation Hypothesis & Testing Multiple Regression Analysis Logistic Regression Multivariate Analysis of Variance Principal Components Analysis 2

9.

Factor Analysis 10. Cluster Analysis 11. Discriminant Analysis 12. Multidimensional Scaling 13. Canonical Correlation Analysis 14. Conjoint Analysis 15. Structural Equation Modeling 3

Introduction

4

Some Basic Concept of MVA

• What is Multivariate Analysis (MVA)?

• Impact of the Computer Revolution • Multivariate Analysis Defined • Measurement Scales • Type of Multivariate Techniques 5

Dependence techniqueDependent variable variable(s).

Dummy variableEffect size – the objective is prediction of the dependent variable(s) by the independent variable(s), e.g., regression analysis. – presumed effect of, or response to, a change in the independent – nometrically measured variable transformed into a metric variable by assigning 1 or 0 to a subject, depending on whether it possesses a particular characteristic.

– estimate of the degree to which the phenomenon being studied (e.g., correlation or difference in means) exists in population.

6

Indicator – single variable used in conjunction with one or more other variables to form a composite measure.

Interdependence technique sets (e.g., factor analysis).

Metric data – also called – classification of statistical techniques in which the variables are not divided into dependent and independent quantitative data , interval data , or ratio data , these measurements identify or describe subjects (or objects) not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by attribute. For example, a person’s age and weight are metric data. 7

MulticollinearityNonmetric data – extent to which a variable can be explained by the other variables in the analysis. As multicollinearity increases, it complicates the interpretation of the variate as it is more difficult to ascertain the effect of any single variable, owing to their interrelationships.

– also called qualitative data .

Power – probability of correctly rejecting the null hypothesis when it is false, that is, correctly finding a hypothesized relationship when it exists. Determined as a function of (1)the statistical significance level (α) set by the researcher for a Type I error, (2) the sample size used in the analysis, and (3) the effect size being examined.

8

Practical significance significance. the result is useful.

Reliability measure. the measure(s).

Validity study. – means of assessing multivariate analysis results based on their substantive findings rather than their statistical Whereas statistical significance determines whether the result is attributable to chance, practical significance assesses whether – extent to which a variable or set of variables is consistent in what it is intended to Reliability relates to the consistency of – extent to which a measure or set of measures correctly represents the concept of Validity is concerned with how well the concept is defined by the measure(s).

9

Type I error – probability of incorrectly rejecting the null hypothesis.

Type II error - probability of incorrectly failing to reject the null hypothesis, it meaning the chance of not finding a correlation or mean difference when it does exist.

Variate – linear combination of variables formed in the multivariate technique by deriving empirical weights applied to a set of variables specified by the researcher.

10

• The Relationship between Multivariate Dependence Methods

Analysis of Variance (ANOVA) Y

X

X

X

1 1 2 3

(metric) (nometric) X n Y

1 

Y

2

Multivariate Analysis of Variance (MANOVA)

Y Y

X

X

X

3

n

1 2 3

(metric) (nometric) X n Y

1

Canonical Correlation

Y

Y Y

X

X

X X

2 3

n

1 2 3

n (metric, nometric) (metric, nometric)

11

Y

1 

X Discriminant Analysis

1 

X

2 

X

3

(nometric) (metric)

 ...

X n Y

Multiple Regression Analysis X

X

X

 1 1 2 3

(metric) (metric, nometric)

...

X n Y

1 

X Conjoint Analysis

X

2 

X

3  ...

1

(metric, nometric) (nometric)

X n

12

Structural Equation Modeling Y

1

Y

2  

X

11

X

21  

X

12

X

22  

X

13

X

23

X

1

n X

2

n Y m

X m

1 

X m

2 

X m

3

(metric) (metric, nometric) X mn

13

Multiple relationships of dependent and independent variables Structural Equation Modeling Dependence How many variables are being predicted?

Several dependent variables in single relationship What is the measurement scale of the dependent variable?

Metric Canonical correlation analysis Metric What is the measurement scale of the dependent variable?

Nometric Multivariate analysis of variance (MANOVA) Nometric Canonical correlation analysis with dummy variables What type of relationship is being examined?

Metric Multiple regression Conjoint analysis One dependent variables in single relationship What is the measurement scale of the dependent variable?

Nometric Multiple discriminant analysis Linear probability models Interdependence Is the structure of relationship s among: Variable Cases/Respondent Factor analysis Cluster analysis Object How are the attributes measured?

Metric Multidimensiona l scaling Nometric Nometric Correspondenc e analysis 14

A Structured Approach to Multivariate Model Building

Stage 1: Define the research problem, objectives, and multivariate technique to be used Stage 2: Develop the analysis plan Stage 3: Evaluate the assumptions underlying the multivariate technique Stage 4: Estimate the multivariate model and assess overall model fit Stage 5: Interpret the variate(s) Stage 6: Validate the multivariate model 15

Examining Your Data

16

HATCO Case

• Primary Database – This example investigates a business-to-business case from existing customers of HATCO. – The primary database consists separate variables .

100 observations on 14 • Three types of information were collected: – The perceptions of HATCO, 7 attributes ( X1 – X7 ); – The actual purchase outcomes, 2 specific measures ( X9,X10 ); – The characteristics of the purchasing companies, 5 characteristics ( X8, X11-X14 ). 17

Table 2.1 Description of Database Variables (Hair et al., 1998) Variables Description Perceptions of HATCO X1 Delivery Speed X2 Price Level X3 X4 Price Flexibility Manufacturer’s Image X5 X6 X7 Product Quality Purchase Outcomes X9 Overall Service Salesforce Image Usage Level X10 Satisfaction Level Purchaser Characteristics X8 Size of Firm X11 X12 X13 X14 Specification Buying Structure of Procurement Type of Industry Type of Buying Situation Variable Type Metric Metric Metric Metric Metric Metric Metric Metric Metric Nonmetric Nonmetric Nonmetric Nonmetric Nonmetric Rating Scale 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 0 – 10 100-point percentage 0 – 10 {0,1} {0,1} {0,1} {0,1} {1,2,3} 18

Fig 2.1 Scatter Plot Matrix of Metric Variables (Hair et al., 1998) 19

Fig 2.2 Examples of Multivariate Graphical Displays (Hair et al., 1998) 20

Missing Data

• A missing data process is any systematic event external to the respondent (e.g. data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing values.

• The impact of missing data is detrimental not only through its potential “hidden” biases of the results but also in its practical impact on the sample size available for analysis. 21

• Understanding the missing data – Ignorable missing data – Remediable missing data • Examining the pattern of missing data 22

Table 2.2 Summary Statistics of Pretest Data (Hair et al., 1998) 23

Table 2.3 Assessing the Randomness of Missing Data through Group Comparisons of Observations with Missing versus Valid Data (Hair et al., 1998) 24

Table 2.4 Assessing the Randomness of Missing Data through Dichotomized Variable Correlations and the Multivariate Test for Missing Completely at Random (MCAR) (Hair et al., 1998) 25

Table 2.5 Comparison of Correlations Obtained with All-Available (Pairwise), Complete Case (Listwise), and Mean Substitution Approaches (Hair et al., 1998) 26

Table 2.6 Results of the Regression and EM Imputation Methods (Hair et al., 1998) 27

Outliers

• Four classes of outliers: – Procedural error – Extraordinary event can be explained – Extraordinary observations has no explanation – Observations fall within the ordinary range of values on each of the variables but are unique in their combination of values across the variables.

• Detecting outliers – Univariate detection – Bivariate detection – Multivariate detection 28

Outliers detection

• Univariate detection threshold: – For small samples, within ± 2.5 standardized variable values – For larger samples, within ± 3 or ± variable values 4 standardized • Bivariate detection threshold: – Varying between 50 and 90 percent of the ellipse representing normal distribution. • Multivariate detection: – The Mahalanobis distance D 2 29

Table 2.7 Identification of Univariate and Bivariate Outliers (Hair et al., 1998) 30

Fig 2.3 Graphical Identification of Bivariate Outliers (Hair et al., 1998) 31

Table 2.8 Identification of Multivariate Outliers (Hair et al., 1998) 32

Testing the Assumptions of Multivariate Analysis

• Graphical analyses of normality – Kurtosis refers to the peakedness or flatness of the distribution compared with the normal distribution.

– Skewness indicates the arc, either above or below the diagonal. • Statistical tests of normality

z

skewness

skewness

6

N

;

z

kurtosis

kurtosis

24

N

33

Fig 2.4 Normal Probability Plots and Corresponding Univariate Distribution (Hair et al., 1998) 34

Homoscedasticity vs. Heteroscedasticity

• Homoscedasticity is an assumption related primarily to dependence relationships between variables.

• Although the dependent variables must be metric, this concept of an equal spread of variance across independent variables can be applied either metric or nonmetric.

35

Fig 2.5 Scatter Plots of Homoscedastic and Heteroscedastic Relationships (Hair et al., 1998) 36

Fig 2.6 Normal Probability Plots of Metric Variables (Hair et al., 1998) 37

Table 2.9 Distributional Characteristics, Testing for Normality, and Possible Remedies (Hair et al., 1998) 38

Fig 2.7 Transformation of X2 (Price Level) to Achieve Normality (Hair et al., 1998) 39

Table 2.10 Testing for Homoscedasticity (Hair et al., 1998) 40

Sampling Distribution

41

Understanding sampling distributions

• A histogram is constructed from a frequency table.

The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. 42

• A bar graph is much like a histogram, differring in that the columns are separated from each other by a small distance. Bar graphs are commonly used for qualitative variables .

43

What is a normal distribution?

• Normal distributions are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails. Normal distributions are sometimes described as bell shaped. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean (m) and the standard deviation (s). 44

45