Quantitative Data Analysis - Global Information Systems

Download Report

Transcript Quantitative Data Analysis - Global Information Systems

Statistical Data
Analysis
Moses Mugolo Kasolo
Training Co-ordinator
Statistician/Data Analyst (GIS)
[email protected],
[email protected]
0772408941/0702196786
Training Objectives
• By the end of the training participants shall have acquired
knowledge and basic skills in statistical data analysis.
• Participants should therefore be able;
• to prepare data for analysis,
• to learn different data transform techniques
• to identify and learn when to apply the various
statistical tests in analyzing quantitative data
• Create a data entry template and enter data (Practical)
• Perform some basic analysis (Practical)
What is Statistical Data Analysis?
After collecting data, the
concerned with five things:
researcher
becomes
 Checking the questionnaire/schedules (Data Cleaning*)
 Sorting out and reducing information collected (Reliability/Factor
Analysis*)
 Summarizing the data into tabular forms (Descriptive Statistics).
 Analyzing findings to bring out salient features (Inferential Statistics)
 Interpreting the results (Narrating the story behind the numbers)
Overall, the technique of converting raw data into
meaningful statements (including data editing,
tabulation, disintegration, graphing, interpretation and
presentation) is what is commonly referred to as Data
Analysis.
Organizing, Entering & Cleaning Data
• After the data collection exercise data must be organized properly
to facilitate the data entry and analysis exercise. There are three
stages that must be followed to organize your data.
•n Data Editing n Data Editing n Data Screening/Cleaning
• Data editing is a process whereby errors in completed interview
schedules, questionnaires, etc. are identified and eliminated
whenever possible.
• Editing is carried out in three stages:
• in the field by the interviewer,
• in the office before data is coded,
• using computer programs , which can edit the data before it is
analyzed.
Data editing Continue…
• Editing is done to check the following:
•
•
•
•
Completeness
Accuracy
Consistency
Inclusiveness(key variables) before analysis is done.
• Data can there after be entered in the computer using the
appropriate computer package.
• Some basic analysis can be done to help detect some
anomalies of the data entered.
• Questionnaires should always be numbered to help facilitate
this exercise (references)
Data Coding
• This process is where by verbal information is
converted into variables and categories of variables
using numbers, so that the data can be easily
entered into computers for purposes of analysis
Variables:
Categories:
Codes
Gender
Age Groups
Did you vote?
Male Female 18-25 26-33 34-41
Code 0 Code 1 Code 1 Code 2 Code 3
No
Yes
Code 1 Code 0
Types of Questions & Answers
• The Dichotomous Question
The dichotomous question is generally a "yes/no"
question. It may also be any question with only two
possible responses/answers?
• Examples:
(1) Have you ever purchased a product or service from
our website? Yes/No
• (2) What is your gender? Male/Female
• Coding for purposes of data analysis: Use 1 and 0
(NOT 1 and 2 as is the common practice)
Types of Questions & Answers
• The Single Choice Questions
The single choice questions consists of many possible
responses but you’re required to choose one.
• What is your highest level of education?
• Primary
• Secondary
• Ordinary Diploma
• Bachelor’s
• Master’s
• PhD
• Others, specify
• Coding for purposes of data analysis: Start with 1, 2, 3 …
Types of Questions & Answers
The Multiple Choice Questions
The multiple-choice question consists of many possible
responses. They normally ask for multiple answers. These
questions can vary depending on how you state them.
Examples: How did you first learn about our web site?
•Television
•Radio
•Newspaper
•Magazine
•Word-of-mouth
•Internet
•Other: Please Specify __________
Coding for purposes of data analysis: Start with 1, 2, 3…
Types of Questions & Answers
• Name three different sources where you learnt about our web
site?
 Television
 Radio
 Newspaper
 Magazine
 Word-of-mouth
 Internet
 Other: Please Specify _______________
• Coding for purposes of data analysis: Start with 1, 2, 3 …
Create three variables in SPSS, name them Source1, Source2
and Source3
Types of Questions & Answers
Tick those sources where you have learnt about our web site?
 Television
 Radio
 Newspaper
 Magazine
 Word-of-mouth
 Internet
 Other: Please Specify _______________
Coding for purposes of data analysis: Create variables with
the above names, Enter 1 for sources and 0 for non - sources
Types of Questions & Answers
• Rank Order Scaling
Rank order scaling questions allow a certain set of brands or
products to be ranked based upon a specific attribute or
characteristic.
Example:
Please rank the following brands according to their reliability.
Place a "1" next to the brand that is most reliable, a "2" next
to the brand that is next most reliable, and so on.
•
•
•
•
Honda
Toyota
Mazda
Ford
• What are you interested in? And how do you code them?
Types of Questions & Answers
• The Rating Scale
A rating scale question requires a person to rate a product or brand
along a well-defined, evenly spaced continuum. Rating scales are
often used to measure the direction and intensity of attitudes. The
following is an example of a comparative rating scale question:
Example:
How do you best describe your last experience purchasing a product or
service on our website?
• Very pleasant
• Somewhat pleasant
• Neither pleasant nor unpleasant
• Somewhat unpleasant
• Very unpleasant
How do you code the above?
Types of Questions & Answers
The Semantic Differential Scale
The semantic differential scale asks a person to rate a product,
brand, or company based upon a seven-point rating scale that
has two bi-polar adjectives at each end. The following is an
example of a semantic differential scale question.
Example:
Would you say our web site is:
•(7) Very Attractive
•(6)
•(5)
•(4)
•(3)
•(2)
•(1) Very Unattractive
Notice that unlike the rating scale, the semantic differential scale
does not have a neutral or middle selection. A person must
choose, to a certain extent, one or the other adjective.
Types of Questions & Answers
• The Staple Scale
The staple scale asks a person to rate a brand, product, or
service according to a certain characteristic on a scale from +5
to -5, indicating how well the characteristic describes the
product or service. The following is an example of a staple
scale question:
• When thinking about Global Information Systems Ltd (GIS), do
you believe that the word "innovative" appropriately
describes or poorly describes the company? On a scale of +5
to -5 with +5 being "very good description of GIS” and -5 are
being "poor description of GIS," how do you rank GIS
according to the word "innovative"?
Types of Questions & Answers
•(+5) Describes very well
•(+4)
•(+3)
•(+2)
•(+1)
•Innovative
•(-1)
•(-2)
•(-3)
•(-4)
•(-5) Poorly Describes
Types of Questions & Answers
• The Constant Sum Question
A constant sum question permits collection of "ratio" data,
meaning that the data is able to express the relative value or
importance of the options (option A is twice as important as
option B).
• Example:
The following question asks you to divide 100 points between
a set of options to show the value or importance you place
on each option. Distribute the 100 points giving the more
important reasons a greater number of points.
Types of Questions & Answers
When thinking about the reasons you purchased our data
mining software, please rate the following reasons
according to their relative importance.
•Seamless integration with other software __________
•User friendliness of software
__________
•Ability to manipulate algorithms
__________
•Level of pre- and post-purchase service
__________
•Level of value for the price
__________
•Convenience of purchase/quick delivery __________
•Others, specify
__________
Total
100 points
Types of Questions & Answers
• The Open-Ended Question
The open-ended question seeks to explore the qualitative, indepth aspects of a particular topic or issue. It gives a person
the chance to respond in detail. Although open-ended
questions are important, they are time-consuming and should
not be over-used, if you intend to perform a quantitative
analysis.
• Example: What products of services were you looking for that
were not found on our website?
• Note: If you want to add an "Other" answer to a multiple
choice question, you would use branching instructions to
come to an open ended question to find out what other really
is?
Types of Questions & Answers
• The Demographic Questions
Demographic questions are an integral part of any
questionnaire. They are used to identify characteristics
such as age, gender, income, number of children, and so
forth.
• Examples:
• How old are you?
• What is your yearly income?
• How many children are you responsible for?
• What time of data is this? And how do you code it?
Data Screening or Cleaning
• Used to identify miscoded (e.g. possible responses are either
Yes – 1, N0 – 0, so there can’t be another code)
• Used to also identify missing data (key variables should not
have missing values)
• Used to identify messy or inconsistent data (e.g. Smoker=No,
No of Cigars per day = 40)
• It helps to find possible outliers, non-normal distributions,
other anomalies in the data.
SPSS uses Validation rules for Data Cleaning
There are three types of rules for validating a data set:
• Single-variable rules
• Cross-variable rules, and
• Multi-case rules.
Single-Variable Rules
• Validation rules that check internal inconsistencies, such as
invalid values and cases within a variable, are known as SingleVariable Rules.
• These rules consist of a set of checks that can be applied to a
variable. Normally, checks for out-of-range or invalid values
and missing values are included in this category. For example,
a value of 5 was entered for the ‘highest education level,
whose valid codes are only 0, 1, 2 and 3.
• Similarly, single-variable rules can be used to check whether
values other than 0 and 1 (or ‘Male’ and ‘Female’) are entered
in variable ‘sex of respondent’
Single-Variable Rules
• Three four in the single variable rule.
• Stage I: Obtain a list of valid values or ranges from the
codebook.
• Stage II: Construct a frequency table for the variable under
test. If there are no invalid values displayed, the variable
under observation is ‘valid’ with the single-variable rule.
• Stage III: (If there are invalid values). Extract all cases with
invalid values for the variable in the data set.
• Stage IV: Identify the questionnaires where those erroneous
cases come from.
Cross-Variable Rules
• Rules for checking inconsistencies in a variable through the
values of other variables in the same case are called CrossVariable Rules.
• Users have to use cross-tabulations to identify whether invalid
cases exist or not, and to apply slightly different rules for
conditional selection of invalid cases.
• For example when you cross tabulate Age and Highest
Education Level, you may discover “suspicious” cases where
respondents aged 12 have University education . Or Ever
fallen sick in the last week? – No Verses Action taken? Visited a health Centre!
• Note: You can cross tabulate more than 2 variables at a time
Multi-Case Rule
• A user-defined rule that can be applied to a single
variable or a combination of variables in a group of cases
is a Multi-Case Rule.
• Multi-case rules are defined by a procedure (sequence of
logical expressions) that flags invalid cases.
• The most common and useful application of multi-case
rules is checking whether there are duplicates in the
data set, such as cases that have been entered more
than once for a single respondent or household, or a
household that has two heads, or two respondents who
have the same opinion, attitude and perception about
something under study.
Data Entering
• Raw data is NOT very useful for purposes of analysis. This
data must be entered into an appropriate computer
software before actual analysis starts.
• Whether you have collected quantitative or qualitative
data, it is important that you enter the data in a logical
format that can be easily understood and analyzed.
• For quantitative data, you either use Microsoft Excel, Epi
Info, Epi Data, Stata or SPSS (most popular data analysis
software).
Measurement Levels
• There is need to identify the level of measurement associated
with the quantitative data. The level of measurement has lot of
influence on the type of analysis you can use. There are four
levels of measurement:
• Nominal data: means that the number assigned to the data simply
represents a category of object. There is no measured difference
between the objects being measured. The numbers assigned are not in
any logical order. Some common examples are assigning a number to
Gender (Male – 1, female – 0) or Marital status (married – 1, single – 2,
divorced - 3, etc.). You are just assigning a number to something for
purposes of analysis.
• Ordinal data: means the larger the number assigned to the data for the
object, then the object is truly larger in some sort of amount, value,
importance or hierarchy. The data numbers assigned have a logical
order, but the differences between values are not constant and
important either. Examples: T-shirt size (small - 1, medium - 2, large - 3),
Level of education (Primary – 1, Secondary – 2, University – 3)
Measurement Levels
• Interval data: means data is continuous and has a logical
order; data has standardized differences between values,
but no natural zero. A natural zero would mean nonexistence of what is being measured. Example of interval
data: Fahrenheit degrees. Zero degrees does not mean
non-existence of temperature!
• Ratio data: means data is continuous, ordered, has
standardized differences between values, and a natural
zero. Examples: height, weight, age, length, etc. Zero
height, weight or age means non-existence of the object
being measured
Transformation of Data
• Constructs are measured in very arbitrary ways. For
example, height, may be measured in feet, inches,
centimeters or millimeters. While weight may be
measured in kilograms, gram or pounds
• These measurements can be converted from one to the
other by a rule or formula.
• The measurement scale we use depends on a number of
factors.
• Converting data from one scale into another is what is
called Data Transformation.
Transformation of Data
• In statistical practice there are a number of
transformations that are commonly used.
• These include:
• Dichotomization
• Standardization
• Normalization
• Computation
• Aggregation
Transformation of Data
• Dichotomization
• A variable that takes on only two values is a dichotomous variable.
• Examples: Male/female, yes/no, agree/disagree, true/false,
present/absent, less than/more than, lowest half/highest half,
experimental group/control group, are all examples of dichotomous
variables
• We can convert continuous measurements to smaller numbers of
categories by recoding the variable into two values - Dichotomization
• Example 1: Convert height into below average or above average (call
them 0 and 1, or 1 and 2).
• Example 2: Convert a Likert scale (Strongly Agree, Agree, Undecided,
Disagree and Strongly Disagree to Agree and Disagree, ignoring the
Undecided responses (if the number is negligible)
Transformation of Data
• Standardization
• Another useful transformation in statistics is standardization.
Sometimes called "converting to Z-scores" or "taking Z-scores“.
• It has the effect of transforming the original distribution to one in
which the mean becomes zero and the standard deviation
becomes One.
• This helps you to compare 2 or more sets of data using a standard
scale
• A Z-score quantifies the original score in terms of the number of
standard deviations that that score is from the mean of the
distribution. The formula for converting from an original or "raw"
score to a Z-score is:
Transformation of Data
• Data Normalization
• A common requirement for parametric tests is that the
population of scores from which the sample observations came
should be normally distributed.
• Data which does not meet this requirement may therefore be
normalized before subjecting it to any parametric tests.
• The most common normalization techniques include; logarithmic,
reciprocal, and square root transformations.
Transformation of Data
• Computation
• When data is collected, there are some variables that may be
derived from the already collected data.
• Example 1: Age may be derived from Date of birth
• Example 2: Suppose 20 respondents are asked a question, where
the possible responses are: (1) Strongly agree, (2) Agree, (3) Not
Sure, (4) Disagree, (5) Strongly Disagree.
• You may compute a new variable showing the average response
to the question and make generalizations about the respondents.
• You can also derive very complex variables depending on the kind
of research you are doing. (y = mx+c)
Transformation of Data
• Aggregation
• In many cases data is collected variable by variable. Actually
even variables that require multiple responses, are still
constructed as “single” variables
• After collecting data that way, you may need to combine or
aggregate some variables and create new ones
• For example after collecting data on household incomes,
family sizes and ages, you may aggregate the data and create a
new dataset showing total income, average income, number
of people per LC1 or LC3.
Analyzing Quantitative Data
• Once you have identified your levels of measurement, you can
begin using some of the quantitative data analysis procedures.
There are several procedures you can use to determine what
narrative your data is telling. Below are some of the common
analyses:
• Data tabulation (frequency distributions & percent
distributions)
• Descriptive statistics
• Data disaggregation
• Choosing Statistical tests
Analyzing Quantitative Data
• Data tabulation
• The first thing you should do with your data is tabulate your
results for the different variables in your data set. This process
will give you a comprehensive picture of what your data looks
like and assist you in identifying patterns. The best ways to do
this are by constructing frequency and percent distributions
• A frequency distribution is an organized tabulation of the
number of individuals or scores located in each category
• See the tables below showing frequency distribution for the
regions and education level
Analyzing Quantitative Data
REGION
Valid
Eastern
Frequency
31
Percent
Valid Percent
25.8
25.8
Cumulative
Percent
25.8
Northern
29
24.2
24.2
50.0
Central
33
27.5
27.5
77.5
Western
27
22.5
22.5
100.0
120
100.0
100.0
Total
Observations:
• What is the difference between Percent and Valid Percent?
• What is the use of Cumulative Percent?
• What do we report? Frequency? Percent? Or Valid Percent?
Analyzing Quantitative Data
Observations:
• Is there a clear difference between Percent and Valid Percent?
Analyzing Quantitative Data
• Descriptive statistics
• A descriptive refers to calculations that are used to “describe” the
data set. The most common descriptives used are:
• Mean – the numerical average of scores for a particular variable
• Minimum and maximum values – the highest and lowest value
for a particular variable
• Median – the numerical middle point or score that cuts the
distribution in half for a particular variable
• Mode – the most common number score or value for a particular
variable
• Standard deviation – a measure of the average spread or
variation from the mean.
Analyzing Quantitative Data
Analyzing Quantitative Data
Statements measuring needs
SA
A
N
D
SD
Mean S.D.
assessment
Farmers attend coffee nursery 62.5% 21.4% 3.6% 7.1% 5.4%
4.29 1.17
planning meetings
Likert Scale: 5- Strongly agree, 4- Agree, 3- Not Sure,
2 - Disagree, 1 - Strongly Disagree
Teaser: Supposing the mean was the same for 2 different
statements but the S.D. was different. What would be the
interpretation? Mean: 4.29. S.D1 =1.17, S.D2 = 2.62
Analyzing Quantitative Data
• Descriptive statistics…
• We can also generate more descriptives, such as below;
• Range- Difference between the highest and lowest values
• Quartiles - are the values that divide a list of numbers into
quarters.
• Skewness - a measure of symmetry, or more precisely, the lack of
symmetry. The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero. Negative values
for the skewness indicate data that are skewed left and positive
values for the skewness indicate data that are skewed right.
• Kurtosis - a measure of whether the data are peaked or flat
relative to a normal distribution. The kurtosis for a standard
normal distribution is three. Positive kurtosis indicates a
"peaked" distribution and negative kurtosis indicates a "flat"
distribution.
Analyzing Quantitative Data
Analyzing Quantitative Data
• Disaggregation of data
• After tabulating the data, you can continue to explore
the data by disaggregating it across different variables.
The 2-way or 3-way Crosstabs allows you to disaggregate
the data across multiple variables.
• Using data from our example, let’s explore the
participant demographics (gender and education level)
for each region.
• By looking at the table below, you can clearly see the
demographic makeup of the respondents.
EDUCATION LEVEL * GENDER Crosstabulation
GENDER
Female
EDUCATION LEVEL
Primary
Count
% within EDUCATION
Male
Total
4
4
8
50.0%
50.0%
100.0%
8.7%
5.4%
6.7%
24
10
34
70.6%
29.4%
100.0%
52.2%
13.5%
28.3%
17
37
54
31.5%
68.5%
100.0%
37.0%
50.0%
45.0%
1
23
24
4.2%
95.8%
100.0%
2.2%
31.1%
20.0%
46
74
120
38.3%
61.7%
100.0%
100.0%
100.0%
100.0%
LEVEL
% within GENDER
Secondary
Count
% within EDUCATION
LEVEL
% within GENDER
Diploma
Count
% within EDUCATION
LEVEL
% within GENDER
Degree
Count
% within EDUCATION
LEVEL
% within GENDER
Total
Count
% within EDUCATION
LEVEL
% within GENDER
Analyzing Quantitative Data
Choosing Statistical Tests
• Statistical tests majorly focus on three aspects;
• Associations (relationships)
• Predictions (forecasting)
• Differences between groups
Choosing Statistical Tests
• Associations:
• Pearson's correlation
• The Pearson product-moment correlation is a measure of the
strength and direction of association that exists between two
variables measured on at least an interval scale.
• Examples;
• Is there an association between exam performance and time
spent revising;
• Is there an association between family size and family savings.
• Pearson’s correlation coefficient ranges between +1 and -1. +1
shows a perfect positive association, -1 perfect negative
association and 0 (zero) no association
Choosing Statistical Tests
• Associations:
• Assumptions for Pearson’s correlation
• Assumption #1: Your two variables should be measured at the
interval or ratio level (i.e., they are continuous).
• Examples;
• Revision time (measured in hours), age(measured in years), exam
performance (measured in marks from 0 to 100), weight
(measured in kg), etc.
• Assumption #2: There needs to be a linear relationship between
the two variables.
• You can plot the dependent variable against your independent
variable on a scatterplot and then visualize it to check for
linearity.
Choosing Statistical Tests
Examples of Scatterplots
Choosing Statistical Tests
• Assumption #3: There should be no significant outliers. Outliers
are simply single data points within your data that do not follow
the usual pattern.
• Pearson’s r is sensitive to outliers, which can have a very large effect
on the line of best fit and the Pearson correlation coefficient, leading
to very difficult conclusions regarding your data.
Choosing Statistical Tests
• Assumption #4: Your variables should be approximately
normally distributed. In order to assess the statistical
significance of the Pearson correlation, you need to have
bivariate normality, but this assumption is difficult to assess,
so a simpler method is more commonly used. This known as
the Shapiro-Wilk test of normality, which is easily tested for
using SPSS.
Choosing Statistical Tests
• Associations continue…
• Spearman's Rank-Order Correlation
• The Spearman's rank-order correlation is the nonparametric
version of the Pearson product-moment correlation.
• It measures the strength of association between two ranked
variables.
• Assumptions
• The two variables must be either Nominal or ordinal
• They may also be Interval or ratio data (where the assumptions
for Pearson have been violated)
Choosing Statistical Tests
• Associations continue…
• Chi-Square Test for Association
• The chi-square test for independence, also called Pearson's chisquare test or the chi-square test of association, is used to discover
if there is a relationship between two categorical variables.
• Examples; Is there an association between Region and Political
affiliation? Is there an association between Gender and Type of
learning (On-line, Books or Face-Face)?
• Assumption #1: Your two variables should be measured at an
ordinal or nominal level (i.e., categorical data).
• Assumption #2: Your two variable should consist of two or more
categorical, independent groups. Examples; Gender (2 groups:
Males and Females).
Choosing Statistical Tests
• Predicting scores
• Linear Regression Analysis
• Linear regression is the next step up after correlation.
• It is used when we want to predict the value of a variable based
on the value of another variable. The variable we want to predict
is called the dependent variable (or sometimes, the outcome
variable).
• The variable we are using to predict the other variable's value is
called the independent variable(predictor or explanatory variable)
• Examples; Exam performance can be predicted based on revision
time; A family’s Savings can be predicted based on the family size.
• Simple linear regression is when there is only one IV and one DV.
Choosing Statistical Tests
• Assumptions
• Assumption #1: Your two variables should be measured at the
interval or ratio level (i.e., they are continuous).
• Assumption #2: There needs to be a linear relationship between
the two variables.
• Assumption #3: There should be no significant outliers. Outliers
are simply single data points within your data that do not follow
the usual pattern.
• Assumption #4: You should have independence of observations,
which you can easily check using the Durbin-Watson statistic,
which is a simple test to run using SPSS.
• Assumption #5: Your data needs to show homoscedasticity,
which is where the variances along the line of best fit remain
similar as you move along the line.
Choosing Statistical Tests
• Homoscedasticity Verses Heteroscedasticity
Choosing Statistical Tests
• Predicting scores continues…
• Multiple Regression Analysis
• Multiple regression is an extension of simple linear regression.
• It is used when we want to predict the value of a variable based on the
value of two or more other variables.
• Example; Use multiple regression to understand whether exam
performance can be predicted based on revision time, exam anxiety,
lecture attendance, and gender.
• Multiple regression also allows you to determine the overall fit (variance
explained) of the model and the relative contribution of each of the
predictors to the total variance explained. For example, you might want
to know how much of the variation in exam performance can be
explained by revision time, exam anxiety, lecture attendance and gender
"as a whole", but also the "relative contribution" of each independent
variable in explaining the variance.
Choosing Statistical Tests
• Assumptions
• Assumption #1: Your dependent variable should be measured on
a continuous scale .
• Assumption #2: You have two or more independent variables,
which can be either continuous (i.e., an interval or ratio variable)
or categorical (i.e. ordinal or nominal variable).
• Assumption #3: You should have independence of observations.
• Assumption #4: There needs to be a linear relationship between
(a) the dependent variable and each of your independent
variables, and (b) the dependent variable and the independent
variables collectively.
• Assumption #5: Your data needs to show homoscedasticity,
which is where the variances along the line of best fit remain
similar as you move along the line.
Choosing Statistical Tests
• Assumption #6: Your data must not show multi-collinearity,
which occurs when you have two or more independent
variables that are highly correlated with each other.
• This leads to problems with understanding which independent
variable contributes to the variance explained in the
dependent variable, as well as technical issues in calculating a
multiple regression model.
Choosing Statistical Tests
• Ordinal logistic regression
• Ordinal logistic regression (often just called 'ordinal regression') is
used to predict an ordinal dependent variable given one or more
independent variables.
• Examples;
• Ordinal regression can be use to predict the belief that "tax is too
high" (your ordinal dependent variable, measured on a 4-point
Likert item from "Strongly Disagree" to "Strongly Agree"), based
on two independent variables: "age" and "income".
• Ordinal regression can also be used to determine whether a
number of independent variables, such as "age", "gender", "level
of physical activity" (amongst others), predicts the ordinal
dependent variable, "obesity", where obesity is measured using
three ordered categories: "normal", "overweight" and "obese".
Choosing Statistical Tests
• Assumptions for ordinal regression
• Assumption #1: Your dependent variable should be measured
at the ordinal level.
• Assumption #2: One or more independent variables is/are
continuous, ordinal or categorical (including dichotomous
variables).
• Assumption #3: There is no multi-collinearity
Choosing Statistical Tests
• Differences between groups
Independent-samples t-test
• The independent-samples t-test compares the means between two
unrelated groups on the same continuous, dependent variable.
• Examples;
• Use an independent t-test to understand whether fresh graduate
salaries differed based on gender (i.e., your dependent variable would
be "fresh graduate salaries" and your independent variable would be
"gender", which has two groups: “Male" and “Female").
• Alternately, use an independent t-test to understand whether there is a
difference in food production based on type of fertilizers (i.e., your
dependent variable would be “food production" and your independent
variable would be “type of fertilizers", which has two groups: “organic"
and “inorganic").
Choosing Statistical Tests
• Assumptions
• Assumption #1: Your dependent variable should be measured at
the interval or ratio level (i.e., they are continuous).
• Assumption #2: Your independent variable should consist of two
categorical, independent groups.
• Assumption #3: You should have independence of observations.
• Assumption #4: There should be no significant outliers.
• Assumption #5: Your dependent variable should be
approximately normally distributed for each category of the
independent variable.
• Assumption #6: There needs to be homogeneity of variances.
Choosing Statistical Tests
• One-way ANOVA
•
•
•
•
The one-way analysis of variance (ANOVA) is used to determine
whether there are any significant differences between the means
of three or more independent (unrelated) groups .
Example;
Use a one-way ANOVA to understand whether exam
performance differed based on exam anxiety levels amongst
students, dividing students anxiety into three independent groups
(e.g., low, medium and high).
It is important to realize that the one-way ANOVA cannot tell you
which specific groups were significantly different from each other;
it only tells you that at least two groups were different.
This can be done by using the ANOVA with Post-hoc test.
Choosing Statistical Tests
• Assumptions
• Assumption #1: Your dependent variable should be measured at
the interval or ratio level.
• Assumption #2: Your independent variable should consist of
more than two categorical, independent groups.
• Assumption #3: You should have independence of observations
• Assumption #4: There should be no significant outliers .
• Assumption #5: Your dependent variable should be
approximately normally distributed for each category of the
independent variable.
• Assumption #6: There needs to be homogeneity of variances.
Choosing Statistical Tests
• Two-way ANOVA
• The two-way ANOVA compares the mean differences between
groups that have been split on two independent variables (called
factors).
• The primary purpose of a two-way ANOVA is to understand if there
is an interaction between the two independent variables on the
dependent variable.
• Example;
• Use a two-way ANOVA to understand whether there an interaction
between gender and educational level on exam anxiety amongst
university students, where gender (males/females) and education
level (undergraduate/postgraduate) are your independent
variables, and exam anxiety your dependent variable.
Choosing Statistical Tests
• The two-way ANOVA cannot tell you which specific groups were
significantly different from each other (e.g., it cannot tell you
whether postgraduate males had greater exam anxiety levels
than postgraduate females); it only tells you that at least two
groups were different.
• Since you may have three, four, five or more groups in your study
design, as well as two independent variables, determining which
of these groups differ from each other is important. You can do
this using a post-hoc test.
• Therefore, where statistically significant interactions are found,
you we need to determine whether there are any "simple main
effects", and if there are, what these effects are.
Choosing Statistical Tests
• Assumption #1: Your dependent variable should be measured at
the interval or ratio level (i.e., they are continuous).
• Assumption #2: Your two independent variables should each
consist of two or more categorical, independent groups.
• Assumption #3: You should have independence of observations,
which means that there is no relationship between the
observations in each group or between the groups themselves.
• Assumption #4: There should be no significant outliers.
• Assumption #5: Your dependent variable should be
approximately normally distributed for each combination of the
categories of the two independent variables.
• Assumption #6: There needs to be homogeneity of variances for
each combination of the categories of the two independent
variables.
Choosing Statistical Tests
• One-way MANOVA
• The one-way multivariate analysis of variance (one-way
MANOVA) is used to determine whether there are any
differences between independent groups on more than one
continuous dependent variable.
• Example;
• Use a one-way MANOVA to understand whether there were
differences in students' short-term and long-term recall of
facts based on three different lengths of lecture (i.e., the two
dependent variables are "short-term memory recall" and
"long-term memory recall", while the independent variable is
"lecture duration", which has four independent groups: "30
minutes", "60 minutes", "90 minutes" and "120 minutes").
Choosing Statistical Tests
• Mann-Whitney U test
• The Mann-Whitney U test is used to compare differences
between two independent groups when the dependent
variable is either ordinal or continuous, but not normally
distributed.
• Example;
• Use the Mann-Whitney U test to understand whether
attitudes towards pay discrimination, where attitudes are
measured on an ordinal scale, differ based on gender
(i.e., your dependent variable would be "attitudes
towards pay discrimination" and your independent
variable would be "gender", which has two groups:
"male" and "female").
Choosing Statistical Tests
• Kruskal-Wallis H test
The Kruskal-Wallis test is the nonparametric test equivalent to
the one-way ANOVA, and an extension of the Mann Whitney
U test to allow the comparison of more than two independent
groups.
• It is used when we wish to compare three or more sets of
scores that come from different groups.
• Example;,
• Use a Kruskal-Wallis test to understand whether exam
performance differed based on exam anxiety levels amongst
students, dividing students into three independent groups
(e.g., low, medium and high).
Choosing Statistical Tests
• Dependent T-Test (Paired-samples t-test)
• The dependent t-test (called the Paired-Samples T Test) compares
the means between two related groups on the same continuous,
dependent variable.
• Example;
• Use a dependent t-test to understand whether there was a
difference in smokers' daily cigarette consumption before and
after a 6 week anti-smoking programme (i.e., your dependent
variable would be "daily cigarette consumption", and your two
related groups would be the cigarette consumption values
"before" and "after" the anti-smoking programme)
Choosing Statistical Tests
• Assumptions
• Assumption #1: Your dependent variable should be measured
at the interval or ratio level.
• Assumption #2: Your independent variable should consist of
two categorical, "related groups" or "matched pairs".
• Assumption #3: There should be no significant outliers in the
differences between the two related groups.
• Assumption #4: The distribution of the differences in the
dependent variable between the two related groups should
be approximately normally distributed.
Choosing Statistical Tests
• ANCOVA
• The ANCOVA (analysis of covariance) can be thought of as
an extension of the one-way ANOVA to incorporate a
"covariate".
• Like the one-way ANOVA, the ANCOVA is used to determine
whether there are any significant differences between the
means of two or more independent (unrelated) groups
• However, the ANCOVA has the additional benefit of allowing
you to "statistically control" for a third variable (sometimes
known as a "confounding variable"), which may be
negatively affecting your results.
• This third variable that could be founding your results is the
"covariate" that you include in an ANCOVA.
Statistical Data Analysis
Moses Mugolo Kasolo
[email protected];
[email protected]
0772408941/0702196786
Global Information Systems Ltd
uSoftware Development v Data
Analysis wSpecialized ICT Training