CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose Not how.
Download ReportTranscript CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose Not how.
Slide 1
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 2
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 3
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 4
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 5
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 6
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 7
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 8
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 9
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 10
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 11
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 12
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 13
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 14
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 15
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 16
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 17
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 18
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 19
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 20
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 21
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 22
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 23
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 24
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 25
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 26
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 27
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 28
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 29
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 30
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 31
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 32
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 33
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 34
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 35
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 36
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 37
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 38
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 39
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 40
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 41
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 42
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 43
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 44
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 45
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 46
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 47
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 48
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 49
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 50
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 51
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 52
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 53
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 54
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 55
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 56
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 57
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 58
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 59
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 60
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 61
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 2
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 3
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 4
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 5
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 6
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 7
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 8
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 9
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 10
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 11
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 12
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 13
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 14
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 15
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 16
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 17
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 18
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 19
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 20
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 21
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 22
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 23
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 24
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 25
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 26
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 27
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 28
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 29
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 30
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 31
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 32
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 33
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 34
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 35
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 36
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 37
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 38
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 39
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 40
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 41
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 42
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 43
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 44
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 45
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 46
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 47
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 48
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 49
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 50
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 51
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 52
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 53
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 54
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 55
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 56
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 57
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 58
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 59
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 60
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007
Slide 61
CUE Forum 2007
Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters
Purpose
Not how to use statistics in a study, but
rather…
To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
To cover common errors and issues
related to each procedure
Overview
Introduction
Descriptive Statistics – Harumi
T-tests – Philip
One-way ANOVA – Peter
Factor Analysis – Matthew
Q&A
Outline
Each presenter will introduce:
–
–
–
–
–
The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for
Descriptive Statistics
Harumi Kimura
Nanzan University
Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?
It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)
Q 2: Values of statistical studies?
Individual behavior & Group phenomena
Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS
Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed
Numerical representations of how each
group performed on the measures
Readers can draw a mental picture
Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures
Two aspects of group behavior
Mean
Central tendency
Standard Deviation
Variability from
mean
Normal Distribution
a normal curve
Just as in the natural world …
Position of an individual
Within a group
or
Comparison of a group
with other groups
How normal?
Not symmetrical
Flat
or
Peaked
Issues
Are the data appropriate
for further statistical analyses?
Mean
and SD
Participants
N
size
and sampling
Mean and SD
Sampling: Random or Convenience
N size
To conclude
Mean
and Standard Distribution
Normal Distribution
These
concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988
t-tests
Philip McNally
Osaka International University
Function: Comparing two means
A t-test will…
…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.
Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test
Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).
Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).
Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).
Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:
N = 62
N = 54
Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.
Interpreting the data: Macaro & Erler
(2007)
Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006
t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*
Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.
Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance
Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.
Issues
• does the data meet normality assumptions?
• is the sample size large enough?
• is the data continuous?
• is Type I error controlled for?
One-way ANOVA
Peter Neff
Doshisha University
What it is
Function
–
ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants
ANOVA Example 1
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)
ANOVA Example 2
Control
Group
Treatment 1
Group
Treatment 2
Group
M2●
M3 ●
M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other
ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
However, t-tests work best when
limited to 2 groups.
ANOVAs can work with 3 or more
groups while introducing less error.
ANOVAs in Language Research
Often used to compare:
–
Assessment scores
– Survey responses
A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences
Vocabulary Testing Example
3 learning groups with equivalent starting
vocabulary range
–
–
–
Group I learns with word cards
Group II learns with word lists
Group III learns with PC software
After several weeks of study, a vocab test is
given
Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower
Peer Review Survey Example
3 learning groups in EFL writing courses
Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types
Reporting One-way ANOVA
Results
Three basic components:
–
1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size
The F-statistic
The higher the better (for the model)
A significant F-statistic (p < .05) is what
researchers look for
ANOVA in the Literature
Descriptive statistics
ANOVA statistics
F-statistic is
significant
…i.e. our model
seems to work
F-statistic cont.
Reaching significance indicates there
are statistically important differences
between some of the group means
But…the F-statistic doesn’t tell us
where the differences are
For that we turn to…
Post-hoc Results and
Effect Size
Post-hoc results
Effect size
These are done if the
F-statistic is significant
Paired comparisons of
the group means
Tell us where the
significant differences
lie
Often reported in the
text (though sometimes
in table form)
Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–
.01 – small effect
.06 – medium effect
.14 – large effect
*According to Cohen
(1988)
Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
“Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”
Common ANOVA
Problems and Issues
Starting
out with non-equivalent
groups
Not reporting the type of ANOVA
performed
Not reporting specific post-hoc
results
Not reporting effect size
Post-hoc results
ANOVA table
“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”
Problems
No mention of the type of ANOVA
No mention of post-hoc results.
–
Which groups were significantly different from each other?
No mention of effect size.
–
What was the magnitude of the treatment effect?
Conclusion
One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
Often used for testing treatment effects or
comparing survey results
Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect
Factor Analysis
Matthew Apple
Doshisha University
FA: What it is
Measures
only one group or
sample population
A “family”
–
of FA
PCA, FA, EFA, CFA…
FA: What it does
Tests
the existence of underlying
(latent) constructs within a sample
population
–
Identifies patterns within large numbers
of participants
–
“Reduces” several items into a few
measurable factors
Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
Often a preliminary step before more
complicated statistical analyses
–
Correlational Analysis
– Multiple Regression
– Structural Equation Modeling
Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.
Integrative
Instrumental
Selfcompetence
Terminology
Factor - the latent construct
Variance - different answers to each
item (variable)
More Terminology!
Factor loading
–
–
Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent
Cronbach’s Alpha
–
–
–
Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves
Determining factors, then items
Researchers should determine the factors
before adding items to the questionnaire
–
Previous research results
–
Carefully constructed model
Items should be designed to relate to a
particular concept (factor)
–
“Borrow” items or develop them in a pilot
–
6-8 items for a robust factor
FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)
Assumptions of FA
Normal distribution
Items are correlated above .3
Large N-size
–
–
“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)
Ex: 30-item questionnaire
3-4 factors
150-300 participants
Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
N-size far too small
Typical factor loading issues
Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor
Conclusions regarding FA
Horribly, horribly complicated
Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
Helps researchers draw conclusions from a
large number of items through data reduction
Requires a large N-size and several reference
books
Often written up with no regard to APA
guidelines or previous research results
To sum up…
Descriptive statistics
–
T-tests
–
Dependent, independent, paired
One-way ANOVA
–
Mean and SD
F, effect size, post-hoc
Factor Analysis
–
Factor, variance, factor loading
Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007