CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose  Not how.

Transcript CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose  Not how.

Slide 1

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 2

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 3

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 4

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 5

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 6

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 7

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 8

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 9

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 10

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 11

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 12

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 13

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 14

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 15

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 16

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 17

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 18

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 19

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 20

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 21

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 22

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 23

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 24

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 25

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 26

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 27

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 28

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 29

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 30

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 31

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 32

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 33

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 34

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 35

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 36

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 37

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 38

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 39

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 40

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 41

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 42

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 43

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 44

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 45

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 46

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 47

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 48

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 49

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 50

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 51

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 52

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 53

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 54

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 55

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 56

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 57

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 58

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 59

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 60

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

Slide 61

CUE Forum 2007

Basic SLA Statistics for
the University Educator
Peter Neff
Harumi Kimura
Philip McNally
Matthew Apple
© 2007 JALT CUE SIG and individual presenters

Purpose


Not how to use statistics in a study, but
rather…

To help everyone better understand
and interpret common statistical
methods encountered in SLA studies
 To cover common errors and issues
related to each procedure


Overview
Introduction
 Descriptive Statistics – Harumi
 T-tests – Philip
 One-way ANOVA – Peter
 Factor Analysis – Matthew
 Q&A


Outline


Each presenter will introduce:
–

–
–
–
–

The function of the procedure
Important underlying concepts
Its use in SLA research
An example of the procedure in action
Errors and issues to look out for

Descriptive Statistics
Harumi Kimura
Nanzan University

Q1: Unreasonable fear?
Why do so many language teachers draw
back in terror when confronted with large
doses of numbers, tables, and statistics?

It is irresponsible to ignore such research
just because you do not have the relatively
simple tools for understanding it.
J.D. Brown (1988)

Q 2: Values of statistical studies?
Individual behavior & Group phenomena







Quantifiable data
Structured with definite procedures
Follow logical steps
Replicable
Reductive
PATTERNS

Q3: What do descriptive statistics
provide?
Snapshot description of the situation
observed

Numerical representations of how each
group performed on the measures

Readers can draw a mental picture

Q4: How do we manage the data?
Organize and present the data
for further analysis
We describe them in/as
Graphs
Figures

Two aspects of group behavior
Mean

Central tendency
Standard Deviation
Variability from
mean

Normal Distribution
a normal curve
Just as in the natural world …

Position of an individual
Within a group
or
Comparison of a group
with other groups

How normal?
Not symmetrical

Flat
or
Peaked

Issues
Are the data appropriate
for further statistical analyses?
 Mean

and SD

 Participants

N

size

and sampling

Mean and SD

Sampling: Random or Convenience

N size

To conclude
 Mean

and Standard Distribution
 Normal Distribution
 These

concepts “are central to all
statistical research and
sometimes forgotten by
researchers.”
Brown, 1988

t-tests
Philip McNally
Osaka International University

Function: Comparing two means
A t-test will…

…tell you whether there is a statistically significant
difference in the mean scores (Pallant, 2006, p.206).
a.) for two different groups, or
b.) for one group at two different times.

Types of t-test
One group (Within-subject or repeated measures design)
Paired samples t-test
Matched pairs t-test
Dependent means t-test
Two groups (Between-group or Between-subjects design)
Independent samples t-test
Independent measures t-test
Independent means t-test

Uses of t-tests
(T)he simplest form of experiment that can be done:
only one independent variable is manipulated in only
two ways and only one dependent variable is measured
(Field, 2003, p.207).

Example: Paired samples t-test
One group
Time 1: no extensive reading (IV); vocab test (DV).
Time 2: after extensive reading (IV); vocab test (DV).

Example: Independent samples ttest
Two groups
Group A - implicit grammar (IV); test (DV).
Group B - explicit grammar (IV); test (DV).

Example: Macaro & Erler (2007)
A longitudinal study of 11-12 year old British learners of French.
The effect of reading strategy instruction.
Treatment group:
Control group:

N = 62
N = 54

Measures taken of reading comprehension, reading strategy use, and
attitudes to French before and after the intervention.

Interpreting the data: Macaro & Erler
(2007)

Results of attitudes to French
Area
Reading
Speaking
Writing
Listening
Spelling
General learning
Homework
Textbook
*p < .006

t = 4.91, df = 114, p = .001*
t = 2.28, df = 114, p = .024
t = 2.30, df = 114, p = .023
t = 4.12, df = 114, p = .001*
t = 3.74, df = 114, p = .001*
t = 3.61, df = 114, p = .001*
t = 2.92, df = 114, p = .004*
t = 3.01, df = 114, p = .005*

Types of error
Type I error
You think you’ve got significance, but you haven’t.
You should have adjusted your alpha value if you made multiple
comparisons.
Type II error
You think the difference between the means was by chance.
It wasn’t, but because you adjusted for multiple comparisons
the data failed to reach significance.

Controlling for Type I error
95% level of significance = 95% sure difference is not by chance
20 comparisons = 1 by chance
100 comparisons = 5 by chance

Controlling for Type I error
So, we have to make a Bonferroni adjustment if we make multiple
comparisons…
Alpha level
No. of comparisons
0.05
5
= 0.01
…and use this new figure as your alpha level.

Issues
• does the data meet normality assumptions?
• is the sample size large enough?

• is the data continuous?
• is Type I error controlled for?

One-way ANOVA
Peter Neff
Doshisha University

What it is


Function
–

ANalysis Of VAriance - a search for mean
differences between data sets
– One-way ANOVA - looking for significant
differences in the mean scores of 2 or
more groups
– Why “one-way?” - looking at the effect that
changing one variable has on the study’s
participants

ANOVA Example 1
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●
M1●
M3 ●
similar means (M) = non-significant (p > .05)

ANOVA Example 2
Control
Group

Treatment 1
Group

Treatment 2
Group

M2●

M3 ●

M1●
M1 significantly different from M2 & M3, but…
M2 & M3 not significantly different from each other




ANOVAs and T-tests
Both procedures look for significant
mean differences between groups;
 However, t-tests work best when
limited to 2 groups.
 ANOVAs can work with 3 or more
groups while introducing less error.


ANOVAs in Language Research


Often used to compare:
–

Assessment scores
– Survey responses


A typical situation may be to try
different treatments/methods with 3
different groups and then testing them
to see if the results show any
significant differences

Vocabulary Testing Example


3 learning groups with equivalent starting
vocabulary range
–
–
–



Group I learns with word cards
Group II learns with word lists
Group III learns with PC software

After several weeks of study, a vocab test is
given
 Results from the test are ANOVA analyzed to
see if any groups scored significantly
higher/lower

Peer Review Survey Example


3 learning groups in EFL writing courses
 Students peer review each other’s written work in
one of 3 ways
– Group I – Written peer review
– Group II – Oral peer review
– Group III – PC-based peer review
 After the review sessions, peer review satisfaction
surveys are given using Likert (1~5) scales
 Results are ANOVA analyzed for significant
differences in satisfaction level among the review
types

Reporting One-way ANOVA
Results


Three basic components:
–

1) Table of descriptive statistics (mean,
standard deviation, etc)
– 2) An ANOVA table (degrees of freedom,
sum of squares, F-statistic)
– 3) A report of the post-hoc results with
effect size

The F-statistic


The higher the better (for the model)



A significant F-statistic (p < .05) is what
researchers look for

ANOVA in the Literature
Descriptive statistics

ANOVA statistics

F-statistic is
significant
…i.e. our model
seems to work

F-statistic cont.


Reaching significance indicates there
are statistically important differences
between some of the group means



But…the F-statistic doesn’t tell us
where the differences are



For that we turn to…

Post-hoc Results and
Effect Size
Post-hoc results

Effect size





These are done if the
F-statistic is significant
 Paired comparisons of
the group means
 Tell us where the
significant differences
lie
 Often reported in the
text (though sometimes
in table form)





Often referred to as ‘etasquared’ or ‘strength of
association’
Indicates the magnitude
of the difference between
means
Reflects the total variance
effected by the
treatments
–
–
–

.01 – small effect
.06 – medium effect
.14 – large effect


*According to Cohen
(1988)

Reporting Post-hoc Results and
Effect Size
“Post-hoc comparisons using the indicated
that the mean score for Group 1 (M=21.36,
SD=4.55) was significantly different from
Group 3 (M=22.96, SD=4.49). Group 2
(M=22.10, SD=4.15) did not differ
significantly from either Group 1 or 3.”
 “Despite reaching statistical significance, the
actual difference in group means was quite
small. The effect size, calculated using etasquared, was .02.”


Common ANOVA
Problems and Issues
 Starting

out with non-equivalent

groups
 Not reporting the type of ANOVA
performed
 Not reporting specific post-hoc
results
 Not reporting effect size

Post-hoc results


ANOVA table



“[This table] shows the result from running through an ANOVA by using
SPSS. It can be seen that the difference among treatments is significant
(p < 0.05). The scores for the Vocabulary condition were much higher than
the other conditions. The Main Character condition was slightly higher than
the Combined condition.”



Problems
No mention of the type of ANOVA
No mention of post-hoc results.




–



Which groups were significantly different from each other?

No mention of effect size.
–

What was the magnitude of the treatment effect?

Conclusion


One-way ANOVAs are useful for looking at
the effect of changing one variable on 3 or
more equivalent groups
 Often used for testing treatment effects or
comparing survey results
 Involves a two-step process of analyzing the
model (through the F-statistic) and
performing post-hoc procedures
 Effect size (eta-squared) is an important
component indicating the magnitude of the
treatment effect

Factor Analysis
Matthew Apple
Doshisha University

FA: What it is
 Measures

only one group or
sample population

 A “family”
–

of FA

PCA, FA, EFA, CFA…

FA: What it does
 Tests

the existence of underlying
(latent) constructs within a sample
population
–

Identifies patterns within large numbers
of participants

–

“Reduces” several items into a few
measurable factors

Uses of FA within SLA
Typically used with psychological
variables and Likert-scale
questionnaires
 Often a preliminary step before more
complicated statistical analyses


–

Correlational Analysis
– Multiple Regression
– Structural Equation Modeling

Example questionnaire factors
1. 英語で外国人と話しがしたい。
I would like to communicate with foreigners in English.
2. 英語習得は自分の教養を高めるのに必要だ。
English is essential for personal development.
3. 日本語でも自分がうまく表現できない。
I am not good at expressing myself even in Japanese.
4. 外国の音楽と文化に興味がある。
I am interested in foreign music and culture.
5. 英語は社会で活躍するのに必要だ。
English is essential to be active in society.
6. 難しいトピックに関しても、自分の意見が言える。
I can express my own opinions even about difficult topics.

Integrative
Instrumental
Selfcompetence

Terminology
Factor - the latent construct
 Variance - different answers to each
item (variable)


More Terminology!


Factor loading
–
–



Amount of shared variance between items and
the factor
Factor loadings above .4 are desirable, above .7
are excellent

Cronbach’s Alpha
–
–
–

Measurement of item-scale reliability
Based on inter-item correlation (i.e., the more
items, the greater the alpha)
Does not “prove” cause-effect or validity of items
themselves

Determining factors, then items




Researchers should determine the factors
before adding items to the questionnaire
–

Previous research results

–

Carefully constructed model

Items should be designed to relate to a
particular concept (factor)
–

“Borrow” items or develop them in a pilot

–

6-8 items for a robust factor

FA in the literature
Item 43 (“The more I study
English, the more enjoyable
I find it”)
F1 (“Beliefs about a
contemporary
(communicative) orientation
to learning English”)
.630 loading
.63 X .63 = 40% of shared
variance with the factor
(Above .4 is acceptable
according to Stevens, 1992)

Assumptions of FA
Normal distribution
 Items are correlated above .3
 Large N-size


–
–

“Over 300” (Tabachnick & Fidell, 2007)
5-10 participants for each item
(Field, 2005)

Ex: 30-item questionnaire
3-4 factors
150-300 participants

Problems and issues with FA
“Fishing” for data (i.e., not reading the
literature, then simply allowing SPSS to
tell you what it finds)
 Not understanding the nature of factors
(i.e., using 2 or 3 items as a “factor” or
keeping too many factors)
 Using an arbitrary cut-off point for
factor loadings (typically .3, .32, .35)
 N-size far too small


Typical factor loading issues

Item 38 (“I am very aware that teachers/lecturers know
a lot more than I do and so I agree with what they say is
important rather than rely on my own judgment”)
F3, .33 loading ; F1, .29 loading
.33 X .33 = 11% of shared variance with the factor

Conclusions regarding FA



Horribly, horribly complicated

Typically used with questionnaires to reduce
individual items to factors for purposes of
correlation or prediction
 Helps researchers draw conclusions from a
large number of items through data reduction
 Requires a large N-size and several reference
books
 Often written up with no regard to APA
guidelines or previous research results

To sum up…


Descriptive statistics
–



T-tests
–



Dependent, independent, paired

One-way ANOVA
–



Mean and SD

F, effect size, post-hoc

Factor Analysis
–

Factor, variance, factor loading

Thank you for attending!
CUE SIG Forum 2007
JALT 2007 International Conference
Yoyogi Olympic Memorial Youth Center
Tokyo, Japan, November 25, 2007

CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose  Not how.

Transcript CUE Forum 2007 Basic SLA Statistics for the University Educator Peter Neff Harumi Kimura Philip McNally Matthew Apple © 2007 JALT CUE SIG and individual presenters Purpose  Not how.

Directory