Interrater reliability of categorical IV (3)
Download
Report
Transcript Interrater reliability of categorical IV (3)
Funded through the ESRC’s Researcher
Development Initiative
Session 3.3: Inter-rater reliability
Prof. Herb Marsh
Ms. Alison O’Mara
Dr. Lars-Erik Malmberg
Department of Education,
University of Oxford
Establish
research
question
Define
relevant
studies
Develop code
materials
Data entry
and effect size
calculation
Pilot coding;
coding
Locate and
collate studies
Main analyses
Supplementary
analyses
Interrater reliability
Aim of co-judge procedure, to discern:
Consistency within coder
Consistency between coders
Take care when making inferences based on little
information,
Phenomena impossible to code become missing
values
Interrater reliability
Percent agreement: Common but not recommended
Cohen’s kappa coefficient
Kappa is the proportion of the optimum improvement
over chance attained by the coders, 1 = perfect
agreement, 0 = agreement is no better than that expected
by chance, -1 = perfect disagreement
Kappa’s over .40 are considered to be a moderate level of
agreement (but no clear basis for this “guideline”)
Correlation between different raters
Intraclass correlation. Agreement among multiple
raters corrected for number of raters using
Spearman-Brown formula (r)
Interrater reliability of categorical IV (1)
Number of
observations
agreed
on
Percent exact agreement =
Total number of
Study Rater 1 Rater 2
observations
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
1
1
2
1
1
0
2
1
1
0
1
1
1
1
2
1
1
0
1
0
1
Categorical IV with 3 discreet
scale-steps
9 ratings the same
% exact agreement = 9/12 =
.75
Interrater reliability of categorical IV (2)
unweighted Kappa
Rater 1
0
Rater 2
K
0
1
2
Sum
PO PE
1 PE
1
2
2
0
0
2
1
6
0
7
0
2
1
3
3
8
1
12
Kappa:
,
PO ( 2 6 1) / 12 . 75
PE [( 2 )( 3 ) ( 7 )( 8 ) ( 3 )( 1)] / 12
K
. 750 . 451
1 . 451
Sum
If agreement
matrix is irregular
Kappa will not be
calculated, or
misleading
. 544
2
. 451
Positive values indicate
how much the raters
agree over and above
chance alone
Negative values indicate
disagreement
Interrater reliability of categorical IV (3)
unweighted Kappa in SPSS
CROSSTABS
/TABLES=rater1 BY rater2
/FORMAT= AVALUE TABLES
/STATISTIC=KAPPA
/CELLS= COUNT
/COUNT ROUND CELL .
Symmetric Measures
As ymp.
a
Std. Error
Value
Measure of Agreement
Kappa
N of Valid Cases
.544
12
a. Not as suming the null hypothesis.
b. Us ing the as ymptotic s tandard error assuming the null hypothes is .
.220
Approx. T
b
2.719
Approx. Sig.
.007
Interrater reliability of categorical IV (4)
Kappas in irregualar matrices
If rater 2 is systmatically “above” rater 1 when coding an
ordinal scale, Kappa will be misleading possible to “fill
up” with zeros
Rater 1
1
2
Rater 2
2
3
4
Sum
3
Rater 1
1
Sum
4
1
0
5
3
6
1
10
0
3
7
10
7
10
8
25
K = .51
Rater 2
1
2
3
4
Sum
2
3
4
Sum
0
0
0
0
0
4
1
0
0
5
3
6
1
0
10
0
3
7
0
10
7
10
8
0
25
K = -.16
Interrater reliability of categorical IV (5)
Kappas in irregular matrices
If there are no observations in some row or column,
Kappa will not be calculated possible to “fill up” with
zeros
Rater 1
1
3
Rater 2
1
2
3
4
Sum
4
Sum
4
0
0
4
2
1
0
3
1
3
2
6
0
1
4
5
7
5
6
18
K not possible to
estimate
Rater 2
1
2
3
4
Sum
Rater 1
1
2
4
0
0
0
4
2
0
1
0
3
1
0
3
2
6
0
0
1
4
5
7
0
5
6
18
K = .47
3
4
Sum
Interrater reliability of categorical IV (6)
weighted Kappa using SAS macro
PROC FREQ DATA = int.interrater1 ;
TABLES rater1 * rater2 / AGREE;
TEST KAPPA; RUN;
KW 1
w i p oi
w i p ei
Papers and macros
available for estimating
Kappa when unequal or
misaligned rows and
columns, or multiple raters:
<http://www.stataxis.com/ab
out_me.htm>
Interrater reliability of continuous IV (1)
Study Rater 1 Rater 2 Rater 3
1
5
6
5
2
2
1
2
3
3
4
4
4
4
4
4
5
5
5
5
6
3
3
4
7
4
4
4
8
4
3
3
9
3
3
2
10
2
2
1
11
1
2
1
12
3
3
3
Correlations
rater1
rater1
Pearson C orrelation
rater2
1
Sig. (2-tailed)
N
rater2
Pearson C orrelation
.873**
Sig. (2-tailed)
.000
N
rater3
12
rater3
.873**
.879**
.000
.000
12
12
1
.000
12
12
Pearson C orrelation
.879**
.866**
Sig. (2-tailed)
.000
.000
12
12
N
.866**
**. Correlation is significant at the 0.01 level (2-tailed).
Average correlation r = (.873 + .879 + .866) / 3 = .873
Coders code in same direction!
12
1
12
Interrater reliability of continuous IV (2)
a
Estimates of Covariance Parameters
Parameter
Residual
Es timate
.222222
Intercept [s ubject = s tudy]
Variance
1.544613
a. Dependent Variable: rating.
B
2
ICC
2
B
2
W
1 . 544
1 . 544 0 . 222
1 . 544
1 . 767
0 . 874
Interrater reliability of continuous IV (3)
Design 1 one-way random effects model when each
study is rater by a different pair of coders
Design 2 two-way random effects model when a
random pair of coders rate all studies
Design 3 two-way mixed effects model ONE pair of
coders rate all studies
Comparison of methods (from Orwin, p. 153;
in Cooper & Hedges, 1994)
Low Kappa but good AR when little variability
across items, and coders agree
Interrater reliability in meta-analysis and
primary study
Interrater reliability in meta-analysis vs. in
other contexts
Meta-analysis: coding of independent variables
How many co-judges?
How many objects to co-judge? (sub-sample of
studies, versus sub-sample of codings)
Use of “Golden standard” (i.e., one “master-coder”)
Coder drift (cf. observer drift): are coders
consistent over time?
Your qualitative analysis is only as good as the
quality of your categorisation of qualitative data