Interrater reliability of categorical IV (3)

Download Report

Transcript Interrater reliability of categorical IV (3)

Funded through the ESRC’s Researcher
Development Initiative
Session 3.3: Inter-rater reliability
Prof. Herb Marsh
Ms. Alison O’Mara
Dr. Lars-Erik Malmberg
Department of Education,
University of Oxford
Establish
research
question
Define
relevant
studies
Develop code
materials
Data entry
and effect size
calculation
Pilot coding;
coding
Locate and
collate studies
Main analyses
Supplementary
analyses
Interrater reliability
Aim of co-judge procedure, to discern:
Consistency within coder
Consistency between coders
Take care when making inferences based on little
information,
Phenomena impossible to code become missing
values
Interrater reliability
Percent agreement: Common but not recommended
Cohen’s kappa coefficient
 Kappa is the proportion of the optimum improvement
over chance attained by the coders, 1 = perfect
agreement, 0 = agreement is no better than that expected
by chance, -1 = perfect disagreement
 Kappa’s over .40 are considered to be a moderate level of
agreement (but no clear basis for this “guideline”)
Correlation between different raters
Intraclass correlation. Agreement among multiple
raters corrected for number of raters using
Spearman-Brown formula (r)
Interrater reliability of categorical IV (1)
Number of
observations
agreed
on
Percent exact agreement =
Total number of
Study Rater 1 Rater 2
observations
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
1
1
2
1
1
0
2
1
1
0
1
1
1
1
2
1
1
0
1
0
1
Categorical IV with 3 discreet
scale-steps
9 ratings the same
% exact agreement = 9/12 =
.75
Interrater reliability of categorical IV (2)
unweighted Kappa
Rater 1
0
Rater 2
K 
0
1
2
Sum
PO  PE
1  PE
1
2
2
0
0
2
1
6
0
7
0
2
1
3
3
8
1
12
Kappa:
,
PO  ( 2  6  1) / 12  . 75
PE  [( 2 )( 3 )  ( 7 )( 8 )  ( 3 )( 1)] / 12
K 
. 750  . 451
1  . 451
Sum
If agreement
matrix is irregular
Kappa will not be
calculated, or
misleading
 . 544
2
 . 451
Positive values indicate
how much the raters
agree over and above
chance alone
Negative values indicate
disagreement
Interrater reliability of categorical IV (3)
unweighted Kappa in SPSS
CROSSTABS
/TABLES=rater1 BY rater2
/FORMAT= AVALUE TABLES
/STATISTIC=KAPPA
/CELLS= COUNT
/COUNT ROUND CELL .
Symmetric Measures
As ymp.
a
Std. Error
Value
Measure of Agreement
Kappa
N of Valid Cases
.544
12
a. Not as suming the null hypothesis.
b. Us ing the as ymptotic s tandard error assuming the null hypothes is .
.220
Approx. T
b
2.719
Approx. Sig.
.007
Interrater reliability of categorical IV (4)
Kappas in irregualar matrices
If rater 2 is systmatically “above” rater 1 when coding an
ordinal scale, Kappa will be misleading  possible to “fill
up” with zeros
Rater 1
1
2
Rater 2
2
3
4
Sum
3
Rater 1
1
Sum
4
1
0
5
3
6
1
10
0
3
7
10
7
10
8
25
K = .51
Rater 2
1
2
3
4
Sum
2
3
4
Sum
0
0
0
0
0
4
1
0
0
5
3
6
1
0
10
0
3
7
0
10
7
10
8
0
25
K = -.16
Interrater reliability of categorical IV (5)
Kappas in irregular matrices
If there are no observations in some row or column,
Kappa will not be calculated  possible to “fill up” with
zeros
Rater 1
1
3
Rater 2
1
2
3
4
Sum
4
Sum
4
0
0
4
2
1
0
3
1
3
2
6
0
1
4
5
7
5
6
18
K not possible to
estimate
Rater 2
1
2
3
4
Sum
Rater 1
1
2
4
0
0
0
4
2
0
1
0
3
1
0
3
2
6
0
0
1
4
5
7
0
5
6
18
K = .47
3
4
Sum
Interrater reliability of categorical IV (6)
weighted Kappa using SAS macro
PROC FREQ DATA = int.interrater1 ;
TABLES rater1 * rater2 / AGREE;
TEST KAPPA; RUN;
KW  1 
 w i p oi
 w i p ei
Papers and macros
available for estimating
Kappa when unequal or
misaligned rows and
columns, or multiple raters:
<http://www.stataxis.com/ab
out_me.htm>
Interrater reliability of continuous IV (1)
Study Rater 1 Rater 2 Rater 3
1
5
6
5
2
2
1
2
3
3
4
4
4
4
4
4
5
5
5
5
6
3
3
4
7
4
4
4
8
4
3
3
9
3
3
2
10
2
2
1
11
1
2
1
12
3
3
3
Correlations
rater1
rater1
Pearson C orrelation
rater2
1
Sig. (2-tailed)
N
rater2
Pearson C orrelation
.873**
Sig. (2-tailed)
.000
N
rater3
12
rater3
.873**
.879**
.000
.000
12
12
1
.000
12
12
Pearson C orrelation
.879**
.866**
Sig. (2-tailed)
.000
.000
12
12
N
.866**
**. Correlation is significant at the 0.01 level (2-tailed).
 Average correlation r = (.873 + .879 + .866) / 3 = .873
 Coders code in same direction!
12
1
12
Interrater reliability of continuous IV (2)
a
Estimates of Covariance Parameters
Parameter
Residual
Es timate
.222222
Intercept [s ubject = s tudy]
Variance
1.544613
a. Dependent Variable: rating.
B
2
ICC 
 
2
B
2
W

1 . 544
1 . 544  0 . 222

1 . 544
1 . 767
 0 . 874
Interrater reliability of continuous IV (3)
Design 1 one-way random effects model when each
study is rater by a different pair of coders
Design 2 two-way random effects model when a
random pair of coders rate all studies
Design 3 two-way mixed effects model ONE pair of
coders rate all studies
Comparison of methods (from Orwin, p. 153;
in Cooper & Hedges, 1994)
Low Kappa but good AR when little variability
across items, and coders agree
Interrater reliability in meta-analysis and
primary study
Interrater reliability in meta-analysis vs. in
other contexts
Meta-analysis: coding of independent variables
 How many co-judges?
 How many objects to co-judge? (sub-sample of
studies, versus sub-sample of codings)
 Use of “Golden standard” (i.e., one “master-coder”)
Coder drift (cf. observer drift): are coders
consistent over time?
 Your qualitative analysis is only as good as the
quality of your categorisation of qualitative data