ground truth - International Educational Data Mining Society

Transcript ground truth - International Educational Data Mining Society

Week 3 Video 1
Ground Truth for Behavior
Detection
Welcome to Week 3


Over the last two weeks, we’ve discussed prediction
models
This week, we focus on a type of prediction models
called behavior detectors
Behavior Detectors


Automated models that can infer from log files
whether a student is behaving in a certain way
We discussed examples of this
 off-task

behavior and gaming detectors
In the San Pedro et al. case study in week 1
Behaviors people have detected
Disengaged Behaviors
5




Gaming the System (Baker et al., 2004, 2008,
2010; Cheng & Vassileva, 2005; Walonoski &
Heffernan, 2006; Beal, Qu, & Lee, 2007)
Off-Task Behavior (Baker, 2007; Cetintas et al.,
2010)
Carelessness (San Pedro et al., 2011; Hershkovitz et
al., 2011)
WTF Behavior (Rowe et al., 2009; Wixon et al.,
UMAP2012)
Meta-Cognitive Behaviors
6



Help Avoidance (Aleven et al., 2004, 2006)
Unscaffolded Self-Explanation (Shih et al., 2008;
Baker, Gowda, & Corbett, 2011)
Exploration Behaviors (Amershi & Conati, 2009)
If you’re not interested in behavior
detectors…


Feel free to take the week off and rejoin us in a
week
Although there may still be stuff that’s interesting to
you this week
Ground Truth


The first big issue with developing behavior
detectors is…
Where do you get the prediction labels?
Behavior Labels are Noisy


No perfect way to get indicators of student
behavior
It’s not truth
 It’s
ground truth
 (Truthiness?)
Behavior Labels are Noisy

Another way to think of it

In some areas, there are “gold-standard” measures
 As

good as gold
With behavior detection, we have to work with
“bronze-standard” measures
 Gold’s
less expensive cousin
Is this a problem?




Not really
It does limit how good we can realistically expect our
models to be
If your training labels have inter-rater agreement of
Kappa = 0.62
You probably should not expect (or want) your detector
to have Kappa = 0.75
Ultimately

“Anything worth doing is worth doing badly.” –
Herb Simon
Picture taken from CMU tribute page
Sources of Ground Truth for Behavior




Self-report
Field observations
Text replays
Video coding
Self-report


Fairly common for constructs like affect and selfefficacy
Not as common for labeling behavior
Are you gaming the system right now?
Yes
No
Was recommended to me by an IRB
compliance officer in 2003
Are you gaming the system right now?
Yes
No
Field Observations


One or more observers watch students and take
systematic notes on student behavior
Takes some training to do right (Ocumpaugh et al.,
2012)
 http://www.columbia.edu/~rsb2162/bromp.html

Free Android App developed by my group (Baker
et al., 2011)
 http://www.columbia.edu/~rsb2162/HART8.6.zip
Text replays

Pretty-prints of student interaction behavior from
the logs
Examples
Major Advantages

Blazing fast to conduct
8
to 40 seconds per observation
Notes

Decent inter-rater reliability is possible

Agree with other measures of constructs

Can be used to train behavior detectors
Major Limitations






Limited range of constructs you can code
Gaming the System – yes
Collaboration in online chat – yes
(Prata et al, 2008)
Frustration, Boredom – sometimes
Off-Task Behavior outside of software – no
Collaborative Behavior outside of software – no
Major Limitations

Lower precision (because lower bandwidth of
observation)
Video Coding




Can be videos of live behavior in classrooms or
screen replay videos
Slowest method
Replicable and precise
Some challenges in camera positioning
 If
you don’t get this right, it can be difficult to code your
data
Inter-Rater Reliability


Good to check for inter-rater reliability when using
expert coding
Researchers typically try to shoot for Kappa = 0.6
or higher for the expert coding
 But
ignore magic numbers…
 Any real signal is something you can try to re-capture
 I’d rather have 1,000 data points with Kappa = 0.5
 Than 100 data points with Kappa = 0.7
Once you have ground truth…


You can build your detector!
Once you’ve synchronized your ground truth to the
log data…
Next Lecture

Data Synchronization and Grain-Sizes