ground truth - International Educational Data Mining Society
Download
Report
Transcript ground truth - International Educational Data Mining Society
Week 3 Video 1
Ground Truth for Behavior
Detection
Welcome to Week 3
Over the last two weeks, we’ve discussed prediction
models
This week, we focus on a type of prediction models
called behavior detectors
Behavior Detectors
Automated models that can infer from log files
whether a student is behaving in a certain way
We discussed examples of this
off-task
behavior and gaming detectors
In the San Pedro et al. case study in week 1
Behaviors people have detected
Disengaged Behaviors
5
Gaming the System (Baker et al., 2004, 2008,
2010; Cheng & Vassileva, 2005; Walonoski &
Heffernan, 2006; Beal, Qu, & Lee, 2007)
Off-Task Behavior (Baker, 2007; Cetintas et al.,
2010)
Carelessness (San Pedro et al., 2011; Hershkovitz et
al., 2011)
WTF Behavior (Rowe et al., 2009; Wixon et al.,
UMAP2012)
Meta-Cognitive Behaviors
6
Help Avoidance (Aleven et al., 2004, 2006)
Unscaffolded Self-Explanation (Shih et al., 2008;
Baker, Gowda, & Corbett, 2011)
Exploration Behaviors (Amershi & Conati, 2009)
If you’re not interested in behavior
detectors…
Feel free to take the week off and rejoin us in a
week
Although there may still be stuff that’s interesting to
you this week
Ground Truth
The first big issue with developing behavior
detectors is…
Where do you get the prediction labels?
Behavior Labels are Noisy
No perfect way to get indicators of student
behavior
It’s not truth
It’s
ground truth
(Truthiness?)
Behavior Labels are Noisy
Another way to think of it
In some areas, there are “gold-standard” measures
As
good as gold
With behavior detection, we have to work with
“bronze-standard” measures
Gold’s
less expensive cousin
Is this a problem?
Not really
It does limit how good we can realistically expect our
models to be
If your training labels have inter-rater agreement of
Kappa = 0.62
You probably should not expect (or want) your detector
to have Kappa = 0.75
Ultimately
“Anything worth doing is worth doing badly.” –
Herb Simon
Picture taken from CMU tribute page
Sources of Ground Truth for Behavior
Self-report
Field observations
Text replays
Video coding
Self-report
Fairly common for constructs like affect and selfefficacy
Not as common for labeling behavior
Are you gaming the system right now?
Yes
No
Was recommended to me by an IRB
compliance officer in 2003
Are you gaming the system right now?
Yes
No
Field Observations
One or more observers watch students and take
systematic notes on student behavior
Takes some training to do right (Ocumpaugh et al.,
2012)
http://www.columbia.edu/~rsb2162/bromp.html
Free Android App developed by my group (Baker
et al., 2011)
http://www.columbia.edu/~rsb2162/HART8.6.zip
Text replays
Pretty-prints of student interaction behavior from
the logs
Examples
Major Advantages
Blazing fast to conduct
8
to 40 seconds per observation
Notes
Decent inter-rater reliability is possible
Agree with other measures of constructs
Can be used to train behavior detectors
Major Limitations
Limited range of constructs you can code
Gaming the System – yes
Collaboration in online chat – yes
(Prata et al, 2008)
Frustration, Boredom – sometimes
Off-Task Behavior outside of software – no
Collaborative Behavior outside of software – no
Major Limitations
Lower precision (because lower bandwidth of
observation)
Video Coding
Can be videos of live behavior in classrooms or
screen replay videos
Slowest method
Replicable and precise
Some challenges in camera positioning
If
you don’t get this right, it can be difficult to code your
data
Inter-Rater Reliability
Good to check for inter-rater reliability when using
expert coding
Researchers typically try to shoot for Kappa = 0.6
or higher for the expert coding
But
ignore magic numbers…
Any real signal is something you can try to re-capture
I’d rather have 1,000 data points with Kappa = 0.5
Than 100 data points with Kappa = 0.7
Once you have ground truth…
You can build your detector!
Once you’ve synchronized your ground truth to the
log data…
Next Lecture
Data Synchronization and Grain-Sizes