Transcript Slide 1

TagHelper Tools
Supporting the Analysis of
Conversational Data
Carolyn P. Rosé
Language Technologies Institute
and Human-Computer Interaction Institute
Carnegie Mellon University
Outline
 What is TagHelper tools?
 What can TagHelper Tools do for YOU?
 How EASY is it to use TagHelper tools?
 What are some TagHelper success
stories?
 What problems are we working on?
What is TagHelper tools?
What is TagHelper tools?
 A PSLC Enabling Technology project
 Machine learning technology for
processing conversational data
 Chat data
 Newsgroup style conversational data
 Short answers and explanations
 Goal: automate the categorization of
spans of text
What is TagHelper tools?
 An add-on to Microsoft Excel
 Research Focus: identify and solve text
classification problems specific to
learning sciences
 Types of categories, nature and size of data
sets
What can TagHelper tools
do for YOU?
Main Uses for TagHelper tools
 Supporting data analysis involving
conversational data
 Triggering interventions
 Supporting on-line assessment
Example: Data Analysis
Example: Triggering an Intervention
 ST1: well what values
do u have for the
reheat cycle ?
 ST2: for some reason
I said temperature at
turbine to be like 400 C
 Tutor: Let's think about
the motivation for
Reheat. What process
does the steam
undergo in the
Turbines ?
…
Example: Supporting on-line
assessment
* Using instructor assigned ratings as gold standard
* Best performance without TagHelper tools: .16 correlation coefficient
* Best performance with TagHelper tools: .63 correlation coefficient
How EASY is it to use
TagHelper tools?
Setting Up Your Data
Iterative Process for Using
TagHelper tools
 Obtain data in natural language form
 Iterative process
 Decide on a unit of analysis
 Single contributions, topic segments, whole
messages, etc.




Decide on a set of categories or a rating system
Set up data in Excel
Assign categories to part of your data
Use TagHelper to assign categories to the
remaining portion of your data
Training and Testing
 Start TagHelper tools by
double clicking on the
portal.bat icon
 You will then see the
following tool pallet
 Train a prediction model
on your coded data and
then apply that model to
uncoded data
Loading a File
First click on Add a File
Then select a file
Simplest Usage
 Once your file is loaded,
you have two options
 The first option is to
code your data using the
default settings
 To do this, simply click on
“GO!”
 The second option is to
modify the default
settings and then code
 We will start with the first
option
 Note that the performance
will not be optimal
Results
Performance on coded data
Results on uncoded data
A slightly more complex
case…
Example: Data Analysis
Setting Up Your Data
What are some TagHelper
success stories?
Success Story 1: Supporting Data
Analysis
 Peer tutoring in Algebra
LearnLab
 Data coded for high-levelhelp, low-level-help, and nohelp
 Important predictor of learning
(e.g., Webb et al., 2003)
* Contributed by Erin Walker
 TagHelper achieves
agreement of .82 Kappa
 Can be used for follow-up
studies in same domain
Success Story 2: Triggering
Interventions
 Collaborative idea
generation in the Earth
Sciences domain
 Chinese TagHelper
learns hand-coded topic
analysis
 Human agreement .84
Kappa
 TagHelper performance
.7 Kappa
 Trained models used in
follow-up study to trigger
interventions and
facilitate data analysis
Example Dialogue
Speaker
Text
Student 1
People stole sand and stones to use for construction.
VIBRANT
Yes, steeling sand and stones may destroy the balance and
thus make mountain areas unstable. Thinking about
development of mountain areas, can you think of a kind of
development that may cause a problem?
Student 2
Development of mountain areas often causes problems.
Student 1
It is okay to develop, but there must be some constraints.
* Feedback during idea generation increases both idea generation
and learning (Wang et al., 2007)
Process Analysis
Unique Ideas
12
Individuals+Feedback
Individuals+NoFeedback
6
8
Pairs+Feedback
Pairs+NoFeedback
0
2
4
#Unique Ideas
10
Nom+N
Nom+F
Real+N
Real+F
0
Process loss Pairs vs Individuals:
F(1,24)=12.22, p<.005, 1 sigma
Negative effect of Feedback:
F(1,24)= 7.23, p<.05, -1.03
sigma
5
10
15
Time Stamp
20
25
30
Process loss Pairs vs Individuals:
F(1,24)=4.61, p<.05, .61 sigma
Positive effect of feedback:
F(1,24)=16.43, p<.0005, 1.37 sigma
What problems are we
working on?
Interesting Problems
 Highly skewed data sets
 Very infrequent classes are often the most interesting and
important
 Careful feature space design helps more than powerful
algorithms
 Huge problem with non-independence of data points
from same student
 Off-the shelf machine learning algorithms not set up for
this
 New sampling techniques offer promise
 “Medium” sized data sets
 Contemporary machine learning approaches designed for
huge data sets
 Supplementing with alternative data sources may help
Example Lesson Learned
Problem
Context oriented coding
Finding
Careful feature space design goes
farther than powerful algorithms
Back to Argumentation Data
Sequential Learning
 Notes sequential dependencies



Perhaps claims are stated before
their warrants
Perhaps counter-arguments are
given before new arguments
Perhaps people first build on their
partner’s ideas and then offer a new
idea
Thread Structure Features
Seg1
Seg2
Seg3
Seg1
Seg2
Seg3
 Thread Depth
 Best Parent
Semantic Similarity
Seg1
Seg2
Seg3
Sequence Oriented Features
 Notes whether
text is within a
certain proximity
to quoted
material
Context-Based Feature Approach
0.73
0.75
0.71
0.70
0.69
0.67
0.70
0.73
Kappa from 10-fold CV
0.66
0.65
0.61
0.62
0.61 0.61
0.60
0.55
0.52
0.50
0.45
0.40
Social
Macro
Micro
Dimension
Base
Base+Thread
Base+Seq
Base+AllContext
Sequential Learning
0.75
0.65
0.70
Kappa from 10-fold CV
0.61
0.63
0.64
0.63
0.59
0.65
0.56 0.56 0.56
0.60
0.54
0.52
0.55
0.50
0.43
0.45
0.40
Social
Macro
Micro
Dimension
Base / 0
Base / 1
Base+AllContext / 0
Base+AllContext / 1
What did we learn?
 Intuition confirmed

Different dimensions responded differently to
context based enhancements
 Feature based approach was more
effective

Thread structure features were especially
informative for Social Modes dimension
 Thread structure information is more
difficult to extract from chat data

Best results of similar approach on chat data
only achieved a kappa of .45
William Cohen
Pinar Donmez
Jaime Arguello
Gahgene Gweon
Rohit Kumar
Yue Cui
Mahesh Joshi
Yi-Chia Wang
Hao-Chuan Wang
Emil Albright
Cammie Williams