Transcript Slide 1
TagHelper Tools Supporting the Analysis of Conversational Data Carolyn P. Rosé Language Technologies Institute and Human-Computer Interaction Institute Carnegie Mellon University Outline What is TagHelper tools? What can TagHelper Tools do for YOU? How EASY is it to use TagHelper tools? What are some TagHelper success stories? What problems are we working on? What is TagHelper tools? What is TagHelper tools? A PSLC Enabling Technology project Machine learning technology for processing conversational data Chat data Newsgroup style conversational data Short answers and explanations Goal: automate the categorization of spans of text What is TagHelper tools? An add-on to Microsoft Excel Research Focus: identify and solve text classification problems specific to learning sciences Types of categories, nature and size of data sets What can TagHelper tools do for YOU? Main Uses for TagHelper tools Supporting data analysis involving conversational data Triggering interventions Supporting on-line assessment Example: Data Analysis Example: Triggering an Intervention ST1: well what values do u have for the reheat cycle ? ST2: for some reason I said temperature at turbine to be like 400 C Tutor: Let's think about the motivation for Reheat. What process does the steam undergo in the Turbines ? … Example: Supporting on-line assessment * Using instructor assigned ratings as gold standard * Best performance without TagHelper tools: .16 correlation coefficient * Best performance with TagHelper tools: .63 correlation coefficient How EASY is it to use TagHelper tools? Setting Up Your Data Iterative Process for Using TagHelper tools Obtain data in natural language form Iterative process Decide on a unit of analysis Single contributions, topic segments, whole messages, etc. Decide on a set of categories or a rating system Set up data in Excel Assign categories to part of your data Use TagHelper to assign categories to the remaining portion of your data Training and Testing Start TagHelper tools by double clicking on the portal.bat icon You will then see the following tool pallet Train a prediction model on your coded data and then apply that model to uncoded data Loading a File First click on Add a File Then select a file Simplest Usage Once your file is loaded, you have two options The first option is to code your data using the default settings To do this, simply click on “GO!” The second option is to modify the default settings and then code We will start with the first option Note that the performance will not be optimal Results Performance on coded data Results on uncoded data A slightly more complex case… Example: Data Analysis Setting Up Your Data What are some TagHelper success stories? Success Story 1: Supporting Data Analysis Peer tutoring in Algebra LearnLab Data coded for high-levelhelp, low-level-help, and nohelp Important predictor of learning (e.g., Webb et al., 2003) * Contributed by Erin Walker TagHelper achieves agreement of .82 Kappa Can be used for follow-up studies in same domain Success Story 2: Triggering Interventions Collaborative idea generation in the Earth Sciences domain Chinese TagHelper learns hand-coded topic analysis Human agreement .84 Kappa TagHelper performance .7 Kappa Trained models used in follow-up study to trigger interventions and facilitate data analysis Example Dialogue Speaker Text Student 1 People stole sand and stones to use for construction. VIBRANT Yes, steeling sand and stones may destroy the balance and thus make mountain areas unstable. Thinking about development of mountain areas, can you think of a kind of development that may cause a problem? Student 2 Development of mountain areas often causes problems. Student 1 It is okay to develop, but there must be some constraints. * Feedback during idea generation increases both idea generation and learning (Wang et al., 2007) Process Analysis Unique Ideas 12 Individuals+Feedback Individuals+NoFeedback 6 8 Pairs+Feedback Pairs+NoFeedback 0 2 4 #Unique Ideas 10 Nom+N Nom+F Real+N Real+F 0 Process loss Pairs vs Individuals: F(1,24)=12.22, p<.005, 1 sigma Negative effect of Feedback: F(1,24)= 7.23, p<.05, -1.03 sigma 5 10 15 Time Stamp 20 25 30 Process loss Pairs vs Individuals: F(1,24)=4.61, p<.05, .61 sigma Positive effect of feedback: F(1,24)=16.43, p<.0005, 1.37 sigma What problems are we working on? Interesting Problems Highly skewed data sets Very infrequent classes are often the most interesting and important Careful feature space design helps more than powerful algorithms Huge problem with non-independence of data points from same student Off-the shelf machine learning algorithms not set up for this New sampling techniques offer promise “Medium” sized data sets Contemporary machine learning approaches designed for huge data sets Supplementing with alternative data sources may help Example Lesson Learned Problem Context oriented coding Finding Careful feature space design goes farther than powerful algorithms Back to Argumentation Data Sequential Learning Notes sequential dependencies Perhaps claims are stated before their warrants Perhaps counter-arguments are given before new arguments Perhaps people first build on their partner’s ideas and then offer a new idea Thread Structure Features Seg1 Seg2 Seg3 Seg1 Seg2 Seg3 Thread Depth Best Parent Semantic Similarity Seg1 Seg2 Seg3 Sequence Oriented Features Notes whether text is within a certain proximity to quoted material Context-Based Feature Approach 0.73 0.75 0.71 0.70 0.69 0.67 0.70 0.73 Kappa from 10-fold CV 0.66 0.65 0.61 0.62 0.61 0.61 0.60 0.55 0.52 0.50 0.45 0.40 Social Macro Micro Dimension Base Base+Thread Base+Seq Base+AllContext Sequential Learning 0.75 0.65 0.70 Kappa from 10-fold CV 0.61 0.63 0.64 0.63 0.59 0.65 0.56 0.56 0.56 0.60 0.54 0.52 0.55 0.50 0.43 0.45 0.40 Social Macro Micro Dimension Base / 0 Base / 1 Base+AllContext / 0 Base+AllContext / 1 What did we learn? Intuition confirmed Different dimensions responded differently to context based enhancements Feature based approach was more effective Thread structure features were especially informative for Social Modes dimension Thread structure information is more difficult to extract from chat data Best results of similar approach on chat data only achieved a kappa of .45 William Cohen Pinar Donmez Jaime Arguello Gahgene Gweon Rohit Kumar Yue Cui Mahesh Joshi Yi-Chia Wang Hao-Chuan Wang Emil Albright Cammie Williams