A Framework for Automated Corpus Generation for Semantic

Download Report

Transcript A Framework for Automated Corpus Generation for Semantic

A Framework for Automated Corpus Generation for Semantic Sentiment Analysis

Amna Asmi and Tanko Ishaya, Member, IAENG

Proceedings of the World Congress on Engineering 2012 Vol I WCE 2012, July 4 - 6, 2012, London, U.K.

Introduction

• • • • Variety of corpora present (WordNet, SentiWordNet and Multi-Perspective Question Answering (MPQA)) Some corpora not large enough Generation and annotation is time consuming and inconsistent.

This paper presents a framework for automated generation of corpus for semantic sentiment analysis of user generated web-content

Existing corpora

• • • • • • MPQA Movie Review (pang and others, 2002) Varbaul (Sankoff and Cedegan, program based on multivariate analysis) Fidditch (automated parser for English) Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM) International corpus of English (ICE)

Existing Techniques for Sentiment Analysis

• • • Direction based text including opinions, sentiments, affects and biases Opinion mining using ML techniques (supervised/ unsupervised) (document /sentence/clause level) Polarity, degree of polarity, features, subjectivity, relationships, identification, affect types, mood classification and ordinal scale

Annotation Process

Methodology • • • Grabbing URL, author, subject, text, comments Text broken to sentences Sentence applied with Stanford Dependencies Parser and Penn Treebank Tagging and broken down into clauses • • Subject-Verb-Object triplet extracted Rules according to POS, negation, punctuation, conjunction is specified using SentiWordNet and WordNet • Rules used to extract sentiment, and define polarity and intensity • Based on subject and object, and topic/title of sentence of post, subjectivity is calculated

Tools used

• • • • • WordNet SentiWordNet Stanford Parser PennTree Bank UMLS(Unified Medical Language System)

Framework

• •

Repository:

• Wordnet, SentiWordNet dictionaries, UMLS Metathesaurus • Rules for sentence, polarity, subjectivity and sentiment identification and analysis

Data Pre-processor:

• Input: Unstructured data from medical forum ( http://www.medhelp.org/forums/list ) • • Input cleaned and filtered Captures thread structure, comments of forum, and arranges other info like author, topic, date.

• • Spell checks Split to set of posts and sent to post pre-processor

Framework

• • Post Pre-Processor • Splits texts to sentences using Penn Tree Tagger • Passes sentences to syntactic parser iteratively • Keeps track of start and end of post Syntactic Parser (SP) • Collects sentences iteratively and invokes POS tagger • Name entities and idioms are identified • • Identifies dependencies/ relationship Classifies sentence as a question, assertion, comparison, confirmation seeking or confirmation providing

Framework

• Sentiment Analyser(SA) • Extracts sentiment oriented words from each sentence by using relationship info (dependencies within) • Polarity Calculator (PC) identifies + and – words. • • Synonyms used if word is not found Collects synonyms from SentiWordNet • Uses UMLS Metathesaurus if synonym not found • Rules for polarity identification used

Framework

• • Subjectivity Calculator(SC) • Considers POS and relationships • • Identifies all sentences related to topic Takes nouns and associated info (synonyms, homonyms, meronyms, holonyms and hyponyms) Sentiment Analyser: • Takes polarities of sentences marked by SC for post polarity calculation • Takes aggregate of all polarities of sentences related to post • Generates sentiment frame info for each sentence • Frame contains type, subject, object/feature, sentiment oriented word(s), sentiment type (absolute / relative), strength (very weak, weak, average, strong, very strong), polarity of sentence, post index and sentence index • Forwards calculated values and info to Sentiment Frame manager

Framework

• Sentiment Frame Manager • Stores all information to a physical location • Loads all frames in tree structure at runtime memory on program load • Keeps track of changes and appends changes • Stored into XML file

Future Work

• • Currently being evaluated using medical based forums Plans to make it general purpose

Thank You

GIFs courtesy : http://www.retrojunkie.com/