Nick`s presentation

Transcript Nick`s presentation

Automated Question Answering
Motivation: support for students
• Demand is for 365 x 24 support
– Students set aside time to complete task
– If problem encountered immediate help required
• Majority of responses direct students to teaching
materials; so not a case of “not there”
• Poor search forums
– Search per forum - not course
– Free-text search options fixed by RDBMS
• No explicit operators (AND, OR, NEAR)
Research questions
• Given the current level of development of
natural language processing (NLP) tools, is it
possible to:
– Classify messages as question/non-question
– Identify the topic of the question
– Direct users to specific course resources
Natural Language Processing tools
•
•
•
•
•
•
•
•
Tokenisation (words, numbers, punctuation, whitespace)
Sentence detection
Part of speech tagging (verbs, nouns, pronouns, etc.)
Named entity recognition (names, locations, events,
organisations)
Chunking/Parsing (noun/verb phrases and relationships)
Statistical modelling tools
Dictionaries, word-lists, WordNet , VerbNet
Corpora tools (Lucene, Lemur)
Question answering solutions
• Open domain
– No restrictions on question topic
– Typically answers from web resources
– Extensive literature
• Closed domain
– Restricted question topics
– Typically answers from small corpus
• Company documents
• Structured data
Open domain QA research
• Well established over two decades
• TREC (Text REtrieval Conference)
– funded by NIST/DARPA since 1992
– QA track 1999 – 2007, directed at ‘Factoids’
• CLEF (Cross Language Evaluation Forum)
– 2001- current
– Information Retrieval, language resources
• NTCIR (NII Test Collection for IR Systems)
– 1997 – current
– IR, question answering, summarization, extraction
TREC Factoids
• Given a fact-based question:
– How many calories in a Big Mac?
– Who as the 16th President of the United States?
– Where is the Taj Mahal?
• Return an exact answer in 50/250 bytes
– 540 calories
– Abraham Lincoln
– Agra, India
Minimal factoid process
• Question analysis
• Normalisation (verbs, auxiliaries, modifiers)
• Identify entities (people, locations, events)
• Pattern detection (who was X?, how high is Y?)
• Query creation, expansion, and execution
• Ordered terms, combined terms, weighted terms
• Answer analysis
• Match answer type to question type
OpenEphyra: open source QA
Source: http://www.cs.cmu.edu/~nico/ephyra/doc/images/overall_architecture.jpg
OpenEphyra: question analysis
Question
‘who was the fourth president of the USA’
Normalization
‘who be fourth president of USA’
Answer type
NEproperName->NEperson
Interpretation
property:
target:
context:
NAME
fourth president
USA
OpenEphyra: query expansion
1. "fourth president USA"
2. (fourth OR 4th OR quaternary) president (USA OR US
OR U.S.A. OR U.S. OR "United States" OR "United
States of America" OR "the States" OR America)
3. "fourth president" "USA" fourth president USA
4. "was fourth president of USA“
5. "fourth president of USA was”
OpenEphyra: result
answer: James Madison
score: 0.7561732
docid: http://www.squidoo.com/james-madison-presidentusa
Document content:
<meta property="og:title" content="James Madison - 4th President of USA"/>
<h1>James Madison - 4th President of USA</h1>
<div class="module_intro>James Madison (March 16, 1751 - June 28, 1836)
was fourth President of the United States (1809-1817), and one of the
Founding Fathers of the United States...
Shallow answer selection
• Answer based on reformulation of question
– Who was the fourth president of the
<location>United States</location>?
– <person>James Maddison</person> was the
fourth president of the <location>United
States</location>
Students don’t ask questions and we don’t provide answers!
Importance of named entities
Search engine
Answer matching
Extracted NEs link
question and answer
Question
processed
for NEs
Search
results
tagged
with NEs
PREPARATORY TASKS
Task list: the real work
• Create database of forum messages
• Adapt open source NLP tools
– Tokenisation, sentence detection, Parts Of Speech, parsing
• Establish question patterns
• Create language analysis tools
– Word frequency
– Named-entities: define, build, and train models
• Prepare corpus
– Format and tag documents (doc, html, pdf)
– Build Indri catalogue and search interface
Iterative process: build, test, refine
NLP tools
• Predominantly Java
– Stanford, OpenNLP, Lingpipe
– GATE: complete analysis + processing system
– IKVM permits use with .NET framework
• Some C++, C#
– WordNet, Lemur/Indri, Nooj, SharpNLP
• Python NLTK
– Complete NLP toolset and corpus
• Lisp, Prolog
Message database
• MySQL database for FirstClass messages
• Extract:
– Forum, Subject, Date, Author
– Body
• Use subject to classify as Original or Reply
No clean-up or filtering of message content undertaken at this stage
Raw forum message (Sample 1)
<?xml version="1.0"?>
<firstclass>
<FCFORMSHEADER>
<fcobject objtype="oConfItem" formid="141" objname="Daniel Hughes 5">
<field id="3" index="0" type="number">-959014497</field>
<subject index="0" >Help Please!!!? Urgent</subject>
<tonames index="0" >T320 09B Eclipse Support</tonames>
</fcobject>
</FCFORMSHEADER>
<body>
I am trying to open an existing project but can't do it. It's driving me mad. I know the project folders
are located in the workspaceblock4 folder. I have deleted all the open projects in the project
explorer window (without deleting content). BUT how on earth do I know proceed to reload some
of the projects without starting from scratch? When I select open file ... it doesn't let me open any
projects files - only the individual files in the project folder. In other words I cannot get any project
files to appear in the project explorer window. Please can anyone help me as I have booked a lot of
time off work to concentrate on the project, but I am a dead end.
</body>
</firstclass>
Raw forum message (Sample 2)
<?xml version="1.0"?>
<firstclass>
<FCFORMSHEADER>
<fcobject objtype="oConfItem" formid="141" objname="Simon Shadbolt">
<field id="3" index="0" type="number">-962619805</field>
<subject index="0" >Block 4 Practical booklet 6 activity 4- Unable to get a fault!</subject>
<tonames index="0" >T320 09B Eclipse Support</tonames>
</fcobject>
</FCFORMSHEADER>
<body>
I have followed the set up and altered the fault to "none" and simulation to normal, but I do not get any faults at all or a listing
that resembles the list on page 12, particularly line 12. I have attached my bpel file and my screenshot, any help appreciated.
Simon

Process bpelEcho3pScope: Instance 1 created.
Process bpelEcho3pScope: Executing [/process]
Process Suspended [/process]
Receive ClientRequestMessage: Executing [/process/flow/receive[@name='ClientRequestMessage']]
.
Scope : Completed normally [/process/flow/scope]
Reply ClientResponseMessage: Executing [/process/flow/reply[@name='ClientResponseMessage']]
Reply ClientResponseMessage: Completed normally [/process/flow/reply[@name='ClientResponseMessage']]
Process bpelEcho3pScope: Completed normally [/process]
</body>
</firstclass>
Eclipse console listing
or XML
T320 09B database properties
•
•
•
•
•
•
Total messages:
Non-replies:
Manually tagged questions:
Average length (lines)
Containing XML:
Containing Eclipse content:
4246
1051
777
7.9
17
37
Creating question patterns
• Extract text from forum messages (non-replies)
• Create n-grams (‘n’ adjacent words)
• Perform frequency analysis of n-grams
• Manually review n-grams to create question
patterns
N-gram results
Number of words
Unique patterns
6
96900
5
96780
4
94975
3
86338
5-word frequency analysis
Frequency
17
16
14
13
12
9
8
8
8
7
7
7
7
6
6
6
6
6
6
6
N-Gram
An unexpected error has occurred.
point me in the right
I get the following error
me in the right direction
unexpected error has occurred. UDDIException
does not seem to be
get the following error message
I get an error message
system cannot find the path
Any help would be appreciated.
I am not sure if
I can not seem to
I do not know what
A problem occured while running
but I get the following
cannot find the path specified
error has occurred. UDDIException java.
has occurred. UDDIException java. net.
I am not sure how
I do not seem to
Top 20 results
Sliding window across message
Frequency
N-gram 1
N-gram 2
1
am not that knowledgable Help
I am not that knowledgable
1
am not the early adopter
I am not the early
1
am not thinking straight today
I am not thinking straight
1
am not too far off
I am not too far
1
am not too sure if
I am not too sure
1
am not using the fault
I am not using the
1
am noticing in the console
I am noticing in the
1
am now a while later
I am now a while
1
am now adding my exception
I am now adding my
1
am now getting the following
I am now getting the
1
am now held up again
I am now held up
1
am now not sure if
I am now not sure
1
am now stuck on activity
I am now stuck on
1
am now trying not to
I am now trying not
1
am now trying to start
I am now trying to
1
am now willing to submit
I am now willing to
1
am obviously missing something here
Candidate question patterns
Class name
Pattern
#question
(a|my) question (about|on|for|is)
#appreciate
appreciate (.*) (advice|comment|guidance|help|direction)
#can/could
(can|could|will|would) (any|some)\s?(body|one)) (.*) (explain|tell me)
#does
does (any|some)\s?(body|one) (have|know)
#having
(have|having) (.*) (problem|nightmare)s?
#how
how (best|can|does|do i|do you|do we)
#i am
i am not (really )?sure (if|how|what|when|whether|why)
#i cannot
i (can not|cannot|could not) find (.*) answer (.*) question)
#just
just wonder(ed|ing)? (if|what)
#point me
point (me|one) (.*) right direction
Generalisation of patterns using POS
Question part
any|some
advice|comment|guidance
appreciated|welcomed
.
POS tag
DT
NN
VB(N|D)
./.
Can/MD anyone/NN offer/VB some/DT help/NN ?/.
Can/MD someone/NN offer/VB some/DT help/NN ?/.
Can/MD anybody/RB give/VB some/DT guidance/NN ?/.
Could/MD somebody/RB give/VB some/DT direction/NN ?/.
POS pattern matching failed due to errors in assigning tags
Final question patterns: RegExs
Pattern ID
Weighting
Regular Expression
1
0
(?<a>(a|my)\squestion\s)(?<b>about|on|for|is)
66
0
(?<a>(i\sam|i'm|im)?\shav(e|ing)\s(difficult(y|ie)|issue|problem)(s)?)
67
0
(?<a>i\s(am|have|was))\b(?<b>.*)\b(?<c>wonder(ed|ing)?\s(if|what|whether)?)
69
0
(?<a>i\sam\s(confused|assuming|unable\sto\scontinue))
70
0
(?<a>i\sam\s(still|getting))\b(?<b>.*)\b(?<c>confused)
71
0
72
0
(?<a>i\sam\snot\s(really\s)?sure)\s(?<b>if|how|what|when|whether|why)
(?<a>i\sam\snot\s(really\s)?sure)\s(?<b>what(\sit\swants\sfrom\sme|\sthey\sare\s
after))
73
0
(?<a>(i|i\sam)\s(not\sat\sall\ssure))
88
0
(?<a>i\shave\s(encountered|found|got))\b(?<b>.*)\b(?<c>issue|problem)
139
0
(?<a>what\s(have\si|i\shave))\b(?<b>.*)\b(?<c>wrong)
164*
100
(?<a>problem\s)(?<b>.*)\b(?<c>WSDL\sconformance\scheck)
* Pattern derived from Eclipse error message
169 patterns using ‘explicit capture’
CHALLENGES PROCESSING
MESSAGES
Poor message style
Incorrect POS tagging
due to spelling errors
when/WRB I/PRP tried/VBD to/TO generate/VB the/DT sample/NN
,/, it/PRP said/VBD the/DT data/NNS is/VBZ available/JJ ./.
XML within messages
Detected as single sentence
Eclipse console listing within message
Line breaks
not
recognised
as end of
sentence
Open-source NLP problems
• Sentence detection failures:
– Bad style (capitalisation, punctuation)
– Ellipsis (i tried... it failed... error message...)
– XML, BPEL segments concatenated to single sentence
• Tokenisation failures:
– Multiple punctuation ???, !!! (student emphasis)
– Abbreviations (im, cant, doesnt, etc.)
• POS errors
– Spelling, grammar
Purpose built tools
• Tokeniser
– Re-coded for typical forum content/style
• Multiple punctuation
• Abbreviations
• Common contractions
• Sentence detector
– New detector based on token sequences
• Pre-filter messages
– Remove XML, console listing, error messages
Message pre-filters
• Short-forms
– i’m, im, i m
– can’t, cant, can t
•
•
•
•
•
i am
can not
Line numbers
Repeated punctuation (!!!, ???, ...)
Smilies
Salutations (Hi all, Hiya, etc.)
Names, signature, course codes
Filtered message
Raw message
containing Eclipse
console listing
Filtered message
ready to process
PRELIMINARY RESULTS:
question classification
Message-set properties
•
•
•
•
•
Number of messages:
1051 (100%)
Number of questions(M): 777 (73.9%)(100%)
Number of questions(A): 756 (97.3%)
False Positives (A not M): 58 (7.4%)
False Negatives (M not A): 79 (10.2%)
Approx 90% success rate
M = manually annotated question, A = automatically annotated question
Message-set properties – cont.
• Average # pattern matches:
• Min # pattern matches:
• Max # pattern matches:
2.7606
1
12
• Average # of lines (ASCII linefeed)
• Min # Lines in a message
• Max # Lines in a message
7.9
1
68
• Average # of sentences
• Min # Sentences in a message
• Max # Sentences in a message
5.0
1
89
• Messages containing XML
• Messages containing BPEL
17
37
Distribution of pattern match count
350
Number of messages
300
295
240
250
200
174
150
150
95
100
42
50
32
9
7
7
8
1
2
2
2
9
10
11
12
0
0
1
2
3
4
5
6
Number of pattern matches
Challenges: false positives
Challenges: false negatives
Challenges: detecting the question
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
Number of messages
Messages matching question pattern
250
10
200
Pattern IDs
150
50
100
68
31
50
0
Pattern ID
Common question patterns (10)
• any
– (advice|clarification|clue|comment|
– further thought|guidance|
– help|hint|idea|opinion|
– pointer|reason|suggestion|taker)(s)?
• .*
• appreciated|welcome|welcomed
216 matches
Terms added
over time to
improve
detection of
questions
Sample question match (10)
Common question patterns (50)
• get|getting|gives|got|receive
• .*
• error(s)?
102 matches
Sample question match (50)
Discrimination vs Classification
250
Multi-Matches
Single Match
Number of messages
200
150
100
50
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
0
Pattern ID
Low discrimination >>> Increases successful classification at the risk of false-positives
High discrimination >>> Reduces successful classification and risk of false-positives
Does process transfer?
• Tested against TT380 forums 04J – 07J
– Preliminary results look promising
– Need to manually tag >4000 messages
– Review message pre-filters
• Need access to Humanities course material
PRELIMINARY RESULTS:
question topic identification
Basic method
• Identify named entities
– NEs are block-specific
– Majority of questions linked to assignments
• Parse sentence for dependencies
– Nouns (that are NEs)
– Verbs
Named entities: inconsistent usage
Message body
Message subject
Error handling  Exception handling
Deep parsing: dependencies
advmod(delete-5, How-1)
aux(delete-5, can-2)
nsubj(delete-5, I-3)
advmod(delete-5, properly-4)
dobj(delete-5, PLTs-6)
conj_and(PLTs-6, PLs-8)
conj_and(PLTs-6, roles-10)
det(project-13, the-12)
prep_from(delete-5, project-13)
prep_in(delete-5, order-15)
aux(have-17, to-16)
xcomp(delete-5, have-17)
det(sheet-20, a-18)
amod(sheet-20, clean-19)
dobj(have-17, sheet-20)
advmod(have-17, again-21)
How can I properly delete
PLTs and PLs and roles from
the project in order to have
a clean sheet again.
Sentences per message
200
Number of messages
180
160
140
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 30 31 38 48 51 89
Sentence count
Sentence counts under-estimated due to spelling /grammar errors.
Of the 120 single-sentence questions >80% are multiple sentences.
Guess the topic
Excuse me for directing this question at you, but when I try to contact
my tutor through my homepage i still go to the details for John
Stephenson but I am sure that he is ill at the moment.
My question refers to the entities described in ECA part2 page 2, it
states that the term identifier must be unique within the UK business
domain.
I thought Buyers ID and Sellers ID could be their email address,
however, I am stuck on the Order ID which might refer to a depatch
note as I do not know what standard these identifiers have to conform
to in UK business.
I would appreciate being directed as to where I can find this
information.
Current status
• Unable to establish question topic for the 95%
of detected questions
• Current NLP techniques (anaphora and
co-reference resolution) for multi-sentence
questions not well established.
Pattern matching in console listing
Practical work: exact patterns
• Process|Assign|Invoke|Scope|Reply
• .*
• Completed with fault:
• invalidVariables|uninitializedVariable|joinFailure
Provide direct link to FAQ or teaching materials
Future work
• Further work on sentence detection
– Everything else depends on this
• Create patterns to identify content
– “how do i (.*)”
– “are you now saying (.*)”
– “(.*) word count”
• Establish relationships between initial message and
replies
• Build tool to process Eclipse console listings
– Could address 5% of all ECA related questions