Data mining and decision support

Download Report

Transcript Data mining and decision support

ACAI 05
ADVANCED COURSE ON KNOWLEDGE DISCOVERY
Language Technologies (2)
Valentin Tablan
University of Sheffield, UK
1
Overview
• Examples of HLT for the Semantic Web
in use
• Work in context of EU SEKT and
PrestoSpace projects
• Mixed Initiative Information Extraction
• RichNews (automated annotation of
news programs)
2
Mixed Initiative IE
- Using Machine Learning for Information
Extraction
- Human annotator and the system can take
the initiative
- HA provides some bootstrap examples
- MI Engine learns and starts suggesting
annotations
- HA corrects these annotations
- And so on…
3
What is Mixed Initiative IE?
• Also known as adaptive IE
• Not active learning !
System selects the next document to be
annotated by the user. Improves the
performances
• Active learning is not a part of MI API but will
use it
4
Requirements
A MI engine must :
• Work as a background task
• Suggest annotations only when a
given performance level is reached
• Be easily usable for a non – expert
user
• Fined grained parameters for
experts
5
OBIE example
• Find instances in a document of
entities and relations from an
ontology
• Usable by a non-expert end user
• No learning corpus available
• Quick adaptation to a new
ontology
6
Specifics of the MI API
• Train a statistical model
• Use several ML algorithms
SVM – Decision Trees – Neural Nets – etc …
• Compare the ML models and use the
one which performs the best at time t
• Combine the ML models
7
Expected behaviour
MI Engine
Performance
Engine 3
8
Engine 2
Minimal performance level tolerated
Time – Size of learning corpus
Engine 1
Limitations of the ML API
• Configure a file per engine
> not suitable for a non expert
• Set the class definition in the file
> problem for OBIE : ontology is not NE
> dynamic settings
• Engine characteristics : binary, numeric, nominal
> uniform declaration, automatic conversion
• Operate on tokens
> cannot annotate spans
• One class per engine
> how to set several possible values for an entity with a
binary engine?
9
Meta Engine
• Combines several instances of ‘simple’
engines
(has to be same engine type e.g Maxent)
• Accepts rich descriptions of class &
attributes
• Converts into suitable format for ‘simple’
engine
• Merges results of embedded engines
• Behaves like a simple engine
• Hides the dirty job
10
MI API Architecture
11
GUI / client code
Mixed Initiative API
Mixed Initiative Engine
DataSet
Orchestrator
Meta Engine
Meta Engine
Evaluation
Module
Meta Engine
Data Set
• Information stored as examples
• No documents
• Used by Meta Engines
• Possibly converted to a native Data Set format (e.g.
SVMlight)
• Possibly reuse an existing
implementation (WEKA, Yale, …)
12
MI API Architecture
13
GUI / client code
Mixed Initiative API
Mixed Initiative Engine
DataSet
Orchestrator
Meta Engine
Meta Engine
Evaluation
Module
Meta Engine
Evaluation module
• Operate on Data Set
• Choice for corpus splitting (has K-fold cross
validation)
• Different evaluation metrics
14
MI API Architecture
15
GUI / client code
Mixed Initiative API
Mixed Initiative Engine
DataSet
Orchestrator
Meta Engine
Meta Engine
Evaluation
Module
Meta Engine
Orchestrator
• Core of a MI Engine
• Manages the Meta Engines
• Uses the Data Set and Evaluation
Module
• Return information about the Meta
Engines
confusion matrix – performances – etc …
• Combines the ML models
• Convert from / to annotations
16
Allegory : MI Engine = Orchestra
17
Music School
Teacher
ME
ME
ME
ME
1- Combine
1- Learn
their skills
one/some/all instruments (entities)?
2- Play for
2- an
Exams
audience
for all at the same time ?
better than existing
Orchestra 3- Good enough and
Conductor
orchestra?
ME
ME
ME
ME
Summary of MI IE
• Required component for Ontology
based Information Extraction
• State-of-the-art functionalities
• Reach high performance level by
combining classification algorithms
18
RichNews
• RichNews aims to automate the
annotation of news programs
• Start from recordings of broadcasts.
• Produces annotations that can be
included in a semantic repository (i.e.
KIM/Sesame)
• Works for English but most processing
resources can be adapted for other
languages.
19
Key Problems
• Speech recognition produces poor
quality transcripts with many mistakes.
• A news broadcast contains several
stories. How do we work out where one
starts and another one stops?
• How can we make a summary or
headline for each story from a poor
quality transcript? How can we work out
what kind of news it reports?
20
Augmented Television News
New Broadcasts are often augmented with textual
content.
–
–
Usually only limited content is available.
The TV company controls content production.
Rich News finds content automatically.
–
–
–
21
Developed on BBC news.
Recordings of broadcasts go in one end.
Relevant news web pages are associated with the
stories in the broadcasts fully automatically.
Semantic Indexing of News
Systems already exist that can index news
broadcasts in terms of ‘named entities’ that
they refer to.
–
–
e.g. Mark Maybury’s Broadcast News Navigator.
Entities such as cities, people, organizations are
marked as such.
Rich News can improve annotation:
–
–
22
–
Annotation is in terms of an ontology.
Uses Automatic Speech Recognition, so can be
applied when no subtitles are available.
Web pages are used to help find named entities.
Using ASR Transcripts
ASR is performed by the THISL system.
• Based on ABBOT connectionist speech
recognizer.
• Optimized specifically for use on BBC
news broadcasts.
• Average word error rate of 29%.
• Error rate of up to 90% for out of studio
recordings.
23
SA General Architecture
Source Detection
Media Object
Story
Segmentation
24
Source
Extractor
Information
Source
Multi-source IE
IE
Merger
(?)
Source
Extractor
Information
Source
IE
Source
Extractor
Information
Source
IE
...
Semantic
Index
RichNews Architecture
Media Object
Source Detection
THISL
ASR
Story 1
Story
Segmenter
25
Multi-source IE
ASR
Web Miner
...
Story 2
...
Story N
ASR
Transcript
Related
Web Pages
ASR
IE System
KIM
Ontological
IE System
GATE/ELAN
Manual
Annotation
(optional)
Entity
2
Instance
Semantic
Index
Topical Segmentation
Uses C99 segmenter:
• Removes common words from the ASR
transcripts.
• Stems the other words to get their roots.
• Then looks to see in which parts of the
transcripts the same words tend to occur.
These parts will probably report the same story.
26
Key Phrase Extraction
Term frequency inverse document frequency
(TF.IDF):
• Chooses sequences of words that tend to occur
more frequently in the story than they do in the
language as a whole.
• Any sequence of up to three words can be a
phrase.
• Up to four phrases extracted per story.
27
Web Search
The Key-phrases are used to search on the BBC,
and the Times, Guardian and Telegraph
newspaper websites for web pages reporting
each story in the broadcast.
• Searches are restricted to the day of broadcast,
or the day after.
• Searches are repeated using different
combinations of the extracted key-phrases.
The text of the returned web pages is compared to
the text of the transcript to find matching stories.
28
Evaluation
Success in finding matching web pages was
investigated.
• Evaluation based on 66 news stories from
9 half-hour news broadcasts.
• Web pages were found for 40% of stories.
• 7% of pages reported a closely related
story, instead of that in the broadcast.
Results are based on earlier version of the
system, only using BBC web pages.
29
Using the Web Pages
Web pages can be made available to the viewer
as additional content.
The web pages contain:
• A headline, summary and section for each story.
• High quality text that is readable, and contains
correctly spelt proper names.
• They give more in depth coverage of the stories.
Web pages could be included in the broadcast by
the TV company.
Or discovered by a device in viewers’ homes.
30
Semantic Annotation
• KIM can semantically annotate the text derived
from the web pages:
• KIM will identify people, organizations, locations
etc.
• KIM performs well on the web page text, but
very poorly when run on the transcripts directly.
• This allows for semantic ontology-aided
searches for stories about particular people or
locations etcetera.
• So we could search for people called Sydney,
which would be difficult with a text-based search.
31
Search for Entities
32
Story Retrieval
33
Summary of RichNews
• Rich News can automatically segment,
describe and classify news broadcasts:
• Requires an on-line textual source that
closely parallels the broadcasts.
• High precision, moderate recall (so
far).
• Easy to adapt to other languages.
34