Text mining and the Semantic Web

Download Report

Transcript Text mining and the Semantic Web

Text mining and the Semantic Web
Dr Diana Maynard
NLP Group
Department of Computer Science
University of Sheffield
http://nlp.shef.ac.uk
Structure of this lecture
•
•
•
•
•
•
Text Mining and the Semantic Web
Text Mining Components / Methods
Information Extraction
Evaluation
Visualisation
Summary
University of Manchester – 15 March 2005
2
Introduction to Text Mining and
the Semantic Web
http://nlp.shef.ac.uk
What is Text Mining?
• Text mining is about knowledge discovery from
large collections of unstructured text.
• It’s not the same as data mining, which is more
about discovering patterns in structured data
stored in databases.
• Similar techniques are sometimes used,
however text mining has many additional
constraints caused by the unstructured nature of
the text and the use of natural language.
• Information extraction (IE) is a major component
of text mining.
• IE is about extracting facts and structured
information from unstructured text.
University of Manchester – 15 March 2005
4
http://nlp.shef.ac.uk
Challenge of the Semantic Web
• The Semantic Web requires machine
processable, repurposable data to complement
hypertext
• Such metadata can be divided into two types of
information: explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.
• More on this later…
University of Manchester – 15 March 2005
5
Text mining components and
methods
http://nlp.shef.ac.uk
Text mining stages
• Document selection and filtering (IR
techniques)
• Document pre-processing (NLP
techniques)
• Document processing (NLP / ML /
statistical techniques)
University of Manchester – 15 March 2005
7
http://nlp.shef.ac.uk
Stages of document processing
• Document selection involves identification and
retrieval of potentially relevant documents from a
large set (e.g. the web) in order to reduce the
search space. Standard or semanticallyenhanced IR techniques can be used for this.
• Document pre-processing involves cleaning and
preparing the documents, e.g. removal of
extraneous information, error correction, spelling
normalisation, tokenisation, POS tagging, etc.
• Document processing consists mainly of
information extraction
• For the Semantic Web, this is realised in terms
of metadataUniversity
extraction
of Manchester – 15 March 2005
8
http://nlp.shef.ac.uk
Metadata extraction
• Metadata extraction consists of two types:
• Explicit metadata extraction involves
information describing the document, such as
that contained in the header information of
HTML documents (titles, abstracts, authors,
creation date, etc.)
• Implicit metadata extraction involves semantic
information deduced from the material itself, i.e.
endogenous information such as names of
entities and relations contained in the text. This
essentially involves Information Extraction
techniques, often with the help of an ontology.
University of Manchester – 15 March 2005
9
Information Extraction (IE)
http://nlp.shef.ac.uk
IE is not IR
IR pulls documents
from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the documents.
IE pulls facts and
structured information
from the content of large
text collections. You
analyse the facts.
University of Manchester – 15 March 2005
11
http://nlp.shef.ac.uk
IE for Document Access
• With traditional query engines, getting the
facts can be hard and slow
• Where has the Queen visited in the last
year?
• Which places on the East Coast of the
US have had cases of West Nile Virus?
• Which search terms would you use to get this
kind of information?
• How can you specify you want someone’s
home page?
• IE returns information in a structured way
• IR returns documents containing the relevant
information somewhere (if you’re lucky)
University of Manchester – 15 March 2005
12
http://nlp.shef.ac.uk
IE as an alternative to IR
• IE returns knowledge at a much deeper
level than traditional IR
• Constructing a database through IE and
linking it back to the documents can
provide a valuable alternative search tool.
• Even if results are not always accurate,
they can be valuable if linked back to the
original text
University of Manchester – 15 March 2005
13
http://nlp.shef.ac.uk
Some example applications
• HaSIE
• KIM
• Threat Trackers
University of Manchester – 15 March 2005
14
http://nlp.shef.ac.uk
HaSIE
• Application developed by University of
Sheffield, which aims to find out how
companies report about health and safety
information
• Answers questions such as:
“How many members of staff died or had accidents
in the last year?”
“Is there anyone responsible for health and
safety?”
“What measures have been put in place to
improve health and safety in the workplace?”
University of Manchester – 15 March 2005
15
http://nlp.shef.ac.uk
HASIE
• Identification of such information is too
time-consuming and arduous to be done
manually
• IR systems can’t cope with this because
they return whole documents, which could
be hundreds of pages
• System identifies relevant sections of each
document, pulls out sentences about
health and safety issues, and populates a
database with relevant information
University of Manchester – 15 March 2005
16
http://nlp.shef.ac.uk
HASIE
University of Manchester – 15 March 2005
17
http://nlp.shef.ac.uk
KIM
• KIM is a software platform developed by
Ontotext for semantic annotation of text.
• KIM performs automatic ontology
population and semantic annotation for
Semantic Web and KM applications
• Indexing and retrieval (an IE-enhanced
search technology)
• Query and exploration of formal
knowledge
University of Manchester – 15 March 2005
18
http://nlp.shef.ac.uk
KIM
Ontotext’s KIM query and results
University of Manchester – 15 March 2005
19
http://nlp.shef.ac.uk
Threat tracker
• Application developed by Alias-I which finds and
relates information in documents
• Intended for use by Information Analysts who
use unstructured news feeds and standing
collections as sources
• Used by DARPA for tracking possible
information about terrorists etc.
• Identification of entities, aliases, relations etc.
enables you to build up chains of related people
and things
University of Manchester – 15 March 2005
20
http://nlp.shef.ac.uk
Threat tracker
University of Manchester – 15 March 2005
21
http://nlp.shef.ac.uk
What is Named Entity Recognition?
• Identification of proper names in texts, and
their classification into a set of predefined
categories of interest
• Persons
• Organisations (companies, government
organisations, committees, etc)
• Locations (cities, countries, rivers, etc)
• Date and time expressions
• Various other types as appropriate
University of Manchester – 15 March 2005
22
http://nlp.shef.ac.uk
Why is NE important?
• NE provides a foundation from which to
build more complex IE systems
• Relations between NEs can provide
tracking, ontological information and
scenario building
• Tracking (co-reference) “Dr Head, John,
he”
• Ontologies “Manchester, CT”
• Scenario “Dr Head became the new
of Manchester
– 15 March
2005
director of University
Shiny
Rockets
Corp”
23
http://nlp.shef.ac.uk
Two kinds of approaches
Knowledge Engineering
Learning Systems
• rule based
• developed by experienced
language engineers
• make use of human
intuition
• require only small amount
of training data
• development can be very
time consuming
• some changes may be
hard to accommodate
• use statistics or other
machine learning
• developers do not need
LE expertise
• require large amounts of
annotated training data
• some changes may
require re-annotation of
the entire training corpus
University of Manchester – 15 March 2005
24
http://nlp.shef.ac.uk
Typical NE pipeline
• Pre-processing (tokenisation, sentence
splitting, morphological analysis, POS
tagging)
• Entity finding (gazeteer lookup, NE
grammars)
• Coreference (alias finding, orthographic
coreference etc.)
• Export to database / XML
University of Manchester – 15 March 2005
25
http://nlp.shef.ac.uk
GATE and ANNIE
• GATE (Generalised Architecture for Text
Engineering) is a framework for language
processing
• ANNIE (A Nearly New Information Extraction
system) is a suite of language processing tools,
which provides NE recognition
GATE also includes:
• plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.
• tools for visualising and manipulating ontologies
• ontology-based information extraction tools
• evaluation and
benchmarking tools
University of Manchester – 15 March 2005
26
http://nlp.shef.ac.uk
GATE
University of Manchester – 15 March 2005
27
http://nlp.shef.ac.uk
Information Extraction for the Semantic Web
• Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
• For the Semantic Web, we need information in a
hierarchical structure
• Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
• Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
University of Manchester – 15 March 2005
28
http://nlp.shef.ac.uk
Richer NE Tagging
• Attachment of
instances in the text to
concepts in the
domain ontology
• Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK
University of Manchester – 15 March 2005
29
http://nlp.shef.ac.uk
Magpie
• Developed by the Open University
• Plugin for standard web browser
• Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
• Provides means for a structured and informed
exploration of the web resources
• e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.
University of Manchester – 15 March 2005
30
http://nlp.shef.ac.uk
MAGPIE in action
University of Manchester – 15 March 2005
31
http://nlp.shef.ac.uk
MAGPIE in action
University of Manchester – 15 March 2005
32
Evaluation
http://nlp.shef.ac.uk
Evaluation metrics and tools
• Evaluation metrics mathematically define how to
measure the system’s performance against
human-annotated gold standard
• Scoring program implements the metric and
provides performance measures
– for each document and over the entire corpus
– for each type of NE
– may also evaluate changes over time
• A gold standard reference set also needs to be
provided – this may be time-consuming to
produce
• Visualisation tools show the results graphically
and enable easy comparison
University of Manchester – 15 March 2005
34
http://nlp.shef.ac.uk
Methods of evaluation
• Traditional IE is evaluated in terms of Precision
and Recall
• Precision - how accurate were the answers the
system produced?
correct answers/answers produced
• Recall - how good was the system at finding
everything it should have found?
correct answers/total possible correct answers
• There is usually a tradeoff between precision
and recall, so a weighted average of the two (Fmeasure) is generally also used.
University of Manchester – 15 March 2005
35
http://nlp.shef.ac.uk
GATE AnnotationDiff Tool
University of Manchester – 15 March 2005
36
http://nlp.shef.ac.uk
Metrics for Richer IE
• Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious
• Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as
a Lecturer is not so wrong
• Similarity metrics need to be integrated
additionally, such that items closer together in
the hierarchy are given a higher score, if wrong
• Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error, and
combined to form a single score
University of Manchester – 15 March 2005
37
Visualisation of Results
http://nlp.shef.ac.uk
Visualisation of Results
• Cluster Map example
• Traditionally used to show documents classified
according to topic
• Here shows instances classified according to
concept
• Enables analysis, comparison and querying of
results
• Examples here created by Marta Sabou (Free
University of Amsterdam) using Aduna software
University of Manchester – 15 March 2005
39
http://nlp.shef.ac.uk
The principle – Venn Diagrams
Documents
classified
according to topic
University of Manchester – 15 March 2005
40
http://nlp.shef.ac.uk
Jobs by region
Instances
classified by
concept
University of Manchester – 15 March 2005
41
http://nlp.shef.ac.uk
Concept distribution
Shows the
relative
importance of
different concepts
University of Manchester – 15 March 2005
42
http://nlp.shef.ac.uk
Correct and
incorrect
instances
attached to
concepts
University of Manchester – 15 March 2005
43
http://nlp.shef.ac.uk
Summary
• Introduction to text mining and the
semantic web
• How traditional information extraction
techniques, including visualisation and
evaluation, can be extended to deal with
complexity of the Semantic Web
• How text mining can help the progression
of the Semantic Web
University of Manchester – 15 March 2005
44
http://nlp.shef.ac.uk
Research questions
• Automatic annotation tools are currently
mainly domain and ontology-dependent,
and work best on a small scale
• Tools designed for large scale applications
lose out on accuracy
• Ontology population works best when the
ontology already exists, but how do we
ensure accurate ontology generation?
• Need large scale evaluation programs
University of Manchester – 15 March 2005
45
http://nlp.shef.ac.uk
Some useful links
• NaCTem (National centre for text mining)
http://www.nactem.ac.uk
• GATE
http://gate.ac.uk
• KIM
http://www.ontotext.com/kim/
• h-TechSight
http://www.h-techsight.org
• Magpie
http://www.kmi.open.ac.uk/projects/magpie
University of Manchester – 15 March 2005
46