Building a Thesaurus by means of Data Extraction

Transcript Building a Thesaurus by means of Data Extraction

Semantic Information Extraction
from Wikipedia
Data Extraction for Data Integration
Patrick Arnold,
DBS-Oberseminar in December 2013
16.07.2015
1
Motivation
• Semantic Enrichment of Mappings
– Given two matching concepts c1, c2
– Two questions:
• Do they really match? (verification)
• What is their semantic relationship? (enrichment)
• One possibility: Generic Strategies
– Example: Morphological Analysis
• e.g., lexicographic similarity between matching concepts
16.07.2015
2
Motivation
• Arbitrariness of Language
– No correlation between meaning and representation
of a real-world object
– Most matching tools based on lexicographic analysis
• Why do they work?
• Schema and Ontology Matching
– Concatenative Word Formation (e.g. Compounding)
– Large overlap between ontologies
16.07.2015
3
Motivation
• Lexicographic strategies fail if…
– Domains are different
– Granularity is different
– Languages are different
• Example: Wikipedia-Ebay Furniture Benchmark
– COMA only obtains 31 % recall
– Only 8 % in default mode (without enrichment)
16.07.2015
4
Motivation
• Remedy: Background Knowledge
– Very precise
– Very effective
• Problem: Limited content
– Example WordNet: Only 10 % recall in Ebay-Wikip. Benchmark
• Problem: Limited amount of sources
–
–
–
–
16.07.2015
Comprehensive and generic?
Free of charge?
English (or German) language?
Semantic relations?
5
Ambitions
• Solution: Build thesaurus (background knowledge
source) by extracting knowledge from the web
• Wikipedia: 4.4 million entries
– Practically every common noun of the English language
– Good reliability
– First sentence is typically a definition
• Expresses “is-a” relations in about 80 .. 95 % of all cases
• Also expresses synonyms and part-of relations
16.07.2015
6
Ambitions
• Provide interface similar to WordNet approach
– Combine it with WordNet approach
16.07.2015
7
Wikipedia Article Distribution
• Wikipedia contains instance pages and concepts
pages
– Instances: Persons, places, buildings, companies,
bands, movies, diseases, software etc.
• Only 5.6 % of Wikipedia are concept pages
– Still some 246,000 articles
• Some instance articles quite valuable
– Diseases, species, vehicles (e.g., BMW is a car)
16.07.2015
8
Wikipedia Article Distribution
Proportions
Persons
Locations
Miscellaneous
Companies, Organizations
Biology & Anatomy
Concepts ("Non-Instances")
Lists & Disambiguations
Movies
Events
Bands, Sports teams etc.
Novels, Journals etc.
Vehicles
Songs/Albums
0
16.07.2015
10
20
30
9
Approach
Article Extraction
• Download Wikipedia
Dump
• Extract each article
• Extract and clean the
abstract of each
article
• Store article objects
in DB
16.07.2015
Information
Extraction
• Take first sentence of
abstract
• Preprocess sentence
• Find Hearst Pattern
• Split sentence in HP
fragments
• Extract relevant
information in each
fragment
• Source terms,
hypernyms,
meronyms, fields
Information
Integration
• Store extracted
information in DB
• Use interface to
access information
• Combine extracted
information with
other sources
• Example: WordNet
10
Step 1: Article Extraction
• Large amount of data to process
– Wikipedia dump: 9.5 GB (zip) and 44 GB unpacked
• Takes about 6 hours to download it
– Using a web crawler: Takes about 17 days
• Can crawl about 3 pages per second
• Using SAX to parse content
– Extract abstract or first X characters of each page
– Save text to Database
• MongoDB for fast access
16.07.2015
11
Step 2: Information Extraction
• Approach: For each article, extract the semantic
relations.
• Exploit systematic structure of (Wikipedia)
definitions.
– Classic Definition: Hypernym + Properties
• Definition must be related to another concept.
• A washing machine is a machine to wash laundry.
– Equivalence relations (using synonyms)
• Rare, as there are hardly any full synonyms
• Example: U.K. stands for United Kingdom
16.07.2015
12
Step 2: Information Extraction
The Structure of Wikipedia Definitions
• Structure of typical Wikipedia Definitions:
– Given a Term T, T is commonly defined by Hypernym(T)
• Someone who does not know T, may know H(T)
• A cassone is a chest.
– Often in combination with synonyms
• An automobile, autocar, motor car or car is a wheeled motor v…
– Sometimes expressing meronyms
• Wipers are part of vehicles to remove rain from windshields.
– Sometimes expressing holonyms
• A computer consists of a CPU, main memory, BUS and some
peripherals.
16.07.2015
13
Step 2: Information Extraction
The Structure of Wikipedia Definitions
• Additional Information
– Field Reference
• In computing, a mouse is an input device that functions by…
• Column or pillar in architecture and structural engineering is
a structural element that…
• Grammatical, phonological or etymological
information
– A bus (/ˈbʌs/; plural "buses", /ˈbʌsɨz/, archaically also
omnibus, multibus, or autobus) is a road vehicle…
– A wardrobe, also known as an armoire from the
French, is a standing closet.
16.07.2015
14
Step 2: Information Extraction
The Structure of Wikipedia Definitions
• Hearst Pattern
– Indicate the relationship between two terms
– Similiar to an relational operator in algebra
– Examples:
• is a
• consists of
• describes a
16.07.2015
15
Step 2: Information Extraction
The Structure of Wikipedia Definitions
• From these information we may form a standard
Wikipedia Definition Pattern:
In computing, a mouse is an input device that…
16.07.2015
16
Step 2: Information Extraction
Procedure
• Find the Hearst Pattern
– Split sentence at the HPs
– Result: 2 or 3 fragments
• From each fragment, extract the source terms,
hypernyms and meronyms
– Also extract the field references
16.07.2015
17
Step 2: Information Extraction
Procedure
• Step 1: Find Hearst Pattern
– Hypernym Pattern: Not restricted to “is a”
•
•
•
•
•
•
•
•
•
16.07.2015
is a
is typically a
is defined as a
is commonly a class of
is any form of
is one of the many
is a general term for
is used as a term for
describes/denotes a
18
Step 2: Information Extraction
Procedure
• Step 1: Find Hearst Pattern
– Typical has-a pattern
• consisting of a
• with a
• having a
– Typical part-of pattern
• within
• (is used) in
• (as part) of
16.07.2015
19
Step 2: Information Extraction
Procedure
– Some fuzzy patterns
•
•
•
•
16.07.2015
refers to
applies to
is similar to
is related to
20
Step 2: Information Extraction
Procedure
• FSM for is-a Pattern
16.07.2015
21
Step 2: Information Extraction
Procedure
• Examples
16.07.2015
22
Step 2: Information Extraction
Procedure
• Approach:
–
–
–
–
Using a FSM to parse through the fragments
Word-by-word processing
Under due regard of word classes (POS)
Very restrictive
• If an unexpected condition is entered: Revoke fragment
16.07.2015
23
Step 2: Information Extraction
Procedure
• Preprocessing
– Remove braces if necessary
• Braces may contain valuable information
• Try different configurations
– Replace expressions for simplification
•
•
•
•
16.07.2015
is applied to  applies to
is any of a variety of  is any
and or  and
means of  {}
(Auto rickshaws are means of public transportation)
24
Step 2: Information Extraction
Procedure
• Extract information from source/target segment
16.07.2015
25
Step 2: Information Extraction
Procedure
16.07.2015
26
Step 2: Information Extraction
Procedure
• Post-Processing
– Extracted information must be post-processed
• Remove braces
• Remove quotes etc.
• Stemming (tbd)
16.07.2015
27
Step 3: Integration
• Store extracted information
– Can be reduced to simple triples (term, relation, term)
– RDBS or Main Memory (Hash Map)
• Work in progress
– Develop Interface to handle queries
– Combine it with WordNet
– Recursive approach
16.07.2015
28
Step 3: Integration
Resolution of Queries (Example)
16.07.2015
29
Evaluation
• 4 Scenarios from Wikipedia:
–
–
–
–
Furniture (186 concepts)
Infectious Diseases (107 concepts)
Optimization Algorithms (122 concepts)
Vehicles (94 concepts)
• Questions:
– How many articles could be parsed?
– How many relevant relations could be found (recall)
• How many extracted relations were correct (precision)
16.07.2015
30
Evaluation
Articles Segmentation
• We can process 63 .. 92 % of all Wikipedia articles
– Highly depends on the domain
– Not all articles are „parsable“
Scenario
#Articles
#Processed
Articles
Effectiveness
Furniture
186
137
73.7 %
Infectious Diseases
107
67
63.2 %
Optimization Algorithms
122
91
74.5 %
Vehicles
94
87
92.5 %
16.07.2015
31
Evaluation
Article Segmentation
• Some WP articles simply have no “classic”
definition (hypnerym/meronym missing)
– Examples:
Instance
16.07.2015
• Anaerobic infections are caused by anaerobic bacteria.
• A blood-borne disease is one that can be spread through
contamination…
• Cholera Hospital was established on June 24, 1854, at…
• Hutchinson's triad is named after Sir Jonathan Hutchinson
(1828–1913).
• A pathogen in the oldest and broadest sense is anything that
can produce disease.
32
Evaluation
Article Segmentation
• Results for parsable articles:
Scenario
#Articles
#Parsable
articles
Furniture
186
169
137
81.1 %
Infectious
Diseases
107
91
67
73.6 %
Optimization
Algorithms
122
113
91
80.5 %
Vehicles
94
91
87
95.6 %
16.07.2015
#Processed
Articles
Effectiveness
33
Evaluation
Concept Extraction
• Recall (strict):
Source Terms
# in B
# in R
Hypernym
# in B
# in R
Meronym/Holonym
# in B
# in R
Vehicl. 194
149
76.8 % 94
80
85.1 % 20
17.5
87.5 %
Dis.
125
70.2 % 86
57.5
66.9 % 36
19.5
54.1 %
178
• Recall (effective):
Source Terms
Hypernym
Meronym/Holonym
Vehicl. 168
138
82.1 % 90
78
86.6 % 20
17.5
87.5 %
Dis.
104
83.2 % 67
55.5
82.8 % 32
19.5
61.0 %
16.07.2015
125
34
Evaluation
Concept Extraction
• Precision (strict):
Source Terms
# in R
# in B
Hypernym
# in R
# in B
Meronym/Holonym
# in R
# in B
Vehicl. 149
161
92.5 % 80
91
87.9 % 17.5
27.5
63.6 %
Dis.
131
95.4 % 57.5
68
83.3 % 19.5
29.5
66.1 %
125
• Precision (effective):
Source Terms
Hypernym
Meronym/Holonym
Vehicl. 138
148
93.2 % 78
89
87.6 % 17.5
27.5
63.6 %
Dis.
106
98.1 % 55.5
65
85.4 % 19.5
28.5
68.4 %
16.07.2015
104
35
Evaluation
Common Precision Problems
• Complex words
– A minibus is a passenger carrying motor vehicle
• NN + VBG + NN + NN
• *Minibus is a passenger
16.07.2015
36
Evaluation
Common Precision Problems
• Braces – Curse and Blessing
– A pedalo (British English) or paddle boat (US,
Canadian, and Australian English) is a watercraft…
• *British English is a watercraft
• *US is a watercraft
– A powered parachute (motorized parachute, PPC,
paraplane) is a parachute…
• Entirely correct extraction (4 source terms)
16.07.2015
37
Evaluation
Common Precision Problems
• Compound determination
– A draisine is a light auxiliary rail vehicle
• ADJ + ADJ + NN + NN
– Expected: Rail vehicle
– Imprecise, but can be handled
• Gradual Modifier Removal / Compound Transitivity
– Ignoring adjectives and verbs (part.) not possible
• High school
• bathing suit
• Pulled rickshaw (rare)
16.07.2015
38
Evaluation
Common Precision Problems
• Misleading nouns
– Jet pack, rocket belt, rocket pack and similar names
are used for various types of devices
– Conclusion: *Similiar names are devices
16.07.2015
39
Evaluation
Common Recall/Precision Problems
• Misleading nouns in target phrases
– Some phrases
•
•
•
•
Is the act of...
Is a method for...
Is a noun describing...
Is the heart of...
– Examples:
• Land sailing is the act of moving across land.
• Leipzig is the heart of the Central German Metropolitan
Region. („*Leipzig is a heart“)
16.07.2015
40
Evaluation
Common Recall Problems
• Too many auxiliary information in definition
– A passenger car (known as a coach or carriage in the
UK, and also known as a bogie in India[1]) is a…
16.07.2015
41
Evaluation
Common Recall Problems
• Erroneous POS-Tagging
– Example: A dog sled is a sled used for...
– DET + NN + VBD + is a + VBD + ...
– Solution: Check whether dubious word is in page
name
• Page name was Dog sled
• „sled“ (VBD) is part of page name
• Ignore POS tag (handle it as noun)
16.07.2015
42
Evaluation
Improvement
• Paretheo principle (80:20)
– Improvement by handling special cases
• 2 Ways for improvement
– Extend FSM
• Consider more specific cases of formulation
• Makes FSM complex and hard to manage
– More preprocessing
• Replace rare expressions by common expressions
• May lead to large lists of specific expressions
16.07.2015
43
Conclusions
• Relation extraction quite succesful
– Some intricate articles not processable
– Some irrelevant or unsensible relations
• To do:
– Provide interface for Semantic Enrichment Module
– Combine with WordNet
– Possibly extend it with further sources (like
Wiktionary)
16.07.2015
44
Thank You!
16.07.2015
45