ACLCLP-IR2010 Workshop Web-Scale Knowledge Discovery and Population from Unstructured Data Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York [email protected] December.
Download ReportTranscript ACLCLP-IR2010 Workshop Web-Scale Knowledge Discovery and Population from Unstructured Data Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York [email protected] December.
ACLCLP-IR2010 Workshop
Web-Scale Knowledge Discovery and Population from Unstructured Data
Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York [email protected]
December 3, 2010 1/55
Outline
Motivation of Knowledge Base Population (KBP) KBP2010 Task Overview Data Annotation and Analysis Evaluation Metrics A Glance of Evaluation Results CUNY-BLENDER team @ KBP2010 Discussions and Lessons Preview of KBP2011 Cross-lingual (Chinese-English) KBP Temporal KBP 2/55
Limitations of Traditional IE/QA Tracks
Traditional Information Extraction (IE) Evaluations (e.g. Message Understanding Conference /Automatic Content Extraction programs) Most IE systems operate one document a time; MUC-style Event Extraction hit the 60% ‘performance ceiling’ Look back at the initial goal of IE Create a database of relations and events from the
entire corpus
Within-doc/Within-Sent IE was an artificial constraint to simplify the task and evaluation Traditional Question Answering (QA) Evaluations Limited efforts on disambiguating entities in queries Limited use of relation/event extraction in answer search 3/55
The Goal of KBP
Hosted by the U.S. NIST, started from 2009, supported by DOD, coordinated by Heng Ji and Ralph Grishman in 2010, 55 teams registered, 23 teams participated Our Goal Bridge IE and QA communities Promote research in discovering facts about entities and expanding a knowledge source What’s New & Valuable Extraction at large scale (> 1 million documents) ; Using a representative collection (not selected for relevance); Cross-document entity resolution (extending the limited effort in ACE); Linking the facts in text to a knowledge base; Distant (and noisy) supervision through Infoboxes; Rapid adaptation to new relations; Support multi-lingual information fusion (KBP2011); Capture temporal information (KBP2011) All of these raise interesting and important research issues 4/55
Knowledge Base Population (KBP2010) Task Overview
5/55
KBP Setup
Knowledge Base (KB) Attributes (a.k.a., “slots”) derived from Wikipedia infoboxes are used to create the reference KB Source Collection A large corpus of newswire and web documents (>1.3 million docs) is provided for systems to discover information to expand and populate KB 6/55
Entity Linking: Create Wiki Entry?
NIL Query = “James Parsons”
7/55
Entity Linking Task Definition
Involve Three Entity Types Person, Geo-political, Organization Regular Entity Linking Names must be aligned to entities in the KB; can use Wikipedia texts Optional Entity linking Without using Wikipedia texts, can use Infobox values Query Example
Slot Filling: Create Wiki Infoboxes?
School Attended: University of Houston
9/55
Regular Slot Filling
Person
per:alternate_names per:date_of_birth per:age per:country_of_birth per:stateorprovince_of_birth per:city_of_birth per:origin per:date_of_death per:country_of_death per:stateorprovince_of_death per:city_of_death per:cause_of_death per:countries_of_residence per:stateorprovinces_of_residence per:cities_of_residence per:schools_attended per:title per:member_of per:employee_of per:religion per:spouse per:children per:parents per:siblings per:other_family per:charges
Organization
org:alternate_names org:political/religious_affiliation org:top_members/employees org:number_of_employees/members org:members org:member_of org:subsidiaries org:parents org:founded_by org:founded org:dissolved org:country_of_headquarters org:stateorprovince_of_headquarters org:city_of_headquarters org:shareholders org:website 10/55
Data Annotation and Analysis
11/55
Data Annotation Overview
Source collection: about 1.3 million newswire docs and 500K web docs, a few speech transcribed docs
Entity Linking Corpus Training Evaluation Genre/Source 2009 Training 2010 Web data Newswire Web data Person 627 500 500 250 Size (entity mentions) Organization 2710 500 500 250 GPE 567 500 500 250 Slot Filling Corpus Training Evaluation Task Regular Task Surprise Task Regular Task Surprise Task Source 2009 Evaluation
2010 Participants
2010 LDC 2010 LDC LDC LDC Person 17 25 Size (entities) Organization 31 25 25 16 50 20 25 16 50 20 12/55
Entity Linking Inter-Annotator Agreement
Annotator 2 Annotator 1 Entity Type Person #Total Queries 59 Annotator 3 Agreement Rate 91.53% Geo-political Organization 64 57 87.5% 92.98% Genre Newswire Web Text Newswire Web Text Newswire Web Text #Disagreed Queries 4 1 3 5 3 1 13/55
Slot Filling Human Annotation Performance
Evaluation assessment of LDC Hand Annotation Performance All Slots All except per:top-employee, per:member_of, per:title P(%) R(%) 70.14 54.06
71.63
57.6
F(%) 61.06
63.86
Why is the precision only 70%?
32 responses were judged as inexact and 200 as wrong answers A third annotator’s assessment on 20 answers marked as wrong: 65% incorrect; 15% correct; 20% uncertain Some annotated answers are not explicitly stated in the document … some require a little world knowledge and reasoning Ambiguities and underspecification in the annotation guideline Confusion about acceptable answers Updates to KBP2010 annotation guideline for assessment
Slot Filling Annotation Bottleneck
The overlap rates between two participant annotators in community are generally lower than 30% Keep adding more human annotators help? No
Can Amazon Mechanical Turk Help?
Given a q, a and supporting context sentence, Turk should judge if the answer is Y: correct; N: incorrect; U: unsure Result Distribution for 1690 instances Useful Annotations (41.8%)
Cases Number Y Y Y Y Y N N N N N Y Y Y Y
U 230 16 151
N N N N
U
Y Y Y Y N N N N N Y Y Y Y
U U
N N N
U U 24 227 46 13 59 Useless Annotations (58.2%)
Cases Y Y Y N N Y Y Y N
U
N N N Y Y Y Y N N
U
Y Y N U U Y Y U U U N N N Y U N N Y U U Y N U U U Y U U U U N N U U U N U U U U U U U U U Number
164 165 158 171 77 17 72 57 22 8 11 1 1
Why is Annotation so hard for Non-Experts?
Even for all-agreed cases, some annotations are incorrect…
Query Citibank Slot
org: top_members /employees
Answer Tim Sullivan Context
He and
Tim Sullivan
,
Citibank
's Boston area manager, said they still to plan seek advice from activists going forward.
International Monetary Fund
org: subsidiaries
World Bank
President George W. Bush said Saturday that a summit of world leaders agreed to make reforms to the
World Bank
and
International Monetary Fund
.
Require quality control Training difficulties
Evaluation Metrics
Entity Linking Scoring Metric
Micro-averaged Accuracy (official metric)
Mean accuracy across all queries
Macro-averaged Accuracy
Mean accuracy across all KB entries
Slot Filling Scoring Metric
Each response is rated as correct, inexact, redundant, or wrong (credit only given for correct responses) Redundancy: (1) response vs. KB; (2) among responses: build
equivalence class
, credit only for one member of each class Correct = # (non-NIL system output slots judged correct) System = # (non-NIL system output slots) Reference = # (single-valued slots with a correct non-NIL response) + # (equivalence classes for all list-valued slots) Standard Precision, Recall, F-measure
Evaluation Results
Top-10 Regular Entity Linking Systems
<0.8 correlation between overall vs. Non-NIL performance
Human/System Entity Linking Comparison (subset of 200 queries) Average among three annotators
Top-10 Regular Slot Filling Systems
CUNY-BLENDER Team @ KBP2010
25/55
System Overview Query Query Expansion IE Pattern Matching Answer Filtering QA Answer Validation Cross-System & Cross-Slot Reasoning Statistical Answer Re-ranking Priority-based Combination Inexact & Redundant Answer Removal Answers External KBs Free Base Wikipedia Text Mining Answer Validation
IE Pipeline
Apply ACE Cross-document IE (Ji et al., 2009) Mapping ACE to KBP, examples:
KBP 2010 slots
per:date_of_birth, per:country_of_birth, per:stateorprovince_of_birth, per:city_of_birth per:countries_of_residence, per:stateorprovinces_of_residence, per:cities_of_residence,per:religion per:school_attended per:member_of per:employee_of per:spouse, per:children, per:parents, per:siblings, per:other_family per:charges
ACE2005 relations/ events
event: be-born relation:citizen-resident-religion ethnicity relation:student-alum relation:membership, relation:sports-affiliation relation:employment relation:family, event: marry, event:divorce event:charge-indict, event:convict
Pattern Learning Pipeline
Selection of query-answer pairs from Wikipedia Infobox split into two sets Pattern extraction For each {
q,a}
pair, generalize patterns by entity tagging and regular expressions e.g.
died at the age of
Pattern assessment Evaluate and filter based on matching rate Pattern matching Combine with coreference resolution Answer Filtering based on entity type checking, dictionary checking and dependency parsing constraint filtering
QA Pipeline
Apply open domain QA system, OpenEphyra (Schlaefer et al., 2007) Relevance metric related to PMI and CCP Answer pattern probability:
P (q, a) = P (q NEAR a): NEAR
within the same sentence boundary
R
(
q
,
a
)
freq
(
q NEAR freq
(
q
)
freq
(
a
)
a
) #
sentences
Limited by occurrence based confidence and recall issues
More Queries and Fewer Answers
Query Template expansion Generated 68 question templates for organizations and 68 persons Who founded
Who established
Query Name expansion Wikipedia redirect links Heuristic rules for Answer Filtering Format validation Gazetteer based validation Regular expression based filtering Structured data identification and answer filtering
Motivation of Statistical Re-Ranking
Union and voting are too sensitive to the performance of baseline systems Union guarantees highest recall requires comparable performance Voting assumes more frequent answers are more likely true (FALSE) Priority-based combination voting with weights assumes system performance does not vary by slot (FALSE)
Slot
org:country_of_headquarters org:founded per:date_of_birth per:origin
IE 75.0
100 -
QA
15.8
46.2
33.3
22.6
PL
16.7
-
76.9
40
Statistical Re-Ranking
Maximum Entropy (MaxEnt) based supervised re ranking model to re-rank candidate answers for the same slot Features Baseline Confidence Answer Name Type Slot Type X System Number of Tokens X Slot Type Gazetteer constraints Data format Context sentence annotation (dependency parsing, …) …
MLN-based Cross-Slot Reasoning
Motivation each slot is often dependent on other slots can construct new ‘revertible’ queries to verify candidate answers
X
is
per:children
of
Y X
was born on date
Y
year –
Y
)
Y
is
per:parents
age of
X
of
X
; is approximately (the current Use Markov Logic Networks (MLN) to encode cross-slot reasoning rules Heuristic inferences are highly dependent on the order of applying rules MLN can adds a weight to each inference rule integrates soft rules and hard rules
Error Analysis on Supervised Model
Name Error Examples classification errors spurious errors
Ayub Masih
under the law in 1998.
missing errors Nominal Missing Error Examples supremo/shepherd/prophet/sheikh/Imam/overseer/oligarchs/Shiites … Intuitions of using lexical knowledge discovered from ngrams Each person has a
Gender (he, she…)
and is
Animate (who…)
34/55
Motivations of Using Web-scale Ngrams
Data is Power Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries).
N-gram data: compressed version of the web Already proven to be useful for language modeling Google N-gram: 1 trillion token corpus (Ji and Lin, 2009) 35/55
car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48,
Gender Discovery from Ngrams
Discovery Patterns (Bergsma et al., 2005, 2008)
(tag=N.*|word=[A-Z].*) (tag=N.*|word=[A-Z].*)
…
tag=CC.*
(word=his|her|its|their)
tag=V.*
(word=his|her|its|their)
If a mention indicates male and female with high confidence, it’s likely to be a person mention Patterns for candidate mentions John Joseph bought/… his/… Haifa and its/… screenwriter published/… his/… it/… is/… fish male 32 21 144 22 female 0 19 27 41 neutral 0 92 0 1741 plural 0 15 0 1186 37/55
Animacy Discovery from Ngrams
Discovery Patterns Count the relative pronoun after nouns not (tag=(IN|[NJ].*)
tag=[NJ].*
(? (word=,))
(word=who|which|where|when)
If a mention indicates animacy with high confidence, it’s likely to be a person mention Patterns for candidate mentions supremo shepherd prophet imam oligarchs sheikh Animate who 24 807 7372 910 299 338 when 0 24 1066 76 13 11 Non-Animate where which 0 0 0 56 63 0 0 0 1141 57 28 0 38/55
Unsupervised Mention Detection Using Gender and Animacy Statistics
Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation
freq (best property) – freq (second best property) freq (second best property)
Full Matching:
John Joseph (M:32)
Composite Matching:
Ayub (M:87) Masih (M:117)
Relaxed Matching:
Mahmoud (M:159 F:13) Hamadan(N:19) Qawasmi(M:0 F:0) Salim(F:13 M:188)
39/55
Mention Detection Performance
Name Mention Detection Nominal Mention Detection Methods Supervised Model Unsupervised Methods Using Ngrams Supervised Model Unsupervised Methods Using Ngrams P(%) 88.24
87.05
R(%) 81.08
82.34
85.93
71.20
70.56
85.18
F(%) 84.51
84.63
77.49
77.57
• Apply the parameters optimized on dev set directly on the blind test set • Blind test on 50 ACE05 newswire documents, 555 person name mentions and 900 person nominal mentions 40/55
Impact of Statistical Re-Ranking
Bottom-up Pipelines Supervised IE Pattern Matching Top-down QA Priority based Combination
Re-Ranking based Combination
Precision 0.2416
0.2186
0.2668
0.3048
0.2797
Recall 0.1421
0.3769
0.1730
0.2658
0.4433
F-measure 0.1789
0.2767
0.2099
0.2840
0.3430
5-fold cross-validation on training set Mitigate the impact of errors produced by scoring based on co-occurrence (slot type x sys feature) e.g. the query “
Moro National Liberation Front
” and answer “
1976
”did not have a high co-occurrence, but was bumped up by the re-ranker based on the slot type feature
org:founded
Impact of Cross-Slot Reasoning
Operations
Removal Adding
Total
277 16
Correct(%)
88% 100%
Incorrect(%)
12% 0%
Brian McFadden | per:title | singers | “She had two daughter with one of the MK’d Westlife singers, Brian McFadden, calling them Molly Marie and Lilly Sue”
Slot-Specific Analysis
A few slots account for a large fraction of the answers: per:title, per:employee_of, per:member_of, and org:top_members/employees account for 37% of correct responses For a few slots, delimiting exact answer is difficult … result is ‘inexact’ slot fills per:charges, per:title (“rookie driver ”; “ record producer ”) For a few slots, equivalent-answer detection is important to avoid redundant answers per:title again accounts for the largest number of cases. e.g., “defense minister” and “defense chief” are equivalent.
How much Inference is Needed?
Why KBP is more difficult than ACE
Cross-sentence Inference – non-identity coreference(per:children)
Lahoud
is married to an Armenian and the couple have three children. Eldest son
Emile Emile Lahoud
was a member of parliament between 2000 and 2005.
Cross-slot Inference (per:children)
People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named 2006.
Henry Daniel Moder
. Henry was born Monday in Los Angeles and weighed 8? lbs. Roberts, 39, and husband
Danny Moder
, 38, are already parents to twins Hazel and Phinnaeus who were born in November
Statistical Re-Ranking based Active Learning
Preview of KBP2011
47/55
Cross-lingual Entity Linking
Query = “
吉姆
.
帕森斯” 48/55
Cross-lingual Slot Filling
Other family: Todd Spiewak Query = “James Parsons”
49/55
Cross-lingual Slot Filling
Two Possible Strategies 1. Entity Translation (ET) + Chinese KBP 2. Machine Translation (MT) + English KBP Stimulate Research on Information-aware Machine Translation Translation-aware Information Extraction Foreign Language KBP, Cross-lingual Distant Learning 50/55
Error Example of SF on MT output
Query: Elizabeth II
Slot type: per:cities_of_residence
Answer: Gulf
XIN20030511.0130.0011 | Xinhua News Agency, London , may 10 -according to British media ten, British
Queen Elizabeth II
did not favour
in
the
Gulf
region to return British unit to celebrate the victory in the war.
51/55
Query Name in Document Not Translated
Query: Celine Dion Answer: PER:Origin = Canada British singer , Clinton 's plan to Caesar Palace of the ( Central news of UNAMIR in Los Angeles , 15th (Ta Kung Pao) consider British singer , Clinton ( ELT on John ) today, according to the Canadian and the seats of the Matignon Accords , the second to Las Vegas in the international arena heavyweight.
52/55
Answer in Document Not Translated
Query: David Kelly Answer: per:schools_attended = Oxford University MT: The 59-year-old Kelly is the flea basket for trapping fish microbiology and internationally renowned biological and chemical weapons experts. He had participated in the United Nations Iraq weapons verification work, and the British Broadcasting Corporation ( BBC ) the British Government for the use of force on Iraq evidence the major sources of information. On , Kelly in the nearby slashed his wrist, and public opinion holds that he " cannot endure the enormous psychological pressure " to commit suicide.
53/55
Temporal KBP (Slot Filling)
54/55
Temporal KBP
Many attributes such as a person’s title and employer, and spouse change over time Time-stamped data is more valuable Distinguish static attributes and dynamic attributes Address the multiple answer problem
What representation to require?
Temporal KBP: scoring
Score each element of 4-tuple separately, then combine scores Smoothed score to handle +∞ and -∞
Need rules for granularity mismatches
Year vs month vs day Possible Formula (constraint based validation) key =
; answer =
if x i is judged as incorrect then
S x i
0 otherwise
S x i
4 (
t i
1
x i
|
m
) 56/55
Need Cross-document Aggregation
Query: Ali Larijani; Answer: Iran Ali Larijani had held the post for over
two years
but resigned after reportedly falling out with the hardline Ahmadinejad over the handling of Iran 's nuclear case.
Doc1: Doc2: The new speaker, Ali Larijani , who resigned as the country's nuclear negotiator in
October
over differences with Ahmadinejad, is a conservative and an ardent advocate of Iran 's nuclear program, but is seen as more pragmatic in his approach and perhaps willing to engage in diplomacy with the West.
57/55
Same Relation Repeat Over Time
Query: Mark Buse; Answer: McCain Doc1: NYT_ENG_20080220.0185.LDC2009T13.sgm
(seven years, P7Y); (2001, 2001) In his case, it was a round trip through the revolving door: Buse had directed McCain 's committee staff for
seven years
before leaving in 2001 to lobby for telecommunications companies.
Doc2:LTW_ENG_20081003.0118.LDC2009T13.sgm
(this year, 2008) as chief of staff.
Buse returned to McCain 's office
this year
58/55
Require Paraphrase Discovery
Query: During when was R. Nicholas Burns a member of the U.S. State Department?
Answer: 1995-2008
1995
0112.0477.LDC2007T07 R. Nicholas Burns,
a career foreign service officer
in charge of Russian affairs at the National Security Council, is due to be named
the new spokesman
at the U.S. State Department, a senior U.S. official said Thursday.
[APW_ENG_20070324.0924.LDC2009T13 and many other DOCS] The United States is "very pleased by the strength of this resolution" after two years of diplomacy, said R. Nicholas Burns,
undersecretary for political affairs
at the State Department.
2008
0118.0161.LDC2009T13 R. Nicholas Burns, the country's
third-ranking diplomat
and Secretary of State Condoleezza Rice's right-hand man, is retiring for personal reasons, the State Department said Friday.
2008
0302.0157.LDC2009T13 The
chief U.S. negotiator
, R. Nicholas Burns, who left his job on Friday, countered that the sanctions were all about Iran's refusal to stop enriching uranium, not about weapons. But that argument was a tough sell.
59/55
Related Work
Extracting slots for persons and organizations (Bikel et al., 2009; Li et al., 2009; Artiles et al., 2008) Distant Learning (Mintz et al., 2009) Re-ranking techniques (e.g. Collins et al., 2002; Zhai et al., 2004; Ji et al., 2006) Answer validation for QA (e.g. Magnini et al., 2002; Peatas et al., 2007; Ravichandran et al., 2003; Huang et al., 2009) Inference for Slot Filling (Bikel et al., 2009; Castelli et al., 2010)
Conclusions
KBP proves a much more challenging task than traditional IE/QA Brings great opportunity to stimulate research and collaborations across communities An adventure to promote IE to web-scale processing and higher quality Encourage research on cross-document cross-lingual IE Big gains from statistical re-ranking combining 3 pipelines Information Extraction Pattern Learning Question-Answering Further gains from MLN cross-slot reasoning Automatic profiles from SF dramatically improve EL Human-system combination provides efficient answer-key generation Faster, better, cheaper!
61/55