Posters - Cognitive Computation Group

Download Report

Transcript Posters - Cognitive Computation Group

Design Challenges and Misconceptions in Named Entity Recognition
Lev Ratinov and Dan Roth
The Named entity recognition problem: identify people,
locations, organizations and other entities in text.
A typical solution is modeled as a structured problem: HMM/CRF:
Demo: http://L2R.cs.uiuc.edu/~cogcomp/demos.php
Download: http://L2R.cs.uiuc.edu/~cogcomp/asoftware.php
How to represent NE’s?
There are at least 5 representation
schemas that vary in the way they mark
the beginning, the end and single-token
entities. The letters used are B(egin),
E(nd), L(ast), I(nside), U(nit), and
O(ther). The more states we use, the
more expressive the model is, but it may
require more training data to converge.
BIO1 BIO2 IOE1 IOE2 BILOU
retain
O
O
O
O
O
the
O
O
O
O
O
Golan
I-loc
B-loc I-loc
Heights
I-loc
I-loc
Israel
B-loc B-loc I-loc
B-loc B-loc
E-loc E-loc L-loc
E-loc U-loc
captured O
O
O
O
O
from
O
O
O
O
O
Syria
I-loc
B-loc I-loc
Key result:
E-loc U-loc
BILOU performs the best overall on 5
datasets, and better than BIO2, which is
most commonly used for NER.
Main Results:
Choice of Inference algorithm
Given that NER is a sequential problem,
Viterbi is the most popular inference
algorithm in NER. It computes the most likely
label assignment in a single-order HMM/CRF
in time that is quadratic in the number of
states and linear in the input size. When nonlocal features are used, exact dynamic
programming with Viterbi becomes
intractable, and Beam search is usually used.
We show that greedy left-to-right decoding
(beam search of size one) performs
comparably to Viterbi and Beam search,
while being much faster. The reason is that in
NER, short NE-chunks are separated by many
O-labeled tokens, which “break” the
likelihood maximization problem to short
independent chunks, where classifiers with
local features perform well. The non-local
dependencies in NER have a very long range,
which the second-order state transition
features fail to model.
With the label set: Per/Org/Loc/Misc/O, and
the BILOU encoding, there are 17 states (BPer, I-Per, U-Per, etc.), and with second order
Viterbi, we need to make 173 decisions in
each step, in contrast to only 17 necessary in
greedy decoding.
…
B-PER
I-PER
B-ORG
…
Reggie
Blinker
Feyenoord
How to inject knowledge?
How to model non-local information?
Common intuition: multiple occurrences of the same token should be assigned the same
labels.
We have compared three approaches proposed in the literature, and found that no single
approach consistently outperformed the rest on a collection of 5 datasets. However, the
combination of the approaches was the most stable and generally performed the best.
Context Aggregation: collect the contexts for all instances of a token, extract features
from all contexts and use them for all instances. For example, all the instances of FIFA in
the sample text, will be added context aggregation features extracted from:
suspension lifted by
two games after
FIFA
We investigated two knowledge resources: word class models
extracted from unlabeled text and extracting Gazetteers from
Wikipedia. The approaches are independent and the contribution
can be accumulated.
Word class models hierarchically cluster words based on
distribution-similarity-like measure. Two clusters are merged if they
tend to appear in similar contexts. For example, clustering Friday
and Monday together, allows us to abstract both as day of the week
and helps dealing with the data sparsity problem. A sample binary
clustering is given below:
on Friday and
Slapped a worldwide
Two-stage Prediction Aggregation: the model tags the text in the first round of
inference. Then , for each token, we mark the most frequent label it assigned among the
multiple occurrences, and if there were named entities found in the document which
contained the token, we mark this information too. We use these statistics as features in
a second round of inference.
Extended Prediction History: the previous two techniques treat all the tokens in the
document identically. However, it is often the case that the easiest named entities to
identify occur at the beginning of the document. When using beam search or greedy
decoding, we collect, for each token, statistic on how often it was previously assigned
each of the labels.
Conclusions
We have presented a conditional model for NER
that uses averaged perceptron (implementation:
Learning Based Java (LBJ):
http://L2R.cs.uiuc.edu/~cogcomp/asoftware.php) with
expressive features to achieve new state of the art
performance on the Named Entity recognition task.
We explored four fundamental design decisions: text
chunks representation, inference algorithm, non-local
features and external knowledge.
Supported by NSF grant SoD-HCER-0613885, by MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC and by the National Library of Congress project NDIIPP
Deciding which level of abstraction to use is challenging; however,
similarly to Hoffman encoding, each word can be assigned a path
from the root to the leaf. For example, the encoding of IBM is 011.
By taking substrings of different lengths, we can consider different
levels of abstraction.
Gazetteers: We use a collection of 14 high-precision, low-recall
lists extracted from the web that cover common names, countries,
monetary units, temporal expressions, etc. While these gazetteers
have excellent accuracy, they do not provide sufficient coverage. To
further improve the coverage, we have extracted 16 gazetteers from
Wikipedia, which collectively contain over 1.5M entities.
Wikipedia is an open, collaborative encyclopedia with several
attractive properties. (1) It is kept updated manually by its
collaborators, hence new entities are constantly added to it. (2) It
contains redirection pages, mapping several variations of spelling of
the same name to one canonical entry. For example, Suker is
redirected to an entry about Davor Šuker, the Croatian footballer. (3)
The entries in Wikipedia are manually tagged with categories.
For example, the entry Microsoft in Wikipedia has the following
categories: Companies listed on NASDAQ; Cloud computing vendors;
etc. We used the category structure in Wikipedia to group the titles
and the corresponding redirects into Higher-level concepts, such as:
locations, organizations, artworks, etc.