pattern ppt - NTU NLPL's Homepage台大自然語言處理

Download Report

Transcript pattern ppt - NTU NLPL's Homepage台大自然語言處理

自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Named Entity Extraction Task
at National Taiwan University
Hsin-Hsi Chen, Yung-Wei Ding, Shih-Chung Tsai, Guo-Wei Bian
Department of Computer Science and Information Engineering
National Taiwan University
Taipei, Taiwan, R.O.C.
MET2
1
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Previous Named Entity Work
at NTU
• COLING96 (Chen & Lee)
– person names: (92.56%, 88.04%)
– transliterated person names: (71.93%, 50.62%)
– organization names: (54.50%, 61.79%)
• Applications
– sentence alignment (Chen & Wu, 1995)
– anaphora resolution (Chen & Lee, 1996)
– white page construction (Chen & Bian, 1997)
– information retrieval (Chen, Ding, & Tsai, 1998)
MET2
2
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
•
•
•
•
•
•
•
•
Flow of Named Entity Extraction
in MET2
Transform Chinese texts in GB codes into texts in Big-5 codes.
Segment Chinese texts into a sequence of tokens.
Identify named people.
Identify named organizations.
Identify named locations.
Use n-gram model to identify named organizations/locations.
Identify the rest of named expressions.
Transform the results in Big-5 codes into the results in GB codes.
MET2
3
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Transformation
from GB codes to Big-5 codes
• Big-5 traditional character set and GB simplified character set are
adopted in Taiwan and in China, respectively.
• Our system is developed on the basis of Big-5 codes, so that the
transformation is required.
• Characters used both in simplified character set and in tradition
character set always result in error mapping.
– 旅遊 vs. 旅游
那麼 vs. 那么
十年 vs. 几十年
長時間裡 vs. 長時間里
報導 vs. 報道
準確 vs. 准確
好像 vs. 好象
最後 vs. 最后
並不是 vs. 并不是幾
由於 vs. 由于
and so on.
• More unknown words may be generated.
MET2
4
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Segmentation
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• We list all the possible words by dictionary look-up, and then resolve
ambiguities by segmentation strategies.
• The test documents in MET-2 are selected from China newspapers.
• Our dictionary is trained from Taiwan corpora.
• Due to the different vocabulary sets, many more unknown words may
be introduced. E.g., “人工智慧” vs. “人工智能”,
“軟體” vs. “軟件”, “肯亞” vs. “肯尼亞”, “紐西蘭” vs. “新西蘭”,etc.
• The unknown words from different code sets and different
vocabulary sets make named entity extraction more changeable.
MET2
5
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Results of MET-2 Formal Run of
NTUNLPL
• F-measures
– P&R: 79.61%
– 2P&R: 77.88%
– P&2R: 81.42%
• Recall and Precision
– name: (85%, 79%)
– number: (91%, 98%)
– time: (95%, 85%)
MET2
6
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Named People Extraction
Natural Language Processing Lab.
National Taiwan University
• Chinese person names
– Chinese person names are composed of surnames and names.
– Most Chinese surnames are single character and some rare ones
are two characters.
– Most names are two characters and some rare ones are single
characters.
– The length of Chinese person names ranges from 2 to 6
characters.
• Transliterated person names
– Transliterated person names denote foreigners.
– The length of transliterated person names is not restricted to 2 to 6
characters.
MET2
7
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Named People Extraction:
Chinese Person Names
• Extraction Strategies
– baseline models: name-formulation statistics
» Propose possible candidates.
– context clues
»
»
»
»
Add extra scores to the candidates.
When a title appears before (after) a string, it is probably a person name.
Person names usually appear at the head or the tail of a sentence.
Persons may be accompanied with speech-act verbs like "發言", "說", "提出", etc.
– cache: occurrences of named people
» A candidate appearing more than once has high tendency to be a person name.
MET2
8
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Training Data
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• Name-formulation statistics is trained from 1-million
person name corpus in Taiwan.
• Each contains surname, name and sex.
• There are 489,305 male names, and 509,110 female names.
• Total 598 surnames are retrieved from this 1-M corpus.
• The surnames of very low frequency like “是”, “那”,
etc., are removed to avoid false alarms.
• Only 541 surnames are left, and are used to trigger the
person name identification system.
MET2
9
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Training Data
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• The probability of a Chinese character to be the first character (the
second character) of a name is computed for male and female,
separately.
• We compute the probabilities using training tables for female and
male, respectively.
• In some cases, either male score or female score must be greater than
thresholds.
• In some cases, female score must be greater than male scores.
• Thresholds are defined as: 99% of training data should pass the
thresholds.
MET2
10
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Baseline Models:
utilize name-formulation statistics
• Model 1. Single character, e.g., ‘趙’, 錢‘, ’孫‘ and ’李’
– P(C1)*P(C2)*P(C3) using the training table for male > Threshold1 and
P(C2)*P(C3) using training table for male > Threshold2, or
– P(C1)*P(C2)*P(C3) using the training table for female > Threshold3 and
P(C2)*P(C3) using the training table for female > Threshold4
• Model 2. Two characters, e.g., ‘歐陽’ and ‘上官’
– P(C2)*P(C3) using training table for male > Threshold2, or
– P(C2)*P(C3) using training table for female > Threshold4
• Model 3. Two surnames together like '蔣宋’
– P(C12)*P(C2)*P(C3) using the training table for female > Threshold3,
P(C2)*P(C3) using the training table for female > Threshold4 and
P(C12)*P(C2)*P(C3) using the training table for female >
P(C12)*P(C2)*P(C3) using training table for male
MET2
11
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Cache
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• Use cache to store the identified candidates and reset cache
when next document is considered.
• Four cases
}
}
contradictory
– C1C2C3 and C1C2C4 are in the cache, and C1C2 is correct.
– C1C2C3 and C1C2C4 are in the cache, and both are correct.
– C1C2C3 and C1C2 are in the cache, and C1C2C3 is correct.
– C1C2C3 and C1C2 are in the cache, and C1C2 is correct.
• The entry with high weight is selected.
• When both have low weights, 2nd character of name is
critical.
MET2
12
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
Named People Extraction:
Transliterated Person Names
• transliterated name set: a built-in set
• character condition
– The first character must belong to a 280-character set.
– The remaining characters must appear in a 411-character set.
– The character condition is a loose restriction. It should be employed with other
clues.
• clues
– titles: the same as Chinese person names
– name introducers: "叫", "叫作", "叫做", "名叫", and "尊稱"
– special verbs: the same as Chinese person names
• first name․middle name ․last name
MET2
13
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Discussion
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• The recall rate and the precision are 91% and 74%.
• Major errors
– segmentation, e.g., 盛世良 -> 盛世 良
Part of person names may be regarded as a word during segmentation.
– surname name, character set and title are incomplete, e.g.,
肖成林, 卡拉 捷 耶夫, 醫生 卡庫
– blanks, e.g., 羅 俏
We cannot tell if blanks exist in the original documents or are inserted by
segmentation system.
– Boundary errors
– Japanese names, e.g., 田中真紀子
MET2
14
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Named Organization Extraction
Natural Language Processing Lab.
National Taiwan University
• An organization name can be divided into name and
keyword parts
• The rules
– OrganizationName  OrganizationName OrganizationNameKeyword
e.g.,
聯合國
部隊
– OrganizationName  CountryName
OrganizationNameKeyword
e.g.,
美國
大使館
– OrganizationName  PersonName
OrganizationNameKeyword
e.g.,
羅慧夫
基金會
– OrganizationName  CountryName
OrganizationName
e.g.,
美國
國防部
– OrganizationName  LocationName
OrgnizationName
e.g.,
伊利諾州
州府
MET2
15
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Named Organization Extraction
(Continued)
Natural Language Processing Lab.
National Taiwan University
– OrganizationName  CountryName {D|DD} OrganizationNameKeyword
e.g.,
中國
國際 廣播電台
– OrganizationName  PersonName {D|D} OrganizationNameKeyword
e.g.,
羅慧夫
文教 基金會
– OrganizationName  LocationName {D|D} OrganizationNameKeyword
e.g.,
台北
國際 廣播電台
• We collect 776 organization names and 1,059 keywords.
• Transliterated person names and location names must
satisfy the character conditions mentioned in named people.
MET2
16
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Named Organization Extraction
(Continued)
Natural Language Processing Lab.
National Taiwan University
• Cache
– problem: when should a pattern be put into cache?
– Character set is incomplete.
• n-gram model
– It must consist of a name and an organization name keyword.
– Its length must be greater than 2 words.
– It does not cross any punctuation marks.
– It must occur more than a threshold.
MET2
17
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Discussion
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• The recall rate and the precision rate are 78% and 85%.
• Major errors
– more than two content words between name and keyword
e.g., 中國 衛星 發射 代理 公司
– absent of keywords
e.g., 巴解法塔賀武裝
– absent of name part
the name part do not satisfy character condition, e.g., 亞星公司
– n-gram errors
e.g., 安得拉邦東南部發射基地
MET2
18
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Named Location Extraction
Natural Language Processing Lab.
National Taiwan University
• A location name is composed of name and keyword parts.
• Rules
– LocationName  PersonName LocationNameKeyword
– LocationName  LocationName LocationNameKeyword
• Locative verbs like '來自', '前往', and so on, are introduced
to treat location names without keywords.
• Cache and n-gram models are also employed to extract
location names.
MET2
19
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Discussion
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• character set
– The characters "鹿" and "島" in the string "鹿兒島縣" do not belong to our
transliterated character set.
• wrong keyword
– The character "部" is an organization keyword. Thus the string "菲律賓馬部"
is mis-regarded as an organization name.
• common content words
– The words such as "太陽", "土星", etc., are common content words. We do not
give them special tags.
• single-character locations
– The single-character locations such as "中", "日", and so on, are missed during
recognition.
• interference words between name part and keywords
MET2
20
NLPL.CSIE.NTU
自然語言處理實驗室
Other Entities:
date expressions
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
•
•
•
•
•
•
•
•
•
•
•
•
•
DATE  NUMBER YEAR
(三 年)
DATE  NUMBER MTHUNIT
(十 月)
DATE  NUMBER DUNIT
(五 日)
DATE  REGINC
(元旦)
DATE  FSTATE DATE
(今年 三月)
DATE  COMMON DATE
(前 兩年)
DATE  REGINE DATE
(民國 七十八年)
DATE  DATE DMONTH
(今年 三月)
DATE  DATE BSTATE
(去年 初)
DATE  FSTATEDATE DATE
(這年 三月底)
DATE  FSTATEDATE DMONTH
(今年 元月)
DATE  FSTATEDATE FSTATEDATE (明年 今天)
DATE  DATE YXY DATE
(去年三月 至 今年五月)
MET2
21
NLPL.CSIE.NTU
自然語言處理實驗室
Other Entities:
time expressions
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
•
•
•
•
•
•
•
•
•
•
TIME  NUMBER HUNIT
TIME  NUMBER MUNIT
TIME  NUMBER SUNIT
TIME  FSTAETIME TIME
TIME  FSTATE TIME
TIME  TIME BSTATE
TIME  MORN BSTATE
TIME  TIME TIME
TIME  TIME YXY TIME
TIME  NUMBER COLON NUMBER
MET2
22
(五 時)
(三十 分)
(六 秒)
(今天 到 明天)
(03 : 45)
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
•
•
•
•
•
•
•
•
Other Entities:
monetary expressions
DMONEY  MOUNIT NUMBER MOUNIT
DMONEY  NUMBER MOUNIT MOUNIT
DMONEY  NUMBER MOUNIT
DMONEY  MOUNIT MOUNIT NUMBER
DMONEY  MOUNIT NUMBER
DMONEY  NUMBER YXY DMONEY
DMONEY  DMONEY YXY DMONEY
DMONEY  DMONEY YXY NUMBER
MET2
23
(美金 五 元)
(五 元 美金)
(五 元)
(美金 $ 5)
($ 5)
(三 至 五元)
(三元 至 五元)
($200 - 500)
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
臺灣大學
Natural Language Processing Lab.
National Taiwan University
•
•
•
•
•
Other Entities:
percentage expressions
DPERCENT  PERCENT NUMBER
DPERCENT  NUMBER PERCENT
DPERCENT  DPERCENT YXY DPERCENT
DPERCENT  DPERCENT YXY NUMBE
DPERCENT  NUMBER YXY DPERCENT
MET2
24
(百分之 十)
(3 %)
(5% 到 8%)
(百分之八 到 十)
(八 到 十百分點)
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Discussion
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• The recall rate and the precision rate for date expression, time
expression, monetary expression and percentage expression are
(94%, 88%), (98%, 70%), (98%, 98%), and (83%, 98%),
respectively.
• Major errors
– propagation errors
» segmentation before entity extraction, e.g., “迄今”
» named people extraction before date expressions
– absent date units
» the date unit does not appear, e.g., “一九九六”
» the date unit should appear, e.g., “九月十”
MET2
25
NLPL.CSIE.NTU
自然語言處理實驗室
Discussion
資訊工程學
研究所
臺灣大學
(Continued)
Natural Language Processing Lab.
National Taiwan University
– Absent keywords
» Some keywords are not listed.
» E.g., “上午莫斯科時間8點58分” is divided into “上午”, “8點58分”
– Rule coverage
» E.g., “今、明兩年”
– Ambiguity
» Some characters like “點” can be used in time and monetary expressions.
E.g., “十二點七七億美元” is divided into two parts: “十二點” and “七
七億美元”
» The strings "十分" and "一時" are words. In our pipelined model, "九點
十分" and "下午一時" will be missed.
MET2
26
NLPL.CSIE.NTU
自然語言處理實驗室
資訊工程學
研究所
Concluding Remarks
臺灣大學
Natural Language Processing Lab.
National Taiwan University
• Features of NTUNLPL named entity extraction system
– Propose a pipeline model
– Employ different types of information from different levels of text
– Achieve the recall recall rate (83%) and the precision rate (77%)
• Major types of errors
– propagation errors, keyword sets, character sets, rule coverage, …
• Future works
– how to integrate different modules in an interleaving way
– how to learn grammar rules, keyword sets and character sets
automatically
MET2
27
NLPL.CSIE.NTU