Unsupervised Overlapping Feature Selection for Conditional

Download Report

Transcript Unsupervised Overlapping Feature Selection for Conditional

for
Rocling 2011
Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo
, Richard Tzong-han Tsai, Wen-lian Hsu
Institute of Information Science, Academia Sinica
Department of Computer Science & Engineering, Yuan Ze University


Term Contributed Boundary Feature using
Conditional Random Fields in 2010
A unified view of several unsupervised
feature selection based on frequent strings
2
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
3
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
4
45

SRILM

YASA
5
45


C++ libraries
The toolkit supports N-gram statistics for
language model
6
45

Automatically extract frequent strings from
unlabeled corpus
Pattern: 自然科學
Frequency
Net Frequency
(自然科學 ,4)
(自然科,4)
(自然,10)
(自然科學 ,4)
(自然科,0)
(自然,6)
7
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
8
45
Character
Label
反
而
會
欲
速
則
不
達
B1
E
S
B1
B2
B3
M
E
9
45

[0 -9 ] + [B1|B2|B3|M|E|S]
10
45

N-Gram score

Frequent string score

Accessor variety score
11
45


Convert from term frequency and
N-Gram frequency
Logarithm ranking mechanism
12
45
Pattern
Frequency
Logarithm
ranking
mechanism
Score
塑膠原料的
10
log2(10) =3
塑膠原料
5
log2(5) = 2
原料的
3
log2(3)=1
的生產
2
Log2(2)=1
塑膠
4
log2(4)=2
13
45

Consider the score of outer pattern

Equation of AV
Lav(S ) : n u m bero f th edis tin cpr
t edec es sro
Rav(S ) : n um bero f th edis tin cpr
t edec es sro
14
45
AV(開發與法制)
AV(的開發與法制),
AV(是開發與法制),
AV(有開發與法制),
AV(開發與法制的),
AV(開發與法制是),
AV(開發與法制為)
15
45
Pattern
塑膠原料的
Logarithm
6-Tag
ranking
Label
mechanism
Score
log2(10) =3 塑 B1
膠 B2
原 B3
料M
的E
Label with
score
塑
膠
原
料
的
3B1
3B2
3B3
3M
3E
Scores are also used for filtering
16
overlapping pattern
45
17
45
Character TCB Feature
B1
塑
B2
膠
B3
原
M
料
E
的
-1
生
-1
產
“塑膠原料的” score 3
conflicts with
”的生產” score 1
”的生產” is labeled as unseen
18
45
Term
反
而
會
欲
速
則
不
達
Label
5S3B14B1
6S3E4B2
6S4E
4S
4S
6S3B1
7S3E
5S3E
19
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
20
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E0B1 4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
21
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
22
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E3B1 0E
0E
0E
不
5S
3E
0E
0E
0E
達
23
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
24
45


Character-based N-gram extracted by SRILM
Keeping overlapping information
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3S
1E
2B1
1B1
1B1
戲
4S
2B1
2B2
1B2
1B2
劇
6S
3B1
2E
1B3
1B3
性
1S
3E
1E
1E
1M
的
2S
2E
1E
1E
1E
結
5S
2B1
1E
1E
1E
果
25
45


Using Frequent String from YASA
Selected by forward maximum matching algorithm
Character TCB Feature
B1
塑
B2
膠
B3
原
M
料
E
的
-1
生
-1
產
26
45



Using Frequent String from YASA
Keep Overlapping information
Converting score from frequent string
Input Unsupervised
1 char 2 char
4S
0E
欲
4S
0E
速
6S
3B1
則
7S
3E
不
5S
3E
達
Feature Selection
3 char 4 char 5 char
0E
0E
0E
0E
0E
0E
0E
0E
0E
0E
0M
0E
0E
0E
0E
27
45



Using SRILM to generate N-Grams
Measure how likely a substring is a Chinese word
Using logarithm ranking mechanism
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3B1
2B1
-1
2B2
-1
塑
3B2
2B2
-1
2E
-1
膠
3B3
2M
1B1
2B2
-1
原
3M
2E
1B2
2E
-1
料
3E
-1
1E
-1
3S
的
-1
-1
1B2
0E
-1
生
-1
-1
1E
0E
-1
產
28
45

Compound AVS and TCB/TCF
Input
塑
膠
原
料
的
生
產
Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3B1
3B2
3B3
3M
3E
-1
-1
2B1
2B2
2M
2E
-1
-1
-1
-1
-1
1B1
1B2
1E
1B2
1E
2B2
2E
2B2
2E
-1
0E
0E
-1
-1
-1
-1
3S
-1
-1
TCB
B1
B2
B3
M
E
-1
-1
29
45
Overlapping
6-Tag
Labeled Score
None
AVS
V
AV score
CNG
V
N-Gram score
TCB
TCF
AVS+TCB
AVS+TCF
None
V
AVS
V
Frequent String score
AV score
Frequent String score,
AV score
30
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
31
45


Undirected graphical models trained to
maximize a conditional probability of
random variables X and Y
Feature instances are generated from
template file
32
45

Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
33
45

Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
34
45

Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
35
45

Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
36
45

Data set
◦ Academia Sinica (AS)
◦ City University of Hong Kong (CityU)
◦ Microsoft Research (MSR)
◦ Peking University (PKU)
37
45
P r ec is io (nP )
th en u m b ero f wo r d sth a ta r ec o r r ec tly
s eg m en te
th en u m b ero f wo r d sth a ta r es eg m en ted
th en u m b ero f wo r d sth a ta r ec o r r ec t ly
s eg m en t e
R ec a ll( R )
th en u m b ero f wo r d sin th eg o lds ta n d a r d
2 P R
F
PR
38
45
ROOV
thenumber of OOV words thatare correctlysegmented

thenumber of OOV words in thegold standard
Number of dataset
1
1
Ran k Sc o re

Nu m b ero f d atas et
r an k
i 1
39
45
F1
0.98
0.97
0.96
0.95
0.94
0.93
AS
CityU
MSR
PKU
0.92
40
45
Rank score of F1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
MRR
0.2
0.1
0
41
45
Recall Out-Of-Vocabulary
0.8
0.78
0.76
0.74
AS
0.72
CityU
0.7
0.68
0.66
MSR
PKU
42
45
Rank score of ROOV
0.8
0.7
0.6
0.5
0.4
0.3
0.2
MRR
0.1
0
43
45



The feature collections which contain AVS
obtains better F1
TCB/TCF enhances the 6-tag approach on
the Recall of Out-of-Vocabulary
Only with high quality feature, overlapping
label can keep useful information
44
45
45
45
45