Unsupervised Overlapping Feature Selection for Conditional
Download
Report
Transcript Unsupervised Overlapping Feature Selection for Conditional
for
Rocling 2011
Ting-hao Yang, Tian-jian Jiang, Chan-hung Kuo
, Richard Tzong-han Tsai, Wen-lian Hsu
Institute of Information Science, Academia Sinica
Department of Computer Science & Engineering, Yuan Ze University
Term Contributed Boundary Feature using
Conditional Random Fields in 2010
A unified view of several unsupervised
feature selection based on frequent strings
2
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
3
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
4
45
SRILM
YASA
5
45
C++ libraries
The toolkit supports N-gram statistics for
language model
6
45
Automatically extract frequent strings from
unlabeled corpus
Pattern: 自然科學
Frequency
Net Frequency
(自然科學 ,4)
(自然科,4)
(自然,10)
(自然科學 ,4)
(自然科,0)
(自然,6)
7
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
8
45
Character
Label
反
而
會
欲
速
則
不
達
B1
E
S
B1
B2
B3
M
E
9
45
[0 -9 ] + [B1|B2|B3|M|E|S]
10
45
N-Gram score
Frequent string score
Accessor variety score
11
45
Convert from term frequency and
N-Gram frequency
Logarithm ranking mechanism
12
45
Pattern
Frequency
Logarithm
ranking
mechanism
Score
塑膠原料的
10
log2(10) =3
塑膠原料
5
log2(5) = 2
原料的
3
log2(3)=1
的生產
2
Log2(2)=1
塑膠
4
log2(4)=2
13
45
Consider the score of outer pattern
Equation of AV
Lav(S ) : n u m bero f th edis tin cpr
t edec es sro
Rav(S ) : n um bero f th edis tin cpr
t edec es sro
14
45
AV(開發與法制)
AV(的開發與法制),
AV(是開發與法制),
AV(有開發與法制),
AV(開發與法制的),
AV(開發與法制是),
AV(開發與法制為)
15
45
Pattern
塑膠原料的
Logarithm
6-Tag
ranking
Label
mechanism
Score
log2(10) =3 塑 B1
膠 B2
原 B3
料M
的E
Label with
score
塑
膠
原
料
的
3B1
3B2
3B3
3M
3E
Scores are also used for filtering
16
overlapping pattern
45
17
45
Character TCB Feature
B1
塑
B2
膠
B3
原
M
料
E
的
-1
生
-1
產
“塑膠原料的” score 3
conflicts with
”的生產” score 1
”的生產” is labeled as unseen
18
45
Term
反
而
會
欲
速
則
不
達
Label
5S3B14B1
6S3E4B2
6S4E
4S
4S
6S3B1
7S3E
5S3E
19
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
20
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E0B1 4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
21
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
22
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E3B1 0E
0E
0E
不
5S
3E
0E
0E
0E
達
23
45
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
5S
3B1 4B1 0B1 0B1
反
6S
3E
4B2 0B2 0B2
而
6S
0E
4E
0B3 0B3
會
4S
0E
0E
0E
0M
欲
4S
0E
0E
0E
0E
速
6S
3B1 0E
0E
0E
則
7S
3E
0E
0E
0E
不
5S
3E
0E
0E
0E
達
24
45
Character-based N-gram extracted by SRILM
Keeping overlapping information
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3S
1E
2B1
1B1
1B1
戲
4S
2B1
2B2
1B2
1B2
劇
6S
3B1
2E
1B3
1B3
性
1S
3E
1E
1E
1M
的
2S
2E
1E
1E
1E
結
5S
2B1
1E
1E
1E
果
25
45
Using Frequent String from YASA
Selected by forward maximum matching algorithm
Character TCB Feature
B1
塑
B2
膠
B3
原
M
料
E
的
-1
生
-1
產
26
45
Using Frequent String from YASA
Keep Overlapping information
Converting score from frequent string
Input Unsupervised
1 char 2 char
4S
0E
欲
4S
0E
速
6S
3B1
則
7S
3E
不
5S
3E
達
Feature Selection
3 char 4 char 5 char
0E
0E
0E
0E
0E
0E
0E
0E
0E
0E
0M
0E
0E
0E
0E
27
45
Using SRILM to generate N-Grams
Measure how likely a substring is a Chinese word
Using logarithm ranking mechanism
Input Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3B1
2B1
-1
2B2
-1
塑
3B2
2B2
-1
2E
-1
膠
3B3
2M
1B1
2B2
-1
原
3M
2E
1B2
2E
-1
料
3E
-1
1E
-1
3S
的
-1
-1
1B2
0E
-1
生
-1
-1
1E
0E
-1
產
28
45
Compound AVS and TCB/TCF
Input
塑
膠
原
料
的
生
產
Unsupervised Feature Selection
1 char 2 char 3 char 4 char 5 char
3B1
3B2
3B3
3M
3E
-1
-1
2B1
2B2
2M
2E
-1
-1
-1
-1
-1
1B1
1B2
1E
1B2
1E
2B2
2E
2B2
2E
-1
0E
0E
-1
-1
-1
-1
3S
-1
-1
TCB
B1
B2
B3
M
E
-1
-1
29
45
Overlapping
6-Tag
Labeled Score
None
AVS
V
AV score
CNG
V
N-Gram score
TCB
TCF
AVS+TCB
AVS+TCF
None
V
AVS
V
Frequent String score
AV score
Frequent String score,
AV score
30
45
Unlabeled
corpus
Automatic
extracted
pattern
Chinese word
segmentation
model
Labeled
training data
31
45
Undirected graphical models trained to
maximize a conditional probability of
random variables X and Y
Feature instances are generated from
template file
32
45
Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
33
45
Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
34
45
Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
35
45
Feature template
Feature
Function
C-1, C0, C1 Previous, current, or next token
C-1C0
Previous and current tokens
C0 C1
Current and next tokens
C-1C1
Previous and next tokens
欲速則不達
36
45
Data set
◦ Academia Sinica (AS)
◦ City University of Hong Kong (CityU)
◦ Microsoft Research (MSR)
◦ Peking University (PKU)
37
45
P r ec is io (nP )
th en u m b ero f wo r d sth a ta r ec o r r ec tly
s eg m en te
th en u m b ero f wo r d sth a ta r es eg m en ted
th en u m b ero f wo r d sth a ta r ec o r r ec t ly
s eg m en t e
R ec a ll( R )
th en u m b ero f wo r d sin th eg o lds ta n d a r d
2 P R
F
PR
38
45
ROOV
thenumber of OOV words thatare correctlysegmented
thenumber of OOV words in thegold standard
Number of dataset
1
1
Ran k Sc o re
Nu m b ero f d atas et
r an k
i 1
39
45
F1
0.98
0.97
0.96
0.95
0.94
0.93
AS
CityU
MSR
PKU
0.92
40
45
Rank score of F1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
MRR
0.2
0.1
0
41
45
Recall Out-Of-Vocabulary
0.8
0.78
0.76
0.74
AS
0.72
CityU
0.7
0.68
0.66
MSR
PKU
42
45
Rank score of ROOV
0.8
0.7
0.6
0.5
0.4
0.3
0.2
MRR
0.1
0
43
45
The feature collections which contain AVS
obtains better F1
TCB/TCF enhances the 6-tag approach on
the Recall of Out-of-Vocabulary
Only with high quality feature, overlapping
label can keep useful information
44
45
45
45
45