Transcript ThuFit

第一届中国大数据技术创新与创业大赛
关键词行业分类
ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤
指导:方展鹏 ,唐杰
清华大学 未来互联网兴趣团队
Task

Given:




Partially labeled keywords
First 10 search results for each keywords
Keyword-buyer relationship
Goal:

Predict unlabeled keywords
Data summary

keyword_class.txt





11%
89%
keyword_users.txt



10,787,584 keywords
1,143,928 labeled, 10.6%
9,963,062 unique keywords
33 classes
keyword distribution
23,942,643 entries
Each entry is a keyword-buyer pair
labeled
unlabeled
keyword_titles.txt


21,575,166 entries, but only 10,787,583 entries are non-empty.
Each entry comprised of keyword and its first 10 search result
using Baidu
Approach

Preprocessing:


Feature Extraction:





Keyword segmentation
Keyword segment
Keyword-buyer relation
Keyword-segment relation
Search result utilization
Model:

liblinear
Keyword segmentation


Keyword 𝑤
Segement



Segmentation


A sub-string of a keyword
Semantic unit
Break a keyword to a set of segment
Two ways:

Exact segmentation

清华大学 => 清华/大学
𝑐𝑢𝑡 𝑤 = [𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 ]
Full segmentaion



清华大学 => 清华/大学/华大/清华大学
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 = {𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 }
 结巴中文分词:https://github.com/fxsjy/jieba

Feature Extraction - segment

Sparse representation of segments



Smoothened TFIDF-based feature
N-gram
“End-gram”
Feature Extraction - TFIDF
Just in this page: segment s = term 𝑡
 𝑊 = {𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑚−1 }
 𝑡𝑓 𝑡, 𝑤 = |{𝑡 ′ | 𝑡 ′ ∈ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤 }|
 𝑑𝑓(𝑡) = 𝑤| 𝑡 ∈ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤
𝑤∈𝑊
 𝑡𝑓 ′ 𝑡, 𝑤 = 1 + log 𝑡𝑓(𝑡, 𝑤)

|𝑊|
= log
1+𝑑𝑓(𝑡)
𝑡, 𝑤 = 𝑡𝑓 ′ 𝑡, 𝑤

𝑖𝑑𝑓 t

𝑡𝑓𝑖𝑑𝑓

Definition of 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡(𝑤) will be given later
𝑖𝑑𝑓(𝑡)
Feature Extraction - N-gram

N-gram


To capture some structure information
Recall




2-gram




2𝑔𝑟𝑎𝑚 𝑤 = 𝑠𝑖 . 𝑠𝑖+1 𝑖 ∈ 0,1, . . , 𝑐𝑢𝑡 𝑤
𝑆 = 2𝑔𝑟𝑎𝑚 𝑤 𝑤 ∈ 𝑊}
𝑊′ = 𝑊 𝑆
s𝑖 ∈ 𝑐𝑢𝑡 𝑤 }
Limitation



There are two ways of segmenting a keyword
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 = {𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 }, a set
𝑐𝑢𝑡 𝑤 = [𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑚−1 ], an ordered list <- adopt this one
Large character set produce large keyword set
Noise
Reduced 2-gram


𝑆′ = 𝑠 𝑠 ∈ 𝑆
𝑊 ′ = 𝑊 𝑆′
𝑠 𝑎𝑝𝑝𝑒𝑎𝑟𝑒𝑑 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 5 𝑡𝑖𝑚𝑒𝑠}
Feature Extraction - End-gram

End-gram





𝑐𝑢𝑡 𝑤 = 𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑚−1
𝑠𝑚−1 is more likely to carry discriminative information
Emphasis on the last segment: append a character that did not appear in 𝑊,
e.g “漢”
Example

𝑤 = rnu209e.tvp2轴承

𝑐𝑢𝑡 𝑤 = ["rnu209e", "𝑡𝑣𝑝2", "轴承"]

𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤 = "轴承漢"

𝑊 ′′ = 𝑊 ′ {𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤)}

w =“hj系列双锥混合机市场调查报告”
Similarly we can define 𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤)
Feature Extraction
 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
𝑤 =
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 2𝑔𝑟𝑎𝑚 𝑤
𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤
 Where is 𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤)?
 Experiments showed that, when adding
𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤), performance slightly
degrades.
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2
S0: C2
B0: C2 C3
K1:
S1: C3
B1:
K2:
S2:
B2:
K3: C3
S3: C2 C3
B3:
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2 C0
S0: C2
B0: C2 C3
K1:
S1: C3 C3
B1: C0 C3
K2: C3
S2: C0
B2:
K3: C3
S3: C2 C3 C0 C3
B3:
Keyword-buyer relation
 Assumption: A user tends to by similar class of
keywords
 Obtain the distribution of classes of keywords a
buyer buys on labeled data.
 Each buyer has a 33-dimensioned feature vector
 For each keyword , its feature vectors is an
average over feature vector of a buyers that buys
this keyword.
 Using only this feature we get an accuracy of 0.82
Keyword-buyer relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer relation
 We have made effort trying modeling buyers by
the segments of keywords they bought, and
model keywords-keywords relationship by
exploiting their common connection with
segments.

Buyer -> Keyword ->Segment =>Buyer -> Segment
 We further introduced higher order relation
influence between buyers and keywords, but
improvements are subtle.
Keyword-segment relation
 Reverse the link between segment and keywords

Keyword ->Segment => Segment -> Keyword
Keyword-segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Search Result Utilization

Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/



1-1828169-5: 1 1828169 5
1-1838143-0: 1 1838143 0
Their search results

1-1838143-0 1-1838143-0全国供货商【IC37旗下站】1-18381430价格|PDF ... IC芯片1-1838143-0品牌、价格、PDF参数 - 电子产
品资料 - 买卖IC网 PIC16C57-XT/SP145的IC、二极管、三极管查
询,采购PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0
2000pcs 1005+ 现货 泰科Tyco431829-1集成电路、连接器、接插件
AMP欧式背板连接器崧晔达_达价格_优质崧晔达批发/采购 - 阿里巴
巴
供应聚氯乙烯_连接器_供应聚 崧晔达价格_优质崧晔达批发/采
购 - 阿里巴巴
供应聚氯乙烯_连接器_供应聚氯乙烯批发_供应聚
氯乙烯供应_阿里巴巴 上海金庆电子技术有限公司
限位开关12
福州福铭仪器
Search Result Utilization
 For normal keywords, the keyword itself has
semantic meaning.
 For those keywords with less semantic
information, they are usually a product serial
number or some domain specific terminology ,
e.g chemical element names.
 These supplementary information yields more
accuracy results on “weird” keywords.
 But these keywords did not seem to be included
in online test.
Search Result Utilization

Recall:




𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤
2𝑔𝑟𝑎𝑚 𝑤
𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤
If we add one more term:
 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡′ 𝑤 =
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 2𝑔𝑟𝑎𝑚 𝑤 𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤 𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑆(𝑤))
 where 𝑆(𝑤) is the search result of 𝑤
Performance decreased by noise introduced
Example


𝑤 = “hj系列双锥混合机市场调查报告”
𝑆 𝑤 =“混合设备 HJ系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 ...混合机-供应HJ系列双锥混合机-混合机尽在阿里巴巴-常州欧朋干燥... HJ
系列双锥混合机厂家_价格-食品机械行业网
HJ系列双锥混合机供应信息,常州市步群干燥设备有限公司 HJ系列双锥
混合机_百度百科
HJ系列双锥混合机 - 常州普耐尔干燥设备有限公
司
HJ系列双锥混合机价格(江苏 常州)-盖德化工网...”
Feature Statistics
 Dimensionality: 200,000
 Lower dimensionality introduce better
generalization ability.
Implementation
Life is short, you need Python
Model
 Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
 A Library for Large Linear Classification
 L2-loss logistic regression
 33 one-vs-all classifiers for each class.
Experiments and Results
 We split labeled data into training and validation
set
 All following results are local results.
 Online test result are higher due to utilizing more
training data.
 Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Experiments and Results
Feature vector
constituents
Accuracy
Keyword-buyer relation
0.8194
Keyword-segment relation
0.9019
Keyword-buyer +
(𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑤) + TFIDF)
0.9537
𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑤) + TFIDF
0.9656
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
TFIDF
0.9635
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤) + TFIDF
0.9725
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚 𝑤 +
𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤) + TFIDF
0.9713
Analysis
 We split labeled data into training and validation
set
 All following results are local results.
 Online test result are higher due to utilizing more
training data.
 Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Limitations
 Two types of feature
 Relation feature:




Utilized prior knowledge of class label information
Low dimension
May biased to training data
TFIDF feature:



No class label information utilized
High dimension
Robust, good generalization ability
But a simple combination of two does not work well
 Ensemble methods may workaround this problem.

Thanks!