Transcript ThuFit
第一届中国大数据技术创新与创业大赛
关键词行业分类
ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤
指导:方展鹏 ,唐杰
清华大学 未来互联网兴趣团队
Task
Given:
Partially labeled keywords
First 10 search results for each keywords
Keyword-buyer relationship
Goal:
Predict unlabeled keywords
Data summary
keyword_class.txt
11%
89%
keyword_users.txt
10,787,584 keywords
1,143,928 labeled, 10.6%
9,963,062 unique keywords
33 classes
keyword distribution
23,942,643 entries
Each entry is a keyword-buyer pair
labeled
unlabeled
keyword_titles.txt
21,575,166 entries, but only 10,787,583 entries are non-empty.
Each entry comprised of keyword and its first 10 search result
using Baidu
Approach
Preprocessing:
Feature Extraction:
Keyword segmentation
Keyword segment
Keyword-buyer relation
Keyword-segment relation
Search result utilization
Model:
liblinear
Keyword segmentation
Keyword 𝑤
Segement
Segmentation
A sub-string of a keyword
Semantic unit
Break a keyword to a set of segment
Two ways:
Exact segmentation
清华大学 => 清华/大学
𝑐𝑢𝑡 𝑤 = [𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 ]
Full segmentaion
清华大学 => 清华/大学/华大/清华大学
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 = {𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 }
结巴中文分词:https://github.com/fxsjy/jieba
Feature Extraction - segment
Sparse representation of segments
Smoothened TFIDF-based feature
N-gram
“End-gram”
Feature Extraction - TFIDF
Just in this page: segment s = term 𝑡
𝑊 = {𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑚−1 }
𝑡𝑓 𝑡, 𝑤 = |{𝑡 ′ | 𝑡 ′ ∈ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤 }|
𝑑𝑓(𝑡) = 𝑤| 𝑡 ∈ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤
𝑤∈𝑊
𝑡𝑓 ′ 𝑡, 𝑤 = 1 + log 𝑡𝑓(𝑡, 𝑤)
|𝑊|
= log
1+𝑑𝑓(𝑡)
𝑡, 𝑤 = 𝑡𝑓 ′ 𝑡, 𝑤
𝑖𝑑𝑓 t
𝑡𝑓𝑖𝑑𝑓
Definition of 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡(𝑤) will be given later
𝑖𝑑𝑓(𝑡)
Feature Extraction - N-gram
N-gram
To capture some structure information
Recall
2-gram
2𝑔𝑟𝑎𝑚 𝑤 = 𝑠𝑖 . 𝑠𝑖+1 𝑖 ∈ 0,1, . . , 𝑐𝑢𝑡 𝑤
𝑆 = 2𝑔𝑟𝑎𝑚 𝑤 𝑤 ∈ 𝑊}
𝑊′ = 𝑊 𝑆
s𝑖 ∈ 𝑐𝑢𝑡 𝑤 }
Limitation
There are two ways of segmenting a keyword
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 = {𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑛−1 }, a set
𝑐𝑢𝑡 𝑤 = [𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑚−1 ], an ordered list <- adopt this one
Large character set produce large keyword set
Noise
Reduced 2-gram
𝑆′ = 𝑠 𝑠 ∈ 𝑆
𝑊 ′ = 𝑊 𝑆′
𝑠 𝑎𝑝𝑝𝑒𝑎𝑟𝑒𝑑 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 5 𝑡𝑖𝑚𝑒𝑠}
Feature Extraction - End-gram
End-gram
𝑐𝑢𝑡 𝑤 = 𝑠0 , 𝑠1 , 𝑠2 , … , 𝑠𝑚−1
𝑠𝑚−1 is more likely to carry discriminative information
Emphasis on the last segment: append a character that did not appear in 𝑊,
e.g “漢”
Example
𝑤 = rnu209e.tvp2轴承
𝑐𝑢𝑡 𝑤 = ["rnu209e", "𝑡𝑣𝑝2", "轴承"]
𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤 = "轴承漢"
𝑊 ′′ = 𝑊 ′ {𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤)}
w =“hj系列双锥混合机市场调查报告”
Similarly we can define 𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤)
Feature Extraction
𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
𝑤 =
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 2𝑔𝑟𝑎𝑚 𝑤
𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤
Where is 𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤)?
Experiments showed that, when adding
𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚(𝑤), performance slightly
degrades.
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2
S0: C2
B0: C2 C3
K1:
S1: C3
B1:
K2:
S2:
B2:
K3: C3
S3: C2 C3
B3:
Keyword-buyer/segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
K0: C2 C0
S0: C2
B0: C2 C3
K1:
S1: C3 C3
B1: C0 C3
K2: C3
S2: C0
B2:
K3: C3
S3: C2 C3 C0 C3
B3:
Keyword-buyer relation
Assumption: A user tends to by similar class of
keywords
Obtain the distribution of classes of keywords a
buyer buys on labeled data.
Each buyer has a 33-dimensioned feature vector
For each keyword , its feature vectors is an
average over feature vector of a buyers that buys
this keyword.
Using only this feature we get an accuracy of 0.82
Keyword-buyer relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Keyword-buyer relation
We have made effort trying modeling buyers by
the segments of keywords they bought, and
model keywords-keywords relationship by
exploiting their common connection with
segments.
Buyer -> Keyword ->Segment =>Buyer -> Segment
We further introduced higher order relation
influence between buyers and keywords, but
improvements are subtle.
Keyword-segment relation
Reverse the link between segment and keywords
Keyword ->Segment => Segment -> Keyword
Keyword-segment relation
B0
K0
S0
K0
C0
B1
K1
S1
K1
C1
B2
K2
S2
K2
C2
B3
K3
S3
K3
C3
Search Result Utilization
Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/
1-1828169-5: 1 1828169 5
1-1838143-0: 1 1838143 0
Their search results
1-1838143-0 1-1838143-0全国供货商【IC37旗下站】1-18381430价格|PDF ... IC芯片1-1838143-0品牌、价格、PDF参数 - 电子产
品资料 - 买卖IC网 PIC16C57-XT/SP145的IC、二极管、三极管查
询,采购PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0
2000pcs 1005+ 现货 泰科Tyco431829-1集成电路、连接器、接插件
AMP欧式背板连接器崧晔达_达价格_优质崧晔达批发/采购 - 阿里巴
巴
供应聚氯乙烯_连接器_供应聚 崧晔达价格_优质崧晔达批发/采
购 - 阿里巴巴
供应聚氯乙烯_连接器_供应聚氯乙烯批发_供应聚
氯乙烯供应_阿里巴巴 上海金庆电子技术有限公司
限位开关12
福州福铭仪器
Search Result Utilization
For normal keywords, the keyword itself has
semantic meaning.
For those keywords with less semantic
information, they are usually a product serial
number or some domain specific terminology ,
e.g chemical element names.
These supplementary information yields more
accuracy results on “weird” keywords.
But these keywords did not seem to be included
in online test.
Search Result Utilization
Recall:
𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤
2𝑔𝑟𝑎𝑚 𝑤
𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤
If we add one more term:
𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡′ 𝑤 =
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 2𝑔𝑟𝑎𝑚 𝑤 𝑒𝑛𝑑𝑔𝑟𝑎𝑚 𝑤 𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑆(𝑤))
where 𝑆(𝑤) is the search result of 𝑤
Performance decreased by noise introduced
Example
𝑤 = “hj系列双锥混合机市场调查报告”
𝑆 𝑤 =“混合设备 HJ系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 ...混合机-供应HJ系列双锥混合机-混合机尽在阿里巴巴-常州欧朋干燥... HJ
系列双锥混合机厂家_价格-食品机械行业网
HJ系列双锥混合机供应信息,常州市步群干燥设备有限公司 HJ系列双锥
混合机_百度百科
HJ系列双锥混合机 - 常州普耐尔干燥设备有限公
司
HJ系列双锥混合机价格(江苏 常州)-盖德化工网...”
Feature Statistics
Dimensionality: 200,000
Lower dimensionality introduce better
generalization ability.
Implementation
Life is short, you need Python
Model
Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
A Library for Large Linear Classification
L2-loss logistic regression
33 one-vs-all classifiers for each class.
Experiments and Results
We split labeled data into training and validation
set
All following results are local results.
Online test result are higher due to utilizing more
training data.
Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Experiments and Results
Feature vector
constituents
Accuracy
Keyword-buyer relation
0.8194
Keyword-segment relation
0.9019
Keyword-buyer +
(𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑤) + TFIDF)
0.9537
𝑠𝑒𝑔𝑚𝑒𝑛𝑡(𝑤) + TFIDF
0.9656
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
TFIDF
0.9635
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤) + TFIDF
0.9725
𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑤 + 2𝑔𝑟𝑎𝑚 𝑤 +
𝑠𝑡𝑎𝑟𝑡𝑔𝑟𝑎𝑚 𝑤 +
𝑒𝑛𝑑𝑔𝑟𝑎𝑚(𝑤) + TFIDF
0.9713
Analysis
We split labeled data into training and validation
set
All following results are local results.
Online test result are higher due to utilizing more
training data.
Due to the complexity of migrating our code to
hadoop platform (mainly because we used third
party non-java libraries), not all of the features
above are employed in our final submission.
Limitations
Two types of feature
Relation feature:
Utilized prior knowledge of class label information
Low dimension
May biased to training data
TFIDF feature:
No class label information utilized
High dimension
Robust, good generalization ability
But a simple combination of two does not work well
Ensemble methods may workaround this problem.
Thanks!