Re-organization of IR/CSC team Hongchao He Guihong Cao

Transcript Re-organization of IR/CSC team Hongchao He Guihong Cao

Re-organization of IR/CSC team

Hongchao He
Conf. follow up TREC-10, NTCIR
 Paper follow up ICCLP, SIGIR paper


Guihong Cao


Yang Wen


MSKK-III – Clustering for technique transfer
MSKK-III – Distance word dependency
Min Zhang

MSKK/CSC – Entropy based pruning for
applications of (Pinyin/Hiragana) input system
Chinese Spelling Checking
(or, the Big CSC)
Jianfeng Gao
NLC Group, MSRCN
Outline
Introduction
 Chinese spelling checking
 Our approach
 Key techniques and experiments
 Millstone

Introduction
Goal: Automatically correct Chinese spelling
errors using MS-Pinyin (MSPY) input
system
 Chinese spelling errors using MS-Pinyin input
system
 Chinese spelling error patterns
 English spelling checking
 Why CSC is difficult?
Chinese spelling errors using MSPY
Text in the brain
Syllable
Pinyin (phonetic)
errors
Typographic errors
Key stroke (Typing)
System errors
Converted text
Chinese spelling errors patterns

Substitution errors
 Pinyin
error
 System error (include Pinyin error in some
systems)

Non-substitution errors  word
segmentation errors

Typographic errors –
insertion/deletion/transposition
English spelling checking

Non-word error detection (“the”  “hte”)
N-gram (letter) analysis
 Dictionary lookup


Real-word error detection (“from”  “form”)
NLP – parser driven
 Statistical approach – data/error driven

Local – n-gram language model, depend on pre-defined
confusion set
 Global – Winnow, Bayesian, TBL, etc.
 Problem – lack of error detection

Why CSC is difficult?

Word segmentation
Ambiguous
 OOV – Proper noun detection (personal name,
location, organization, etc.)
 Segmentation error propagation



Non-word errors (in sense of English) do not
exist
MSPY makes good use of word trigram
language model
Chinese spelling checking

CSC – related works




Template matching – long distance, e.g. <之所以> <是因为>
Pattern matching – long words (n>=3), e.g. 一文不明  一文
不名, 忠耿耿  忠心耿耿
N-gram models – substitution errors
CSC – challenges




Long distance, coverage issue of template/pattern set
High-frequent-used confusion set, e.g. {像，象} {在，再}
OOV, especially the proper nouns
N-gram, has been fully used by MSPY
Chinese spelling errors patterns in
MSPY

Proper noun




Non-word errors: context independent



Personal name
Location
organization
Insertion/deletion/transposition/substitution
E.g. 一文不明  一文不名, 忠耿耿  忠心耿耿
Real-word errors: context sensitive

E.g. 像  象, 在  再, 实施  事实
Flowchart of our approach
Text with errors
Proper noun detection
Word segmentation
Word fuzzy matching
Non-word error correction Trigger: single char string ,
low prob
Context sensitive
disambiguation
Real-word error
correction
Word segmentation and proper noun
detection



Language model based word segmentation
Class-based language model
 P(W) = Poutside(W) Pinsidea(W|<PN>), a = ?
Outside probability – PN tagged training data





Using NLPWIN to tag the corpus
Filtering, rule base
EM?
Inside probability – PN list training data
Using cache (or, dynamic dictionary)
Experiments and Findings





Measure: precision/recall – definition
Training data – People Daily
Tag tool – NLPWIN
Test data – spec.
Results and Findings
Long word fuzzy matching

Definition of Distance(s1, s2)



Fast fuzzy matching



Global – Lei Zhang’s ACL
Local – trigger, (single char, or low n-gram probability )
Search – error detection/correction


Long word, n>=3,
Sum of delete/insert/substitute a character
Viterbi
Simplified version

Long word + Local matching
Experiments and Findings







Contact: 100 person, 3000 -- 5000 characters/person
Error analysis
Algorithm …
Measure: precision/recall
Large lexicon, acquisition.
Trigger/threshold ?
Results and Findings
Context sensitive disambiguation


Building confusion set – specific to MSPY
Feature selection – Context vector




Weighting schema and Classifier



Collocation – contiguous POS or words/characters
Context words – words/characters within a K-size window
Triple ?
Context Vector, TFIDF
Winnow, Bayesian, TBL, etc.
Scaling up



Enlarge confusion set
Feature pruning
Adaptation
Experiments and Findings
Measure: precision/recall
 Training data
 Test data (XXX confusion set)
 Results and Findings

Experiments and Findings

Current Work

Pseudo-training set based on MSPY IME


Preliminary data processing (400M PD)
Unigram error model (10,000 Words useful)


Trigram error pattern (980,000 useful)


共[度]难关=>渡 / 不够[英]，=>硬
Experiments based on basic approaches




使是/69484 市/10289 诗/2394 ……
Pseudo-test set from 南方周末
Continuous pair (Recall = 50%, Precision = 25%)
Pattern Matching (??)
Future Work

Hybrid approaches


Pattern Clustering + Continuous pair
Functional words error detection
System evaluation – put it all together
Evaluation toolset
 Measure: precision/recall
 Training data
 Test data
 Results and Findings

Prototype
Demo …
 Online & offline CSC
 Right click

 Spelling
error detection/correction
 Proper noun detection/correction
Assignment
Jianfeng Gao – overall, fuzzy matching
 Mu Li – context sensitive disambiguation
 Jian Sun – PN detection
 Yang Wen – system evaluation
 Yulin Kang – demo
 Lei Zhang – senior consultant

Millstone
Oct. 2001, Ming says “Yes” (TAB demo)
 Dec. 2001, Dong says “Yes” (Transfer)
 Aug. 2002, HJ says “Yes” (Party)

Information
Access at \\msrcn4p3\rootD\gaojf\spell
 Contact me if any problems

 Jianfeng
Gao, Tel: 86-10-62617711-5778,
Email: [email protected]