Re-organization of IR/CSC team Hongchao He Guihong Cao
Download
Report
Transcript Re-organization of IR/CSC team Hongchao He Guihong Cao
Re-organization of IR/CSC team
Hongchao He
Conf. follow up TREC-10, NTCIR
Paper follow up ICCLP, SIGIR paper
Guihong Cao
Yang Wen
MSKK-III – Clustering for technique transfer
MSKK-III – Distance word dependency
Min Zhang
MSKK/CSC – Entropy based pruning for
applications of (Pinyin/Hiragana) input system
Chinese Spelling Checking
(or, the Big CSC)
Jianfeng Gao
NLC Group, MSRCN
Outline
Introduction
Chinese spelling checking
Our approach
Key techniques and experiments
Millstone
Introduction
Goal: Automatically correct Chinese spelling
errors using MS-Pinyin (MSPY) input
system
Chinese spelling errors using MS-Pinyin input
system
Chinese spelling error patterns
English spelling checking
Why CSC is difficult?
Chinese spelling errors using MSPY
Text in the brain
Syllable
Pinyin (phonetic)
errors
Typographic errors
Key stroke (Typing)
System errors
Converted text
Chinese spelling errors patterns
Substitution errors
Pinyin
error
System error (include Pinyin error in some
systems)
Non-substitution errors word
segmentation errors
Typographic errors –
insertion/deletion/transposition
English spelling checking
Non-word error detection (“the” “hte”)
N-gram (letter) analysis
Dictionary lookup
Real-word error detection (“from” “form”)
NLP – parser driven
Statistical approach – data/error driven
Local – n-gram language model, depend on pre-defined
confusion set
Global – Winnow, Bayesian, TBL, etc.
Problem – lack of error detection
Why CSC is difficult?
Word segmentation
Ambiguous
OOV – Proper noun detection (personal name,
location, organization, etc.)
Segmentation error propagation
Non-word errors (in sense of English) do not
exist
MSPY makes good use of word trigram
language model
Chinese spelling checking
CSC – related works
Template matching – long distance, e.g. <之所以> <是因为>
Pattern matching – long words (n>=3), e.g. 一文不明 一文
不名, 忠耿耿 忠心耿耿
N-gram models – substitution errors
CSC – challenges
Long distance, coverage issue of template/pattern set
High-frequent-used confusion set, e.g. {像,象} {在,再}
OOV, especially the proper nouns
N-gram, has been fully used by MSPY
Chinese spelling errors patterns in
MSPY
Proper noun
Non-word errors: context independent
Personal name
Location
organization
Insertion/deletion/transposition/substitution
E.g. 一文不明 一文不名, 忠耿耿 忠心耿耿
Real-word errors: context sensitive
E.g. 像 象, 在 再, 实施 事实
Flowchart of our approach
Text with errors
Proper noun detection
Word segmentation
Word fuzzy matching
Non-word error correction Trigger: single char string ,
low prob
Context sensitive
disambiguation
Real-word error
correction
Word segmentation and proper noun
detection
Language model based word segmentation
Class-based language model
P(W) = Poutside(W) Pinsidea(W|<PN>), a = ?
Outside probability – PN tagged training data
Using NLPWIN to tag the corpus
Filtering, rule base
EM?
Inside probability – PN list training data
Using cache (or, dynamic dictionary)
Experiments and Findings
Measure: precision/recall – definition
Training data – People Daily
Tag tool – NLPWIN
Test data – spec.
Results and Findings
Long word fuzzy matching
Definition of Distance(s1, s2)
Fast fuzzy matching
Global – Lei Zhang’s ACL
Local – trigger, (single char, or low n-gram probability )
Search – error detection/correction
Long word, n>=3,
Sum of delete/insert/substitute a character
Viterbi
Simplified version
Long word + Local matching
Experiments and Findings
Contact: 100 person, 3000 -- 5000 characters/person
Error analysis
Algorithm …
Measure: precision/recall
Large lexicon, acquisition.
Trigger/threshold ?
Results and Findings
Context sensitive disambiguation
Building confusion set – specific to MSPY
Feature selection – Context vector
Weighting schema and Classifier
Collocation – contiguous POS or words/characters
Context words – words/characters within a K-size window
Triple ?
Context Vector, TFIDF
Winnow, Bayesian, TBL, etc.
Scaling up
Enlarge confusion set
Feature pruning
Adaptation
Experiments and Findings
Measure: precision/recall
Training data
Test data (XXX confusion set)
Results and Findings
Experiments and Findings
Current Work
Pseudo-training set based on MSPY IME
Preliminary data processing (400M PD)
Unigram error model (10,000 Words useful)
Trigram error pattern (980,000 useful)
共[度]难关=>渡 / 不够[英],=>硬
Experiments based on basic approaches
使 是/69484 市/10289 诗/2394 ……
Pseudo-test set from 南方周末
Continuous pair (Recall = 50%, Precision = 25%)
Pattern Matching (??)
Future Work
Hybrid approaches
Pattern Clustering + Continuous pair
Functional words error detection
System evaluation – put it all together
Evaluation toolset
Measure: precision/recall
Training data
Test data
Results and Findings
Prototype
Demo …
Online & offline CSC
Right click
Spelling
error detection/correction
Proper noun detection/correction
Assignment
Jianfeng Gao – overall, fuzzy matching
Mu Li – context sensitive disambiguation
Jian Sun – PN detection
Yang Wen – system evaluation
Yulin Kang – demo
Lei Zhang – senior consultant
Millstone
Oct. 2001, Ming says “Yes” (TAB demo)
Dec. 2001, Dong says “Yes” (Transfer)
Aug. 2002, HJ says “Yes” (Party)
Information
Access at \\msrcn4p3\rootD\gaojf\spell
Contact me if any problems
Jianfeng
Gao, Tel: 86-10-62617711-5778,
Email: [email protected]