n-gram based approach

Download Report

Transcript n-gram based approach

Language Identification in
Web Pages
Bruno Martins, Mário J. Silva
Faculdade de Ciências da Universidade Lisboa
ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)
Motivation
●
●
Goal: Efficiently crawl web pages in a given
language, Portuguese in our case.
Necessity to accurately distinguish one
language from others.
We take a n-gram based approach to solve
this problem, which has been reported to
give excellent results.
Problems
●
Web texts are considerably different:
–
Multilingual documents.
–
Spelling errors.
–
Lack of coherent sentences.
–
Often small amounts of textual data.
These considerable differences motivate a
revisit to the problem.
Outline
●
Introduction.
●
Context and Related Work.
–
Language identification.
–
Text categorization with n-grams.
●
Our Language Identification Algorithm.
●
Experimental Results.
●
Future Work.
●
Conclusions.
Language Identification
●
Sibun and Reynar provided a good survey.
●
Variety of features have been tried:
–
Characters, words, POS tags, n-grams, ...
N-gram based methods seem to be the
most promising.
●
Dunning, Damashek, Cavnar & Trenkle, ...
N-grams in text categorization
N-grams = n-character slices of a longer string.
●
●
“tumba!” is composed of the following n-grams:
–
Unigrams: _, t, u, m, b, a, !, _
–
Bigrams: _t, tu, um, mb, ba, a!, !_
–
Trigrams: _tu, tum, umb, mba, ba!, a!_, !__
–
Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___
–
Quintgrams: _tumb, tumba, umba!, mba!_ , ba!__, a!___, !____
Advantages:
–
Efficiently handle spelling and grammatical errors.
–
No need for tokenization, stemming, ...
–
Computationally and space efficient.
Outline
●
Introduction.
●
Context and Related Work.
●
Our Language Identification Algorithm.
–
N-gram categorization approach.
–
Measuring similarity with n-gram profiles.
–
Heuristics for Web documents.
●
Experimental Results.
●
Future Work.
●
Conclusions.
N-gram categorization approach
●
●
Measure similarity among documents through n-gram
statistics.
N-grams of multiple lengths simultaneously (1-5)
N-gram similarity - Cavnar & Treckle
More efficient similarity measures
●
Lin's information theoretic similarity measure:
●
Jiang and Conranth's distance formula:
Heuristics for the Web
●
Use meta-data information, if available and valid.
–
●
Filter common or automatically generated strings.
–
●
Title, bold typeface, subject and description metatags
Handle insufficient data.
–
●
“optimized for Internet Explorer”
Weight n-grams according to HTML markup.
–
●
Matching strings on the language meta tag.
Ignore pages with less 40 characters.
Handle multilingualism and hard to decide cases.
–
Weight largest sentences.
Outline
●
Introduction.
●
Context and Related Work.
●
Our Language Identification Algorithm.
●
Experimental Results.
●
Future Work.
●
Conclusions.
Evaluation Experiments
●
●
Language profiles for 23 different languages.
Test collection: 500 documents for each of 12
different languages.
HTML documents crawled from portals and
online newspapers.
●
●
Tested the classification algorithm in different
settings.
●
Lin's measure was the most accurate.
●
Heuristics improve performance.
Evaluation Results
Evaluation Results
% Correct Guesses
1 00
90
80
70
60
50
40
30
20
10
0
Danish
Dut ch
English
Finnish
French
German
It alian
Japanes
e
Port uguese
Russian
Spanish
Swedish
Language
Lin's Similarit y
Measure
Jiang's Dist ance
Measure
Original â€
Rankœ
Orderâ€
•St at ist ic
Lin's Measure W/O
Heurist ics
Application to the Portuguese Web
About 3.5 million pages.
Multiple file types.
Significant portion of
the Portuguese
Web is written in
foreign languages,
especially English.
Limitations
●
●
Unable to distinguish dialects of the same language?
–
Portuguese from Portugal and from Brazil.
–
English and American English?
Possible directions:
–
Web linkage information.
–
“Discriminative” n-grams instead of most frequent.
Future Work
●
Carefully choose better training data.
●
Smoothing (Good-Turing).
●
Use n-grams approach for other classification tasks.
Conclusions
●
N-grams are effective in language guessing.
●
Text from the Web presents problems.
●
Lin's similarity measure seems effective.
Thanks for your attention!
[email protected]
http://www.tumba.pt
http://tcatng.sourceforge.net