Users, Interaction and Information Seeking/Retrieval
Download
Report
Transcript Users, Interaction and Information Seeking/Retrieval
Managing Morphologically
Complex Languages in
Information Retrieval
Kal Järvelin & Many Others
University of Tampere
1. Introduction
Morphologically complex languages
unlike English, Chinese
rich inflectional and derivational morphology
rich compound formation
U. Tampere experiences 1998 - 2008
monolingual IR
cross-language IR
focus: Finnish, Germanic languages, English
Methods for Morphology
Variation
Management
Reductive
Methods
Stemming
Generative
Methods
Lemmatization
Infl stem
Generation
Word Form
Generation
Rules +
Dict
Rules +
Dict
Inflectional
Stems
FCG
Rulebased
Rulebased
Inflectional
Stems
Generating
All Forms
enhanced
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
2. Normalization
Reductive methods, conflation
stemming
lemmatization
+ conflation -> simpler searching
+ smaller index
+ provides query expansion
Stemming available for many languages (e.g. Porter
stemmer)
Lemmatizers less available and more demanding
(dictionary requirement)
Alkula 2001
Boolean environment, inflected index, Finnish:
Boolean environment, infl vs. lemma index, Finnish:
manual truncation vs. automatic stemming
stemming improves P and hurts R
many derivatives are lost
manual truncation vs. lemmatization
lemmatization improves P and hurts R
many derivatives are lost, others correctly avoided
Differences not great between automatic methods
Kettunen & al 2005
Ranked retrieval, Finnish:
Three problems
how lemmatization and inflectional stem generation compare
in a best-match environment?
is a stemmer realistic for the handling Finnish morphology?
feasibility of simulated truncation in a best-match system?
Lemmatized vs inflected form vs. stemmed index.
Kettunen & al. 2005
Method
FinTWOL
Inf Stem Gen
Porter
Raw
Index
lemmas
inflform
stemmed
inflform
MAP
35.0
34.2
27.7
18.9
Change %
-- 2.3
- 20.9
- 46.0
But very long queries for inflectional stem generation &
expansion (thousands of words); weaker generations
shorter but progressively deteriorating results.
(InQuery/TUTK/graded-35/regular; )
Kettunen & al. 2005
QuickTime™ and a
decompressor
are needed to see this picture.
MonoIR: Airio 2006
Language
English
Finnish
Swedish
German
Index t ype
Inflected
Lemmas
Stemmed
Inflected
Lemmas
Stemmed
Inflected
Lemmas
Stemmed
Inflected
Lemmas
Stemmed
Average P %
43.4
45.6
46.3
31.0
47.0
48.5
30.2
31.4
33.5
30.2
31.9
35.7
InQuery/CLEF/TD/TWOL&Porter&Raw
Diff Base %
2.2
2.9
16.0
17.5
1.2
3.3
1.7
5.5
CLIR: Inflectional Morphology
NL queries contain inflected form source keys
Dictionary headwords are in basic form (lemmas)
Problem significance varies by language
Stemming
stem both the dictionary and the query words
but may cause all too many translations
Stemming in dictionary translation best applied after
translation.
Lemmatization in CLIR
Lemmatization
easy to access dictionaries
but tokens may be ambiguous
dictionary translations not always in basic form
lemmatizer’s dictionary coverage
insufficient -> non-lemmatized source keys, OOVs
too broad coverage -> too many senses provided
CLIR Findings: Airio 2006
English -> X
Target X
Finnish
Swedish
German
Index t ype
Lemmas
Stemmed
Lemmas
Stemmed
Lemmas
Stemmed
Average P % Diff to Split%
29.0
-6.5
20.8
-14.7
17.4
-9.7
19.0
-8.1
26.4
-4.6
25.7
-5.3
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
3. Compounds
Compounds, compound word types
determinative: Weinkeller, vinkällare, life-jacket
copulative: schwartzweiss, svartvit, black-and-white
compositional: Stadtverwaltung, stadsförvaltning
non-compositional: Erdbeere, jordgubbe, strawberry
Note on spelling : compound word components
are spelled together (if not -> phrases)
Compound Word Translation
All compounds are not in dictionary
some languages are very productive
small dictionaries: atomic words, old noncompositional compounds
large dictionaries: many compositional compounds
added
Compounds remove phrase identification
problems, but cause translation and query
formulation problems
Joining Morphemes
Joining morphemes
complicate compound
analysis & translation
Joining morpheme types in
Swedish
<omission> flicknamn
-s rättsfall
-e flickebarn
-a gästabud
-u gatubelysning
-o människokärlek
Joining morpheme types in
German
-s Handelsvertrag
-n Affenhaus
-e Gästebett
-en Fotographenausbildung
-er Gespensterhaus
-es Freundeskreis
-ens Herzensbrecher
<omission> Sprachwissenschaft
Suggestive finding that the treatment of joining morphemes improves MAP by 2 %
- Hedlund 2002, SWE->ENG, 11 Qs
Compound Processing, 2
A Finnish natural language
query:
lääkkeet sydänvaivoihin
(medicines for heart problems)
The output of morphological
analysis
lääke
sydänvaiva, sydän, vaiva
Dictionary translation and the
output of component tagging:
lääke ---> medication drug
sydänvaiva - ”not in dict”
sydän ---> heart
vaiva ---> ailment, complaint,
discomfort, inconvenience, trouble,
vexation
Many ways to combine
components in query
Compound Processing, 3
Sample English CLIR query:
#sum( #syn( medication drug ) heart #syn( ailment,
complaint, discomfort, inconvenience, trouble, vexation ))
i.e. translating as if source compounds were phrases
Source compound handling may vary here:
#sum( #syn( medication drug ) #syn(#uw3( heart ailment )
#uw3( heart complaint ) #uw3( heart discomfort ) #uw3(
heart inconvenience ) #uw3( heart trouble ) #uw3( heart
vexation )))
#uw3 = proximity operator for three intervening words, free
word order
i.e. forming all proximity combinations as synonym sets.
Compound Processing, 4
No clear benefits seen from using proximity
combinations.
We did neither observe a great effect in
changing the proximity operator (OD vs. UW)
Some monolingual results follow (Airio 2006)
Language Index t ype
Average Diff to
P%
Baseline
English
Inflected
43.4
Lemmas
45.6
2.2
Stemmed
46.3
2.9
Finnish
Inflected
31.0
Lemma & decomp
50.5
19.5
Lemmas
47.0
16.0
Stemmed
48.5
17.5
Swedish Inflected
30.2
Lemma & decomp
38.8
8.6
Lemmas
31.4
1.2
Stemmed
33.5
3.3
German
Inflected
30.2
Lemma & decomp
36.2
6.0
Lemmas
31.9
1.7
Stemmed
35.7
5.5
InQuery/CLEF/Raw&TWOL&Porter
English
QuickTime™ and a
decompressor
are needed to see this picture.
Swedish
QuickTime™ and a
decompressor
are needed to see this picture.
Finnish
QuickTime™ and a
decompressor
are needed to see this picture.
Hedlund 2002
Compound translation as compounds:
47 German CLEF 2001 topics, English docs
collection.
comprehensive dictionary (many compounds) vs.
small dict (no compounds)
mean AP 34.7% vs. 30.4%
dictionary matters ...
Alternative approach: if not translatable, split
and translate components
CLEF Ger -> Eng
1. best manually translated
0,4465
2. large dict, no comp splitting
0,3520
3. limited dict, no comp splitting
0,3057
4. large dictionary & comp splitting
0,3830
5. limited dict & comp splitting
0,3547
InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n
CLIR Findings: Airio 2006
English ->
Target
Index t ype
language
Finnish
Lemma & decomp
Lemmas
Stemmed
Swedish Lemma & decomp
Lemmas
Stemmed
German
Lemma & decomp
Lemmas
Stemmed
Average Diff to
P%
Baseline%
35.5
29.0
-6.5
20.8
-14.7
27.1
17.4
-9.7
19.0
-8.1
31.0
26.4
-4.6
25.7
-5.3
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
Eng->Fin
QuickTime™ and a
decompressor
are needed to see this picture.
Eng->Ger
QuickTime™ and a
decompressor
are needed to see this picture.
Eng->Swe
Quic kT ime™ and a
dec ompress or
are needed to s ee this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
4. Generative Methods
Variation
handling
Reductive
Methods
Stemming
Generative
Methods
Lemmatization
Infl stem
Generation
Word Form
Generation
Rules +
Dict
Rules +
Dict
Inflectional
Stems
FCG
Rulebased
Rulebased
Inflectional
Stems, ench
Generating
All Forms
Generative Methods: inf stems
Instead of normalization, generate inflectional
stems for an inflectional index.
then using stems harvest full forms from the index
long queries ...
... OR ...
Instead of normalization, generate full
inflectional forms for an inflectional index.
Long queries? Sure!
Sounds absolutely crazy ...
... BUT!
Are morphologically complex languages that
complex in IR in practice?
Instead of full form generation, only generate
sufficient forms -> FCG
In Finnish, 9-12 forms cover 85% of all
occurrences of nouns
Kettunen & al 2006: Finnish
IR
MAP for relevance level
Method Liberal Normal Stringent
TWOL
37.8
35.0
24.1
FCG12
32.7
30.0
21.4
FCG6
30.9
28.0
21.0
Snowball 29.8
27.7
20.0
Raw
19.6
18.9
12.4
... monolingual ...
Kettunen & al 2007: Other Langs
IR
Method
TWOL
FCG
FCG
Snowball
Raw
MAP for Language
Swe
Ger
Rus
32.6
39.7
..
30.6 /4
38.0 /4
32.7 /2
29.1 /2
36.8 /2
29.2 /6
28.5
39.1
34.7
24.0
35.9
29.8
Results for long queries ... monolingual ...
CLIR Findings: Airio 2008
Language
pairs
Raw
transl
Fi-FCG_9
Fi-FCG_12
Sv-FCG_4
Sv-FCG_7
Lemmatized
Fin -> Eng
11.2
32.4
32.5
39.6
Fin -> Swe
14.3
22.6
23.9
35.2
Eng -> Swe
18.1
25.1
27.3
34.1
Swe -> Fin
11.7
28.0
27.9
37.6
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
5. Query Structures
Translation ambiguity such as ...
Homonymy: homophony, homography
Examples: platform, bank, book
Inflectional homography
Examples: train, trains, training
Examples: book, books, booking
Polysemy
Examples: back, train
... a problem in CLIR.
Ambiguity Resolution
Methods
Part-of-speech tagging (e.g. Ballesteros & Croft ‘98)
Corpus-based methods Ballesteros & Croft ‘96; ‘97;
Chen & al. ‘99)
Query Expansion
Collocations
Query structuring - the Pirkola Method (1998)
Query Structuring
Concepts?
From weak to strong
query structures by
recognition of ...
concepts
expression weights
phrases, compounds
no
Weighting ?
Queries may be
combined ... query
fusion
yes
no
Weighting ?
yes
no
~~
~~
yes
Phrases ?
no
Phrases ?
yes
no
~~
~~
#sum(a b c d
e)
yes
#wsum(1 3 #syn(a #3(b
c)) 1 #syn(d e))
Structured Queries in CLIR
CLIR performance (Pirkola 1998, 1999)
English baselines, manual Finnish translations
Automatic dictionary translation FIN -> ENG
natural language queries (NL) vs. concept queries (BL)
structured vs. unstructured translations
single words (NL/S) vs. phrases marked (NL/WP)
general and/or special dictionary translation
500.000 document TREC subcollection
probabilistic retrieval (InQuery)
30 health-related requests
The Pirkola Method
All translations of all senses provided by the
dictionary are incorporated in the query
All translations of each source language word
are combined by the synonym operator,
synonym groups by #and or #sum
this effectively provides disambiguation
An Example
Consider the Finnish natural language query:
lääke sydänvaiva [= medicine heart_problem]
Sample English CLIR query:
#sum( #syn( medication drug ) heart #syn( ailment,
complaint, discomfort, inconvenience, trouble,
vexation ) )
Each source word forming a synonym set
Query Translation Test Set-up
English Request
Translated Finnish Request
Finnish NL Query
General Dict
Baseline Queries
TREC
Finnish BL Query
Med. Dict. General Dict
Translated English Queries
InQuery
Unix-server
Med. Dict.
Unstructured NL/S Queries
40
P
r
e
c
i
s
i
o
n
35
30
Baseline
25
20
15
10
5
0
10
20
30
40
50
60
70
80
90
100
Rec al l
GD
SD and GD
SD --> GD
BL
#sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)
Only 38%
of the
average
baseline
precision
(sd&gd)
Structured Queries w/ Special
Dictionary
77% of the
average
baseline
precision
(sd & gd)
40
P
r
e
c
i
s
i
o
n
35
Baseline
30
25
20
15
10
5
0
10
20
30
40
50
60
70
80
Recall
GD
SD and GD
SD --> GD
BL
90
100
Structure
doubles
precision in
all cases
#and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))
Query Structuring, More Results
CLEF
Topic set 2000
(N = 33)
Unstructured
queries
no s-grams
Structured
queries
no s-grams
Change %
Finnish - Eng
0.1609
0.2150
33.6
German - Eng
0.2097
0.2639
25.8
Swedish - Eng
0.2015
0.2242
11.3
Topic set 2001
(N = 47)
Finnish - Eng
0.2407
0.3443
43.0
German - Eng
0.3242
0.3830
18.1
Swedish - Eng
0.3466
0.3465
0.0
Transit CLIR – Query Structures
TRANSLATION TYPE
Average
precision
Difference
% units
ENG monolingual
45.2
SWE-ENG structured
34.6
SWE-ENG unstructured
28.8
- 5.8
SWE-FIN-ENG structured
36.1
+ 1.5
SWE-FIN-ENG unstructured
18.3
- 16.3
FIN-ENG
36.4
FIN-SWE-ENG
33.5
-2.9
Average precision for the transitive, bilingual and
monolingual runs of CLEF 2001 topics (N = 50)
Transitive CLIR Results, 2
Transitive CLIR Effectiveness
Regular
Relevance
MAP
N=35
% monolingual
performance
Swe-Eng-Fin
21.8
59**
68**
Eng-Swe-Fin
27.5
75
88
Ger-Eng-Fin
24.3
66**
83
Ger-Swe-Fin
29.3
79
100
Mean
25.7
70
85
Lehtokangas & al 2008
% direct
performance
TransCLIR + pRF effectiveness
Regular
Relevance
%
MAP
monolingual +
N=35
pRF
%
monolingual
%
direct
%
pRF exp direct
Swe-Eng-Fin 28.7
68**
78
90
78*
Eng-Swe-Fin 32.3
76*
88
103
87
Ger-Eng-Fin 33.6
79
91
115
104
Ger-Swe-Fin 34.5
81
94
118
107
76
88
107
94
Mean
32.3
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
6. OOV Words
Low coverage -- non-translated words
Domain-specific terms in general dictionaries
e.g. dystrophy
Covered in domain-specific dictionaries
Compound words
Proper names: persons, geographical names, …
Often central for query effectiveness
Large dictionaries solution? BUT:
Excessive number of senses and words for each
Increases ambiguity problems
and still many words remain OOVs
OOV Processing
Proper names are often spelled differently between
languages
Transliteration variation
Brussels, Bryssel; Chernobyl, Tshernobyl; Chechnya,
Tsetsenia
Non-translated keys are often used as such in CLIR
queries -- simplistic approach, not optimal
In some languages, proper names may inflect
In Finnish ”in Cairo” = Kairossa
Approximate String Matching
Means for
correcting spelling errors
matching variant word forms (e.g. proper names
between languages)
Methods
edit-distance, n-grams, skip-grams
soundex, phonix, based of phone similarity
transliteration
Sample digram Matching
Sample words Bryssel, Brussels, Bruxelles
Bryssel --> N1= {br, ry, ys, ss, se, el}
Brussels --> N2 = {br, ru, us, ss, se, el, ls}
Bruxelles --> N3 = {br, ru, ux, xe, el, ll, le, es}
sim(N1, N2) = | N1 N2| / | N1 N2|
= 4 / 9 = 0.444
sim(N1, N3) = 1/6 = 0.167
Skip-grams
Generalizing n-gram matching
The strings to be compared are split into
substrings of length n
Skipping characters is allowed
Substrings produced using various skip lengths
n-grams remain a special case: no skips
(Pirkola & Keskustalo & al 2002)
An Example
katalyyttinen – catalytic
skip-0:
skip-1:
{_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_}
{_a, ct, aa, tl, ay, lt, yi, tc, i_}
skip-2:
{_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_}
{_c, cs, at, ta, al, ly, yt, ti, ic, c_}
{_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_}
{_t, ca, al, ty, at, lc, yc, t_}
calculate similarity over different skip-gram sets
Skip-gram effectiveness
Several n-gram methods tested
n-grams & skip-grams
with & without padding
The relative improvement for some s-grams vs.
n-grams in X -> Finnish name matching:
18.2% (Eng medical)
49.7% (Eng geograph)
20.7% (Ger geograph)
17.1% (Swe geograph)
Statistically significant, p = 0.01-0.001
CLIR Findings: Järvelin & al 2008
Closely related languages
Norwegian and Swedish
Translation by string matching alone
no dictionary at all, no other vocabulary source
encouraging results
CLIR Findings: Järvelin & al 2008
QuickTime™ and a
decompressor
are needed to see this picture.
CLIR Findings: Airio 2008
Language pairs
Raw
transl
*)
Fin -> Eng
Fi-sgram_9
Fi-sgram_12
Sv-sgram_4
Sv-sgram_7
*)
Lemmatized
*)
11.2
29.2
31.0
39.6
Eng -> Swe 18.1
25.7
25.3
34.1
Swe -> Fin- 11.7
26.1
26.7
37.6
Fin -> Swe
26.7
22.6
35.2
14.3
*) target index not normalized
Rule-based Transliteration
Between languages, the originally ‘same’ words have
different spellings:
Proper names, technical terms
Transliteration rules are based on regular variations in
the spelling of equivalent word forms between
languages
construction
somatology
universiti
- konstruktio,
- somatologia,
- university,
c -> k
y -> ia
i -> y
Transliteration rules mined in a bilingual word list
Their frequency and reliability are recorded
Sample Rules German-toEnglish
Source
string
Target
string
Location
of the rule
Confidence
factor
ekt
m
akt
ko
ect
ma
act
co
middle
end
middle
beginning
89.2
21.1
86.7
80.7
TRT Translation – 2 Steps
TRT Rule
Production
TRT
Rule
Base
Target
Index
OOV
Source
Word
TRT
Translation
Skipgram matching
Translation
Candidates
Identified
Target
Word(s)
Recall
Precision
1
2
Evaluation
TRT – rule collections
Rule collection
# Rules
Spanish-English
Spanish-German
Spanish-French
German-English
French-English
8800
5412
9724
8609
9873
# Rules
CF>4.0%, Fr. >2
1295
984
1430
1219
1170
TRT Effectiveness
Finnish-to-English translation (HCF)
Term type
Bio terms
Place names
Economics
Technology
Miscellaneous
Digrams TRT + digrams
61,4
72,0
30,0
35,9
32,2
38,0
31,6
53,7
33,8
40,6
% chg
+17,3
+19,7
+18,0
+69,9
+20,1
A Problem with TRT
TRT gives many target word forms for a source word
(even tens of thousands) but does not indicate the
correct one
For example, in Spanish-English translation TRT gives
the following forms of for a Spanish word biosintesis:
biosintesis, biosintessis, biosinthesis, biosinthessis,
biosyntesis, biosyntessis, biosynthesis, biosynthessis
To identify the correct equivalent, we use FITE
frequency-based identification of equivalents
Sample Frequency Pattern
The frequency pattern associated with the English
target word candidates for the Spanish word biosintesis:
Target candidate
Doc Freq
biosynthesis
biosintesis
biosyntesis
biosinthesis
biosynthessis
biosintessis
...
2 230 000
909
634
255
3
0
...
FITE-TRT Translation
TRT Rule
Production
OOV
Source
Word
TRT
Rule
Base
Frequency
Statistics
Identified
Target
Word
TRT
Translation
Translation
Candidates
1
FITE
Identification
2
Native
Source
Word
Evaluation
Frequency
pattern Ok
Relative
Frequency Ok
Length
Difference Ok
Recall
Precision
FITE-TRT effectiveness
Spanish-English biological and medical spelling
variants (n = 89)
Finnish-English biological and medical spelling
variants (n = 89)
Translation toward English by TRT
English equivalent identification by FITE
FITE-TRT Effectiveness
Translation Recall and Precision for bio-terms
Source
Language
Spanish
Web
Freq List
Finnish
Web
Freq List
Translation
Recall %
91.0
82.0
71.9
67.4
Translat
Prec %
98.8
98.8
97.0
97.3
FITE-TRT effectiveness
UTACLIR TREC Genomics Track 2004 topics
Spanish-English actual OOV words (n = 93+5)
Finnish-English actual OOV words (n = 48+5)
Translation toward English by TRT
English equivalents identification by FITE
FITE-TRT Effectiveness
Translation Recall and Precision for actual OOV words
Source
Language
Spanish
Finnish
Web
Freq List
Web
Freq List
Translation
Recall %
89.2
87.1
72.9
79.2
Translation
Precision %
97.6
97.6
97.2
95.0
Genomics CLIR experiment
German - English CLIR in genomics
TREC Genomics Track data, 50 topics (20
training + 30 testing)
MedLine 4.6 M medical abstracts
Baseline: raw German as such + Dict transl
Test queries - FITE-TRT
Genomics CLIR experiment
QuickTime™ and a
decompressor
are needed to see this picture.
Agenda
1. Introduction
2. Reductive Methods
3. Compounds
4. Generative Methods
5. Query Structures
6. OOV Words
7. Conclusion
8. Conclusions
Monolingual retrieval
morphological complexity
who owns the index?
reductive and generative approaches
skewed distributions; surprisingly little may be enough
compound handling perhaps not critical in
monolingual retrieval
More Conclusions
Cross-language retrieval
Query structuring simple and effective
Closely related languages / dialects
Observe what happens between compounding vs
isolating languages
compound splitting seems essential
Inflected form indices
simple language independent techniques maybe enough
skip-grams or FCG after translation may be a solution
Specific domains: Dom dictionaries or aligned corp
Thanks!