Transcript 投影片 1
Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment
Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2011/05/30
Cross-Language Information Retrieval
• Query in source language and retrieve relevant documents in target languages Hussein Source Query Query Translation Target Translation Information Retrieval Target Documents 海珊 / 侯賽因 / 哈珊 / 胡笙 侯赛因 / 海珊 / 哈珊 (SC) (TC)
References
• • • • •
The Web as a Parallel Corpus
–
Philip Resnik
and
Noah A. Smith,
Computational Linguistics, Special Issue on the Web as Corpus, 2003
Automatic Construction of English/Chinese Parallel Corpora
–
Christopher C. Yang
Technology, 2003 &
Kar Wing Li,
Journal of the American Society for Information Science and
Statistical Cross-Language Information Retrieval using N Best Query Translations (SIGIR2002)
–
Marcello Federico & Nicola Bertoldi
, ITC-irst Centro per la Ricerca Scientifica e Techologica
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
–
Wessel Kraaij
2003 ,
Jian-Yun Nie
and
Michel Simard,
Computational Linguistics, Special Issue on the Web as Corpus,
A Probability Model to Improve Word Alignment (ACL2003)
–
Colin Cherry & Dekang Lin
, University of Alberta
The Web as Corpus
Outline
•
The Web as a Parallel Corpus
Philip Resnik
2003 and
Noah A. Smith
Computational Linguistics, Special Issue on the Web as Corpus, •
Automatic Construction of English/Chinese Parallel Corpora
Christopher C. Yang
Technology, 2003 and
Kar Wing Li
Journal of the American Society for Information Science and •
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
Wessel Kraaij
2003 ,
Jian-Yun Nie
and
Michel Simard
Computational Linguistics, Special Issue on the Web as Corpus,
The Web as a Parallel Corpus
Philip Resnik
and
Noah A. Smith
Computational Linguistics, Special Issue on the Web as Corpus, 2003
Parallel Corpora
• Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing .
• Not readily available in necessary quantities – Canadian parliamentary proceedings (Hansards) in English/French – United Nations proceedings (Linguistic Data Consortium, http://www.ldc.upenn.edu/) – Religious texts (Resnik, Olsen, and Diab) – Localized versions of software manuals (Resnik and Melamed 1997; Menezes and Richardson)
STRAND
• An architecture for s tructural t ranslation r ecognition, a cquiring n atural d ata (Resnik 1998, 1999) • Identify pairs of Web Pages that are mutual translations.
• Web page authors disseminate information in multiple languages – When presenting the same content in two different languages, authors exhibit a very strong tendency to use the same document structure
Finding Parallel Web Pages
• Finding parallel text on the Web consists of three main steps: – Location of pages that might have parallel translation – Generation of candidate pairs that might be translations – Structural filtering out of nontranslation candidate pairs • Locating pages – Two types: parents and siblings – Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)
Two types of Website Structure
STRAND
• Generating Candidate Pairs: – Automatic language identification (Dunning 1994) – URL-matching : manually creating a list of substitution rules • E.g., http://mysite.com/english/home_en.html => http://mysite.com/big5/home_ch.html – Document length: length(
E
) C
.
length(
F
) • Structural filtering – The heart of STRAND – Markup analyzer : determine a set of pair-specific structural values for translation pairs
Automatic Construction of English/Chinese Parallel Corpora
Christopher C. Yang
and
Kar Wing Li
Journal of the American Society for Information Science and Technology, 2003
Web Parallel Corpora
• Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language.
• Title alignment and dynamic programming matching
References
Statistical Cross-Language Information Retrieval using N-Best Query Translations
Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica
Outline
• Statistical CLIR Approach • Query Document Model • Query Translation Model
Statistical CLIR Approach
• CLIR problem − Given a query
i
in the source language (Italian), one would like to find relevant documents
d
in the target language (English), within a collection
D
.
P
(
d
| i)
P
(i,
d
) − To fill the language difference between query and documents, the hidden variable
e
is introduced, which represents an English translation of
i
.
Statistical CLIR Approach
P
( i ,
d
) e
P
( i , e ,
d
)
P
( i , e)P(
d
e | e )
P
( i , e) e P(e,
P
( e,
d d
) ' ) d' – –
P
(e,
d
) is computed by the query-document model
P
(i,e) is computed by the query-translation model
Statistical CLIR Approach
Query-Document Model
P
( q ,
d
)
P
( q |
d
)
P
(
d
)
P
( q
q
1 ...
q n
|
d
)
n
P
(
k
1 • Statistical LM & Smoothing
q k
|
d
) – Term frequencies of a document are smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary local
P
(
q
|
d
)
P
(
d
,
q
)
N
(
d
) |
V
(
d
) | |
V
(
d
) |
N
(
d
) |
V
(
d
) |
P
(
q
)
P
(
q
)
N
(
q
)
N
|
V
| |
V
|
N
|
V
1 | |
V
| global
Query-Translation Model
• According to the HMM
P
( i
i
1 ...
i n
, e
e
1 ...
e n
) • Determine N-best
P
(
e
1 )
P
(
i
1 |
e
1 )
k n
2
P
(
e k
|
e k
1 )
P
(
i k
|
e k
) translations – The most probable translation e* can be computed through the Viterbi search algorithm.
– Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the
N
most possible translations of
i
.
Query-Translation Model
•
P
(
i
|
e
) are estimated from a bilingual dictionary •
P
(
i
|
e
) (
i
, (
i
'
e
) ,
e
) , (
i
,
e
) 1 if (
i,e
) is a translati 0 otherwise on pair
i
'
P
(
e
|
e ’
) are estimated on the target document collection (order-free bigram LM)
P
(
e
|
e
' )
e
P
(
e
,
P
(
e e
' '' , )
e
' '' ) • Smoothing
P
(
e
,
e
' ) max
C
(
e
,
e
' )
N
, 0
βP
(
e
)
P
(
e
' ),
n
1
n
1 2
n
2 where
C
(
e,e'
) is the number of co occurrence s appearing in the corpus,
P
(
e
) is estimated as described above
P
(
q
), and
n k
represent the number of term pairs occurring
k
times in the corpus.
CLIR Algorithm
P
( i ,
d
)
P
( i , e) e P(e,
d P
( e, )
d
' ) d' • Use two approximations to limit the set of possible translations and documents .
Appr.
1 :
P
' ( i , e )
P
(
K
1 i , e ) if (i) e
Τ N
( 0 otherwise i ) Appr.
2 :
P
' ( e,
d
)
P K
( e, 2
d
(e) ) if e ( e 0 otherwise )
Complexity of CLIR Algorithm
n
: query length
N:
number of generated translati ons : average number of translati ons of a term
Ι
: average number of documents spanned by each entry of the inverted file index
Text Preprocessing
Blind Relevance Feedback
• The
R
most relevant terms are selected from the top
B
ranked documents according to:
r w
(
r w
( 0 .
5 )(
N N w
r w
0 .
N w
5 )(
B B
r w r w
0 0 .
5 ) .
5 )
r w
: the number of documents containing term
w
among the
B
top documents
Comparison with other CLIR Models
• Hiemstra (1999)
P
(
i
|
d
)
k n
1
P
(
i k
|
d
)
k n
1
e k P
(
i k
,
e k
|
d
)
k n
1
e k P
(
i k
|
e k
)
P
(
e k
|
d
) • Xu (2001)
P
(
i
|
d
)
k n
1 [
P
(
i k
) ( 1 )
e k P
(
i k
|
e k
)
P
(
e k
|
d
) ]
Term Translation Model using Search Result Pages
• Apply page authority to search-result-based translation extraction
P
(
t
|
q
)
P
(
t
|
R q
)
dr P
(
t
|
dr
)
P
(
dr
|
q
)
dr P
(
t
|
dr
)
P
(
q
|
dr
)
P
(
dr
)
P
(
q
) 1
P
(
q
) [
P
(
t dr
|
dr
)
i k
1
P
(
q i
|
dr
)]
P
(
dr
) where
P
(
dr
) # link of
dr
# total links
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
Wessel Kraaij
,
Jian-Yun Nie
and
Michel Simard
Computational Linguistics, Special Issue on the Web as Corpus, 2003
Web Mining for CLIR
• The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically .
• The resulting translation models can be embedded in several ways in a retrieval model.
• Conventional approach: IR + MT (machine translation)
Problems in Query Translation
• Finding translations – Lexical coverage: Proper names and abbreviations.
– Transliteration: Phonemic representation of a named entity.
• Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script) • Pruning translation alternatives • Weighting translation alternatives
Exploitation of Parallel Texts
• Using a pseudofeedback approach (Yang et al. 1998) • Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002) • Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998) • Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)
Mining Process in PTMiner
Embedding Translation into IR Model
• Basic language model • Normalized log-likelihood ratio (NLLR)
Embedding Translation into IR Model
* Basic Language Model:
(log likelihood ratio) (normalized log likelihood ratio)
* Query Model: * Document Model:
A Probability Model to Improve Word Alignment
Colin Cherry & Dekang Lin, University of Alberta
Outline
• Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features
Introduction
• Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.
– E.g., translation lexicons, transfer rules • Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) –
Indirect association problem
: Melamed (2000) proposed competitive linking along with an explicit noise model to solve
score B
(
u
,
v
) log
B
(
links
(
u
,
v
) |
B
(
links
(
u
,
v
) |
cooc
(
u
,
v
), )
cooc
(
u
,
v
), ) CISCO System Inc.
思科 系統 • To propose a probabilistic word-alignment model which allows easy integration of context-specific features.
noise
Probabilistic Word-Alignment Model
• Given
E
=
e 1
,
e 2
, … ,
e m
,
F
=
f 1
,
f 2
, … ,
f n
– If
e i
and
f j
are translation pair, then link l (
e i
,
f j
) exists – If
e i
has no corresponding translation, then null link l (
e i
,
f 0
) exists – If
f j
has no corresponding translation, then null link l (
e 0
,
f j
) exists – An alignment
A F
is a set of links such that every word in participates in at least one link
E
and • Alignment problem is to find alignment
A
to maximize
P
(
A
|
E
,
F
) • IBM ’ s translation model: maximize
P
(
A
,
F
|
E
)
Probabilistic Word-Alignment Model (Cont.)
• Given
A
= {
l 1
,
l 2
, … ,
l t
}, where
l k
= consecutive subsets of
A
,
l i j P
(
A
|
E
,
F
)
P
(
l
1
t
|
E
,
F
)
k t
1
P
(
l k
|
E
,
F
,
l
1
k
= { 1 )
l i
,
l l
(
e ik
,
f jk
), then
i+1
, … ,
l j
} • Let
C k
= {
E
,
F
,
l 1 k-1
} represent the context of
l k P
(
l k
|
C k
)
P
(
l k
,
C k
)
P
(
C k
)
P
(
C k
|
l k
)
P
(
l k
)
P
(
C k
,
e ik
,
f jk
)
P
(
P C k
(
C
|
k e ik
| ,
l k f
)
jk
)
P
(
l k
,
e ik
,
f jk
)
P
(
e ik
,
f jk
)
P
(
l k
|
e ik
,
f jk
)
P P
(
C k
(
C
|
k e ik
| ,
l k
)
f jk
)
P
(
C k
P
(
e ik
, ,
e ik f jk
, |
f jk C k
) )
P
1 (
C k
)
P
(
e ik
,
f jk
|
C k
)
P
P
(
l k
(
e ik
, ,
e ik f
,
jk f
|
l jk k
) )
P
1 (
l k
)
P
(
e ik
,
f jk
|
l k
)
Probabilistic Word-Alignment Model (Cont.)
•
C k
= {
E
,
F
,
l 1 k-1
} is too complex to estimate •
FT k
is a set of context-related features such that
P
(
l k
|
C k
) can be approximated by
P
(
l k
|
e ik
,
f jk
,
FT k
) • Let
C k ’
= {
e ik
,
f jk
} ∪
FT k P
(
l k
|
C k
' )
P
(
l k
|
e ik
,
f jk
)
P
(
C
'
k P
(
C k
' |
e ik
| ,
l k f
)
jk
)
P
(
l k
|
e ik
,
f jk
)
P
(
FT P
(
FT k
|
k e ik
| ,
l k
)
f jk
)
P
(
A
|
E
,
F
)
k t
1
P
(
l k
|
e ik
,
f jk
)
ft
FT k P
(
ft
|
l k
)
P
(
ft
|
e ik
,
f jk
)
An Illustrative Example
Word-Alignment Algorithm
• Input: –
E
,
F
,
T E T E
is
E
’ s dependency tree which enable us to make use of
features
and
constraints
based on linguistic intuitions • Constraints –
One-to-one constraint
: every word participates in exactly one link –
Cohesion constraint
: use
T E
induce
T F
with no crossing dependencies to
Word-Alignment Algorithm (Cont.)
• Features – Adjacency features
ft a
: for
l
any word pair (
e i
,
f j
), if a link (
e
-2
i ’
,
i f
’
j ’
) exists where
i
2 and -2 then
ft a
(
i
-
i
’ ,
j
-
j
’ ,
e i
’ ) is active for this context.
j
’ -
j
2, – Dependency features
ft d
: for any word pair (
e i
,
f j
), let
e i ’
be the governor of
e i
let
rel
If a link context.
l
(
e i ’
,
f j ’
,and be the grammatical relationship between them. ) exists, then
ft d
(
j
-
j
’ ,
rel
) is active for this
pair
(
the
1 ,
l
' )
pair
(
the
1 ,
les
)
ft ft ft d a d
( ( 1 , 1 1 , ,
det
( 3 ,
det
)
host
) )
Experimental Results
• Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs
Future Work
• The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “ many ” side.