Transcript 投影片 1

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2011/05/30

Cross-Language Information Retrieval

• Query in source language and retrieve relevant documents in target languages Hussein Source Query Query Translation Target Translation Information Retrieval Target Documents 海珊 / 侯賽因 / 哈珊 / 胡笙 侯赛因 / 海珊 / 哈珊 (SC) (TC)

References

• • • • •

The Web as a Parallel Corpus

Philip Resnik

and

Noah A. Smith,

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Automatic Construction of English/Chinese Parallel Corpora

Christopher C. Yang

Technology, 2003 &

Kar Wing Li,

Journal of the American Society for Information Science and

Statistical Cross-Language Information Retrieval using N Best Query Translations (SIGIR2002)

Marcello Federico & Nicola Bertoldi

, ITC-irst Centro per la Ricerca Scientifica e Techologica

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij

2003 ,

Jian-Yun Nie

and

Michel Simard,

Computational Linguistics, Special Issue on the Web as Corpus,

A Probability Model to Improve Word Alignment (ACL2003)

Colin Cherry & Dekang Lin

, University of Alberta

The Web as Corpus

Outline

The Web as a Parallel Corpus

Philip Resnik

2003 and

Noah A. Smith

Computational Linguistics, Special Issue on the Web as Corpus, •

Automatic Construction of English/Chinese Parallel Corpora

Christopher C. Yang

Technology, 2003 and

Kar Wing Li

Journal of the American Society for Information Science and •

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij

2003 ,

Jian-Yun Nie

and

Michel Simard

Computational Linguistics, Special Issue on the Web as Corpus,

The Web as a Parallel Corpus

Philip Resnik

and

Noah A. Smith

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Parallel Corpora

• Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing .

• Not readily available in necessary quantities – Canadian parliamentary proceedings (Hansards) in English/French – United Nations proceedings (Linguistic Data Consortium, http://www.ldc.upenn.edu/) – Religious texts (Resnik, Olsen, and Diab) – Localized versions of software manuals (Resnik and Melamed 1997; Menezes and Richardson)

STRAND

• An architecture for s tructural t ranslation r ecognition, a cquiring n atural d ata (Resnik 1998, 1999) • Identify pairs of Web Pages that are mutual translations.

• Web page authors disseminate information in multiple languages – When presenting the same content in two different languages, authors exhibit a very strong tendency to use the same document structure

Finding Parallel Web Pages

• Finding parallel text on the Web consists of three main steps: – Location of pages that might have parallel translation – Generation of candidate pairs that might be translations – Structural filtering out of nontranslation candidate pairs • Locating pages – Two types: parents and siblings – Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)

Two types of Website Structure

STRAND

• Generating Candidate Pairs: – Automatic language identification (Dunning 1994) – URL-matching : manually creating a list of substitution rules • E.g., http://mysite.com/english/home_en.html => http://mysite.com/big5/home_ch.html – Document length: length(

E

)  C

.

length(

F

) • Structural filtering – The heart of STRAND – Markup analyzer : determine a set of pair-specific structural values for translation pairs

Automatic Construction of English/Chinese Parallel Corpora

Christopher C. Yang

and

Kar Wing Li

Journal of the American Society for Information Science and Technology, 2003

Web Parallel Corpora

• Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language.

• Title alignment and dynamic programming matching

References

Statistical Cross-Language Information Retrieval using N-Best Query Translations

Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica

Outline

• Statistical CLIR Approach • Query Document Model • Query Translation Model

Statistical CLIR Approach

• CLIR problem − Given a query

i

in the source language (Italian), one would like to find relevant documents

d

in the target language (English), within a collection

D

.

P

(

d

| i)

P

(i,

d

) − To fill the language difference between query and documents, the hidden variable

e

is introduced, which represents an English translation of

i

.

Statistical CLIR Approach

P

( i ,

d

)    e 

P

( i , e ,

d

)

P

( i , e)P(

d

e | e )  

P

( i , e) e  P(e,

P

( e,

d d

) ' ) d' – –

P

(e,

d

) is computed by the query-document model

P

(i,e) is computed by the query-translation model

Statistical CLIR Approach

Query-Document Model

P

( q ,

d

) 

P

( q |

d

)

P

(

d

)

P

( q 

q

1 ...

q n

|

d

) 

n

 

P

(

k

1 • Statistical LM & Smoothing

q k

|

d

) – Term frequencies of a document are smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary local

P

(

q

|

d

) 

P

(

d

,

q

)

N

(

d

)  |

V

(

d

) |  |

V

(

d

) |

N

(

d

)  |

V

(

d

) |

P

(

q

)

P

(

q

) 

N

(

q

)

N

 |

V

|  |

V

|

N

 |

V

1 | |

V

| global

Query-Translation Model

• According to the HMM

P

( i 

i

1 ...

i n

, e 

e

1 ...

e n

) • Determine N-best 

P

(

e

1 )

P

(

i

1 |

e

1 )

k n

  2

P

(

e k

|

e k

 1 )

P

(

i k

|

e k

) translations – The most probable translation e* can be computed through the Viterbi search algorithm.

– Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the

N

most possible translations of

i

.

Query-Translation Model

P

(

i

|

e

) are estimated from a bilingual dictionary •

P

(

i

|

e

)     (

i

, (

i

'

e

) ,

e

) ,  (

i

,

e

)    1 if (

i,e

) is a translati 0 otherwise on pair

i

'

P

(

e

|

e ’

) are estimated on the target document collection (order-free bigram LM)

P

(

e

|

e

' ) 

e

P

(

e

,

P

(

e e

' '' , )

e

' '' ) • Smoothing

P

(

e

,

e

' )  max

C

(

e

,

e

' ) 

N

 , 0    

βP

(

e

)

P

(

e

' ),  

n

1 

n

1 2

n

2 where

C

(

e,e'

) is the number of co occurrence s appearing in the corpus,

P

(

e

) is estimated as described above

P

(

q

), and

n k

represent the number of term pairs occurring

k

times in the corpus.

CLIR Algorithm

P

( i ,

d

)  

P

( i , e) e  P(e,

d P

( e, )

d

' ) d' • Use two approximations to limit the set of possible translations and documents .

Appr.

1 :

P

' ( i , e )   

P

(

K

1 i , e ) if (i) e 

Τ N

( 0 otherwise i ) Appr.

2 :

P

' ( e,

d

)   

P K

( e, 2

d

(e) ) if e   ( e 0 otherwise )

Complexity of CLIR Algorithm

n

: query length

N:

number of generated translati ons  : average number of translati ons of a term

Ι

: average number of documents spanned by each entry of the inverted file index

Text Preprocessing

Blind Relevance Feedback

• The

R

most relevant terms are selected from the top

B

ranked documents according to:

r w

(

r w

(  0 .

5 )(

N N w

r w

  0 .

N w

5 )( 

B B

 

r w r w

  0 0 .

5 ) .

5 )

r w

: the number of documents containing term

w

among the

B

top documents

Comparison with other CLIR Models

• Hiemstra (1999)

P

(

i

|

d

) 

k n

  1

P

(

i k

|

d

) 

k n

  1

e k P

(

i k

,

e k

|

d

) 

k n

  1

e k P

(

i k

|

e k

)

P

(

e k

|

d

) • Xu (2001)

P

(

i

|

d

) 

k n

  1 [ 

P

(

i k

)  ( 1   ) 

e k P

(

i k

|

e k

)

P

(

e k

|

d

) ]

Term Translation Model using Search Result Pages

• Apply page authority to search-result-based translation extraction

P

(

t

|

q

)  

P

(

t

|

R q

) 

dr P

(

t

|

dr

)

P

(

dr

|

q

)  

dr P

(

t

|

dr

)

P

(

q

|

dr

)

P

(

dr

)

P

(

q

)  1

P

(

q

)  [

P

(

t dr

|

dr

)

i k

  1

P

(

q i

|

dr

)]

P

(

dr

) where

P

(

dr

)  # link of

dr

# total links

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij

,

Jian-Yun Nie

and

Michel Simard

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Web Mining for CLIR

• The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically .

• The resulting translation models can be embedded in several ways in a retrieval model.

• Conventional approach: IR + MT (machine translation)

Problems in Query Translation

• Finding translations – Lexical coverage: Proper names and abbreviations.

– Transliteration: Phonemic representation of a named entity.

• Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script) • Pruning translation alternatives • Weighting translation alternatives

Exploitation of Parallel Texts

• Using a pseudofeedback approach (Yang et al. 1998) • Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002) • Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998) • Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)

Mining Process in PTMiner

Embedding Translation into IR Model

• Basic language model • Normalized log-likelihood ratio (NLLR)

Embedding Translation into IR Model

* Basic Language Model:

(log likelihood ratio) (normalized log likelihood ratio)

* Query Model: * Document Model:

A Probability Model to Improve Word Alignment

Colin Cherry & Dekang Lin, University of Alberta

Outline

• Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features

Introduction

• Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.

– E.g., translation lexicons, transfer rules • Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) –

Indirect association problem

: Melamed (2000) proposed competitive linking along with an explicit noise model to solve

score B

(

u

,

v

)  log

B

(

links

(

u

,

v

) |

B

(

links

(

u

,

v

) |

cooc

(

u

,

v

),   )

cooc

(

u

,

v

),   ) CISCO System Inc.

思科 系統 • To propose a probabilistic word-alignment model which allows easy integration of context-specific features.

noise

Probabilistic Word-Alignment Model

• Given

E

=

e 1

,

e 2

, … ,

e m

,

F

=

f 1

,

f 2

, … ,

f n

– If

e i

and

f j

are translation pair, then link l (

e i

,

f j

) exists – If

e i

has no corresponding translation, then null link l (

e i

,

f 0

) exists – If

f j

has no corresponding translation, then null link l (

e 0

,

f j

) exists – An alignment

A F

is a set of links such that every word in participates in at least one link

E

and • Alignment problem is to find alignment

A

to maximize

P

(

A

|

E

,

F

) • IBM ’ s translation model: maximize

P

(

A

,

F

|

E

)

Probabilistic Word-Alignment Model (Cont.)

• Given

A

= {

l 1

,

l 2

, … ,

l t

}, where

l k

= consecutive subsets of

A

,

l i j P

(

A

|

E

,

F

) 

P

(

l

1

t

|

E

,

F

) 

k t

  1

P

(

l k

|

E

,

F

,

l

1

k

= {  1 )

l i

,

l l

(

e ik

,

f jk

), then

i+1

, … ,

l j

} • Let

C k

= {

E

,

F

,

l 1 k-1

} represent the context of

l k P

(

l k

|

C k

) 

P

(

l k

,

C k

)

P

(

C k

) 

P

(

C k

|

l k

)

P

(

l k

)

P

(

C k

,

e ik

,

f jk

) 

P

(

P C k

(

C

|

k e ik

| ,

l k f

)

jk

) 

P

(

l k

,

e ik

,

f jk

)

P

(

e ik

,

f jk

) 

P

(

l k

|

e ik

,

f jk

) 

P P

(

C k

(

C

|

k e ik

| ,

l k

)

f jk

)

P

(

C k

P

(

e ik

, ,

e ik f jk

, |

f jk C k

) ) 

P

 1 (

C k

) 

P

(

e ik

,

f jk

|

C k

)

P

P

(

l k

(

e ik

, ,

e ik f

,

jk f

|

l jk k

) ) 

P

 1 (

l k

) 

P

(

e ik

,

f jk

|

l k

)

Probabilistic Word-Alignment Model (Cont.)

C k

= {

E

,

F

,

l 1 k-1

} is too complex to estimate •

FT k

is a set of context-related features such that

P

(

l k

|

C k

) can be approximated by

P

(

l k

|

e ik

,

f jk

,

FT k

) • Let

C k ’

= {

e ik

,

f jk

} ∪

FT k P

(

l k

|

C k

' ) 

P

(

l k

|

e ik

,

f jk

) 

P

(

C

'

k P

(

C k

' |

e ik

| ,

l k f

)

jk

) 

P

(

l k

|

e ik

,

f jk

) 

P

(

FT P

(

FT k

|

k e ik

| ,

l k

)

f jk

)

P

(

A

|

E

,

F

) 

k t

  1

P

(

l k

|

e ik

,

f jk

)  

ft

FT k P

(

ft

|

l k

)

P

(

ft

|

e ik

,

f jk

)

An Illustrative Example

Word-Alignment Algorithm

• Input: –

E

,

F

,

T E T E

is

E

’ s dependency tree which enable us to make use of

features

and

constraints

based on linguistic intuitions • Constraints –

One-to-one constraint

: every word participates in exactly one link –

Cohesion constraint

: use

T E

induce

T F

with no crossing dependencies to

Word-Alignment Algorithm (Cont.)

• Features – Adjacency features

ft a

: for

l

any word pair (

e i

,

f j

), if a link (

e

-2

i ’

 ,

i f

j ’

) exists where

i

 2 and -2  then

ft a

(

i

-

i

’ ,

j

-

j

’ ,

e i

’ ) is active for this context.

j

’ -

j

 2, – Dependency features

ft d

: for any word pair (

e i

,

f j

), let

e i ’

be the governor of

e i

let

rel

If a link context.

l

(

e i ’

,

f j ’

,and be the grammatical relationship between them. ) exists, then

ft d

(

j

-

j

’ ,

rel

) is active for this  

pair

(

the

1 ,

l

' )

pair

(

the

1 ,

les

)     

ft ft ft d a d

( (  1 ,  1  1 , ,

det

(  3 ,

det

)

host

) )

Experimental Results

• Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs

Future Work

• The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “ many ” side.