投影片 1

Transcript 投影片 1

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2011/05/30

Cross-Language Information Retrieval

• Query in source language and retrieve relevant documents in target languages Hussein Source Query Query Translation Target Translation Information Retrieval Target Documents 海珊 / 侯賽因 / 哈珊 / 胡笙侯赛因 / 海珊 / 哈珊 (SC) (TC)

References

• • • • •

The Web as a Parallel Corpus

–

Philip Resnik

and

Noah A. Smith,

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Automatic Construction of English/Chinese Parallel Corpora

–

Christopher C. Yang

Technology, 2003 &

Kar Wing Li,

Journal of the American Society for Information Science and

Statistical Cross-Language Information Retrieval using N Best Query Translations (SIGIR2002)

–

Marcello Federico & Nicola Bertoldi

, ITC-irst Centro per la Ricerca Scientifica e Techologica

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

–

Wessel Kraaij

2003 ,

Jian-Yun Nie

and

Michel Simard,

Computational Linguistics, Special Issue on the Web as Corpus,

A Probability Model to Improve Word Alignment (ACL2003)

–

Colin Cherry & Dekang Lin

, University of Alberta

The Web as Corpus

Outline

•

The Web as a Parallel Corpus

Philip Resnik

2003 and

Noah A. Smith

Computational Linguistics, Special Issue on the Web as Corpus, •

Automatic Construction of English/Chinese Parallel Corpora

Christopher C. Yang

Technology, 2003 and

Kar Wing Li

Journal of the American Society for Information Science and •

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij

2003 ,

Jian-Yun Nie

and

Michel Simard

Computational Linguistics, Special Issue on the Web as Corpus,

The Web as a Parallel Corpus

Philip Resnik

and

Noah A. Smith

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Parallel Corpora

• Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing .

• Not readily available in necessary quantities – Canadian parliamentary proceedings (Hansards) in English/French – United Nations proceedings (Linguistic Data Consortium, http://www.ldc.upenn.edu/) – Religious texts (Resnik, Olsen, and Diab) – Localized versions of software manuals (Resnik and Melamed 1997; Menezes and Richardson)

STRAND

• An architecture for s tructural t ranslation r ecognition, a cquiring n atural d ata (Resnik 1998, 1999) • Identify pairs of Web Pages that are mutual translations.

• Web page authors disseminate information in multiple languages – When presenting the same content in two different languages, authors exhibit a very strong tendency to use the same document structure

Finding Parallel Web Pages

• Finding parallel text on the Web consists of three main steps: – Location of pages that might have parallel translation – Generation of candidate pairs that might be translations – Structural filtering out of nontranslation candidate pairs • Locating pages – Two types: parents and siblings – Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)

Two types of Website Structure

STRAND

• Generating Candidate Pairs: – Automatic language identification (Dunning 1994) – URL-matching : manually creating a list of substitution rules • E.g., http://mysite.com/english/home_en.html => http://mysite.com/big5/home_ch.html – Document length: length(

)  C

length(

) • Structural filtering – The heart of STRAND – Markup analyzer : determine a set of pair-specific structural values for translation pairs

Automatic Construction of English/Chinese Parallel Corpora

Christopher C. Yang

and

Kar Wing Li

Journal of the American Society for Information Science and Technology, 2003

Web Parallel Corpora

• Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language.

• Title alignment and dynamic programming matching

References

Statistical Cross-Language Information Retrieval using N-Best Query Translations

Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica

Outline

• Statistical CLIR Approach • Query Document Model • Query Translation Model

Statistical CLIR Approach

• CLIR problem − Given a query

in the source language (Italian), one would like to find relevant documents

in the target language (English), within a collection

(

| i)

(i,

) − To fill the language difference between query and documents, the hidden variable

is introduced, which represents an English translation of

Statistical CLIR Approach

( i ,

)    e 

( i , e ,

)

( i , e)P(

e | e )  

( i , e) e  P(e,

( e,

d d

) ' ) d' – –

(e,

) is computed by the query-document model

(i,e) is computed by the query-translation model

Statistical CLIR Approach

Query-Document Model

( q ,

) 

( q |

)

(

)

( q 

1 ...

q n

) 

 

(

1 • Statistical LM & Smoothing

q k

) – Term frequencies of a document are smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary local

(

) 

(

)

(

)  |

(

) |  |

(

) |

(

)  |

(

) |

(

)

(

) 

(

)

 |

|  |

 |

1 | |

| global

Query-Translation Model

• According to the HMM

( i 

1 ...

i n

, e 

1 ...

e n

) • Determine N-best 

(

1 )

(

1 |

1 )

k n

  2

(

e k

 1 )

(

i k

e k

) translations – The most probable translation e* can be computed through the Viterbi search algorithm.

– Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the

most possible translations of

Query-Translation Model

•

(

) are estimated from a bilingual dictionary •

(

)     (

, (

) ,

) ,  (

)    1 if (

i,e

) is a translati 0 otherwise on pair

(

e ’

) are estimated on the target document collection (order-free bigram LM)

(

' ) 



(

e e

' '' , )

' '' ) • Smoothing

(

' )  max

(

' ) 

 , 0    

βP

(

)

(

' ),  

1 

1 2

2 where

(

e,e'

) is the number of co occurrence s appearing in the corpus,

(

) is estimated as described above

(

), and

n k

represent the number of term pairs occurring

times in the corpus.

CLIR Algorithm

( i ,

)  

( i , e) e  P(e,

d P

( e, )

' ) d' • Use two approximations to limit the set of possible translations and documents .

Appr.

1 :

' ( i , e )   

(

1 i , e ) if (i) e 

Τ N

( 0 otherwise i ) Appr.

2 :

' ( e,

)   

P K

( e, 2

(e) ) if e   ( e 0 otherwise )

Complexity of CLIR Algorithm

: query length

number of generated translati ons  : average number of translati ons of a term

: average number of documents spanned by each entry of the inverted file index

Text Preprocessing

Blind Relevance Feedback

• The

most relevant terms are selected from the top

ranked documents according to:

r w

(

r w

(  0 .

5 )(

N N w



r w

  0 .

N w

5 )( 

B B

 

r w r w

  0 0 .

5 ) .

5 )

r w

: the number of documents containing term

among the

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Wessel Kraaij

Jian-Yun Nie

and

Michel Simard

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Web Mining for CLIR

• The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically .

• The resulting translation models can be embedded in several ways in a retrieval model.

• Conventional approach: IR + MT (machine translation)

Problems in Query Translation

• Finding translations – Lexical coverage: Proper names and abbreviations.

– Transliteration: Phonemic representation of a named entity.

• Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script) • Pruning translation alternatives • Weighting translation alternatives

Exploitation of Parallel Texts

• Using a pseudofeedback approach (Yang et al. 1998) • Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002) • Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998) • Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)

Mining Process in PTMiner

Embedding Translation into IR Model

• Basic language model • Normalized log-likelihood ratio (NLLR)

Embedding Translation into IR Model

* Basic Language Model:

(log likelihood ratio) (normalized log likelihood ratio)

* Query Model: * Document Model:

A Probability Model to Improve Word Alignment

Colin Cherry & Dekang Lin, University of Alberta

Outline

• Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features

Introduction

• Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.

– E.g., translation lexicons, transfer rules • Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) –

Indirect association problem

: Melamed (2000) proposed competitive linking along with an explicit noise model to solve

score B

(

)  log

(

links

(

) |

(

links

(

) |

cooc

(

),   )

cooc

(

),   ) CISCO System Inc.

思科系統 • To propose a probabilistic word-alignment model which allows easy integration of context-specific features.

noise

Probabilistic Word-Alignment Model

• Given

e 1

e 2

, … ,

e m

f 1

f 2

, … ,

f n

– If

e i

and

f j

are translation pair, then link l (

e i

f j

) exists – If

e i

has no corresponding translation, then null link l (

e i

f 0

) exists – If

f j

has no corresponding translation, then null link l (

e 0

f j

) exists – An alignment

A F

is a set of links such that every word in participates in at least one link

and • Alignment problem is to find alignment

to maximize

(

) • IBM ’ s translation model: maximize

(

)

Probabilistic Word-Alignment Model (Cont.)

• Given

= {

l 1

l 2

, … ,

l t

}, where

l k

= consecutive subsets of

l i j P

(

) 

(

) 

k t

  1

(

l k

= {  1 )

l i

l l

(

e ik

f jk

), then

i+1

, … ,

l j

} • Let

C k

= {

l 1 k-1

} represent the context of

l k P

(

l k

C k

) 

(

l k

C k

)

(

C k

) 

(

C k

l k

)

(

l k

)

(

C k

e ik

f jk

) 

(

P C k

(

k e ik

| ,

l k f

)

) 

(

l k

e ik

f jk

)

(

e ik

f jk

) 

(

l k

e ik

f jk

) 

P P

(

C k

(

k e ik

| ,

l k

)

f jk

)

(

C k



(

e ik

, ,

e ik f jk

, |

f jk C k

) ) 

 1 (

C k

) 

(

e ik

f jk

C k

)



(

l k

(

e ik

, ,

e ik f

jk f

l jk k

) ) 

 1 (

l k

) 

(

e ik

f jk

l k

)

Probabilistic Word-Alignment Model (Cont.)

•

C k

= {

l 1 k-1

} is too complex to estimate •

FT k

is a set of context-related features such that

(

l k

C k

) can be approximated by

(

l k

e ik

f jk

FT k

) • Let

C k ’

= {

e ik

f jk

} ∪

FT k P

(

l k

C k

' ) 

(

l k

e ik

f jk

) 

(

k P

(

C k

' |

e ik

| ,

l k f

)

) 

(

l k

e ik

f jk

) 

(

FT P

(

FT k

k e ik

| ,

l k

)

f jk

)

(

) 

k t

  1

(

l k

e ik

f jk

)  



FT k P

(

l k

)

(

e ik

f jk

)

An Illustrative Example

Word-Alignment Algorithm

• Input: –

T E T E

’ s dependency tree which enable us to make use of

features

and

constraints

based on linguistic intuitions • Constraints –

One-to-one constraint

: every word participates in exactly one link –

Cohesion constraint

: use

T E

induce

T F

with no crossing dependencies to

Word-Alignment Algorithm (Cont.)

• Features – Adjacency features

ft a

: for

any word pair (

e i

f j

), if a link (

-2

i ’

 ,

i f

’

j ’

) exists where

 2 and -2  then

ft a

(

’ ,

e i

’ ) is active for this context.

’ -

 2, – Dependency features

ft d

: for any word pair (

e i

f j

), let

e i ’

be the governor of

e i

let

rel

If a link context.

(

e i ’

f j ’

,and be the grammatical relationship between them. ) exists, then

ft d

(

’ ,

rel

) is active for this  

pair

(

the

1 ,

' )

pair

(

the

1 ,

les

)     

ft ft ft d a d

( (  1 ,  1  1 , ,

det

(  3 ,

det

)

host

) )

Experimental Results

• Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs

Future Work

• The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “ many ” side.

投影片 1

Transcript 投影片 1

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment

References

The Web as Corpus

Outline

The Web as a Parallel Corpus

Parallel Corpora

STRAND

Finding Parallel Web Pages

Two types of Website Structure

STRAND

Automatic Construction of English/Chinese Parallel Corpora

Web Parallel Corpora

References

Statistical Cross-Language Information Retrieval using N-Best Query Translations

Outline

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval

Web Mining for CLIR

Problems in Query Translation

Exploitation of Parallel Texts

Mining Process in PTMiner

Embedding Translation into IR Model

Embedding Translation into IR Model

A Probability Model to Improve Word Alignment

Outline

Directory