沒有投影片標題 - NTU NLPL's Homepage台大自然
Download
Report
Transcript 沒有投影片標題 - NTU NLPL's Homepage台大自然
Chapter 10
Cross-Language Information Retrieval
Hsin-Hsi Chen (陳信希)
Department of Computer Science and
Information Engineering
National Taiwan University
Hsin-Hsi Chen
10-1
Outlines
Multilingual Environments
What is Cross-Language Information
Retrieval?
Interdisciplinary relationship in CLIR
Major Problems in CLIR
Major Approaches in CLIR
Summary
Hsin-Hsi Chen
10-2
Multilingual Collections
There are 6,703 languages listed in the Ethnologue
Digital libraries
– OCLC Online Computer Library Center serves more
than 17,000 libraries in 52 countries and contains over
30 million bibliographic records with over 500 million
records ownership attached in more than 370 languages
World Wide Web
– Around 40% of Internet users do not speak English,
however, 80% of Web sites are still in English
Hsin-Hsi Chen
10-3
真實世界語言使用人口
( http://www.g11n.com/faq.htm)
Speakers (Millions)
800
600
400
200
0
Chinese
中
文
Hsin-Hsi Chen
HindiUrdu
英
語
印
度
語
Portuguese
西
班
牙
語
葡
萄
牙
語
Russian
孟
加
拉
語
俄
語
Japanese
阿
拉
伯
語
日
語
10-4
荷蘭語
葡萄牙語
義大利語
韓文
西班牙語
瑞典語
中文
法語
德語
日語
Hsin-Hsi Chen
(Statistics from Euro-Marketing Associates, 1998)
10-5
中文人口
比例(6.1%)
<
法文人口
比例(8.8%)
(1998年)
Hsin-Hsi Chen
(Statistics from Euro-Marketing Associates, 1999)
http://www.glreach.com/globstats/
10-6
網路世界語言使用人口
Hsin-Hsi Chen
10-7
網際網路內容
Internet Hosts (thousands)
(Network Wizards Jan 99 Internet Domain Survey)
100,000
33,878
10,000
1,687 1,684
654
546 546 473 458 432
1,000
英 100
English
語
40%的Internet使用者
不懂英文,但是80%
的Internet內容是英文
Hsin-Hsi Chen
German
Dutch
Spanish
Swedish
Language (estimated by domain)
日
語
德
語
法
語
荷
蘭
語
芬
蘭
語
西
班
牙
語
中
文
瑞
典
10-8
語
Hsin-Hsi Chen
(Source: http://www.emarketer.com)
10-9
What is Cross-Language
Information Retrieval?
Definition: Select information in one
language based on queries in another.
Terminologies
– Cross-Language Information Retrieval
(ACM SIGIR 96 Workshop on Cross-Linguistic
Information Retrieval)
– Translingual Information Retrieval
(Defense Advanced Research Project Agency DARPA)
Hsin-Hsi Chen
10-10
Generalization:
Multi- & Cross- Lingual Information Access
Hsin-Hsi Chen
10-11
MLIR Applications
Multilingual information access in multilingual
country, organization, enterprise, etc.
Cross- language information retrieval for users
who read a second language (large passive
vocabulary) but are not able to formulate good
queries (small active vocabulary).
Monolingual users may retrieve images by taking
advantage of multilingual captions.
Monolingual users may retrieve documents and
have them translated (automatically or manually)
in their language.
Hsin-Hsi Chen
10-12
Why is Cross- Language Information
Retrieval Important?
More information workers with less time
require fast access to global resources
global B2B interactions (virtual enterprises)
global B2C interactions (online trading,
travelling)
time critical information (translation comes too
late)
Hsin-Hsi Chen
10-13
History
1970 Salton runs retrieval experiments with a
small English/ German dictionary
1972 Pevzner shows for English and Russian that
a controlled thesaurus can be used effectively for
query term translation
1978 ISO Standard 5964 for developing
multilingual thesauri (revised in 1985)
1990 Latent Semantic Indexing (LSI) applied to
CLIR
Hsin-Hsi Chen
10-14
History (Continued)
1994 1st PhD thesis on CLIR by Khaled
Radwan
1996 Similarity thesaurus applied to CLIR
(ETH Zurich)
1996 Dictionary based retrieval applied to
CLIR (Umass & XEROX Grenoble)
1997 Generalized Vector Space Model
(GVSM) applied to CLIR (CMU)
Hsin-Hsi Chen
10-15
History (Continued)
1997 CLIR (Cross- Language Information
Retrieval) track starts within TREC
1998 NTCIR starts in Japan
1999 TIDES (Translingual Information
Detection, Extraction, and Summarization)
starts in U. S.
2000 CLEF starts in Europe
Hsin-Hsi Chen
10-16
An Architecture of Multilingual Information Access
Multiple Langauges
Multilingual Resources
Language
Identification
(LI)
Information
Extraction
Information
Filtering
Information
Retrieval
Query
Translation
Text
Classification
Document
Translation
Text
Summarization
Text Processing
Language
Translation
User Interface
(UI)
Native Langauge(s)
Hsin-Hsi Chen
10-17
An Architecture of Cross-Language Information Retrieval
Hsin-Hsi Chen
10-18
Building Blocks for CLIR
Information
Retrieval
Information
Science
Hsin-Hsi Chen
Artificial
Intelligence
Speech
Recognition
Computational
Linguistics
10-19
Information Science
User interface
Interactive search technique
Thesaurus construction
Evaluation
Hsin-Hsi Chen
10-20
Computational Linguistics
Language identification
Morphological analysis
Stylistic analysis
Part-of-speech tagging
Identifying occurrences of phrases
Using parallel corpora
Using comparable corpora
Hsin-Hsi Chen
10-21
Computational Linguistics (Continued)
Aligning documents
Identifying occurrences of geographic and
temporal concepts
Stochastic language models
Word disambiguation
Lexicons (morphology, part-of-speech)
Bilingual dictionaries (terms and possible
translation)
Hsin-Hsi Chen
10-22
Information Retrieval (w/o CL)
Filtering
Relevance Feedback
Document representation
Latent semantic indexing
Generalization vector space model
Collection fusion
Passage retrieval
Hsin-Hsi Chen
10-23
Information Retrieval (Continued)
Similarity thesaurus
Local context analysis
Automatic query expansion
Fuzzy term matching
Adapting retrieval methods to collection
Building cheap test collection
Evaluation
Hsin-Hsi Chen
10-24
Artificial Intelligence
Machine translation
Machine learning
Template extraction and matching
Building large knowledge bases
Semantic network
Hsin-Hsi Chen
10-25
Speech Recognition
Signal processing
Pattern matching
Phone lattice
Background noise elimination
Speech segmentation
Modeling speech prosody
Building test databases
Evaluation
Hsin-Hsi Chen
10-26
Building Blocks Dealing with
Term Dependencies
IS: ISO-Thesaurus
CL: Word disambiguation, bilingual
dictionaries
AI: Semantic network
SR: Stochastic language models
IR: LSI, GVSM, similarity thesaurus, local
context analysis, (weighted) Boolean filters
Hsin-Hsi Chen
10-27
Major Problems of CLIR
Queries and documents are in different
languages.
– translation
Words in a query may be ambiguous.
– disambiguation
Queries are usually short.
– expansion
Hsin-Hsi Chen
10-28
Major Problems of CLIR (Continued)
Queries may have to be segmented.
– segmentation
A document may be in terms of various
languages.
– language identification
Hsin-Hsi Chen
10-29
Enhancing Traditional
Information Retrieval Systems
Which part(s) should be modified for CLIR?
Documents
Queries
(1)
(3)
Document
Representation
Query
Representation
(2)
(4)
Comparison
Hsin-Hsi Chen
10-30
Enhancing Traditional
Information Retrieval Systems
(Continued)
(1): text translation
(2): vector translation
(3): query translation
(4): term vector translation
(1) and (2), (3) and (4): interlingual form
Hsin-Hsi Chen
10-31
What are the Problems?
Ambiguous terms (e.g., performance)
Multiword phrases may correspond to single-word
phrases (e. g. South Africa => 南非, Südafrika)
Coverage of the vocabulary
There is not a one-to-one mapping between two
languages
Translating queries automatically (lack of syntax)
Translating documents automatically (performance, …)
Computing mixed result lists
Hsin-Hsi Chen
10-32
Cross-Language Information Retrieval
Cross-Language Information Retrieval
Query Translation
Controlled Vocabulary
Free Text
Knowledge-based
Ontology-based
Dictionary-based
Thesaurus-based
Hsin-Hsi Chen
Document Translation
Text Translation
Corpus-based
Term-aligned
Sentence-aligned
Vector Translation
Hybrid
Document-aligned
Parallel
No Translation
Unaligned
Comparable
10-33
Query Translation Based CLIR
English
Query
Hsin-Hsi Chen
Translation
Device
Chinese
Query
Retrieved
Chinese
Documents
Monolingual
Chinese
Retrieval
System
10-34
Translating the 400 Million
non-English Pages of the WWW
... would take 100’000 days (300 years) on
one fast PC. Or, 1 month on 3’600 PC’s.
Hsin-Hsi Chen
10-35
Controlled Vocabulary
Sublanguage chosen by human indexers
National Library of Medicine
– Unified Medical Language System (UMLS)
– Integrating medical coverage of many thesauri
• English, French, German, Portuguese
Hsin-Hsi Chen
10-36
Knowledge-Based
Examples
– Subject Thesaurus
• Hierarchical and associative relations.
• Unique term assigned to each node.
– Concept List
• Term space partitioned into concept spaces.
– Term List
• List of cross-language synonyms.
– Lexicon
• Machine readable syntax and/or semantics.
Hsin-Hsi Chen
10-37
Ontology-Based Approaches
Exploit complex knowledge representations
e.g., EuroWordNet
A Proposal for Conceptual Indexing using
EuroWordNet
Hsin-Hsi Chen
10-38
Ontology-Based Approaches
(Continued)
The Indexing Process
Hsin-Hsi Chen
10-39
Dictionary-Based Approaches
Exploit machine-readable dictionaries.
Problems
– translation ambiguity + target polysemy
– coverage (unknown words, abbreviations, ...)
Hsin-Hsi Chen
10-40
Dictionary-Based Approaches
(Continued)
Issue 1: selection strategy
– Select all.
– Select N randomly.
– Select best N.
Issue 2: which level
– word
– phrase
Hsin-Hsi Chen
10-41
Selection Strategy: Select All
Hull and Grefenstette 1996
– Take concatenation of all term translation.
E: politically motivated civil disturbances
F: troubles civils a caractere politique
trouble - turmoil, discord, trouble, unrest, disturbance, disorder
civil - civil, civilian, courteous
caractere - character, nature
politique - political, diplomatic, politician, policy
– Original English (0.393) vs. Automatic wordbased transfer dictionary (0.235): 59.8%.
– errors: multi-word expressions and ambiguity
Hsin-Hsi Chen
10-42
Selection Strategy: Select All
(Continued)
Davis 1997 (TREC5)
– Replace each English query term with all of its
Spanish equivalent terms from the Collins
bilingual dictionary.
– Monolingual (0.2895) vs. All-equivalent
substitution (0.1422): 49.12%
Hsin-Hsi Chen
10-43
Evaluation Method
Average Precision (5-, 9-, 11-points)
Model
TREC
Spanish Query
English Query
English Query
Hsin-Hsi Chen
Mono
IR Engine
Spanish
Corpus
Bilingual
Spanish
Mono
Dictionary Equivalents IR Engine
POS
Bilingual
Dictionary
Spanish
Mono
Equivalents IR Engine
by POS
TREC
Spanish
Corpus
TREC
Spanish
Corpus
10-44
Selection Strategy: Select N
Simple word-by-word translation
– Each query term is replaced by the word or
group of words given for the first sense of the
term’s definition.
• 50-60% drop in performance (average precision)
Hsin-Hsi Chen
10-45
Selection Strategy: Select N
(Continued)
word/phrase translation
– Take at most three translations of each word,
one from each of the first three senses. Take
phrase translation if appearing in dictionary.
• 30-50% worse than good translation
– Well-translated phrases can greatly improve
effectiveness, but poorly translated phrases may
negate the improvements.
• WBW (0.0244), phrasal (0.0148), good phrasal (0.0610)
-39.3%
+150.3%
Hsin-Hsi Chen
10-46
Selection Strategy: Select Best N
Hayashi, Kikui and Susaki 1997
– search for a dictionary entry corresponding to the
longest sequence of words from left to right
– choose the most frequently used word (or phrases) in a
text corpus collected from WWW
– no report for this query translation approach
Davis 1997 (TREC5)
– POS disambiguation
– Monolingual (0.2895) vs. All-equivalent substitution
(0.1422) vs. POS disambiguation (0.1949): near 67.3%
Hsin-Hsi Chen
10-47
Corpus-Based Approaches
Categorization
–
–
–
–
Term-Aligned
Sentence-Aligned
Document-Aligned (Parallel, Comparable)
Unaligned
Usage
– Setup Thesaurus
– Vector Mapping
Hsin-Hsi Chen
10-48
Term-Aligned Corpora
Fine-grained alignment in parallel corpora
Oard 1996
– Term alignment is a challenging problem.
English Query
Parallel
Translation
Cooccurrance
Binlingual
Tables
Statistics
Corpus
Hsin-Hsi Chen
Machine
Translation
System
Spanish
Query
10-49
Sentence-Aligned Corpora
Davis & Dunning 1996 (TREC4)
– High-frequency Terms
Hsin-Hsi Chen
10-50
Sentence-Aligned Corpora
(Continued)
– Statistically Significant Terms
Hsin-Hsi Chen
10-51
Sentence-Aligned Corpora
(Continued)
Precision-Recall Averages
Hsin-Hsi Chen
10-52
Document-Aligned Corpora
Exploit parallel or comparable corpora
Parallel: linked translation equivalents
– LSI mate retrieval achieve 99% effectiveness
Comparable: separate authorship, same
topic
– Easier to find, harder to link the documents
Hsin-Hsi Chen
10-53
Query Term Disambiguation
Hsin-Hsi Chen
10-54
Comparable Document-Aligned Corpora
Sheridan & Ballerini 1996
– Create a comparable corpus.
Align news stories in German and Italian by
topic label and date, and merge them to create
pseudo-parallel documents.
– Generate co-occurrence thesaurus.
– Perform translations using thesaurus.
Hsin-Hsi Chen
10-55
Unaligned Corpora
No document links
Used in conjunction with dictionaries
– Pretranslation Local feedback (Ballesteros &
Croft 1997)
Hsin-Hsi Chen
10-56
Brief Summary
dictionary-based methods
– Specialized vocabulary not in the dictionaries will not
be translated.
– Ambiguities will add extraneous terms to the query.
parallel/comparable corpora-based methods
– Parallel corpora are not always available.
– Available corpora tend to be relative small or to cover
only a small number of subjects.
– Performance is dependent on how well the corpora are
aligned.
Hsin-Hsi Chen
10-57
Brief Summary (Continued)
Dictionaries are very useful.
– Achieve 50% on their own
Parallel corpora have limitations.
– Domain shifts
– Term alignment accuracy
Dictionaries and corpora are complementary.
– Dictionaries provide broad and shallow coverage.
– Corpora provide narrow (domain-specific) but deep
(more terminology) coverage of the language.
Hsin-Hsi Chen
10-58
Hybrid Methods
What knowledge can be employed?
– lexical knowledge
– corpus knowledge
– ...
Hsin-Hsi Chen
10-59
Hybrid Methods (Continued)
Query Expansion
– Issue 1: context
• pseudo relevance feedback (local feedback)::
A query is modified by the addition of terms found
in the top retrieved documents.
• local context analysis::
Queries are expanded by the addition of the top
ranked concepts from the top passages.
Hsin-Hsi Chen
10-60
Hybrid Methods (Continued)
– Issue 2: when
• before query translation
• after query translation
Hsin-Hsi Chen
10-61
Pseudo- Relevance Feedback illustrated
Hsin-Hsi Chen
10-62
Query Expansion through
Local Context Analysis
local analysis
– Based on the set of documents retrieved for the
original query
– Based on term co-occurrence inside documents
– Terms closest to individual query terms are selected
global analysis
– Based on the whole document collection
– Based on term co-occurrence inside small contexts
and phrase structures
– Terms closest to the whole query are selected
Hsin-Hsi Chen
10-63
Query Expansion through
Local Context Analysis (Continued)
candidates
– noun groups instead of simple keywords
– single noun, two adjacent nouns, or three
adjacent nouns
query expansion
– Concepts are selected from the top ranked
documents (as in local analysis)
– Passages are used for determining cooccurrence (as in global analysis)
Hsin-Hsi Chen
10-64
Query Expansion through
Local Context Analysis (Continued)
algorithm
– Retrieve the top n ranked passages using the original
query
– For each concept in the top ranked passages, the
similarity sim(q,c) between the whole query q and the
concept c is computed using a variant of tf-idf ranking
– The top m ranked concepts are added to the original
query q
• Each concept is assigned a weight 1-0.9i/m (i: rank)
• Each term in the original query is assigned a weight 2original
weight
Hsin-Hsi Chen
10-65
Hybrid Methods (Continued)
Ballesteros & Croft 1997
Original Spanish human English (BASE)
TREC Queries translation
Queries
automatic
dictionary
translation
Spanish
Queries
English
Queries
query
expansion
Spanish
Queries
Hsin-Hsi Chen
query
expansion
INQUERY
automatic
dictionary
translation
Spanish
Queries
10-66
Hybrid Methods (Continued)
– Performance Evaluation
• pre-translation
MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139)
+33.5%
+38.5%
• post-translation
MRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022)
+11.3%
+24.1%
• combined pre- and post-translation
MRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358)
+51.0%
+65.0%
• 32% below a monolingual baseline
Hsin-Hsi Chen
10-67
Hybrid Methods (Continued)
Davis 1997 (TREC5)
UN English
English Query
Bilingual Spanish
Parallel
Dictionary Equivalents IR Engine
Compare
Document
Vectors
(POS)
UN Spanish
TREC
Spanish
Corpus
Hsin-Hsi Chen
Mono
Reduced
IR Engine Equivalent Set
10-68
Hybrid Methods (Continued)
– corpus-based disambiguation vs. POS-based
disambiguation
– MONO (0.2895) vs. ALL (0.1422) vs.
49.12%
CORP (0.1153) vs. POS (0.1949) vs.
39.83%
67.32%
BOTH (0.2127)
73.47%
Hsin-Hsi Chen
10-69
Document Translation
Translate the documents, not the query
Documents
Queries
Document
Representation
Query
Representation
MT
(1) Efficiency Problem
(2) Retrieval Effectiveness???
(word order, stop words)
(3) Cross-language mate finding
using MT-LSI (Dumais, et al, 1997)
Hsin-Hsi Chen
Comparison
10-70
Vector Translation
Translate document vectors
Documents
Queries
Document
Representation
Query
Representation
MT
Comparison
Hsin-Hsi Chen
10-71
No Translation
Latent Semantic Indexing (Dumais, et al. 1997)
Hsin-Hsi Chen
10-72
No Translation (Continued)
Cross-Language Retrieval Using LSI
resource: document-aligned parallel corpus
Hsin-Hsi Chen
10-73
No Translation (Continued)
Yellow Page Cross-Language Retrieval
Top 1
Top 10
CL-LSI
63.8%
86.9%
MT
57.5%
74.8%
Hsin-Hsi Chen
10-74
A Comparative Evaluation
Carbonell, Yang, Frederking, et al. (CMU,LTI)
–
–
–
–
–
Corpus-driven Term Translation (TMT)
Pseudo-Relevance Feedback (PRF)
Generalized Vector Space Model (GVSM)
Latent Semantic Indexing (LSI)
GVSM slightly outperforms LSI, which in turn
outperforms PRF and TMT.
Hsin-Hsi Chen
10-75
Research Directions
Comparable corpus techniques
– Automatic document linking
Dictionary-based approaches
– Word sense disambiguation
Evaluation
– Side-by-side tests
– Controllable domain shift
Hsin-Hsi Chen
10-76
CLIR system using query
translation
Hsin-Hsi Chen
10-77
Generating Mixed
Ranked Lists of Documents
Normalizing scales of relevance
– using aligned documents
– using ranks
– interleaving according to given ratios
Mapping documents into the same space
– LSI
– document translations
Hsin-Hsi Chen
10-78
Tools
Hsin-Hsi Chen
10-79
Types of Tools
Mark-Up Tools
Language Identification
Stemming/Normalization
Entity Recognition
Part-of-Speech taggers
Indexing Tools
Text Alignment
Speech Recognition/ OCR
Visualization
Hsin-Hsi Chen
• Character Set/Font Handling
• Word Segmentation
• Phrase/Compound Handling
• Terminology Extraction
• Parsers/Linguistic Processors
• Lexicon Acquisition
• MT Systems
• Summarization
10-80
Character Set/Font Handling
Input and Display Support
– Special input modules for e.g. Asian languages
– Out-of-the-box support much improved thanks
to modern web browsers
Character Set/File Format
– Unicode/UTF-8
– XML
Hsin-Hsi Chen
10-81
Language Identification
Different levels of multilingual data
– In different sub-collections
– Within sub-collections
– Within items
Different approaches
– Tri-gram
– Stop words
– Linguistic analysis
Hsin-Hsi Chen
10-82
Stemming/Normalization
Reduction of words to their root form
Important for languages with rich
morphology
Rule- based or dictionary- based
Case normalization
Handling of diacritics (French, …)
Vowel (re-) substitution (e.g. semitic
languages, …)
Hsin-Hsi Chen
10-83
Entity Recognition/
Terminology Extraction
Proper Names, Locations, ...
– Critical, since often missing from dictionaries
– Special problems in languages such as Chinese
Domain- specific vocabulary, technical
terms
– Critical for effectiveness and accuracy
Hsin-Hsi Chen
10-84
Phrase/Compound Handling
Collocations (“Hong Kong“)
– Important for dictionary lookup
– Improves retrieval accuracy
Compounds (“Bankangestelltenlohn“ –bank
employee salary)
– Big problem in German
– Infinite number of compounds – dictionary is
no viable solution
Hsin-Hsi Chen
10-85
Lexicon Acquisition/
Text Alignment
Goal: automatic construction of data
structures such as dictionaries and thesauri
– Work on parallel and comparable corpora
– Terminology extraction
– Similarity thesauri
Prerequisite: training data, usually aligned
– Document, sentence, word level alignment
Hsin-Hsi Chen
10-86
CLIR Evaluation at TREC
Hsin-Hsi Chen
10-87
Too many factors in
CLIR system evaluation
translation
automatic relevance feedback
term expansion
disambiguation
result merging
test collection
need to tone it down to see what happened
Hsin-Hsi Chen
10-88
TREC-6 Cross-Language Track
In cooperation with the Swiss Federal Institute of
Technology (ETH)
Task Summary: retrieval of English, French, and
German documents, both in a monolingual and a
cross-lingual mode
Documents
– SDA (1988-1990): French (250MB), German (330 MB)
– Neue Zurcher Zeitung (1994): German (200MB)
– AP (1988-1990): English (759MB)
13 participating groups
Hsin-Hsi Chen
10-89
TREC-7 Cross-Language Track
Task Summary: retrieval of English, French,
German and Italian documents
Results to be returned as a single multilingual
ranked list
Addition of Italian SDA (1989-1990), 90 MB
Addition of a subtask of 31,000 structured German
social science documents (GIRT)
9 participating groups
Hsin-Hsi Chen
10-90
TREC-8 Cross-Language Track
Tasks, documents and topic creation similar
to TREC-7
12 participating groups
Hsin-Hsi Chen
10-91
CLIR in TREC-9
Documents
– Hong Kong Commercial Daily, Hong Kong
Daily News, Takungpao: all from 1999 and
about 260 MB total
25 new topics built in English; translations
made to Chinese
Hsin-Hsi Chen
10-92
Cross-Language Evaluation Forum
A collaboration between the DELOS
Network of Excellence for Digital Libraries
and the US National Institute for Standards
and Technology (NIST)
Extension of CLIR track at TREC (19971999)
Hsin-Hsi Chen
10-93
Main Goals
Promote research in cross-language system
development for European languages by
providing an appropriate infrastructure for:
– CLIR system evaluation, testing and tuning
– Comparison and discussion of results
Hsin-Hsi Chen
10-94
CLEF 2000 Task Description
Four evaluation tracks in CLEF 2000
– multilingual information retrieval
– bilingual information retrieval
– monolingual (non-English) information
retrieval
– domain-specific IR
Hsin-Hsi Chen
10-95
CLEF 2000 Document Collection
Multilingual Comparable Corpus
–
–
–
–
English: Los Angeles Times
French: Le Monde
German: Frankfurter Rundschau+Der Speigel
Italian: La Stampa
Similar for genre, content, time
Hsin-Hsi Chen
10-96
Case Study: CLIR for NPDM
Hsin-Hsi Chen
10-97
3M in Digital Libraries/Museums
Multi-media
– Selecting suitable media to represent contents
Multi-linguality
– Decreasing the language barriers
Multi-culture
– Integrating multiple cultures
Hsin-Hsi Chen
10-98
NPDM Project
Palace Museum, Taipei, one of the famous
museums in the world
NSC supports a pioneer study of a digital
museum project NPDM starting from 2000
– Enamels from the Ming and Ch’ing Dynasties
– Famous Album Leaves of the Sung Dynasty
– Illustrations in Buddhist Scriptures with
Relative Drawings
Hsin-Hsi Chen
10-99
Design Issues
Standardization
– A standard metadata protocol is indispensable for the
interchange of resources with other museums.
Multimedia
– A suitable presentation scheme is required.
Internationalization
– to share the valuable resources of NPDM with users of
different languages
– to utilize knowledge presented in a foreign language
Hsin-Hsi Chen
10-100
Translingual Issue
CLIR
– to allow users to issue queries in one language
to access documents in another language
– the query language is English and the document
language is Chinese
Two common approaches
– Query translation
– Document translation
Hsin-Hsi Chen
10-101
Resources in NPDM pilot
an enamel, a calligraphy, a painting, or an
illustration
MICI-DC
– Metadata Interchange for Chinese Information
– Accessible fields to users
• Short descriptions vs. full texts
• Bilingual versions vs. Chinese only
– Fields for maintenance only
Hsin-Hsi Chen
10-102
Search Modes
Free search
– users describe their information need using
natural languages (Chinese or English)
Specific topic search
– users fill in specific fields denoting authors,
titles, dates, and so on
Hsin-Hsi Chen
10-103
Example
Information need
– Retrieval “Travelers Among Mountains and Streams,
Fan K‘uan” (“范寬谿山行旅圖”)
Possible queries
– Author: Fan Kuan; Kuan, Fan
– Time: Sung Dynasty
– Title: Mountains and Streams; Travel among mountains;
Travel among streams; Mountain and stream painting
– Free search: landscape painting; travelers, huge
mountain, Nature; scenery; Shensi province
Hsin-Hsi Chen
10-104
English
Query
Document
Translation
Query
Translation
English
Names
Name
Search
Specific
Bilingual
Dictionary
Machine
Transliteration
Chinese
Names
English
Titles
Query
Disambiguation
Title
Search
Generic
Bilingual
Dictionary
Chinese
Titles
Chinese
Query
NPDM
Collection
Chinese IR
System
Hsin-Hsiin
Chen
ECIR
NPDM
10-105
Results
Specific Topic Search
proper names are important query terms
– Creators such as “林逋” (Lin P’u), “李建中”
(Li Chien-chung), “歐陽脩” (Ou-yang Hsiu), etc.
– Emperors such as “康熙” (K'ang-hsi), “乾隆”
(Ch'ien-lung), “徽宗” (Hui-tsung), etc.
– Dynasty such as ”宋” (Sung), “明” (Ming), “清”
(Ch’ing), etc.
Hsin-Hsi Chen
10-106
Name Transliteration
The alphabets of Chinese and English are totally
different
Wade-Giles (WG) and Pinyin are two famous systems
to romanize Chinese in libraries
backward transliteration
– Transliterate target language terms back to source language
ones
– Chen, Huang, and Tsai (COLING, 1998)
– Lin and Chen (ROCLING, 2000)
Hsin-Hsi Chen
10-107
Name Mapping Table
Divide a name into a sequence of Chinese
characters, and transform each character
into phonemes
Look up phoneme-to-WG (Pinyin) mapping
table, and derive a canonical form for the
name
Example
– “林逋” “ㄌㄧㄣ ㄆㄨ” “Lin P’u”
(WG)
Hsin-Hsi Chen
10-108
Name Similarity
Extract named entity from the query
Select the most similar named entity from name
mapping table
Naming sequence/scheme
– LastName FirstName1, e.g., Chu Hsi (朱熹)
– FirstName1 LastName, e.g., Hsi Chu (朱熹)
– LastName FirstName1-FirstName2, e.g., Hsu Tao-ning
(許道寧)
– FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu
(許道寧)
– Any order, e.g., Tao Ning Hsu (許道寧)
– Any transliteration, e.g., Ju Shi (朱熹)
Hsin-Hsi Chen
10-109
Title
谿山行旅圖” “Travelers
among Mountains and Streams”
"travelers", "mountains", and
"streams" are basic components
Users can express their information
need through the descriptions of a
desired art
System will measure the similarity of
art titles (descriptions) and a query
Hsin-Hsi Chen
10-110
Free Search
A query is composed of several concepts.
Concepts are either transliterated or translated.
The query translation similar to a small scale IR
system
Resources
–
–
–
–
–
Name-mapping table
Title-mapping table
Specific English-Chinese Dictionary
Generic English-Chinese Dictionary
…
Hsin-Hsi Chen
10-111
Algorithm
(1) For each resource, the Chinese translations
whose scores are larger than a specific threshold
are selected.
(2) The Chinese translations identified from
different resources are merged, and are sorted by
their scores.
(3) Consider the Chinese translation with the
highest score in the sorting sequence.
– If the intersection of the corresponding English
description and query is not empty, then select the
translation and delete the common English terms
between query and English description from query.
– Otherwise, skip the Chinese translation.
Hsin-Hsi Chen
10-112
Algorithm (Continued)
(4) Repeat step (3) until query is empty or
all the Chinese translations in the sorting
sequence are considered.
(5) If the query is not empty, then these
words are looked up from the general
dictionary. A Chinese query is composed of
all the translated results.
Hsin-Hsi Chen
10-113