Support for Multilingual Information Access

Download Report

Transcript Support for Multilingual Information Access

Support for
Multilingual Information Access
Douglas W. Oard
College of Information Studies and
Institute for Advanced Computer Studies
University of Maryland, College Park, MD, USA
August 21, 2002
Szechenyi National Library
Multilingual Information Access
Help people find information
that is expressed in any language
Outline
• User needs
• System design
• User studies
• Next steps
0
Japanese
Arabic
Russian
Bengali
Portuguese
Spanish
Hindi-Urdu
English
Chinese
Speakers (Millions)
Global Languages
800
600
400
200
Source: http://www.g11n.com/faq.html
Global Internet User Population
2000
2005
8%
5%
8%
9%
12%
32%
6%
English
8%
40%
English
5%
6%
5%
3%
52%
4%
5%
Chinese
4%
8%
4%
21%
6%
2%
3%
2%
5%
2%
6%
2%
3%
5%
5%
2%
Source: GlobalJapanese
Reach
Spanish
Chinese
Scandanavian
Korean
Portuguese
3%
Spanish
German
Chinese
Italian
Korean
Other
Japanese
French
Scandanavian
Dutch
Portuguese
English
German
Spanish
Italian
French
Other
Italian
Portuguese
2% 5%
French
Japanese
Dutch
Chinese
English
Dutch
Other
2%
German
Scandanavian
Korean
English
Global Internet Hosts
10.0
Swedish
Chinese
Spanish
Finnish
Dutch
French
German
0.1
Japanese
1.0
English
Internet Hosts (million):
100.0
Language (estimated by domain)
Source: Network Wizards Jan 99 Internet Domain Survey
European Web Size Projection
English
Other European
1,000.0
100.0
10.0
1.0
-0
5
O
ct
-0
4
O
ct
-0
3
O
ct
-0
2
O
ct
-0
1
O
ct
-0
0
O
ct
-9
9
O
ct
-9
8
O
ct
-9
7
O
ct
-9
6
0.1
O
ct
Billions of Words
10,000.0
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
Global Internet Audio
Over 2500
Internet-accessible
Radio and Television
Stations
English
1062
1438
Other
Languages
source: www.real.com, Mar 2001
Who needs Cross-Language Search?
• Searchers who can read several languages
– Eliminate multiple queries
– Query in most fluent language
• Monolingual searchers
– If translations can be provided
– If it suffices to know that a document exists
– If text captions are used to search for images
Outline
• User needs
System design
• User studies
• Next steps
Multilingual Information Access
Information Science
Information Retrieval
Cross-Language Retrieval
Indexing Languages
Machine-Assisted Indexing
Digital Libraries
Artificial Intelligence
Natural Language Processing
Machine Translation
Information Extraction
Text Summarization
Other Fields
Human-Computer Interaction
Localization
Information Visualization
World-Wide Web
Ontological Engineering
Web Internationalization
Multilingual Metadata
Information Use
Multilingual Ontologies
Speech Processing
Knowledge Discovery
Topic Detection and Tracking
International Information Flow
Diffusion of Innovation
Automatic Abstracting
Textual Data Mining
Document Image Understanding
Machine Learning
Multilingual OCR
Multilingual Information Access
Cross-Language
Search
Cross-Language
Browsing
Select
Query
Translation
Examine
Document
Delivery
The Search Process
Author
Monolingual
Searcher
Cross-Language
Searcher
Choose
Document-Language
Terms
Choose
Document-Language
Terms
Choose
Query-Language
Terms
Infer
Concepts
Select
Document-Language
Terms
Document
Query-Document
Matching
Query
Interactive Search
Query
Formulation
Query
Query
Translation
Translated Query
Search
Ranked List
Selection
Document
Examination
Document
Query Reformulation
Use
Synonym Selection
KeyWord In Context (KWIC)
Outline
• User needs
• System design
User studies
• Next steps
Cross-Language Evaluation Forum
• Annual European-language retrieval evaluation
– Documents: 8 languages
• Dutch, English, Finnish, French, German, Italian,
Spanish, Swedish
– Topics: 8 languages, plus Chinese and Japanese
– Batch retrieval since 2000
• Interactive track (iCLEF) started in 2001
– 2001 focus: document selection
– 2002 focus: query formulation
iCLEF 2001 Experiment Design
144 trials, in blocks of 16, at 3 sites
Participant
1
2
Task Order
Topic11, Topic17
Topic11, Topic17
Topic13, Topic29
Topic Key
Narrow:
11, 13
Broad:
17, 29
Topic13, Topic29
System Key
3
Topic17, Topic11
4
Topic17, Topic11
Topic29, Topic13
System A:
Topic29, Topic13
System B:
An Experiment Session
• Task and system familiarization
• 4 searches (20 minutes each)
– Read topic description
– Examine document translations
– Judge as many documents as possible
• Relevant, Somewhat relevant, Not relevant, Unsure, Not judged
• Instructed to seek high precision
• 8 questionnaires
– Initial, each topic (4), each system (2), final
Measure of Effectiveness
• Unbalanced F-Measure:
– P = precision
– R = recall
–  = 0.8
1
F 
 1

P
R
• Favors precision over recall
• This models an application in which:
– Fluent translation is expensive
– Missing some relevant documents would be okay
French Results Overview
CLEF 
AUTO 
English Results Overview
CLEF 
AUTO 
Commercial vs. Gloss Translation
• Commercial Machine Translation (MT) is almost always better
– Significant with one-tail t-test (p<0.05) over 16 trials
• Gloss translation usually beats random selection
Retrieval Effectiveness
1.2
|-------- Broad topics ----------|
1
MT
|-------- Narrow topics ---------|
GLOSS
0.8
0.6
0.4
0.2
0
umd01
umd02
umd03
umd04
umd01
Searcher
umd02
umd03
umd04
iCLEF 2002 Experiment Design
Topic Description
Standard
Ranked List
Query
Formulation
Automatic
Retrieval
Interactive
Selection
Mean
Average
Precision
F
0.8
Maryland Experiments
• 48 trials (12 participants)
– Half with automatic query translation
– Half with semi-automatic query translation
• 4 subjects searched Der Spiegel and SDA
– 20-60 relevant documents for 4 topics
• 8 subjects searched Der Spiegel
– 8-20 relevant documents for 3 topics
• 0 relevant documents for 1 topic!
Some Preliminary Results
• Average of 8 query iterations per search
• Relatively insensitive to topic
– Topic 4 (Hunger Strikes):
6 iterations
– Topic 2 (Treasure Hunting): 16 iterations
• Sometimes sensitive to system
– Topics 1 and 2: system effect was small
– Topics 3 and 4: fewer iterations with semi-automatic
• Topic 3: European Campaigns against Racism
Subjective Evaluation
• Semi-automatic system:
– Ability to select translations – good
• Automatic system:
– Simpler / less user-involvement needed - good
– Few functions / easier to learn and use – good
– No control over translations - bad
• Both systems:
– Highlighting keywords helps - good
– Untranslated/poorly-translated words - bad
– No Boolean or proximity operator – bad
Outline
• User needs
• System design
• User studies
Next steps
Next Steps
• Quantitative analysis from 2002 (MAP, F)
– Iterative improvement of query quality
• Utility of MAP as a measure of query quality?
• Utility of semiautomatic translation
– Accuracy of relevance judgments
• Search strategies
– Dependence on system
– Dependence on topic
– Dependence on density of relevant documents
An Invitation
• Join CLEF
– A first step: Hungarian topics
– http://clef.iei.pi.cnr.it
• Join iCLEF
– Help us focus on true user needs!
– http://terral.lsi.uned.es/iCLEF