WP 11 Kick-Off meeting

Download Report

Transcript WP 11 Kick-Off meeting

WP 10 Multilingual Access
Philipp Daumke, Stefan Schulz
Multilingual Access - Rationale
English as a Foreign Language
English as Second Language
English as First Language
No English Language Skills
• < 70 % of the world's scientists read in English
• 80 % of the world's electronically stored information is in English
• 90 % English articles in Medline (2000)
Sources: The British Council, 2005
Fung ICH: Open access for the non-English-speaking world: overcoming the language
barrier. Emerging Themes in Epidemiology, 2008
Non-native speakers
English as a Foreign Language
English as Second Language
• Broad range of command of English
• Reading skills > writing skills
• Reduced active vocabulary
Difficulty in formulating precise queries
Cross-language document retrieval example
Korrelation von
Hypertonie und
Läsion der
Weißen Substanz…
“Correlation of high
blood pressure and
lesion of the white
substance”
Cross-language document retrieval example
Korrelation von
Hypertonie und
Läsion der
Weißen Substanz…
“Correlation of high
blood pressure and
lesion of the white
substance”
Cross-language document retrieval example
Korrelation von
Hypertonie und
Läsion der
Weißen Substanz…
“Correlation of high
blood pressure and
lesion of the white
substance”
BootStrep WP 10 - Multilingual access
• Objectives:
– To provide a multilingual search interface to the BootStrep
Biolexicon / Bioontology
– We do NOT propose to deliver a multilingual extension of the
BootStrep biolexicon
• Query Languages: French, German, English, (Italian)
• Output language: English
• Method: Subword-based semantic indexing
• Resources:
– MorphoSaurus multilingual subword lexicon & thesaurus
– MorphoSaurus Semantic Indexer
Technique: Morphosemantic Indexing
• Subword-based, multilingual semantic indexing for document
retrieval
• Subwords are atomic, conceptual or linguistic units:
–
–
–
–
Stems:
Prefixes:
Suffixes:
Infixes:
stomach, gastr, diaphys
anti-, bi-, hyper-ary, -ion, -itis
-o-, -s-
• Equivalence classes contain synonymous subwords and their
translations:
– #derma
= { derm, cutis, skin, haut, kutis, pele, cutis, piel, … }
– #inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch,
inflam, flog, inflam, flog, ... }
Subword Thesaurus Structure
•
•
Thesaurus:
~21.000 equivalence
classes (MIDs)
Lexicon entries:
– English:
– German:
– Portuguese:
~15.000
– Spanish
:
– French:
– Swedish:
– Italian:
herz
subword
heart
corazon
Eq Class
card
card
HEART
muscle
~23.000
~24.000
INFLAMM
MUSCLE
myo
-itis
~11.000
~ 8.000
~10.000
~ 4.000
muscul
inflam
entzünd
muskel
inflamm
Segmentation:
Indexation:
Myo|kard|itis
#muscle #heart #inflamm
Herz|muskel|entzünd|ung
#heart #muscle #inflamm
Inflamm|ation of the heart muscle
#inflamm #heart #muscle
Indexing Pipeline
Indexing Pipeline
Indexing Pipeline
Indexing Pipeline
Subword-based document transformation
Morphosemantic
indexer
Subword-Based Search
Korrelation von
Hypertonie und
Läsion der
Weißen Substanz…
#correl #hyper
#tens #lesion #whit
#matter
Subword-based query transformation
Korrelation von
Hypertonie und
Läsion der
Weißen Substanz…
#correl #hyper
#tens #lesion #whit
#matter
Adapting Morphosemantic Indexing of BootStrep
• BootStrep terminology mostly disjoint from existing
clinical terminology
• Enhancement of data resources (e.g. for acronym
resolution, multi-term equivalences)
• BootStrep Terms for multilingual access
– Gene Ontology , InterPro, IntAct, Gene Regulation Ontology,
Species
• Medline subcorpus (about E. coli gene regulation)
Ongoing/Completed Tasks
•
Manual Training of MorphoSaurus-Lexica by means of the BootStrep
corpora
(en, de, fr)
•
Multilingual Terminology Browser
– 2268 GO terms + translations
– 6925 InterPro terms + translations
– 2082 IntAct terms + translations
– URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/
•
Multilingual Search Engine:
– Document collection: BootStrep-Medline subset
– Languages: English, German, French
– Query modes: Author, Title, title + keywords, All
Terminology Browser
Search Results
Navigation
Further Information
Terminology Browser
Multilingual Search Engine
To do: Tools and Resources
• BootStrep-Browser
– Integration of Species
– Integration of the Gene Regulation Ontology
• Multilingual Search Engine
–
–
–
–
Multilingual treatment of acronyms
Inclusion of species synonym list
Dealing with mixed queries (German-English, English-French)
Integration with the fact store
• Continue lexicon population
– Italian terms ?
To do: Evaluation
• Creation of a gold standard
– Typical English queries
– Find all relevant documents in the E.coli subset
• CLIR experiments
– Translate queries to French and German
– Compare mean average precision
• Reuse of already existing routines on standard
benchmarks (OHSUMED, IMAGEClef)
ImageCLEFMed Benchmark
• Baseline:
monolingual
Top 20 Average Precision
Top 20 Average Precision
– Stemmed English queries
– Stemmed English texts
100
90
• Query translation
80
– Google translator
– Multilingual dictionary
Query Translation
compiled from UMLS
70
60
50
• Morphosaurus
Morphosemantic Indexing
Morphosaurus+D
40
30
– Interlingual representation
of user queries and
documents
20
10
0
DE
PT
Portuguese
EN
SP
FR
SV
AV
Average
Swedish
French
Spanish
German
Language and Condition
English
Percent
of
Percent of
Baseline
Baseline
• Morphosemantic Indexing
– incorporating
disambiguation module