Kent State University Gregory M. Shreve Internationalizing Digital Libraries: Towards A Standards-Based Strategy Shreve 11/7/2004

Download Report

Transcript Kent State University Gregory M. Shreve Internationalizing Digital Libraries: Towards A Standards-Based Strategy Shreve 11/7/2004

Kent State University
Gregory M. Shreve
Internationalizing Digital Libraries:
Towards A Standards-Based Strategy
1
Shreve
11/7/2004
Kent State University
Multilingual Modalities
Digital libraries may contain resources in many languages.
Accessible through the Internet, libraries may be consulted by
individuals in other cultural/linguistic "locales" seeking resources in
their own languages or searching across languages for resources
in languages other than their own.
U
S
E
R
S
2
Shreve
R
E
S
O
U
R
C
E
S
11/7/2004
Kent State University
Internationalization
In order to enable the efficient and effective acquisition, storage
and retrieval of cross-cultural and cross-linguistic resources, a
digital library has to be designed from the outset to allow for
heterogeneous linguistic and cultural content. The design process
is called “internationalization.” The most effective
internationalization strategies are standards-based.
Internationalization: a design process intended
to enable subsequent linguistic and cultural
adaptation
I18N
3
Shreve
11/7/2004
Kent State University
Internationalization Strategy
An internationalization strategy for a Digital Library involves:
(1) determining the metadata elements, attributes, value spaces and
values that are culturally and linguistically dependent and are to be
rendered in multiple languages.
(2) creating a mechanism for internationalization that provides
administrative control, cross-language tools capability, authority for
keywords (terms), translations and translation equivalents.
(3) providing an internationalization scheme that offers reusability and
scalability and interfaces with relevant national and international
standards.
Other issues are important (different writing systems and character sets
of resources) and different display preferences (interface, resources),
but we do not deal with these in this paper.
4
Shreve
11/7/2004
Kent State University
Internationalization & Localization
Localization is the preparation of locale-specific versions of a
digital library resource or collection and consists of the translation
of textual material into the language and textual conventions of the
target locale and the adaptation of non-textual materials and
delivery / display mechanisms to take into account the cultural
requirements of that locale.
internationalization
localization
translation
adaptation
Internationalization is an “upstream” engineering process that
should precede localization. Its aim is to make subsequent
localization/translation easier, more efficient, and less costly.
5
Shreve
11/7/2004
Kent State University
Internationalization & Localization
internationalization
Document
Processes
controlled language
terminology control
content / display separation
cultural stylesheeting
exchange standards
authority management
concept-orientation
Standards-Based
Internationalization
Strategies
creation
storing
rendering
distribution
acquisition
retrieval
localization
reusability, scalability, authority, control, quality, accessibility, acceptability, accuracy
6
Shreve
11/7/2004
Kent State University
Internationalization
Foci of Internationalization in a Digital Library:
7
reusability
translations
scalability
I18N solution
authority / quality
equivalence
accessibility
cross-language
accuracy / acceptability
target culture(s)
control
target document
Shreve
1.
2.
3.
4.
5.
6.
resource content
metadata content
metadata elements
interface elements
keywords (terms)
vocabularies
11/7/2004
Kent State University
Internationalization
Loci of Internationalization in a Digital Library:
DL resource content
(new and existing translations, equivalents)
DL resource metadata & description
(element labels, content, vocabularies)
DL interface
(localized dialogs, help, messages, menus)
DL tools
(x-language: search, glossaries, taxonomies, thesauri)
8
Shreve
11/7/2004
Kent State University
Parallel Metadata: Inline Parallel
As discussed in my ASIST 2003 presentation, there are two I18N
approaches to support localizing a DL. The first approach is inline
parallel and involves providing multiple local versions of, for
instance, a title or keyword data element in a resource record. The
data elements are flagged as “local” versions via the lang attribute.
This is the most common localization method. Note that
“equivalence” is assumed via adjacency and no authority is
provided.
9
Shreve
11/7/2004
Kent State University
Inline Parallel: Flawed
NO
Because this method stores local equivalents
of metadata content inline with the original
content in the resource record itself, it does
not provide for reusability.
It is not easily scalable because multiple
translations of the same or identical items will
exist in different places, leading to
redundancy and difficulties in maintenance
and quality control.
Because there is no schema and system for
documenting and managing translations, the
source, authority and quality of equivalents
and translations cannot be assured.
Because authority and quality cannot be
assured, accessibility, accuracy and
acceptability cannot be assured.
reusability
scalability
authority / quality
accessibility
accuracy / acceptability
control
The approach does not provide control.
10
Shreve
11/7/2004
Kent State University
Parallel Metadata: External Parallel
A more fruitful approach, provides references to standards-based
external objects. The external objects can be translation memories
(for translations of titles, descriptions or other textual content) or
standard (e.g. ISO 12620) glossaries (for multilingual equivalents of
data element names and their possible restricted vocabulary values).
S
T
A
N
D
A
R
D
S
Translation Memory
Digital
Library
Resources
text segments
terms
B
A
S
E
D
ISO 12620 Glossary
11
Shreve
11/7/2004
Kent State University
Parallel Metadata: External Parallel
TMX-Compliant Translation Memory
<tuv xml:lang=“en-US” creationdate=“20031012” creationid=“Shreve” >
<seg>Thermal analysis of anisotropic bodies</seg>
</tuv>
<tuv xml:lang=“zh-CH” creationdate=“20031012” creationid=“Shreve”>
<seg>
</seg> </tuv>
<Title lang=“en-US” hastranslation=“true”>Thermal analysis of anisotropic bodies</Title>
Optional LOM
Attribute
TMX = Translation Memory Exchange. A translation memory is a database of “aligned” text
segments that are translations of one another. It maintains linguistically “parallel” texts.
12
Shreve
11/7/2004
Kent State University
Parallel Metadata: External Parallel
Translation memories and
glossaries are the most
common external localizing
objects, but the growing
use of statistically based
corpus linguistics to create
language resources will
also make it possible to
utilize other monolingual
and multilingual resources
in Digital Libraries.
Standards for representing
and storing some of these
new language resources
do not yet exist.
13
Shreve
Corpus
11/7/2004
Kent State University
Parallel Metadata: External Parallel
Ontologies
Thesauri
Taxonomies
Corpus
14
Shreve
For instance, multilingual
ontologies, thesauri and
taxonomies could be
constructed from term
analysis of DL document
corpora.
11/7/2004
Kent State University
Internationalizing Metadata
Internationalizing a DL not only involves providing and controlling
translations of the content and metadata descriptive elements.
Internationalizing a metadata schema also involves determining the
elements and element attributes that could affect the scheme’s
ability to be used for classification, search, retrieval, and reuse of
learning objects in multicultural and multilingual contexts.
An internationalization strategy begins with specifying all metadata
elements that are culturally and linguistically dependent. Ideally,
internationalization is a goal during initial schema development.
Unfortunately, as with IEEE-LOM, internationalization may involve
existing data elements in a pre-existing schema. Additions and
modifications to the elements and element set may be necessary
recommended.
15
Shreve
11/7/2004
Kent State University
Culturally Dependent Metadata
LOM element 2.3.3. Date ( is an example of a culturally dependent
meta-data element. CEN (European Committee for Standardization)
suggests extensions to “internationalize” Date:
<DATETIME>2003-12-25</DATETIME>
<DATETIMELOCALE>
<LOCALE>US</LOCALE>
<SOURCE>http://standards.org/us/calendarSpecs.pdf</SOURCE>
<LOCALIZEDDATETIME>12/25/03</LOCALIZEDDATETIME>
</DATETIMELOCALE>
<DATETIMELOCALE>
<LOCALE>UK</LOCALE>
<LOCALIZEDDATETIME>25/12/03</LOCALIZEDDATETIME>
</DATETIMELOCALE>
<DATETIMELOCALE>
<LOCALE>AE</LOCALE>
<SOURCE>http://standards.org/ae/calendarNumSpecs.pdf</SOURCE>
<LOCALIZEDDATETIME>1/11/1424</LOCALIZEDDATETIME>
</DATETIMELOCALE>
<DATETIMELOCALE>
<LOCALE>AE</LOCALE>
<SOURCE>http://standards.org/ae/calendarTextSpecs.pdf</SOURCE>
<LOCALIZEDDATETIME>1 Dhu’l-Qa’dah 1424</LOCALIZEDDATETIME>
</DATETIMELOCALE>
16
Shreve
Addresses
Calendar
Currency
Date
Numbers
Telephone
Time
11/7/2004
Kent State University Culturally Dependent Metadata Values
Some “universal” metadata
elements have values that
may be very culturally
dependent. For instance,
LOM 5.6 Educational.
Context has a value space
[school, higher education,
training, other] that is not
only extremely limited, but
derives from a single
cultural context. Different
countries have different
educational systems. The
LOM values are often not
applicable or do not have a
real correspondence.1
17
Shreve
11/7/2004
Kent State University Culturally Dependent Metadata Values
Although CEN has suggested simply “enlarging” the value space for
such elements, true internationalization of these “system” dependent
elements would involve providing a locale specification for the
element so that a specific vocabulary could be retrieved.
<education locale=‘en-US’>
<context>
value space
</context>
</education>
<education locale=‘de-DE’>
<context>
value space
</context>
</education>
18
Shreve
Kindergarten
Elementary School
Middle School
High School
…
Kindergarten
Grundschule
Hauptschule
Realschule
Gesamtschule
Gymnasium
…
The ISO 639
language codes and
the ISO 3166
country codes do not
allow for even more
“local” localization.
In Germany, for
instance, the
Bavarian school
system differs from
the German “norm.”
11/7/2004
Kent State University
Metadata: Translation?
Creating locale-specific value spaces for more “universal” data
elements is a complex task. Localized value spaces cannot be
achieved by simply translating the existing or default values.
Kindergarten
Elementary School
Middle School
High School
…
en-US
Kindergarten
Grundschule
Hauptschule
Realschule
Gesamtschule
Gymnasium
…
de-DE
Some values may have one-to-one equivalence. Others do not.
Middle school (junior high) may include one or more of Hauptschule /
Realschule / Gymnasium / Gesamtschule. The values imply
different age ranges, different educational objectives and values and
different social structures.
19
Shreve
11/7/2004
Kent State University
Restricted Vocabularies
Multilingual / multicultural restricted vocabularies must be developed
as standards by in-country domain experts. Equivalence should be
contolled, standardized and authoritative.
LOM 5.2 Learning Resource Type
element value space
exercise
simulation
questionnaire
diagram
figure
graph
index
slide
table
narrative text
exam
experiment
problem statement
self assessment
lecture
20
Shreve
European Treasury Browser
Controlled Vocabulary
authoritative
equivalence
validated mapping
Lecture
Vorlesung
Conferencia
Conferenza
Föreläsningar
Foredrag
διάλεξη
11/7/2004
Kent State University
Restricted Vocabularies
Multilingual / multicultural restricted vocabularies should be conceptbased. For two vocabulary items to be equivalent they should
represent the same concept. The concepts should be documented in
authoritative multilingual glossaries such as those specified in ISO
12620. Such glossaries provide one of the bases for external
parallel metadata methods.
concept
ISO 12620 Glossary
label
lecture
21
Shreve
Vorlesung
Conferencia
Conferenza
Föreläsningar
διάλεξη
Foredrag
11/7/2004
Kent State University
Concept Object
Concept objects are the
core of terminology
glossaries. They
organize both
monolingual and
multilingual data.
Organized into
terminology glossary
databases for computerassisted translation,
they are indispensable
in today’s language
industry.
22
Shreve
11/7/2004
Kent State University
KOS, Glossary and Concept
When concepts are
documented in
authoritative multilingual
glossaries they can also
provide the basis for
KOS (knowledge
organization systems) of
use in concept-mediated
monolingual and
multilingual browsing and
searching in DLs.
23
Shreve
11/7/2004
Kent State University
ISO 12620 Terminology Glossary
• A terminology is concept-oriented.
• A terminology is documented in a glossary, not a dictionary.
• A terminology glossary is organized by concept, not by linguistic
label.
• A term is the word, lexical string, or linguistic label used to
designate a single concept in the language / culture / subculture
of a special subject field.
• A glossary documents the multiple words or lexical strings (in a
single language or in multiple languages) that designate a single
concept.
• A glossary thus organizes synonyms (monolingual) and
equivalents (multilingual) of a concept.
• The organization of a terminology system / glossary reflects the
knowledge organization system of the domain it describes. It is
also a Knowledge Organization System (KOS) document.
24
Shreve
11/7/2004
Kent State University
Glossary
ISO 12620 Data Categories I
ISO 12620 Data Categories
<termEntry id="boundary conditions">
concept
<descrip type="subjectField">Computational Materials Science</descrip>
concept description
<descrip type="definition"> Those physical and/or mechanical conditions existing
around the surfaces and limits of a structural body.</descrip>
<admin type="source"> Composite Materials Dictionary:
http://composite.about.com/library/glossary/blglossary-d.htm </admin>
concept relations
<descrip type="superordinateconcept" target="boundary "> boundary </descrip>
administration
<admin type="originatingPerson">Adriana Luchian</admin>
language set
<langSet xml:lang="en-us">
term (label) description
25
Shreve
11/7/2004
Kent State University
language set
ISO 12620 Data Categories II
<langSet xml:lang="en-us">
term (label) information
equivalence
<termNote type='transferComment'>...</
language set
term (label) information
<langSet xml:lang=“fr-fr">
26
Shreve
<tig>
<term>boundary conditions</term>
<date>4/12/03</date>
<descrip type="context">For solids with spatial discontinuities,
such as bounded solids or those containing holes, crack,
interfaces, etc., we need to satisfy some prescribed boundary
conditions.</descrip>
<admin type="source">Computational Materials Science Corpus,
Kent State University, March, 2003</admin>
</tig>
<tig>
<term>conditions limites</term>
<date>4/12/03</date>
<descrip type="context">Elles ont été appliquées au cas d'un objet
impénétrable " mou " (condition de Dirichlet sur son contour) par
C. Rozier et objet " dur " (condition de Neumann sur son contour)
par E. Bocly et moi-même immergé dans un guide d'onde dont les
parois sont impénétrables (la condition limite à la surface est de
Dirichlet et sur le fond de Neumann).</descrip>
<admin type="source">Computational Materials Science Corpus,
Kent State University, March, 2003</admin>
</tig>
11/7/2004
Kent State University
ISO TC 37 Glossaries
Thomas Baker, in his discussion of the Dublin Core in multiple
languages, laments the lack of “comprehensive dictionaries” for
metadata labels and vocabularies.2
Many issues in multilingual, multicultural DL development revolve
around cultural variation in concept description and concept systems
(KOS) and establishing linguistic authority (access to authoritative
terms, documentation of authority and availability of authoritative
equivalents). What we really need to support DL metadata schemas
is not a “dictionary,” but standards-based external internationalization
strategies such as TMX translation memories and multilingual
terminology glossaries as defined by ISO TC 37’s ISO 12620 and
other standards.
ISO TC 37: Standardization of principles, methods and applications
relating to terminology and other language resources.
27
Shreve
11/7/2004
Kent State University ISO TC 37 Glossaries and Searching
A concept-based multilingual
glossary can be implemented to
support cross-language searching. A
glossary can provide authority for
keyword selection where multilingual
equivalents are then included in
“parallel” in the resource record.
Alternatively, a glossary-based DL
can make it unnecessary to include
more than one local term in the
resource record.
Keyword L2
query
Keyword L1
Glossary
concept-mediated multilingual search
28
Shreve
Keyword L3
Keyword L4
11/7/2004
Kent State University
Data Element Names
A glossary can also be implemented to provide localized labels for
data element names. In the event there are “local” versions of a
schema (a Dublin Core or IEEE-LOM not in English) that need to be
equated for software exchange, or data elements that need to be
explained (training, help files) or used in an interface (resource
submission form) a glossary can provide authoritative multilanguage labels for a canonical data element name and its
attributes.
Glossary
token
L1 element name
canonical element name
or identifier
L2 element name
label
L3 element name
29
Shreve
11/7/2004
Kent State University
Conclusion & References
Adding multilingual and multicultural metadata to a DL involves:
1. Determining the metadata elements, attributes , value spaces and
values that are culturally dependent and, if the display and
interface are to be localized, those metadata elements that are to
be rendered in multiple languages;
2. Providing external parallel strategies for localization;
3. The external parallel system is a more robust localization
approach, providing control, administrative tools, authoritative
terminology, and authority for translations and equivalents.
4. The external parallel system offers reusability, scalability and
leverages the strengths of international standards.
30
1.
European Committee for Standardization. 2003. CEN Workshop Agreement 14643. Internationalisation of the IEEE Learning
Object Metadata. ICS 03.180; 35.060; 35.240.99.
2.
Baker, Thomas. 1997. Metadata Semantics Shared Across Languages: Dublin Core in languages other than English.
http://dublincore.org/documents/multilingual-semantics/
3.
European Schoolnet. Recommended data model format to be used as a standard by national systems to include national/local
resources in the EU Treasury Browser. http://www.en.eun.org/etb/survey/d4.2.pdf
Shreve
11/7/2004