Customizing the IMDI metadata schema for endangered languages

Download Report

Transcript Customizing the IMDI metadata schema for endangered languages

Customizing the IMDI
metadata schema for
endangered languages
Heidi Johnson (AILLA)
Arienne Dwyer (DOBES)
Introduction
IMDI: International Standards for
Language Engineering Metadata
Initiative
DOBES: Volkswagen Foundation’s
Documentation of Endangered
Languages initiative
AILLA: the Archive of the Indigenous
Languages of Latin America
Types of resources
Audio and video recordings in various
digital formats
Annotation text files, e.g. transcriptions
and translations
Standalone texts, e.g. dictionaries,
poetry
Wide range of genres: from verbal art to
scholarly analyses
Bundles of resources
Session (IMDI, 2001): resources
resulting from a linguistic elicitation
session - recordings and annotations.
Only models one kind of resource
production - a recording session.
Collections will include a greater variety
of resources, in sets of related
materials.
Types of bundles
Canonical bundle: the original session. A
digitized recording, in different formats, and
some textual annotation files, also in different
formats.
Minimal bundle: a single file. Examples:
dictionary, poem, recording of uninterpretable
chants.
Meta-bundle: a bundle containing other
bundles. Example: a book about a set of
annotated recordings.
Bundle elements
Current:
– Name of bundle
– Date and place of production
Proposed:
– Resource relations
– Date archived
– Last modified
Major subschemas
Project
Collector
Content
Participants
Resources
References
The Content Subschema
Genre is the top-level category:
– Interaction: conversation, interview …
– Explanation: description, recipe …
– Performance: narrative, poem, oratory …
– Teaching: primer, textbook …
– Analysis: grammar, dictionary …
Other Content categories
Modality: speech, writing, gesture
Communication context:
– Interactivity
– Planning
– Involvement
Languages
Task
Description
Keys
AILLA’s Content Keys
Register: a characterization of how the
discourse reflects the social context.
Example: honorific speech
Style: about poetic and stylistic effects.
Examples: parallelism, metered verse.
The Project subschema
Current elements:
– Name: a nickname or acronym
– Title: official title
– ID: a unique identifier
– Contact information
Proposed element:
– Funder: name of funding organization
The Collector subschema
AILLA renames this Depositor, since
this is the individual we have to keep
track of (e.g. for Level 3 access
permission). When the Depositor is not
also the Collector, Collector can be
listed under Participants.
The Participants subschema
Type: functional role, e.g. creator
Role: family relationship
Name/Full name
Language(s)
Ethnic group, age, sex:
Education
Anonymous: True if participant’s Full name is
reserved; False otherwise
AILLA additions to Participants
Origin: Place (country, region, etc) of
origin of the creator of the primary
resource in the bundle (e.g. the speaker
whose voice is recorded).
Occupation: Can be relevant in
assessing accuracy of some kinds of
data.
The Resources subschema
Resources contains information about
formats and provenance of files in a
bundle.
Media Files: audio, video, etc.
Annotation Files: text files.
Proposal: call them all Media Files, to
reduce redundancy in the database.
(All have URL, size, etc. elements.)
Text resources
Current elements:
– Type: type of annotation, e.g. phonetic
transcription.
– Content encoding: annotation encoding
scheme, e.g. EUROTYP.
– Character encoding: character set(s) used
in a text file.
Text resources 2
Proposed elements:
– Transcription type
– Translation (aka Glossing) type
– Software: used to produce transcriptions,
translations, other annotations (e.g.
Shoebox)
Describe Annotator in Participants
(along with Translator, etc.)
Proposed subschema
Place: composed of several elements:
–
–
–
–
Continent
Country
Region
Subregion (address)
Repeated at least twice, in Bundle and in
Participants (Origin).
Might also be useful in the Language
subschema.
Conclusion
IMDI schema is a flexible tool.
Customization through Key/Value pairs
allows local modifications.
Most of the proposed changes are
terminological, moving from the DOBES
in-house terminology to more general
usage.