The Rosetta Project: ALL Language Archive and the Impact

Download Report

Transcript The Rosetta Project: ALL Language Archive and the Impact

The Rosetta Project
Digital Language Archive
Laura Buszard-Welcher
The Long Now Foundation /
University of California, Berkeley
The Rosetta Project Archive
•
A public, Web-based, digital
archive of language documentation
•
Part of the National Science Digital
Library (NSF program for
dissemination of educational STEM
resources)
•
Over 95,000 pages of resources on
over 2,300 languages
•
Over 3000 wordlists (Swadesh
lists, 500-1500 term lists)
•
New! Audio files
Project Goals: Resources
• We are a digital language archive with comprehensive,
global scope: we can and do accept digital resources
on any language, dialect, family, or subgroup.
• Promotes linguistic diversity by broadly disseminating
resources on languages with small numbers of
speakers--contributes to the effort to document and
disseminate resources on endangered languages.
• Comprehensive scope both requires and builds
communities: global networks of linguists, speakers,
educators
Project Goals: Interoperability
and Resource Discovery
• Supporting metadata standardization and
interoperability (OLAC participating archive and
individuals, E-MELD, GOLD, LSA Conversation
on Endangered Language Archiving)
• Promoting resource discovery through open archive
search: we serve oai_dc, nsdl_dc, olac_dc metadata
Project Goals: Developing tools for
collaborative linguistic research
•
•
•
•
•
Endangered Language Query Room
DOCS (Digital Online Curation Services)
LangGator
Wordlist tool (collaboration with MPI-EVA)
New Rosetta V2.0 Website
Site Infrastructure
• Plone 2.1 content management system, running in the
Zope Application Server
• Open source, leverages worldwide developer
communities
• Lots of “plug in” modules for functionality expansion
– CMF Bibliography AT, Plone Board, etc.
• Heavily modified infrastructure (language node
design) and user interface
Nodal Architecture
• Languages, language families, family subgroups,
dialects all represented by nodes.
• A node is a content aggregation page
• Nodes and parent-child relationships each have unique
IDs
• The system currently represents Ethnologue language
relationships, but has the flexibility to be agnostic about
them, represent relationships from various theoretical
perspectives
Node Pages
• Accessible from a variety of
browse and search pages
– Browse by language name, family,
country data type
– Quick search, advanced search
• Node page organization
–
–
–
–
Node metadata
Descriptive Resources
Navigation: classification tree
Links to people functions, LINGUIST
List people search
– External links: searches
Content
• In-house collection, vetting
– Primary focus of collection
– Rosetta descriptive categories
• Special collections
– Endangered Language Fund (ELF) Digital Archives
– Alan Lomax Audio Collection
– Future collections that come in through DOCS
• Future development
– Uploaded, peer-reviewed resources
– Collaborative content areas (bulletin boards, wiki)
Scanning
•
•
•
•
Historically, the primary focus
of in-house collection
Rosetta serves over 95,000
images from a variety of
published resources
Excerpts in data categories (see
following slides)
Public domain resources can be
scanned in their entirety
Categories of Collection (1)
Ethnologue
metadata
General information from www.ethnologue.com about
language affiliation, where spoken, number of speakers,
dialects, alternate language names.
General
description
General description of the language. Origin and current
distribution of language, number of speakers, family,
typology, history, etc.
Maps
Maps of the geographic distribution of a language and its
relationship to other languages in the region.
Orthography
Writing system(s) of the language with any
accompanying guide to pronunciation, use, etc.
Phonology
A description of the basic sound units in a language
(phonemes) and how they combine to form utterances.
Categories of Collection (2)
Grammar
How a language combines the smallest units of meaning
(morphemes) to create words and words to create
sentences.
Core Word
Lists
A common word list of 100 or 200 terms typically
collected in linguistic fieldwork (“Swadesh Lists”), often
used for comparative purposes.
Numbers
A description of the numbering system(s) in a language
with a list of basic terms.
Parallel Texts
A common text with translation for each language.
Initially Genesis Chapters 1-3 (a commonly collected
text). Now also the UN Declaration of Human Rights.
Glossed Texts
Transcribed indigenous texts with word glosses, free
translations and grammatical markup.
Resource Pages
•
•
•
•
•
Accessed from node pages
Bibliographic metadata
Links to other resources
Resource bundles
Associated resource files
–
–
–
–
–
Scanned images
OCR’ed live text files
Annotated text files
Audio/video files
User comments
Community Functions
• Goal: build a network of linguists, speakers, educators
• People:
– Member pages
– Regional and language curators
• Collaborative content:
–
–
–
–
Discussions (nodes, resources)
Resource upload
Vetting by volunteer language/family experts
In the future? Wiki documents (unvetted, but resources
produced may go through higher vetting levels)
Member
Gallery
•
•
•
Central access to
member search
and browse
Central access to
language forums
Highlighted
members
Member
Profile Page
•
•
•
User-defined content area
List of recent uploads
Lists of recent forum
postings
Audio Digitization
• Alan Lomax language audio
collection (mostly reel-to-reel,
some cassette)
• Edirol external digitizer (96
kHz sample rate, 24 bit depth)
• Sound Forge 7.0,
uncompressed .wav
• Now accepting audio deposits
(on a limited basis)
• We archive and serve digital
resources, not physical media
Rosetta Depositor Consent Form
•
•
•
•
Prompted by special collections (ELF, Alan Lomax Audio)
Intended to work on paper, or in digital form
Inspired by AILLA’s graded access system
Encourages depositors to see archiving as a kind of publication:
assumes dissemination of some or all of resources
– “In general, we encourage all depositors to make their resources freely
available, and to consider archiving with us as a form of publication. If
you feel the need to place an extreme form of restriction on the resource,
then our project may not be the most suitable place to archive your
resource. We reserve the right to archive only those resources that we
deem appropriate to our project, with respect to both content and access.”
Level 1: Open access to recordings
Users have full access to recordings after agreeing to
our Terms and Conditions. For this level, we assume
that depositors have already gained permission for
public access from the speakers or authors of the
resource. Level 1 access may be applied to the entire
deposit, or to parts of the deposit. If portions of the
deposit are to be restricted, attach a detailed description
that clearly identifies them, and designate one of the
following access levels (2-5) for each restricted portion.
Level 2: Access limited by password
Users may access recordings only if they know a password that
you create. This type of access allows you to keep resources
private, or provide access to others by sharing the password with
them. Access limited by passwords must be renegotiated with
The Rosetta Project every five years, at which time depositors
may continue use of a previous password, choose a new
password, or select another access level (Rosetta will contact the
depositor at the appropriate time). If not renegotiated, access to
the resource changes to open access (Level 1).
Level 3: Access protected by a time limit
Users may not access the resources until after a specified date.
Although we encourage all depositors to make their resources
freely available, we understand that some depositors may want to
restrict access to resources for a few years (normally five or
less) while preparing a publication, such as a dissertation. After
the date you specify, access to the resource changes to open
access (Level 1).
Levels 4 and 5: Designated Controllers
Level 4. The depositor controls access to the resource. The Rosetta
Project will provide contact information, and the user will have to
contact the depositor directly for permission, and the depositor then
will write to The Rosetta Project. If permission is granted, The
Rosetta Project will give the user access to the resource.
Level 5. The depositor designates another person or organization
to control the resource. The Rosetta Project will contact the
controller on the user’s behalf. If permission is granted, The Rosetta
Project will give the user access to the resource (please attach
controller’s contact information).
Depositor/Controller Responsibilies
Note: for Levels 2, 3, 4, and 5, the depositor must
ensure that the appropriate contact information is
up to date. If contact information is not up to
date, or documented good faith attempts made by
the Rosetta archive or its users to obtain access
are not answered, then determinations of
permission to access and use the resource reverts
to the curator of the archive.
The Archivist in the Driver’s Seat
• Archiving and serving digital resources is a valuable,
(and expensive) service
• Some archives also provide digitization services
• For these reasons, archives can be expected to set
conditions on what they will archive
• Rosetta’s consent forms are intended to ensure that:
– The majority of our resources are publicly accessible on the
Web (all are available for listening in person)
– Archivist is never at the mercy of extreme access restrictions
– All access conditions work toward open access (Level 1)
URLs
• Electronic Metastructure for Endangered Language Data (E-MELD)
http://www.emeld.org (School of Best Practice, FIELD Tool).
• Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/.
• The Ethnologue http://www.ethnologue.com.
• General Ontology for Linguistic Description (GOLD) http://www.linguisticsontology.org or http://emeld.org/school/workroom/terminology/
• LINGUIST List http://www.linguistlist.org
• National Science Digital Library (NSDL) http://nsdl.org
• ODIN www.csufresno.edu/odin
• Open Language Archives Community (OLAC) http://www.languagearchives.org.
• The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new
Web site is available at http://preview.rosettaproject.org.