Transcript Document

OLAC: Open Language
Archives Community
OLAC: The Open Language
Archives Community
Gary F. Simons
SIL International and
Graduate Institute of Applied Linguistics
DRIVER Summit, Goettingen, 16-17 Jan 2008
What is OLAC?
www.language-archives.org
► OLAC
is an international partnership of institutions
and individuals who are creating a world-wide
virtual library of language resources by:
 Developing consensus on best current practice for the
digital archiving of language resources
 Developing a network of interoperating repositories and
services for housing and accessing such resources
► Founded
in December 2000
 Now has 34 participating archives
 12 European participants (bolded on next slide)
2
Who’s involved?
►
Natural Language Software Registry
►
Online Database of Interlinear Text (ODIN)
►
►
Aboriginal Studies Electronic Data Archive
Academia Sinica
Alaska Native Language Center
Archive of Indigenous Languages of Latin America
ATILF Resources
Berkeley Language Center
Centre de Ressources pour la Description de l'Oral
CHILDES Data Repository
Comparative Corpus of Spoken Portuguese
Cornell Language Acquisition Laboratory
Dictionnaire Universel Boiste 1812
DOBES catalogue (MPI, Nijmegen)
Ethnologue: Languages of the World
►
European Language Resources Association
►
Oxford Text Archive
PARADISEC
Perseus Digital Library
Research Papers in Computational Linguistics
Rosetta Project 1000 Language Archive
SIL Language and Culture Archives
Surrey Morphology Group Databases
Survey for California and Other Indian Languages
TalkBank
Tibetan and Himalayan Digital Library
TRACTOR
Typological Database Project
►
Laboratoire Parole et Langage
►
University of Bielefeld Language Archive
►
Linguistic Data Consortium Corpus Catalog
►
►
LINGUIST List Language Resources
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
►
University of Queensland Flint Archive
► Virtual Kayardild Archive (Melbourne) 3
How does it work?
► Based
on OAI Protocol for Metadata Harvesting
 Adds a community-specific archive description to the
Identify response
 Defines a new olac metadata format
 We operate a static repository gateway for participants
with small collections (needs olac format only)
 http://www.language-archives.org/sr
 We operate an aggregator that harvests all participants
and crosswalks them to oai_dc format
 http://www.language-archives.org/cgi-bin/olaca3.pl
4
OLAC metadata format
► Based
on the Dublin Core metadata set
 Record format follows the DC guidelines for
implementing Qualified DC in XML
 Adds community-specific controlled vocabularies:
 Linguistic Data Type to qualify Type
 Linguistic Field to qualify Subject
 Participant Role to qualify Creator and Contributor
 ISO 639-3 to qualify Language and Subject
5
Who’s involved?
6
Controlled vocabularies for
language identification
► Situation:
6,912 living languages
are used throughout the world
 Source: Ethnologue,
15th edition
http://www.ethnologue.com
► Problem:
The standard used in the library
community (MARC language codes, or ISO 639-2)
 Has codes for fewer than 400 languages
 Uses 66 “collective” codes to handle the other 6,500, e.g.
 South American Indian (Other) [sai] covers 421 languages
 Bantu (Other) [bnt] covers 612 languages
7
ISO 639-3
► In
2002, ISO TC37 invited SIL to propose a
comprehensive standard compatible with 639-2
► Result:
ISO 639-3, Alpha-3 code for comprehensive
coverage of languages (published 2007-02-05)
 Codes for ~6,900 living languages
 Codes for ~600 extinct, historical, ancient, and
constructed languages
 RA site: http://www.sil.org/639-3/
► OLAC
uses this controlled vocabulary for identifying the languages a resource is in or about
8
What is the current
coverage of OLAC?
All archives
Excluding
Ethnologue
Items in catalog
30,591
23,292
ISO 639-3 languages
included
7,299
3,134
Items with online
open access
16,018
8,719
9
Current developments
►
In first year of a 3-year NSF sponsored grant to increase use and coverage by an order of magnitude
1. Develop guidelines and services that encourage best
common practices among language archives that will
facilitate language resource discovery with precision
through OLAC (and attract more archives to join).
2. Develop services to bridge the resource catalogs of the
repository, library, and web domains (e.g. OAI, MARC,
Google) to facilitate language resource discovery with
precision through OLAC.
 E.g. User searches OLAC aggregator for a specific
639-3 code and finds hits in external aggregators 10
External interoperation
►
Strategy 1: Use existing cataloging information to identify
languages with precision
 639-2 codes for individual languages
 Language names in LC subject headings, Call numbers
►
Strategy 2: Promote use of 639-3 in cataloging
 ISO639-3 is now an encoding scheme in DC Terms
 iso639-3 has been added to the MARC standard as a
recognized identifier for a source in Field 041
 E.g., these are valid 041 fields for a grammar in English of
Lushootseed [lut] of the Salishan [sal] family
 041 1_$asal$aeng
(using 639-2 by default)
 041 17$alut$aeng$2iso639-3 (using 639-3)
11
Conclusion
► OLAC
would like to establish interoperation with the
DRIVER infrastructure. We could:
 Implement a driver set on OLAC aggregator
 Harvest language resources from DRIVER aggregator
► OLAC
is pleased that DRIVER already recommends
ISO 639-3 as best practice with Language element
 We are available to advise institutions who need help
implementing this
 We are looking for partners who will help advocate adoption of 639-3 in other guidelines and standards so as to
broaden the base for language-related interoperation 12