Transcript Document
Archiving
David Nathan
ELDP Training Workshop
March 2010
1
Archiving: what do you think of?
2
3
4
5
6
What is a language archive, then?
7
8
What is a digital language archive?
a forum / platform for data providers and
data users to negotiate and exchange
a trusted repository created and maintained
by an institution with a commitment to the
long-term preservation of archived material
has policies and processes for materials
acquisition, cataloguing, preservation,
dissemination, migration to new digital
formats
a collection of managed materials
9
OAIS model
OAIS archives define three types of
‘packages’
ingestion, archive, dissemination:
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
afd_34
dfa dfadf
fds fdafds
Producers
10
Ingestion
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
Archive
Dissemination
Designated
communities
What is archiving of language materials?
preparing materials in a structured, welldocumented, and complete form
building long-term relationships
it is not just backup
it is not just dissemination/publication
it does not define good linguistic practice
11
What can a language archive offer?
12
Security - keep your electronic materials safe
Preservation - store your materials for the long
term
Discovery - help others to find out about your
materials, and you to find out about users
Protocols - respect and implement sensitivities,
restrictions
Sharing - share results of your work, if appropriate
Acknowledgement - create citable
acknowledgement
Mobilisation - create usable language materials for
communities
Quality and standards - advice for assuring your
materials are of the highest quality and robust
standards
Kinds of language archives
many cross-cutting classifications:
Indigenous and local, eg. Squamish Nation,
“language centres”
regional, eg. AILLA, Paradisec
international, eg. DoBeS, ELAR
associated with research institute, eg. AIATSIS,
ANLC
grant-driven deposits, eg. DoBeS, ELAR
digital vs physical vs mixed, eg. DoBeS vs
Vienna Sound Archive, ANLC
13
Potential users
depositors – deposit, access or update
materials
speakers and their descendants (“majority
of users of Berkeley Language Center
archive are community members”)
other researchers - comparative/historical
linguists, typologists, theoreticians,
anthropologists, historians, musicologists
etc etc
other “stakeholders”, eg educationalists
journalists and the wider public
14
Archives networks and bodies
foundation concepts and technologies from
library initiatives, eg. D-LIB http://www.dlib.org/
OAI (Open Archives Initiative)
OAIS Open Archival Information Systems
(NASA and space agencies incl JAXA)
Open Language Archives Community
(OLAC)
Digital Endangered Languages and
Archives Network (DELAMAN)
ELAR, DOBES, ANLC, Paradisec, EMELD,
LACITO, AIATSIS, AMPM (Maori)
15
Archives networks and bodies
DELAMAN’s interests and activities
include:
language archiving training coordination and
syllabus
citation of deposits (for academic recognition of
deposited corpora)
archive federations (for seamless access to
resources across )
16
Citation examples
Courtesy Heidi Johnson of AILLA
Collection:
Sherzer, Joel. "Kuna Collection." The Archive of the
Indigenous Languages of Latin America:
www.ailla.utexas.org. Media: audio, text, image. Access:
0% restricted.
File/resource:
Sherzer, Joel (Researcher). (1970). "Report of a curing
specialist." Kuna Collection. Archive of the Indigenous
Languages of Latin America: www.ailla.utexas.org. Type:
transcription&translation. Media: text. Access: public.
Resource ID: CUK001R001.
17
Why is language archiving different?
what is a language?
the data is not conventionalised (like $,
age, year of publication etc) – what and
how to code?
varying and competing expectations
18
And endangered languages archiving?
extremely diverse context – languages,
cultures, communities, individuals, projects
typical source is fieldworkers
no established genres
difficult for archive staff to manage
sensitivities and restrictions
extremely high priority
19
Endangered Languages ARchive (ELAR)
one of 3 semi-autonomous programs of the
Hans Rausing Endangered Languages
Project
staff of 3; archivist, software developer,
technician, (research assistants etc)
develop policies, preservation
infrastructure, cataloguing and
dissemination, facilities, training, advice,
materials development and publishing
20
ELAR’s holdings
ELAR currently holds about 50 deposits
with a total volume of approx 4 TB.
the average deposit is about 80 GB
sizes vary widely, with a small number of
huge deposits. The median size is around
15GB
we expect volume to nearly double over
the next 18 months
see next slides for distribution of data types
21
ELAR holdings by data type
data types for a 25%
sample of holdings
(early 2008)
data type by volume
(MB) and number of
files, sorted by
volume
22
Data type
Volume
(MB)
Files
audio
360,411
6,312
video
208,995
895
image
28,592
2,221
msword
223
404
pdf
196
134
eaf
33
176
text
32
781
lex
9
29
trs
5
246
xls
1
19
imdi
1
26
The way we were ... ASEDA
Aboriginal Studies Electronic Data Archive,
AIATSIS Canberra, founded early 1990s
(modelled on Oxford Text Archive)
receive and catalogue electronic materials
that were at risk or not accessible
lexica
grammars
texts
23
How things have changed ..
types of data (modalities and genres)
now predominantly media / documentation
storage methods
now “professional”, mass data systems
standardisation and metadata
now various standards for data and metadata
dissemination
now web-based dissemination
expanded influence into practice and workflow
of linguists
24
Why digital?
preservation: digitisation is the only way that
media (audio and video) can be preserved for
the future
because it can be copied and transmitted with
zero loss
cataloguing, sharing, dissemination all
facilitated
25
Digital disadvantages
digital data is fragile and ephemeral
cost (human, equipment, maintenance)
requires strategy and luck to get infrastructure
right
preservation depends on file and data formats
26
depend on tools and software
depends on formats (prefer standard, open,
explicit, long-lasting)
materials may have to be converted and
migrated
some formats require particular software (can
we archive the software?)
These issues impact on archive policy
how to balance cost of andling and
preservation with value of materials?
how to provide long-term preservation
when our funding is time-limited?
27
The archiving process (depositors’ view)
28
Documenter and archive interactions
29
grant formulation and application
communications, questions, advice
training
archiving services (transfer, conversion etc)
ongoing management of materials
Documenter & archive interactions
30
Query/interaction topics
analysis of approx 150 queries from
documenters/linguists
31
ELAR Feedback template
ELAR Data Sample Evaluation
Prepared for:
By:
Date:
TEXT - xx file
Document type
Document format/layout/data structures
Character/language representation
Linking/references
Consistency
33
ELAR Feedback template
AUDIO
Document type/format
Resolution
Quality
Editing
Length
Annotation/transcription
Consistency
34
ELAR Feedback template
VIDEO
Document type/format
Resolution
Quality
Editing
Length
Annotation/transcription
Consistency
35
ELAR Feedback template
GENERAL
File naming
Data volume
Delivery
Consistency
36
Example detail (section: Document format)
Use of typography (size, underlining, bold, spaces
etc) to make headings and other structures is
weak - at least Styles should be used (with
complete consistency).
Tables to represent interlinear data is reasonably
appropriate, although would need to be converted
later.
Is it clear from this document, or somewhere else,
where to look up codes etc, such as the speaker
initials?
While the language is consistently labelled in the
interlinear section, it is identified only by the
alternation in font in the first section.
37
Example detail (section: Audio quality)
AD-MD03a 4Noe Song thami miya.wav - quality good.
AD-MD04b 33Boa Sr. LongNarrativeOnTsunami.wav quality reasonable, but background hiss is too loud
in proportion to the signal. Was this was part of your
original recording (on what equipment?) or was
introduced by digitisation, in which case it would be
a good idea to try de-digitising.
AD-MD05b 34Peje Phonetic Variation.wav - quality
quite good. Stereo separation of voices is nice.
CIILQ Seasons Contd 699-703.wav - suffers a number
of faults, including severe clipping (overmodulation),
background noise, microphone physical handling,
and poor acoustic representation (probably due to
poor microphone and/or recorder?).
38
Audio evaluation using Dobbin
software from Cube-Tec who make
Quadriga
audio evaluation, conversion and reporting
39
Dobbin
40
Dobbin
41
Dobbin
42
Dobbin
43
Dobbin
44
Dobbin
45
What can you archive (at ELAR)?
media - sound, video
graphics - images, scans
text - fieldnotes, grammars, description,
analysis
structured data - aligned and annotated
transcriptions, databases, lexica
metadata - structured, standardised
contextual information about the materials
46
Archive objects
an “object” could be a file, a set of files, a
directory, a “session” or a set of files with
relationships between them
these are often called “bundles”
like all structures, these should be made
explicit
eg through metadata
our new catalogue system will provide a facility
to create and label bundles
47
Data “portability” (Bird & Simons 2003)
data should also be “portable” (Bird &
Simons “Seven Dimensions ...”)
48
complete
explicit
documented
preservable
transferable
accessible
adaptable
not technology-specific
(also appropriate, accurate, useful etc!!)
Archive material should be selected
example: Depositor’s question: How much
video can I archive?
answer: ...
however,
unlikely that linguist is in position to plan and
consistently create excellent video, so selection
is unavoidable
data has always been edited and selected!
49
(... selection)
in your linguistic work you also:
selected
labeled
transformed/processed/edited
added, corrected, expanded
made links
made or assumed relationships between
“whole” and processed units; invented labels,
IDs, scope etc
imposed formats
50
File organisation example 1
IPF10011-Disk3-Story-WulaTuki-LunarEclipse
IMDI_3.0.xsd
WulaTuki_LunarEclipse.eaf
WulaTuki_LunarEclipse.imdi
WulaTuki_LunarEclipse.imdi.backup
WulaTuki_LunarEclipse.pfs
WulaTuki_LunarEclipse.txt
WulaTuki_LunarEclipse.wav
51
File organisation example 2
/
labelling-system.doc
AngryD-Bsi
AngryD-Bsi.pdf
AngryD-Bsi.wav
AngryD-Bsi.doc
52
File organisation example 3
/
archivist_notes.txt
ELAN transcription key FTG0025.pdf
Overview metadata FTG0025.xls [open]
Kay07-aud
Kay07-aud-jul03a.wav
Kay07-aud-jul03b.wav
Kay07-aud-jul03c.wav
53
Metadata
54
Metadata
Metadata
the data about data that enables the
management, identification, retrieval and
understanding of that data
reflects the knowledge and practice of
data providers
defines and constrains audiences and
usages for data
documentation’s goals heighten the
importance of metadata
55
Metadata formats
common or standard:
IMDI (‘ISLE Metdata Initiative’, from DoBeS)
OLAC (Open Language Archives Community)
EAD, and others
ELAR: has created its own set, currently in
implementation
deposit-wide metadata in deposit form
file level metadata (will be) by web form
also, depositor’s own metadata
56
On metadata formats
57
each depositor can also have different
metadata!
types of metadata are relative to each
project, consultants, community ...
our goal: to maximise the amount and
quality of metadata
quality and extent is more important than
standards and comparability
many depositors are sending extensive
metadata in a variety of formats including
spreadsheets
Types of metadata
58
depositor's / delegates' details
descriptive metadata
administrative metadata
preservation metadata
access protocols
metadata for individual files
Depositors and delegates
59
name
address
contact details (telephone, fax, email, URL)
role
affiliation
date of birth
nationality
Descriptive metadata
60
title, description, subject, summary
keywords
subject language, community
location
time span
Administrative metadata
project details
funding and hosting institutions
details of external copies
modifications and status
details of accession agreement
cf. deposit form
access
access protocols (see elsewhere)
group membership identification
61
Preservation metadata
carrier media
formats, size
provenance (source/history)
62
File-level metadata
media files
duration, file size
MIME type, content type
text files
font, character set, encoding
format, markup
access protocols
63
Access protocol
sensitivities, restrictions: identification,
description and implementation
deposit, file or object-level protocol
depositor-oriented
change/manage protocol over time
delegate
other rights holders
sunset clause
64
Protocol grows naturally with documentation
focus on recorded data » more people, more
genres, less researcher knowledge
community participation » framework for speakers
to shape documentation process and products
mobilisation » selecting, juxtaposing; community
participation
focus on revitalisation » which language to teach?
who to host and teach? who can learn? etc
time » significance and sensitivities change over
time
access » increasing scope for dissemination,
control of IP
65
Other kinds of metadata
information to make resources accessible
to community members
genres eg songs
languages, eg community language
materials for language teaching and learning
types of metadata are relative to the
particular project, consultants, community
...
66
Archiving and data management
Most data-related issues are properly part
of linguistic data management
There are now few data-related issues that
are archive-specific
But teaching curricula, training, and
practices need time to catch up
Ultimate goal of documenting languages
well means that we must find the optimal
“division of labour” in each case
67
ELAR assists depositors
68
preserve your deposited materials
implement your access restrictions etc
provide advice, general and specific
assistance, eg data conversion
provide web-based deposit management
allow updates and additions
provide some equipment and services
on a case by case basis, develop
resources
What is required to make a deposit?
resource(s) for an endangered language
it could be just one file
inventory / metadata
deposit form
an online version will be available soon
deposits can be updated, supplemented,
metadata added/modified
69
How do depositors deliver data?
Hard disks
we return them
we send them out
some grant applicants factor them into grants
Email
good for samples for evaluation
OK for most text materials
Flash cards and USB sticks
A web upload facility will be
available later
Web download
70
What about CDs and DVDs?
we have found CDs, and
especially DVDs, to be
very unreliable
DVD fail rate about 10%
cause confusion as files
are allocated to fit on
disks, not according to
corpus structure
create a lot of work for
depositors and for ELAR
71
We ask depositors to
manage materials well
collect and provide protocol information
deliver materials, metadata
send trial samples etc
(funded grantees) not withhold materials
share/manage/delegate custodianship of
materials
maintain relationships with language
stakeholders and ELAR
72
ELAR online
We now have ELAR online archive,
although data is only just starting to be
released to public view:
http://elar.soas.ac.uk/
The archive has been implemented using a
Content Management System, in this
case Drupal:
open-source web software
based on PHP, MySQL and JavaScript
implements user, role and group-based
access to materials
73