Transcript Document

Archiving
David Nathan
ELDP Training Workshop
March 2010
1
Archiving: what do you think of?
2
3
4
5
6
What is a language archive, then?
7
8
What is a digital language archive?
 a forum / platform for data providers and
data users to negotiate and exchange
 a trusted repository created and maintained
by an institution with a commitment to the
long-term preservation of archived material
 has policies and processes for materials
acquisition, cataloguing, preservation,
dissemination, migration to new digital
formats
 a collection of managed materials
9
OAIS model
 OAIS archives define three types of
‘packages’
ingestion, archive, dissemination:
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
afd_34
dfa dfadf
fds fdafds
Producers
10
Ingestion
afd_34
afd_34
dfa dfadf
dfa dfadf
fds fdafds
fds fdafds
Archive
Dissemination
Designated
communities
What is archiving of language materials?
 preparing materials in a structured, welldocumented, and complete form
 building long-term relationships
 it is not just backup
 it is not just dissemination/publication
 it does not define good linguistic practice
11
What can a language archive offer?
12
 Security - keep your electronic materials safe
 Preservation - store your materials for the long
term
 Discovery - help others to find out about your
materials, and you to find out about users
 Protocols - respect and implement sensitivities,
restrictions
 Sharing - share results of your work, if appropriate
 Acknowledgement - create citable
acknowledgement
 Mobilisation - create usable language materials for
communities
 Quality and standards - advice for assuring your
materials are of the highest quality and robust
standards
Kinds of language archives
 many cross-cutting classifications:
 Indigenous and local, eg. Squamish Nation,
“language centres”
 regional, eg. AILLA, Paradisec
 international, eg. DoBeS, ELAR
 associated with research institute, eg. AIATSIS,
ANLC
 grant-driven deposits, eg. DoBeS, ELAR
 digital vs physical vs mixed, eg. DoBeS vs
Vienna Sound Archive, ANLC
13
Potential users
 depositors – deposit, access or update
materials
 speakers and their descendants (“majority
of users of Berkeley Language Center
archive are community members”)
 other researchers - comparative/historical
linguists, typologists, theoreticians,
anthropologists, historians, musicologists
etc etc
 other “stakeholders”, eg educationalists
 journalists and the wider public
14
Archives networks and bodies
 foundation concepts and technologies from
 library initiatives, eg. D-LIB http://www.dlib.org/
 OAI (Open Archives Initiative)
 OAIS Open Archival Information Systems
(NASA and space agencies incl JAXA)
 Open Language Archives Community
(OLAC)
 Digital Endangered Languages and
Archives Network (DELAMAN)
 ELAR, DOBES, ANLC, Paradisec, EMELD,
LACITO, AIATSIS, AMPM (Maori)
15
Archives networks and bodies
 DELAMAN’s interests and activities
include:
 language archiving training coordination and
syllabus
 citation of deposits (for academic recognition of
deposited corpora)
 archive federations (for seamless access to
resources across )
16
Citation examples
 Courtesy Heidi Johnson of AILLA
Collection:
Sherzer, Joel. "Kuna Collection." The Archive of the
Indigenous Languages of Latin America:
www.ailla.utexas.org. Media: audio, text, image. Access:
0% restricted.
File/resource:
Sherzer, Joel (Researcher). (1970). "Report of a curing
specialist." Kuna Collection. Archive of the Indigenous
Languages of Latin America: www.ailla.utexas.org. Type:
transcription&translation. Media: text. Access: public.
Resource ID: CUK001R001.
17
Why is language archiving different?
 what is a language?
 the data is not conventionalised (like $,
age, year of publication etc) – what and
how to code?
 varying and competing expectations
18
And endangered languages archiving?
 extremely diverse context – languages,
cultures, communities, individuals, projects
 typical source is fieldworkers
 no established genres
 difficult for archive staff to manage
 sensitivities and restrictions
 extremely high priority
19
Endangered Languages ARchive (ELAR)
 one of 3 semi-autonomous programs of the
Hans Rausing Endangered Languages
Project
 staff of 3; archivist, software developer,
technician, (research assistants etc)
 develop policies, preservation
infrastructure, cataloguing and
dissemination, facilities, training, advice,
materials development and publishing
20
ELAR’s holdings
 ELAR currently holds about 50 deposits
with a total volume of approx 4 TB.
 the average deposit is about 80 GB
 sizes vary widely, with a small number of
huge deposits. The median size is around
15GB
 we expect volume to nearly double over
the next 18 months
 see next slides for distribution of data types
21
ELAR holdings by data type
 data types for a 25%
sample of holdings
(early 2008)
 data type by volume
(MB) and number of
files, sorted by
volume
22
Data type
Volume
(MB)
Files
audio
360,411
6,312
video
208,995
895
image
28,592
2,221
msword
223
404
pdf
196
134
eaf
33
176
text
32
781
lex
9
29
trs
5
246
xls
1
19
imdi
1
26
The way we were ... ASEDA
 Aboriginal Studies Electronic Data Archive,
AIATSIS Canberra, founded early 1990s
(modelled on Oxford Text Archive)
 receive and catalogue electronic materials
that were at risk or not accessible
 lexica
 grammars
 texts
23
How things have changed ..
 types of data (modalities and genres)
now predominantly media / documentation
 storage methods
now “professional”, mass data systems
 standardisation and metadata
now various standards for data and metadata
 dissemination
now web-based dissemination
expanded influence into practice and workflow
of linguists
24
Why digital?
 preservation: digitisation is the only way that
media (audio and video) can be preserved for
the future
 because it can be copied and transmitted with
zero loss
 cataloguing, sharing, dissemination all
facilitated
25
Digital disadvantages
 digital data is fragile and ephemeral
 cost (human, equipment, maintenance)
 requires strategy and luck to get infrastructure
right
 preservation depends on file and data formats
26
 depend on tools and software
 depends on formats (prefer standard, open,
explicit, long-lasting)
 materials may have to be converted and
migrated
 some formats require particular software (can
we archive the software?)
These issues impact on archive policy
 how to balance cost of andling and
preservation with value of materials?
 how to provide long-term preservation
when our funding is time-limited?
27
The archiving process (depositors’ view)
28
Documenter and archive interactions





29
grant formulation and application
communications, questions, advice
training
archiving services (transfer, conversion etc)
ongoing management of materials
Documenter & archive interactions
30
Query/interaction topics
 analysis of approx 150 queries from
documenters/linguists
31
ELAR Feedback template
ELAR Data Sample Evaluation
Prepared for:
By:
Date:
TEXT - xx file
Document type
Document format/layout/data structures
Character/language representation
Linking/references
Consistency
33
ELAR Feedback template
AUDIO
Document type/format
Resolution
Quality
Editing
Length
Annotation/transcription
Consistency
34
ELAR Feedback template
VIDEO
Document type/format
Resolution
Quality
Editing
Length
Annotation/transcription
Consistency
35
ELAR Feedback template
GENERAL
File naming
Data volume
Delivery
Consistency
36
Example detail (section: Document format)
Use of typography (size, underlining, bold, spaces
etc) to make headings and other structures is
weak - at least Styles should be used (with
complete consistency).
Tables to represent interlinear data is reasonably
appropriate, although would need to be converted
later.
Is it clear from this document, or somewhere else,
where to look up codes etc, such as the speaker
initials?
While the language is consistently labelled in the
interlinear section, it is identified only by the
alternation in font in the first section.
37
Example detail (section: Audio quality)
AD-MD03a 4Noe Song thami miya.wav - quality good.
AD-MD04b 33Boa Sr. LongNarrativeOnTsunami.wav quality reasonable, but background hiss is too loud
in proportion to the signal. Was this was part of your
original recording (on what equipment?) or was
introduced by digitisation, in which case it would be
a good idea to try de-digitising.
AD-MD05b 34Peje Phonetic Variation.wav - quality
quite good. Stereo separation of voices is nice.
CIILQ Seasons Contd 699-703.wav - suffers a number
of faults, including severe clipping (overmodulation),
background noise, microphone physical handling,
and poor acoustic representation (probably due to
poor microphone and/or recorder?).
38
Audio evaluation using Dobbin
 software from Cube-Tec who make
Quadriga
 audio evaluation, conversion and reporting
39
Dobbin
40
Dobbin
41
Dobbin
42
Dobbin
43
Dobbin
44
Dobbin
45
What can you archive (at ELAR)?
 media - sound, video
 graphics - images, scans
 text - fieldnotes, grammars, description,
analysis
 structured data - aligned and annotated
transcriptions, databases, lexica
 metadata - structured, standardised
contextual information about the materials
46
Archive objects
 an “object” could be a file, a set of files, a
directory, a “session” or a set of files with
relationships between them
 these are often called “bundles”
 like all structures, these should be made
explicit
 eg through metadata
 our new catalogue system will provide a facility
to create and label bundles
47
Data “portability” (Bird & Simons 2003)
 data should also be “portable” (Bird &
Simons “Seven Dimensions ...”)









48
complete
explicit
documented
preservable
transferable
accessible
adaptable
not technology-specific
(also appropriate, accurate, useful etc!!)
Archive material should be selected
 example: Depositor’s question: How much
video can I archive?
 answer: ...
 however,
 unlikely that linguist is in position to plan and
consistently create excellent video, so selection
is unavoidable
 data has always been edited and selected!
49
(... selection)
 in your linguistic work you also:






selected
labeled
transformed/processed/edited
added, corrected, expanded
made links
made or assumed relationships between
“whole” and processed units; invented labels,
IDs, scope etc
 imposed formats
50
File organisation example 1
IPF10011-Disk3-Story-WulaTuki-LunarEclipse
IMDI_3.0.xsd
WulaTuki_LunarEclipse.eaf
WulaTuki_LunarEclipse.imdi
WulaTuki_LunarEclipse.imdi.backup
WulaTuki_LunarEclipse.pfs
WulaTuki_LunarEclipse.txt
WulaTuki_LunarEclipse.wav
51
File organisation example 2
/
labelling-system.doc
AngryD-Bsi
AngryD-Bsi.pdf
AngryD-Bsi.wav
AngryD-Bsi.doc
52
File organisation example 3
/
archivist_notes.txt
ELAN transcription key FTG0025.pdf
Overview metadata FTG0025.xls [open]
Kay07-aud
Kay07-aud-jul03a.wav
Kay07-aud-jul03b.wav
Kay07-aud-jul03c.wav
53
Metadata
54
Metadata
 Metadata
 the data about data that enables the
management, identification, retrieval and
understanding of that data
 reflects the knowledge and practice of
data providers
 defines and constrains audiences and
usages for data
 documentation’s goals heighten the
importance of metadata
55
Metadata formats
 common or standard:
 IMDI (‘ISLE Metdata Initiative’, from DoBeS)
 OLAC (Open Language Archives Community)
 EAD, and others
 ELAR: has created its own set, currently in
implementation
 deposit-wide metadata in deposit form
 file level metadata (will be) by web form
 also, depositor’s own metadata
56
On metadata formats
57
 each depositor can also have different
metadata!
 types of metadata are relative to each
project, consultants, community ...
 our goal: to maximise the amount and
quality of metadata
 quality and extent is more important than
standards and comparability
 many depositors are sending extensive
metadata in a variety of formats including
spreadsheets
Types of metadata






58
depositor's / delegates' details
descriptive metadata
administrative metadata
preservation metadata
access protocols
metadata for individual files
Depositors and delegates







59
name
address
contact details (telephone, fax, email, URL)
role
affiliation
date of birth
nationality
Descriptive metadata





60
title, description, subject, summary
keywords
subject language, community
location
time span
Administrative metadata
 project details
 funding and hosting institutions
 details of external copies
 modifications and status
 details of accession agreement
 cf. deposit form
 access
 access protocols (see elsewhere)
 group membership identification
61
Preservation metadata
 carrier media
 formats, size
 provenance (source/history)
62
File-level metadata
 media files
 duration, file size
 MIME type, content type
 text files
 font, character set, encoding
 format, markup
 access protocols
63
Access protocol
 sensitivities, restrictions: identification,
description and implementation
 deposit, file or object-level protocol
 depositor-oriented
 change/manage protocol over time
 delegate
 other rights holders
 sunset clause
64
Protocol grows naturally with documentation
 focus on recorded data » more people, more
genres, less researcher knowledge
 community participation » framework for speakers
to shape documentation process and products
 mobilisation » selecting, juxtaposing; community
participation
 focus on revitalisation » which language to teach?
who to host and teach? who can learn? etc
 time » significance and sensitivities change over
time
 access » increasing scope for dissemination,
control of IP
65
Other kinds of metadata
 information to make resources accessible
to community members
 genres eg songs
 languages, eg community language
 materials for language teaching and learning
 types of metadata are relative to the
particular project, consultants, community
...
66
Archiving and data management
 Most data-related issues are properly part
of linguistic data management
 There are now few data-related issues that
are archive-specific
 But teaching curricula, training, and
practices need time to catch up
 Ultimate goal of documenting languages
well means that we must find the optimal
“division of labour” in each case
67
ELAR assists depositors








68
preserve your deposited materials
implement your access restrictions etc
provide advice, general and specific
assistance, eg data conversion
provide web-based deposit management
allow updates and additions
provide some equipment and services
on a case by case basis, develop
resources
What is required to make a deposit?
 resource(s) for an endangered language
 it could be just one file
 inventory / metadata
 deposit form
 an online version will be available soon
 deposits can be updated, supplemented,
metadata added/modified
69
How do depositors deliver data?
 Hard disks
 we return them
 we send them out
 some grant applicants factor them into grants
 Email
 good for samples for evaluation
 OK for most text materials
 Flash cards and USB sticks
 A web upload facility will be
available later
 Web download
70
What about CDs and DVDs?
 we have found CDs, and
especially DVDs, to be
very unreliable
 DVD fail rate about 10%
 cause confusion as files
are allocated to fit on
disks, not according to
corpus structure
 create a lot of work for
depositors and for ELAR
71
We ask depositors to






manage materials well
collect and provide protocol information
deliver materials, metadata
send trial samples etc
(funded grantees) not withhold materials
share/manage/delegate custodianship of
materials
 maintain relationships with language
stakeholders and ELAR
72
ELAR online
 We now have ELAR online archive,
although data is only just starting to be
released to public view:
 http://elar.soas.ac.uk/
 The archive has been implemented using a
Content Management System, in this
case Drupal:
 open-source web software
 based on PHP, MySQL and JavaScript
 implements user, role and group-based
access to materials
73