E-MELD Working Group for Corpus Management and Metadata Preliminary report June 20, 2006 Participants  Heidi Johnson (chair)  Michael Appleby (liaison)  Gary Simons  Joseph Grimes 

Download Report

Transcript E-MELD Working Group for Corpus Management and Metadata Preliminary report June 20, 2006 Participants  Heidi Johnson (chair)  Michael Appleby (liaison)  Gary Simons  Joseph Grimes 

E-MELD Working Group for
Corpus Management and
Metadata
Preliminary report
June 20, 2006
Participants
 Heidi Johnson (chair)
 Michael Appleby (liaison)
 Gary Simons
 Joseph Grimes
 Shauna Eggars
 Alison Alvarez
 Anthony Aristar
 Charles Warner
Topics visited today
 Transmittal of data to archives
 Metadata:
– Standards
– Simple tools
– How to get people to do it
Transmittal to archives
 This means "electronic transmittal of digital
data & metadata"
 Group consensus: it's the way of the future
and should be encouraged
 Hows & whens depend on the institution in
which the archive is embedded - not really
an EMELD issue
Transmittal to archives II
Gary S. notes institutional repositories using
D-SPACE are growing rapidly:
 Mainly for archiving/disseminating e.g.
preprints of articles
 May be a role OLAC can play in
developing standards & tools for supporting
multimedia language data in D-SPACE
repositories
Transmittal to archives III
Johnson wishes it noted that this is should be
a low priority issue because:
 D-SPACE isn't handy yet for most linguists;
 Uploading/downloading 100+ Mb audio
files is too slow for most field linguists &
speakers;
 Mailing stuff on dvds & cds works great
and requires no development so people can
just go ahead and do it now with no excuses
Metadata: the big picture
 OAI-STER meta-catalog (?) now has over 5
million records;
 Only list things with URLs (not things in
archives that require logins);
 OAI is just basic Dublin Core: very
dumbed-down metadata;
 OLAC's role continues to be providing the
linguist's-eye view on this vast ocean of
data.
Metadata - Standards
 We have a standard for metadata: the OLAC
schema.
 This should be considered bottom-line,
basic, required catalog information for all
linguistic data.
 Deeper schemas like IMDI and specialized
subfield schemas are encouraged.
So,
how do we get people to do it?
Johnson's complaint:
Putting up a page of XML code and saying
"this is what metadata looks like" is
guaranteed to drive the average field
linguist away. It frustrates them and gives
them an excuse to do nothing.
 "XML is an interchange format, not an
authoring format." Gary S.
Metadata - Solutions
1. All tools for producing and manipulating
linguistic data should include sections for
creating and maintaining (at least) OLAC
metadata.
2. Tools that do this need to be more widely
disseminated (e.g. WordCorr.)
3. New tools being developed must pay
attention to standards so that they are
interoperable with other tools.
A very simple solution
 We need templates for metadata catalogs for all
the popular tools: Excel, Word, FilemakerPro.
 These templates should be downloadable from the
School.
 Archives then need transformer scripts/tools to
convert these templates into XML.
 Archives can help maintain & distribute the
templates & transformers, but we are all
underfunded so we need help from all you Perl
wizards out there!
More about the simple solution
 Templates for popular tools and transformer
scripts don't require much development
effort, but they will result in a HUGE
improvement in the amount of metadata that
is collected and transferred with data to
archives.
 From nothing to something in seconds flat!
A once-and-future standard?
 Recording audio/video headers with the
basic metadata.
 All recordings from the 60's have them; few
recordings from the 90's do. What
happened?
 This must be part of all field methods
classes and taught on projects.
 We should put some good examples - both
audio & video - up at the School.
Metadata details for EL data
More demographic info about speakers:
Age, sex, family & social position & role,
occupation/economic status/education level
where relevant, native language, other
languages, place of origin, keep speaker's
name anonymous/nickname
Note: the IMDI schema includes these fields.
How do we get people to keep
catalog information?
"Never underestimate people's ability to get
out of doing something." Alison
1. Make their grade or degree depend on it.
2. Funder pressure: require all data collected
w/grant to be archived.
3. Nag them relentlessly.
4. Suite of simple tools & templates that
makes it easy and possibly even fun.
Tony's Extremely Excellent Idea
 Departments have control over standards for
dissertations: require that dissertations
include in an appendix all the metadata for
all the data on which the diss is based.
 Make it an LSA resolution: who could
object?
 Gary S.: so now the tools should say
"Which appendix is this catalog going
into?"
Editorial pressure
 Linguistic publications could demand
(prefer) citation of properly archived data,
so linguists would archive & cite in order to
publish.
 Step 1: a standard format for citing
audio/video linguistic data as well as texts;
 Step 2: widely disseminating this standard;
 Step 3: increasingly requiring citation.
Summary
 We have a perfectly good metadata standard
(in fact, we have two.)
 What we need is to get people to use them!!
 We need:
– Plain language lists of required elements and
examples available at the School;
– Simple templates & transformers for use with
popular tools;
– An arsenal of "persuaders" (publication, diss
reqs, funder pressures, shame & guilt, etc.)
The publication value chain
 Gary S. notes that we are moving towards
institutional repositories.
 Which means that libraries are taking on
more of the publishers' role in selecting,
(re)formatting, & disseminatng.
 So that it may be that depts will lean on
authors to publish via institutional
repositories, rather than or in addition to
traditional publications
The perfect corpus management
tool (PCMT)
The ideal tool helps people manage and
integrate their workflow:
 project initiation
 collection of primary data
 products of analysis
 publications based on the corpus
Features of the PCMT - I
1. designed from the start to support
internationalization
2. has remarks fields at every level &
especially wherever the primary input is
from a controlled vocabulary (e.g. genre)
3. is easy to use, for linguists & speakers
Features of the PCMT - II
4. generates identifiers that relate related
materials. You can use these identifiers as
to label media and as file names.
5. keeps track of relationships among
materials
6. allows you to treat materials as unrelated
if you want, so you can e.g. archive all
your recordings and then later add texts as
they are done
Features of the PCMT - III
works offline, but lets you connect and upload
your metadata to an archive or other repository
8. lets you to connect to the Ethnologue to look up
language codes
9. divides metadata into sub-packages (objects) so
that users don't have to re-enter any info (e.g for
the language)
10. metadata objects are easy to share across tools
7.
Features of the PCMT - IV
11. lets you copy info from one md object to
another so you don't have to type the same
info twice. Ex: all your consultants have
the same socio-economic situation; all
your recordings were elicited; this series
of questionnaires was collected by the
same person...
Features of the PCMT - V
12. objects are modelled graphically so you
can e.g. drag this contributor who speaks
this language into this project
13. lets you customize so you can nickname
objects for quick reference
14. supports versioning of materials
15. cross-platform, unicode, outputs xml, etc
Features of the PCMT - VI
10. objects:
1. project info
2. language info
3. contributor info
4. equipment info
5. data info: objects for recordings, texts,
databases, spreadsheets, etc etc etc
Example
 At the start of your project, you enter:
– project data: your name & contact info,
director, funder, project period, description
– language data: code, notes about the variety,
speaker population, etc
– equipment: enter all the specs as you take
things out of the box
Example, cont.
 In the field:
 Enter contributor data at the start of a
session with each new consultant
 Enter recording data each time you record:
–
–
–
–
–
tool gives you an identifier/label
date, place, context, conditions
reference contributors
reference equipment
reference language
Example, cont.
 Back at the university:
 Review/revise previously entered info
 Transmit recordings and metadata to archive
 Enter info for analytical products:
– reference recording objects
– reference contributors (new ones, like yourself)
– create format objects for software, platform,
fonts, etc