Best Practice In Your Back Pocket: Getting the most out of

Download Report

Transcript Best Practice In Your Back Pocket: Getting the most out of

A Field Linguist’s Guide to Making
Long Lasting Texts and Databases
LSA Organized Session
January 4, 2007
Anaheim, California
Organized by:
Jeff Good and Heidi Johnson
Open Language Archives Community (OLAC)
Outreach Committee
Moderator:
Laura Welcher
Speakers:
Debbie Anderson,
Michael Appleby, Jessica Boynton,
Naomi Fox, Connie Dickinson
Presentations from this session
will be posted at
http://www.languagearchives.org/news.html#olac07
Best Practice in Your Back
Pocket: Getting the Most Out
of the Tools You Have
Laura Welcher
The Rosetta Project / Long Now Foundation
A great way to freak out a linguist
“To be in compliance with best practice
recommendations (ahem), your
interlinear glossed text needs to be in
XML format with morphosyntactic tags
that reference the GOLD ontology.”
Reality Check
• There’s a difference between ideal best practice resources
(which is still somewhat of a moving target) and a good,
sufficient approximation.
• Some common practices are far from ideal or sufficient (like
saving the dictionary you worked 5 years on as a Microsoft
Word document file).
• We can easily modify these practices to produce archivable
resources that will last.
• And this can be done using tools that you already have, and
knowledge that is easy to acquire.
• Hence the title: Best practice in your back pocket: getting the
most out of the tools that you have.
Best Practice
• E-MELD project (Electronic Metastructure for
Endangered Languages Data)
• Goals:
– Help preserve endangered languages data
– Develop infrastructure for electronic archives
• Defining best practice
– E-MELD summer workshops http://www.emeld.org
• Promoting best practice:
– “School of Best Practice” at
http://www.emeld.org/school/index.html
Good, Better, Best Practice
•
•
The information presented here comes
from presentations of the E-MELD team,
particularly the following:
Simons and Dry (2006) Good, Better, and
Best Practice The Experience of the
E-MELD Project
http://www.linguistlist.org/emeld/docume
nts/Bielefeld-Dry-Simons.pdf
The first consideration:
working, presentation and
archival formats
• The process of creating digital language
resources usually involves creating files
in different formats:
– Working format
– Presentation format
– Archival format
Working Format
• The saved format of whatever program you are
working in:
– .doc (MS Word)
– .xls (Excel)
– .fp7 (FileMaker Pro)
• This format is what you use for your own
convenience and productivity
– Typically this format is proprietary
– Less typically, people may work in programs whose native
format is not proprietary, automatically saving in .txt
(plain text), .xml or .html (types of formatted plain text)
• A proprietary working file format is not the only
format you should have!
Archival Format
• A very important format -- this format helps
ensure that your resource will last and be
usable well into the future
• An archival format has LOTS of good qualities
(Simons, 2004)
–
–
–
–
Lossless
Open Standard
Transparent
Supported by multiple vendors
Archival Format: Lossless
• Avoid compressed formats that lose content
• A good rule-of-thumb is to use uncompressed formats:
– Text: .txt, .html, .xml
– Images: .tiff, .bmp
– Audio: .wav (Windows), .aiff (Apple), .au (Sun, Java, Unix)
but make sure it is PCM (uncompressed)
– Video: .avi (some codecs), .rtv
• Most compressed formats lose content, but some are
lossless (.zip for text, black and white .gif for images,
.ale Apple Lossless Encoding for audio, jpeg2000
video codec) -- use with caution!
Archival Format: Open
• Avoid proprietary formats like .doc, .xls, .fp7
• The company that produces the software may
stop supporting the format, rendering your file
unreadable
• For your archival format, choose a file format
that is “open standard” like .xml, .html, .pdf or
.rtf
• “Open standard” means that the specification
of the format is publically available, and
anyone can implement it.
Archival Format: Transparent
• Use a file format that is easy to interpret
• Example: text files (.txt)
– Have common characters like letters, numbers,
punctuation
– Virtually no formatting (tabs, returns)
– Because of the simplicity of this file type, many programs
can read it and make use of the data
• Other transparent formats: .wav, .aiff can be read by
any audio program
• Not transparent: .zip, .mp3 (require a special
algorithm for interpretation)
Archival Format: Supported
• Prefer formats that are widely supported
• If more vendors support it, it is less likely to
become obsolete
• This is another reason to prefer an open
standard format to a proprietary one
Presentation Format
• Presentation formats are those you choose for
the convenience and ease of accessibility and
display
• It is fine that presentation formats be
compressed, so long as you make a lossless
archival copy as well
• Examples of presentation formats include .pdf
files, .mp3 files, .jpg images, MPEG-2 video
So far, so good?
• As a responsible linguist creating digital
language documentation that will last well
into the future you…
– Know the difference between a working,
presentation, and archival file format
– Know what makes a good archival format (LOTS)
– Maintain an archival format of your data
• Anything beyond this? Yes, a bit more…
Best Practice Digital Resources are…
• Preservable in formats that are not vulnerable
to decay or obsolescence (see LOTS)
• Intelligible so that content that is easily
understood by future scholars
• Accessible so that resources are easily
discovered and accessed
• They are also interoperable, but this is mostly
a concern of archives and services
(Simons and Dry, 2006)
Create Preservable Resources
• Linguists are responsible for making
preservable resources
• That is, creating archival formats that
follow the principles of LOTS
Create Intelligible Resources
• In order to create resources that are intelligible
to others, you must document your practices!
• Documentation includes:
– Your markup practices
– The encoding you use
– Metadata about your resources
• This information should be kept a file or files
in an archival format, and archived along with
your resources.
Presentational Markup
• Many people use presentational markup,
particularly in the working formats like Microsoft
Word.
• Presentational markup means that aspects of the
presentation (like bold, italics, indenting) are
themselves meaningful
• For example…
Example of Presentational Markup
AS_5.2.1978_audio: Alice Spear, Potawatomi,
“Crane Boy”, May 2, 1978, Mayetta, Kansas.
<bold>AS_5.2.1978_audio</bold>
<plain.text>AliceSpear</plain.text>
<italics>“Crane Boy”</italics>
<plain.text>May 2, 1978</plain.text>
<plain.text>Mayetta, Kansas</plain.text>
Presentational Markup
• Presentational markup is not recommended.
BUT if you do use it, describe all meaningful
aspects (e.g. “bold” means head word,
“italics” is used for the part of speech)
Descriptive Markup
• It is better practice to use descriptive markup,
like XML
• XML is basically text with “tags” that provide
information about what is between the tags
– <headword>mnomen</headword>
– <gloss>rice</gloss>
• Tags can be also used to group information, much
like you would group information in a database
record, and have a whole set of information in a
database
Example of Descriptive Markup
AS_5.2.1978_audio: Alice Spear, Potawatomi,
“Crane Boy”, May 2, 1978, Mayetta, Kansas.
<ID>AS_5.2.1978_audio</ID>
<speaker>Alice Spear</speaker>
<description>“Crane Boy”</description>
<recording.date>May 2, 1978</recording.date>
<location>Mayetta, Kansas</location>
Descriptive Markup: XML
<?xml version=“1.0" encoding=“UTF-8"?>
<?xml-stylesheet type=“text/xsl" href=“archive.xsl"?>
<my.archive>
<record>
<identifier>AS_5.2.1978_audio</identifier>
<subject.language code=“x-sil-POT"/><language code="en"/>
<format>Analog audio recording on Cassette tape</format>
<contributor refine="speaker">Alice Spear</contributor>
<contributor refine="researcher">Laura Buszard-Welcher</contributor>
<description>“Crane Boy” narrative told in Potawatomi and in English</description>
<date code=“1978-05-02"/>
<coverage>Mayetta, Kansas</coverage>
<relation>digital audio: AS_5.2.1978_audio.wav, interlinear text:
AS_5.2.1978_audio.txt</relation>
<type.linguistic code=“primary_text"/>
<rights>Some restrictions; contact field linguist</rights>
</record>
</my.archive>
Descriptive Markup: XML
• It is a good practice to use standard tags where they
are available.
– OLAC has a set of tags that you would use for metadata to
describe your resources
– GOLD has a set of tags used for morphosyntactic description
• Otherwise, be sure to document the meaning of the
tags that you use
• Although some people feel comfortable working in
XML, many don’t like to use it as a working format.
• Fortunately many common programs now allow you to
save your work as an XML file.
The Advantage of XML
• Besides creating an archival data file, XML
has other advantages
• By creating stylesheets, you can give the
same XML file different presentation forms
• For example…
Delimited Text
• Another kind of markup that you might find yourself using is
delimited text.
• Spreadsheet and database programs allow you to export your
data as text, delimited by a particular character
– Comma separated text (.csv)
– Tab separated text (.tab)
• To help with intelligibility, create an initial record where the
name of each field / cell is given inside the record itself. That
way, the names of your fields / cells will be exported and saved
along with the rest of your data.
• Text data exported this way is good practice, particularly if you
are careful about documenting your practices inside your fields /
cells (for more on this see following slides).
Other aspects of markup
• Document any special conventions that you use
• What do your morpheme boundary markers mean (+ /
- / = …any others?)
• What glossing conventions do you use? Give the full
names of abbreviations (e.g. POS means ‘possessive’,
PV means ‘preverb’).
• Describe grammatical terms that you use (like ‘aorist’,
or ‘preverb’) and what it means for the language you
are describing. You don’t have to write a grammar -- a
sentence or two describing the term is sufficient)
• Also note if you are using standard terminology sets,
like Leipzig Glossing Rules, or GOLD terminology
Document the Encoding
• Identify the character set you are using
• Document any non-standard characters
• Best practice is to use Unicode
Create Metadata
• You will need to create some additional
information about your resources
• Metadata usually includes information about:
–
–
–
–
The setting (time, date, participants, location)
The language (ISO 636-3)
Linguistic type (text, grammar, lexicon) and subject
Access restrictions
• There are metadata standards for language
resources: OLAC and IMDI
OLAC Metadata Elements
Contributor (content)
Language (audience)
Coverage (e.g. location)
Publisher
Creator (content)
Relation (to another resource
Date
Rights (controlled vocab.)
Description
Source (say, for re-elicited data)
Format
Subject (controlled vocab.)
Encoding Format (character set)
Subject Language (ISO 636-3 code)
Markup Format (XML schema)
Title
Identifier (file name, URL
Linguistic Type
http://www.language-archives.org/OLAC/olacms.html
Create Metadata
• Keep a metadata record for each of your resources.
• The records should themselves be in an archival format. This
could be:
–
–
–
–
A text file (good)
Delimited text, exported from a simple database file (good)
An XML file (better)
An OLAC or IMDI formatted XML file (best)
• Your archivist may have a preference about metadata formats,
and prefer something relatively simple (like a paper form) if the
archive will be manually entering the metadata.
• Archive this file along with the rest of your resources.
Make your resources accessible
• Archive, archive, archive! (Not just on your own, or your
departmental server. Archives are committed to the longterm preservation and availability of your resources.)
• Before you leave to do fieldwork, or when you are writing
your grant, establish contact with the archive where you
intend to deposit your resources
• Archivists will
– give you guidelines for creating archival files
– help you select the best metadata set
– give you information about setting access levels
• When you return, the first thing to do is send your files, along
with the metadata and markup descriptions to the archive
• Most archives will then give you an ID number for your
resources that you can then cite in your publications
A Community Responsibility
• Best practice involves what individual field linguists
do, but also how we collectively use and care for these
resources
• This broader community involves
– Other researchers like yourself who create resources
– A growing set of interconnected digital language archives
that care for, protect, and disseminate your resources
– People who develop tools and services to make your
resources locateable, searchable, and reusable
– Others: linguistics organizations, organizations like OLAC
and DELAMAN, funding agencies who promote the work
of this community
Unicode
• Debbie Anderson “A field linguists’ guide to
Unicode”
• Michael Appleby “How to use Unicode on
your computer”
Field Case Studies:
Texts and Databases
• Jessica Boynton
– “Transcription, Time-Alignment and Annotation”
• Naomi Fox
– “Using Filemaker Pro to produce archivable
language documentation”
• Connie Dickonson
– “The Tsafiki Text Factory”
Panel Session
• Talks are 25 minutes, consecutive.
• Please remember or write down your
questions!
• We will field them in a panel session after
the talks.