Transcript Slide 1
Data management
LingDy
February 13, 2012
TUFS, Tokyo
David Nathan
Endangered Languages Archive
Hans Rausing Endangered Languages Project
SOAS, University of London
1
Two most valuable strategies
design and use a filename system
work out your basic units of documentation
and the relationships between them
- if you get these right, it will do the “heavy
lifting” of your data management strategy
- data and metadata are intertwined, points in
a spectrum rather than different things
2
Three most important qualities
consistency
machine readability
“computer programs can act on
data in terms of its proper structures
and categories”
bad example
documentation of conventions, structures,
methods
3
Data management
understand and model the data (units,
relationships)
use appropriate data structure methods – in
both file contents and organisation
use appropriate and conventional data
encoding methods (e.g. Unicode)
be explicit and consistent
plan for flow of data, working with others,
across different systems
document steps, decisions, conventions,
structures
think ahead to archiving
4
Managing data in your computer
design a well-organised system of folders so
that you can always find your stuff according
to what it is, not:
where the software decided to put it
what the software decided to call it
when/where you last used it
what someone else called it
5
File structures and names
6
design folder structure as a logical hierarchy
that suits your goals, content and work style
have documentary materials within one
overall directory (e.g. for backup)
make directories for relevant categories,
e.g. sessions, media types, dates
design it so that you will always be able to
find things
you may need to restructure at different
points in your project, e.g. move from datebased to session-based structures
Designing a file/folder structure
it should relate to reality
locations should make sense, so you (and
others) will know where to look for things
(where do you keep your passport; favourite
cup?)
the best location is “the place that one would
naturally look to find it”
7
3 models for file system structures
tree of descriptive folder- and file-names
one folder with descriptive filenames
one folder with numerical filenames
… what else is needed?
8
On identifiers
real world objects are uniquely identified
because they are physically unique - an
unlabelled cassette is poorly identified
digital objects have no physical existence they depend on identifiers that we give them
three types of identifiers:
semantic
keys
relative
9
On identifiers
semantic, e.g.
Nelson Mandela
The Sound of Music
SA_JA_Bongo_Palace_Land Dispute
Trial_015_29-04-2010.wav *
* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav
10
On identifiers
keys (disambiguators), e.g.
1137204 (a student number)
0803 211 6148 (a telephone number),
p12893fh23.pdf (some system's reference
number)
11
On identifiers
relative, e.g.
67 High Street
the secretary
index.html
metadata.xls
12
On identifiers
your collection may have a mix of these but it
is important to be aware of their differences
and limitations, for example:
semantic identifiers: invite name clashes
keys: a program or process might depend
on the identifier to work properly
relative identifiers: if you move them, you
probably change or destroy their meaning
13
Objects and identities
a digital object’s identity includes its location
a file’s full identity = path + filename
the path is a representation of the volume
and the directory (folder) hierarchy
if the full identity is unambiguous then
everything can be fine, compare:
c:\\dogs\spaniels\rover.jpg
c:\\cars\british\rover.jpg
or
lectures\syntax\2013-02-12\notes.doc
14
Objects and identities
but semantic identifiers are potentially
dangerous, because just adding more chunks
to disambiguate them will not work:
my\rover.jpg
my\white_rover.jpg
so domains that do not offer semantic
uniqueness may need identifiers which are
either keys, or relative identifiers
15
Segue to file names
(having said all that)
filenames are only filenames, and do not
necessarily provide information
common mistaken assumptions:
that a filename “dp_verbs_39.wav”
means there is an entity “dp_verbs_39”
that files are logically linked just by sharing
some part of their filenames
- these are only true if your system ensures
it (and you state it explicitly)
16
File naming
filenames that are unsystematic or are nonstandard will cause problems, eventually
unsystematic file naming might be OK if
you already have many files
you have a working method that already
does everything you need to do
your “system” will do everything you need
to do in the future
17
Manage file names from the start
a new file:
don’t just accept the default filename or
location suggested by the application when
you first save the file
put it where it belongs, immediately. If
necessary, create the place (directory/path)
where it belongs
name it according to your naming system!
if you have an inventory/index of files, add
an entry for the new file
18
Filename rules
all filenames should have correct extensions
each filename should have only one ".",
before the extension
use only ASCII characters (US keyboard)
use only letters, numbers, hyphens (-) and
underscores (_)
keep filenames short, just long enough to
contain the necessary identifier - don't fill
them up with lots of information about the
content (that is metadata!)
19
How about these file names?
20
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
ready.audio.wav
ReAlLyOdDtOReAd.txt
éclair.jpg
éclair_fr.jpg
e'clair.jpg
french-cake.jpeg
french-cake.jaypeg
lexicon-master
ɘɫIɲʰ.eaf
ice cream.doc
OBAMA.TXT
Obama.txt
Make filenames sortable
make filenames usefully sortable:
20100119lecture.doc
20100203lecture.doc
gr_transcription_1.txt
gr_transcription_2.txt
gr_transcription_9.txt
gr_transcription_53.txt
21
gr_transcription_001.txt
gr_transcription_002.txt
gr_transcription_009.txt
gr_transcription_053.txt
Associating files
you can make resources sortable together by
giving them the same filename root (the part
before the extension), or part of the root:
gr_reefs.wav
gr_reefs.eaf
gr_reefs.txt
22
paaka_photo001.jpg
paaka_photo002.jpg
paaka_txt_conv203.wav
paaka_txt_conv203.eaf
paaka_txt_lex.doc
document your conventions and system if
you do this
Avoid metadata in filenames
avoid putting metadata into filenames. A
filename is an identifier, not a data container
better to use a simple (semantic) filename or
a key (i.e. meaningless) filename, and then
create a metadata table to contain all the
relevant information
a table can properly express all the
information, contain links etc, and is
extensible for further metadata
23
Avoid metadata in filenames
e.g. Paaka_Reefs_Dan_BH_3Oct97.wav
better:
paaka_063.wav
plus
paaka_063.txt
paaka_063.txt
24
language
topic
speaker
location
date
Paakantyi
Reefs at
Mutawintyi
Dan Herbert
Broken Hill
1997-10-03
A filenaming system
carefully design a filename system for your
data and document the system so that
somebody else can understand it
one documenter’s new system:
aaa_bb_cc_yyyy-mm-dd_nnn.wav
25
A filenaming system
aaa_bb_cc_yyyy-mm-dd_nnn.wav
aaa = village code
bb = (main) speaker code
cc = genre/event code
yyyy-mm-dd = date (why this order?)
nnn = optional number (e.g. 001)
.wav = correct extension for file content type
26
Also (for Part 2)
27
creating an inventory/index/metadata
metadocumentation
data/file versions
transferring data
sharing data
backup
Documenting the filename system
describe the system
- how would you describe it?
- where would you put the description?
document the codes – this is probably part of
your metadata
28
On changing file names
29
decide if it’s possible, benefits and side effects
(e.g. loss of links in ELAN files)
design a system first
don’t change names in situ – copy data set
and gradually migrate it to your new system
document file name changes
if possible, automate or copy and paste
filenames
if possible, use machine processes, e.g.
system filename listings, XLS formulas