Scientific Data Management Presented by: Craig A.Stewart stewart@iu.edu University Information Technology Services Indiana University Copyright 2002 Craig A.

Scientific Data Management Presented by: Craig A.Stewart [email protected] University Information Technology Services Indiana University Copyright 2002 Craig A.

Transcript Scientific Data Management Presented by: Craig A.Stewart [email protected] University Information Technology Services Indiana University Copyright 2002 Craig A.

Scientific Data Management
Presented by:
Craig A.Stewart
[email protected]
University Information Technology Services
Indiana University
Copyright 2002 Craig A. Stewart and the Trustees of Indiana University
License terms
•
•
Please cite as: Stewart, C.A. 2002. Scientific Data Management. Tutorial
Presentation. Presented at Laboratory Information Management Systems
Conference, 2-3 May, Philadelphia, PA. http://hdl.handle.net/2022/14001
Some figures are shown here taken from web, under an interpretation of fair
use that seemed reasonable at the time and within reasonable readings of
copyright interpretations. Such diagrams are indicated here with a source url.
In several cases these web sites are no longer available, so the diagrams are
included here for historical value. Except where otherwise noted, by inclusion
of a source url or some other note, the contents of this presentation are © by
the Trustees of Indiana University. This content is released under the Creative
Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the author
or licensor (but not in any way that suggests that they endorse you or your use
of the work). For any reuse or distribution, you must make clear to others the
license terms of this work.
2
Why a tutorial on Scientific Data
Management at the LIMS Institute
Conference?
• Requested on last year’s conference surveys
• As scientific research becomes more oriented towards
high-volume lab work, there will be increasing presence of
LIMS in scientific labs.
• As labs that already employ LIMS produce larger amounts
of data, the techniques already used and understood in
scientific research can be applied to management of
industrial data
• It is becoming increasingly important to assure long-term
preservation of data of all sorts; techniques developed and
understood in the scientific data management area can
help.
13 June 2002
3
The key matter to be discussed today
Once the LIMS system has assured you that all of the
measurements have been made and checked, and you know
where all of the samples are stored, and all of the output
data has been written into an output file,
–on what storage medium/system,
–and in what logical structure,
should data be stored in to assure its long term readability
and utility?
13 June 2002
4
The approach
• This tutorial casts a very wide net in terms of its
subject matter.
• A large part of the challenge in this topic is simply
managing the vocabulary.
• Much of the day will be spend introducing
concepts and terms.
• We will cover a large span of scale – ranging from
single spreadsheets to systems holding hundreds
of TBs of data.
13 June 2002
5
Goals for today
• Explain the key problems of scientific data management
• Define and outline the concepts and nomenclature
surrounding the problem
• Identify some of the key concepts, a few of the directions
in which good answers might lie, and a few of the
directions that definitely head to wrong answers
• Provide enough information and references that you can
independently investigate those matters of interest to you.
• At the end of the tutorial, you might not be in a position to
start building a scien
13 June 2002
6
Sources & format
• There exists no text for this material that covers this
material in the manner discussed in this tutorial. CAS is an
expert in some of the areas to be discussed today, but not
all. Expect extensive footnoting and acknowledgement of
other sources.
• The level of detail is intentionally uneven. Greater detail is
generally associated with one of two factors:
– A topic is sufficiently straightforward that some details
will let the participant go off and do something on
her/his own.
– A topic is especially important and the participant may
want to refer to it later. (In this case we may skim over
some details during the actual presentation).
13 June 2002
7
Outline
Topic
•
•
•
•
•
•
•
•
•
•
The problem
Physical storage of data: tapes,
CDs, disk
Data management strategies
Data warehouses, data federations
Range of application
Single researcher to enterprise
Enterprise to
national/international communities
Distributed file systems, external
data sources, and data grids
Visualization and collection-time
Single researcher to enterprise
data reduction as critical strategies
Archival and backup software systems Lab group to enterprise
Future of storage media
Closing thoughts
References
13 June 2002
8
Bits, Bytes, and the proof that
CDs have consciousness
• A bit is the basic unit of storage, and is always either a 1 or
a 0.
• 8 bits make a byte, the smallest usual unit of storage in a
computer.
• MegaByte (MB) - 1,048,576 bytes (A CD-ROM holds ~
600 MBs)
• GigaByte (GB) – ~ 1 billion bytes
• TeraByte (TB) - ~ 1 trillion bytes (a large library might
have ~1 TB of data in printed material)
• PetaByte (PB) – 1 thousand TBs
• ExaByte (EB) – 1 thousand PBs
13 June 2002
9
The problem of scientific data
management
Explosion of data and need to retain it
• Science historically has struggled to acquire data;
computing was largely used to simulate systems without
much underlying data
• Lots of data:
– Lots of data available “out there”
– Dramatically accelerating ability to produce new data
• One of the key challenges, and one of the key uses of
computing, is now to make sense out of data now so easily
produced
• Need to preserve availability of data for ???
13 June 2002
11
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
13 June 2002
12
Accelerating ability to produce new data
• Diffractometer – 1
TB/year
• Synchotron – 60 GB/day
bursts
• Gene expression chip
readers – 360 GB/day
• Human Genome – 3
GB/person
• High-energy physics – 1
PB per year
*http://atlasinfo.cern.ch/Atlas/Welcome.html
13 June 2002
13
Some things to think about
• 25 years ago data was stored on punched tape or punched
cards
• How would you get data off an old AppleII+ diskette?
How about one of those high-density 5 ¼” DOS diskettes?
• The backup tape in the sock drawer (especially if it’s a
VMS backup tape of an SPSS-VMS data file)
• The no-longer-easily-handled data file on a CD (e.g. 1990
Census data)
• Data is essentially irreproducible more than a short period
of time after the fact
13 June 2002
14
Have you even tried to read one of your
old data files?
Exp_2_2_feb_14_1981
30 0
30 0
0.0 139.5 000.0
.
.
.
.
.
.
.
.
.
5.0 142.
.
.
.
.
.
.
.
.
.
.
13 June 2002
0.0060
.0053
.0057
.0060
.0055
.5760
.5707
.5696
.5718
.5755
.0045
.0047
.0045
.0045
.0045
.4821
.4821
.4847
.4857
.4879
0.02123
.02123
.02123
.02123
.02123
.43607
.43247
.43161
.43325
.43450
.02169
.02169
.02167
.02167
.02164
.36409
.36512
.36733
.36851
.37028
-20.48
-20.48
-20.47
-20.44
-20.46
0.00
0.00
0.00
0.00
0.00
1.38
1.39
1.38
1.41
1.41
5.45
5.46
5.46
5.46
5.46
098.4571
98.4557
98.4536
98.4533
98.4557
98.4396
98.4319
98.4350
98.4305
98.4305
98.8949
98.8938
98.8952
98.8942
98.8942
98.9020
98.9020
98.8991
98.8960
98.8949
26.2
408.03
408.03
408.03
408.83
409.16
26.4
412.24
412.18
412.01
411.78
411.78
15
Even a small file can be
undecipherable!
1
2
3
4
5
6
7
m
F
F
M
M
F
M
13 June 2002
1
2
2
1
2
3
3
99
320
195
110
218
120
125
1
2
2
1
2
1
1
210
420
350
215
364
355
355
16
And something even older…
Hwæt! We Gardena in geardagum,
þeodcyninga, þrym gefrunon,
hu ða æþelingas ellen fremedon.
Oft Scyld Scefing sceaþena þreatum…
This is from Beowulf, written 1,000 years ago. Think about
the language problem relative to the half-life of
radioactive waste!
13 June 2002
17
Physical storage of data: tapes,
CDs, disk
Durability of media
•
•
•
•
•
•
Stone: 40,000 years
Ceramics: 8,000 years
Papyrus: 5,000 years
Parchment: 3,000 years
Paper: 2,000 years
Magnetic tape: 10 years (under ideal conditions; 3-5 more
conservative)
• CD-RW: 5-10 years (under ideal conditions; 1.5 years more
conservative)
• Magnetic disk: 5 years
• Even if the media survives, will the technology to read it?
13 June 2002
19
Data storage: media issues
• So what do you do with data on a paper tape?
• Long term data storage inevitably forces you to confront
two issues:
– the lifespan of the media
– the lifespan of the reading device
13 June 2002
20
Data storage: removable magnetic media
• The right answer to any long-term (or even intermediateterm) data storage problem is almost never diskettes. It’s
always a race between the lifespan of the media and the
lifespan of the readers. One or the other always wins, and
usually more quickly than you’d expect.
• Esoteric removable magnetic media are never a good idea.
Even Zip drives are probably not a good bet in the long
run. What do you do with a critical data set when your
only copy is on a Bernoulli drive?
13 June 2002
21
Magnetic Tapes
• Tapes store data in tracks on a magnetic medium. The
actual material on the tape can become brittle and/or worn
and fall off.
• Tapes are best used in machine room environments with
controlled humidity.
• There are three situations in which tapes are the right
choice:
– Within production machine rooms
– As backup media
– For transfer between machine rooms under some
circumstances
13 June 2002
22
Tape formats
• There are several formats with small user bases;
these should probably be avoided. [This is
admittedly a conservative stance, but…].
• DAT tapes don’t last well
• For system backups of office, lab, or departmental
servers, Digital Linear Tape (DLT) is best choice
13 June 2002
23
Tape formats, II
• In machine rooms, Linear Tape Open (LTO) is the
best choice.
• LTO is a multi-vendor standard
• Two variants:
– Accelis: faster, lower capacity (planned up to
25 GB/tape; 50 w compression)
– Ultrium: slower, high capacity (planned up to
100 GB/tape; 200 w compression)
13 June 2002
24
Non-magnetic removable media
• Acronym soup:
–
–
–
–
–
CD – Compact Disk
CD-ROM – CD-Read Only Memory
CD-RW – CD –Read/Write
DVD – Digital Versatile Disk
DVD-RW – DVD-Read/Write
13 June 2002
25
CDs and DVDs con’t
• For routine, reliable, reasonably dense storage of data
around the lab, you can’t beat CDs or DVDs.
• CD writers are commonplace & reliable
• DVD writers are newer, more costly, and more prone to
format issues.
• Always be sure to have extensive and complete
information on the CD – including everything you need to
know to remember what it really is later. There should be
no data physically on the CD that is not contained in a file
burned on the CD.
• Watch out for longevity issues!!
13 June 2002
26
CD & DVD Jukeboxes
• Jukeboxes are good for
what they do
• Because the basic media
are standard, if you had to
ditch your investment in
the jukebox itself you
could
• 240 CD jukebox at left
from
http://www.kubikjukebox.
com/index.htm
13 June 2002
27
CD & DVD Jukeboxes, con’t
• System shown at left
holds 16 jukeboxes;
each holds 240 CDs
• http://www.kubikjuke
box.com/index.htm
13 June 2002
28
Spinning disk storage
• JBOD (Just a Bunch of Disk) – alright so long as
it’s alright to loose data now and again. High
speed access, takes advantage of relatively low
cost of disk drives. Good for temporary data
parking while data awaits reduction.
• RAID (Redundant Array of Independent Disks) –
what you need if you don’t want to lose data.
• Lifecycle replacement an issue in both cases
13 June 2002
29
Disk Current State of Art
•
•
•
•
•
•
•
Seagate Barracuda 180
largest-capacity disc at present: 181.6 GB
Internal Transfer Rate (Mbits/sec) 282-508
Average Seek Read/Write (msec) 7.4/8.2
Average Latency (msec) 4.17
Spindle Speed (RPM) 7200
Power consumption: 10 watts (idle)
13 June 2002
30
Disk Trends
• Capacity: doubles each year
• Transfer rate: 40% per year
• MB per $: doubles each year
13 June 2002
31
RAID*
• Level 0: Provides data striping (spreading out blocks of
each file across multiple disks) but no redundancy. This
improves performance but does not deliver fault tolerance.
• Level 1: Provides disk mirroring.
• Level 3: Same as Level 0, but also reserves one dedicated
disk for error correction data. It provides good
performance and some level of fault tolerance.
• Level 5: Provides data striping at the byte level and also
stripe error correction information. This results in excellent
performance and good fault tolerance.
*webopedia.com
13 June 2002
32
RAID 3
“This scheme consists of an array of HDDs for data and one
unit for parity. … The scheme generates from XOR (exclusiveor) parity derived from bit 0 through bit7. If any of the HDDs
fail, it restores the original data by an XOR between the
redundant bits on other HDDs and the parity HDD. With RAID
3, all HDDs operate constantly. “
http://www.studiostuff.com/ADTX/adtxwhatisraid.html
13 June 2002
33
RAID 5
“RAID5 implements striping and parity. In RAID5,
the parity is dispersed and stored in all HDDs. ….
RAID5 is most commonly used in the products on
market these days.”
*http://www.studio-stuff.com/ADTX/adtxwhatisraid.html
13 June 2002
34
Storage Area Network (SAN)
• Storage Area Network (SAN) is a highspeed subnetwork of shared storage devices.
A storage device is a machine that contains
nothing but a disk or disks for storing data.
A SAN's architecture works in a way that
makes all storage devices available to all
servers on a LAN or WAN.
*Webopedia.com
13 June 2002
35
Network Attached Storage (NAS)
• A network-attached storage (NAS) device is a
server that is dedicated to file sharing through
some protocol such as NFS. NAS does not provide
any of the activities that a server in a servercentric system typically provides, such as e-mail,
authentication or file management. …
*modified from Webopedia.com
13 June 2002
36
Storage Bricks
•
•
•
•
Group of hard disks inside a sealed box
Includes spare disks
Typically RAID 5
When one disk fails, one of the spares is put
to use
• When you’re out of spares…
• Sun seems to have originated this idea
13 June 2002
37
Backups
• A properly administered backup system and
schedule is a must.
• How often should you back up? More frequently
than the amount of elapsed time it takes you to
acquire an amount of data that you can’t afford to
loose.
• Backup schedules – full and incremental
• RAID disk enhances reliability of storage, but it’s
not a substitute for backups
• More about backup software and such later!
13 June 2002
38
Disaster recovery
• If your data is too important to lose, then it’s too important
to have in just one copy, or have all of the copies in just
one location.
• Natural disasters, human factors (e.g. fire), theft (a
significant portion of laptop thefts have data theft as their
purpose) can all lead to the loss of one copy of your data.
If it’s your only copy…… or the only location where
copies are kept…
• Offsite data storage is essential
– Vaulting services
– Remote locations of your business
13 June 2002
39
Data management strategies
•
•
•
•
•
Flat files
Spreadsheets and Statistical software
Relational Databases
XML
Specialized scientific data formats
13 June 2002
40
Flat Files
13 June 2002
41
Data Management Strategies:
Flat files
• Nothing beats an ASCII flat file for simplicity
• ASCII files are not typically used for data storage
by commercial software because proprietary
formats can be accessed more quickly
• If you want a reliable way to store data that you
will be able to retrieve later reliably (media issues
notwithstanding), an ASCII flat file is a good
choice.
13 June 2002
42
Data Management Strategies:
Flat files, II
• IF you use an ASCII flat file for simple long-term storage,
be sure that:
– The file name is self-explanatory
– There is no information embedded in the file name that
is not also embedded in the file
– Each individual data file includes a complete data
dictionary, explanation of the instrument model and
experimental conditions, and explanation of the fields
– Lay the data out in accordance with First, Second, and
Third Normal Forms as much as is possible (more on
these terms later)
13 June 2002
43
Data dictionary
• Definition from webopedia.com:
– In database management systems, a file that defines the
basic organization of a database. A data dictionary
contains a list of all files in the database, the number of
records in each file, and the names and types of each
field. …
• More generally:
– A data dictionary is what you (or someone else) will
need to make sense of the data more than a few days
after the experiment is run
13 June 2002
44
Spreadsheets and statistical
packages
13 June 2002
45
Spreadsheet Software as a data
management tool
• Microsoft’s Excel may suffice for many data management
needs
• If any given data set can be described in a 2D spreadsheet
with up to hundreds of rows and columns, and if there is
relatively little need to work across data sets, then Excel
might do the trick for you
• Do beware of version issues!
13 June 2002
46
Spreadsheet software as a data
management tool, con’t
• Designed originally to be electronic accountant ledgers
• Feature creep in some ways has helped those who have
moderate amounts of data to manage
• There are several options, including Open Source products
such as Gnumeric and nearly open source products such as
StarOffice
• Since MS Excel is the most commonly used spreadsheet
package, this discussion will focus on MS Excel
13 June 2002
47
The MS Excel Data menu
• Sort: Ascending or descending sorts on multiple columns
• Lists: Allow you to specify a list (use only one list per
spreadsheet) and then perform filters, selecting only those
that meet a certain criteria (probably more useful for
mailing lists than scientific data management)
• Validation: lets you check for typos, data translation errors,
etc. by searching for out of bounds data
• Consolidate
• Group and outline
• Pivottable
• Get external data
13 June 2002
48
MS Excel Statistics
• Mean, standard deviation, confidence
intervals, etc. up to t-test are available as
standard functions within MS Excel
• One-way ANOVA and more complex
statistical routines are available in the
Statistics Add-in Pack
13 June 2002
49
MS Excel Graphics
• Does certain things quite easily
• If it doesn’t do what you want it to do
easily – it probably won’t do it at all
• Constraints on the way data are laid out in
the spreadsheet are often an issue
13 June 2002
50
Statistical Software as a data
management tool
• SPSS and SAS are the two leading packages
• Both have ‘spreadsheet-like’ data entry or editing
interfaces
• Both have been around a long time, and are likely to
remain around for a good while
• Workstation and mainframe versions of both available
13 June 2002
51
What’s wrong with this
program?
DATA LIST FILE=sample.dat
/id 1 v1 3 (A) v2 5 v3 7-9 v4 11 v5 13-15
LIST VARIABLES v1 v2 v3
ONEWAY v3 BY v2 (1,3)
REGRESSION
/DEPENDENT=v5
/METHOD=ENTER v3
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
13 June 2002
52
Better….
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
4 m 1 110 1 215
5 m 2 218 2 364
6 f 3 120 1 355
7 m 3 125 1 335
13 June 2002
53
Now you have a fighting chance
DATA LIST FILE=sample.dat
/id 1 gender 3 (A) weight 5 glucose 7-9 bp 11 reactime 13-15
VARIABLE LABELS ID ‘Subjet ID #' GENDER 'Subject Gender'
WEIGHT ‘Subject Weight in pounds’ GLUCOSE ‘Blood glucose level’
BP ‘Blood Pressure’ REACTIME ‘Reaction Time in Minutes”
VALUE LABELS GENDER m ‘Male’ f ‘Female’
LIST VARIABLES gender weight glucose
ONEWAY glucose BY weight (1,3)
REGRESSION
/DEPENDENT=reactime
/METHOD=ENTER glucose
FINISH
1 m 1 99 1 210
2 f 2 320 2 420
3 f 2 195 2 350
.
13 June 2002
54
An example SAS program
/* Computer Anxiety in Middle School Chlidren */
/* The following procedure specifies value lables for variables */
PROC FORMAT;
VALUE $sex 'M'='Male'
'F'='Female';
VALUE exp 1='upto 1 year' 2='2-3 yrs' 3='3+ yrs';
VALUE school 1='rural' 2='city' 3='suburban';
DATA anxiety;
INFILE clas;
INPUT ID 1-2 SEX $ 3 (EXP SCHOOL) (1.) (C1-C10) (1.)
(M1-M10) (1.) MATHSCOR 26-27 COMPSCOR 28-29;
FORMAT SEX $SEX.; FORMAT EXP EXP.; FORMAT SCHOOL SCHOOL.;
/* conditional transformation */
IF MATHSCOR=99 THEN MATHSCOR=.;
IF COMPSCOR=99 THEN COMPSCOR=.;
/* Recoding variables. Several items are to be reversed while scoring. */
/* The Likert type questionnaire had a choice range of 1-5 */
C3=6-C3; C5=6-C5; C6=6-C6; C10=6-C10;
M3=6-M3; M7=6-M7; M8=6-M8; M9=6-M9;
COMPOPI = SUM (OF C1-C10) /*FIND SUM OF 10 ITEMS USING SUM FUNCTION */;
MATHATTI = M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /*ADDING ITEM BY ITEM */;
/* Labeling variables */
LABEL ID='STUDENT IDENTIFICATION' SEX='STUDENT GENDER'
EXP='YRS OF COMP EXPERIENCE' SCHOOL='SCHOOL REPRESENTING'
MATHSCOR='SCORE IN MATHEMATICS' COMPSCOR='SCORE IN COMPUTER SCIENCE'
COMPOPI='TOTAL FOR COMP SURVEY' MATHATTI='TOTAL FOR MATH ATTI
13 June 2002
SCALE';
55
SAS example, Part 2
/* Printing data set by choosing specific variables */
PROC PRINT;
VAR ID EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI;
TITLE 'LISTING OF THE VARIABLES';
/* Creating frequency tables */
PROC FREQ DATA=ANXIETY;
TABLES SEX EXP SCHOOL;
TABLES (EXP SCHOOL)*SEX;
TITLE 'FREQUENCY COUNT';
/* Getting means */
PROC MEANS DATA=ANXIETY;
VAR COMPOPI MATHATTI MATHSCOR COMPSCOR;
TITLE 'DESCRIPTIVE STATICTS FOR CONTINUOUS VARIABLES';
RUN;
/* Please refer to the following URL for further infomation */
/* http://www.indiana.edu/~statmath/stat/sas/unix/index.html */
13 June 2002
56
An example SPSS program
TITLE 'COMPUTER ANXIETY IN MIDDLE SCHOOL CHILDREN'
DATA LIST FILE=clas.dat
/ID 1-2 SEX 3 (A) EXP 4 SCHOOL 5 C1 TO C10 6-15 M1 TO M10 16-25
MATHSCOR 26-27 COMPSCOR 28-29
MISSING VALUES MATHSCOR COMPSCOR (99)
RECODE C3 C5 C6 C10 M3 M7 M8 M9 (1=5) (2=4) (3=3) (4=2) (5=1)
RECODE SEX ('M'=1) ('F'=2) INTO NSEX /* Changing char var into numeric
var
COMPUTE COMPOPI=SUM (C1 TO C10) /*Find sum of 10 items using SUM
function
COMPUTE MATHATTI=M1+M2+M3+M4+M5+M6+M7+M8+M9+M10 /* Adding eachi item
VARIABLE LABELS ID 'STUDENT IDENTIFICATION' SEX 'STUDENT GENDER'
EXP 'YRS OF COMP EXPERIENCE' SCHOOL 'SCHOOL REPRESENTING'
MATHSCOR 'SCORE IN MATHEMATICS' COMPSCOR 'SCORE IN COMPUTER SCIENCE'
COMPOPI 'TOTAL FOR COMP SURVEY' MATHATTI 'TOTAL FOR MATH ATTI SCALE'
13 June 2002
57
SPSS Example, Part 2
/*Adding labels
VALUE LABELS SEX 'M' 'MALE' 'F' 'FEMALE'/
EXP 1 'UPTO 1 YR' 2 '2 YEARS' 3 '3 OR MORE'/
SCHOOL 1 'RURAL' 2 'CITY' 3 'SUBURBAN'/
C1 TO C10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
M1 TO M10 1 'STROGNLY DISAGREE' 2 'DISAGREE'
3 'UNDECIDED' 4 'AGREE' 5 'STRONGLY AGREE'/
NSEX 1 'MALE' 2 'FEMALE'/
PRINT FORMATS COMPOPI MATHATTI (F2.0) /*Specifying the print format
comment Listing variables.
* listing variables.
LIST VARIABLES=SEX EXP SCHOOL MATHSCOR COMPSCOR COMPOPI MATHATTI/
FORMAT=NUMBERED /CASES=10 /* Only the first 10 cases
FREQUENCIES VARIABLES=SEX,EXP,SCHOOL/ /* Creating frequency tables
STATISTICS=ALL
USE ALL.
ANOVA COMPSCOR by EXP(1,3).
FINISH
comment Please refer to the following URL for further infomation
http://www.indiana.edu/~statmath/stat/spss/unix/index.html.
13 June 2002
58
Keys to using Statistical Software
as a data management tool
• Be sure to make your programs and files selfdefining. Use variable labels and data labels
exhaustively.
• Write out ASCI versions of your program files
and data sets.
• Stat packages generally are able to produce
platform-independent ‘transport’ files. Good for
transport, but be wary of them as a long-term
archival format
13 June 2002
59
Keys to using Statistical Software
as a data management tool, 2
• Statistical software is excellent when your data
can be described well without having to use
relational database techniques. If you can describe
the data items as a very long vector of numbers,
you’re set!
• Statistical software is especially useful when many
transformations or calculations are required
• But beware transforms, calculations, and creation of new
variables interactively!
13 June 2002
60
Perl and C
•
•
•
•
Portable extensible report language
Problematic esoteric rubbish lister
It’s a bit of both
Perl is good way to manipulate small amounts of
data in a prototype setting, but performance in a
production setting will probably seem inadequate
• Use Perl to prototype, but if you’re using Perl,
rewrite the final application in C or C++
13 June 2002
61
Relational Databases
13 June 2002
62
Database Definitions*
• Database management system: A collection of programs
that enables you to store, modify, and extract information
from a database.
• Types of DBMSs: relational, network, flat, and
hierarchical.
• If you need a DBMS, you need a relational DBMS
• Query: a request to extract data from a database, e.g.:
– SELECT ALL WHERE NAME = "SMITH" AND AGE > 35
• SQL (structured query language) – the standard query
language
*modified from webopedia.com
13 June 2002
63
Relational Databases*
• Relational Database theory developed at IBM by E.F.
Codd (1969)
• Codd's Twelve Rules – the key to relational databases but
also good guides to data management generally.
• Codd’s work is available in several venues, most
extensively as a book. The number of rules has now
expanded to over 300, but we will start with rules 1-12 and
the 0th rule.
• 0th rule: A relational database management system
(DBMS) must manage its stored data using only its
relational capabilities.
• *Based on Tore Bostrup. www.fifteenseconds.com
13 June 2002
64
Codd’s 12 rules
1.Information Rule. All information in the database should be
represented in one and only one way -- as values in a table.
2.Guaranteed Access Rule. Each and every datum (atomic
value) is guaranteed to be logically accessible by resorting
to a combination of table name, primary key value, and
column name.
3.Systematic Treatment of Null Values. Null values (distinct
from empty character string or a string of blank characters
and distinct from zero or any other number) are supported
in the fully relational DBMS for representing missing
information in a systematic way, independent of data type.
13 June 2002
65
Codd’s 12 rules, con’t
4.Dynamic Online Catalog Based on the Relational
Model. The database description is represented at
the logical level in the same way as ordinary data,
so authorized users can apply the same relational
language to its interrogation as they apply to
regular data.
13 June 2002
66
Codd’s 12 rules, con’t
5.Comprehensive Data Sublanguage Rule. A relational
system may support several languages and various modes
of terminal use. However, there must be at least one
language whose statements are expressible, per some welldefined syntax, as character strings and whose ability to
support all of the following is comprehensible:
a. data definition
b. view definition
c. data manipulation (interactive and by program)
d. integrity constraints
e. authorization
f. transaction boundaries (begin, commit, and rollback).
13 June 2002
67
Codd’s 12 rules, con’t
6. View Updating Rule. All views that are theoretically
updateable are also updateable by the system.
7. High-Level Insert, Update, and Delete. The capability of
handling a base relation or a derived relation as a single
operand applies not only to the retrieval of data, but also to
the insertion, update, and deletion of data.
8. Physical Data Independence. Application programs and
terminal activities remain logically unimpaired whenever
any changes are made in either storage representation or
access methods.
13 June 2002
68
Codd’s 12 rules, con’t
9. Logical Data Independence. Application programs
and terminal activities remain logically
unimpaired when information preserving changes
of any kind that theoretically permit unimpairment
are made to the base tables.
10. Integrity Independence. Integrity constraints
specific to a particular relational database must be
definable in the relational data sublanguage and
storable in the catalog, not in the application
programs.
13 June 2002
69
Codd’s 12 rules, con’t
11. Distribution Independence. The data manipulation
sublanguage of a relational DBMS must enable application
programs and terminal activities to remain logically
unimpaired whether and whenever data are physically
centralized or distributed.
12. Nonsubversion Rule. If a relational system has or
supports a low-level (single-record-at-a-time) language,
that low-level language cannot be used to subvert or
bypass the integrity rules or constraints expressed in the
higher-level (multiple-records-at-a-time) relational
language.
13 June 2002
70
The problem with (some) DBMS
computer science
• Database theory is wonderful stuff
• It is sometimes possible to get so caught up
in the theory of how you would do
something that the practical matters of
actually doing it go by the wayside
• This is particularly true of the concept of
“normal forms” – only three of which we
will cover
13 June 2002
71
Some terminology
Formal Name
Relation
Tuple
Attribute
Common Name
Table
Row
Column
Also known as
Entity
Record
Field
A key is a field that *could* serve as a unique identifier of
records. The Primary key is the one field chosen to be the
unique identifier of records.
13 June 2002
72
First Normal Form
• Reduce entities to first normal form (1NF)
by removing repeating or multivalued
attributes to another, child entity.
Specimen #
Measurement #`1 Measurement #2 Measurement #3
14
35
43
38
Specimen #
Measurement#
14
14
14
Value
1
2
3
35
43
38
Specimens
14
13 June 2002
73
Second Normal Form
• Reduce first normal form entities to second normal form
(2NF) by removing attributes that are not dependent on the
whole primary key.
Specimen #
Measurement#
14
14
16
Specimen #
Measurement#
14
14
16
Specimens
13 June 2002
Species
1 M. musculus
2 M. musculus
3 R. norvegicus
Value
35
43
38
Value
1
2
3
35
43
38
Species
14 M. Musculus
16 R. norvegicus
74
Third Normal form
• Reduce second normal form entities to third normal form
(3NF) by removing attributes that depend on other, nonkey
attributes (other than alternative keys).
• It may at times be beneficial to stop at 2NF for
performance reasons!
Specimen #
Measurement#
14
14
16
Specimen #
13 June 2002
1
2
3
Measurement#
14
14
16
O2 consumption
Mass
35
43
85
O2 consumption Mass
1
35
2
43
3
85
O2
consumption
per gram
14
2.50
15
2.87
28
3.04
14
15
28
75
On to database products
• Microsoft Access – Common, relatively
inexpensive, moderately scalable
• Oracle – Common, relatively more expensive,
extremely robust and scalable
• DB2 – Relatively common, IBM’s commercial
database application
• MySQL – Becoming more common, free, good
for prototyping and small-scale applications
13 June 2002
76
MySQL
•
•
•
•
Open source database software
Available for several operating systems
Downloadable from www.mysql.com
Excellent for prototyping database
applications, and in many cases plenty for
production
13 June 2002
77
Components of MySQL
(exemplary of database products
generally)
•
•
•
•
•
•
•
•
•
mysql – executes sql commands
mysqlaccess – manages users
mysqladmin – database administration
mysqld – MySQL server process
mysqldump – dumps definition and contents of a database
into a file
mysqlhotcopy – hot backup of databast
mysqlimport – imports data from other formats
mysqlshow – shows information about server and objects
mysqld_safe – starts and manages mysql on Unix
13 June 2002
78
Database applications and the
web?
• An Open Source option
– MySQL - database
– PHP - web scripting application
– Apache - web server
• Oracle and its web modules
• Stat package and web modules
13 June 2002
79
Specialized Data formats
• XML
• HDF
13 June 2002
80
XML
• The Extensible Markup Language (XML) is
the universal format for structured
documents and data on the Web.
• http://www.w3.org/XML/
13 June 2002
81
A few of “XML in10 points”*
1.
2.
3.
4.
5.
XML is for structuring data. XML makes it easy for a
computer to generate data, read data, and ensure that the
data structure is unambiguous.
XML looks a bit like HTML. Like HTML, XML makes
use of tags (words bracketed by '<' and '>') and attributes
(of the form name="value").
XML is text, but isn't meant to be read.
XML is verbose by design. (And it’s *really* verbose)
XML is a family of technologies. (This leads to the
opportunity to create discipline-specific XML templates)
*http://www.w3.org/XML/1999/XML-in-10-points
13 June 2002
82
XML
• XML really is one of the most important data
presentation technologies to be developed in
recent years
• XML is a meta-markup language
• The development and use of DTDs (document
type definition) is time consuming, critical, and
subject to the usual laws regarding standards
• XML is a way to present data, but not a good way
to organize lots of data
13 June 2002
83
Some XML examples
• Chemical Markup Language
http://www.xml-cml.org/
• Extensible Data Format
http://xml.gsfc.nasa.gov/XDF/XDF_home.h
tml
• BioXML – no longer active
13 June 2002
84
XML issues
• Great technology
• Good commercial authoring systems
available or in development
• The problem with standards….
• Perhaps the biggest challenge in XML is the
fact that it is so easy to put together a web
site and propose a DTD as a standard
13 June 2002
85
XML vs PDF
• PDF files are essentially universally readable.
PDF file formats give you a picture of what was
once data in a fashion that makes retrieval of the
data hard at best.
• XML requires a bit more in terms of software, but
preserves the data as data, that others can interact
with.
• Utility of XML and PDF interacts with proprietary
concerns, institutional concerns, and community
concerns – which are not always in harmony!
13 June 2002
86
Specialized data storage formats HDF
•
•
•
•
Hierarchical Data Format (HDF)
HDF is an open-source effort
http://hdf.ncsa.uiuc.edu/
HDF5 is a general purpose library and file
format for storing scientific data.
13 June 2002
87
HDF, con’t
• HDF5 can store two primary objects: datasets and groups.
A dataset is essentially a multidimensional array of data
elements, and a group is a structure for organizing objects
in an HDF5 file.
• Using these two basic objects, one can create and store
almost any kind of scientific data structure.
• Designed to address the data management needs of
scientists and engineers working in high performance, data
intensive computing environments.
• HDF5 emphasizes storage and I/O efficiency.
13 June 2002
88
HDF, con’t
• HDF is nontrivial to implement
• If you need the full capabilities of HDF,
there’s nothing like it
• There is a bit of history of questions about
performance, but HDF5 is designed to
resolve these questions
13 June 2002
89
Free Software Foundation
• Many of the software products mentioned in this
talk (XML, Perl, etc.) are Open Source Software
• The GNU general public license is the standard
license for such software
• Some of the best software for specific scientific
communities is open source (community software)
• There are certain expectations about such software
and how it is used
13 June 2002
90
Data exchange among
heterogeneous formats
• I have data files in SAS, SPSS, Excel, and Access
formats. What do I do?
• Each of the more widely used stat packages
contain significant utilities for exchanging data.
Stata makes a package called Stat Transfer
• DBMS/Copy (Conceptual Software) probably the
best software for exchange among heterogeneous
formats
13 June 2002
91
Distributed Data
•
•
•
•
•
Data warehouses
Data federations
Distributed File Systems
External data sources
Data Grids
13 June 2002
92
Data warehouses
• In a large organization one might want to ask
research questions of transactional data. And what
will the MIS folks say about this?
• Transactions have to happen now; the analysis
does not necessarily have to.
• Data warehousing is the coordinated, architected,
and periodic copying of data from various sources,
both inside and outside the enterprise, into an
environment optimized for analytic and
informational processing (Definition from “Data
warehousing for dummies” by Alan R. Simon
13 June 2002
93
Getting something out of the data
warehouse
• Querying and reporting: tell me what’s what
• OLAP (On-Line Analytical Processing): do some
analysis and tell me what’s up, and maybe test
some hypotheses
• Data mining: Atheoretic. Give me some obscure
information about the underlying structure of the
data
• EIS (Executive Information Systems): boil it
down real simple for me
13 June 2002
94
More Buzzwords
• Data Mart: Like a data warehouse, but
perhaps more focused. [Term often used by
the team after the Data Warehouse fiasco]
• Operational Data Store: Like a data
warehouse, but the data are always current
(or almost). [Day traders]
13 June 2002
95
Distributed File Systems
• DCE/DFS – DFS seems to have a
questionable future
• AFS – Andrew File System – Widely used
among physicists
13 June 2002
96
AFS
• AFS is a distributed filesystem product, pioneered
at Carnegie Mellon University and supported and
developed as a product by Transarc Corporation
(now IBM Pittsburgh Labs). It offers a clientserver architecture for file sharing, providing
location independence, scalability and transparent
migration capabilities for data.
*http://www.openafs.org/main.html
13 June 2002
97
AFS Structure
• AFS operates on the basis of “cells”
• Each cell depends upon a cell server that creates
the root level directory for that cell
• Other network-attached devices can attach
themselves into the AFS cell directory structure
• Moving data from one place to another than
becomes just like a file operation except that it is
mediated by the network
• Requires installation of client software (available
for most Unix flavors and Windows)
13 June 2002
98
Computing Grids
• What’s a grid? Hottest current buzzword
• A way to link together disparate, geographically
disparate computing resources to create a metacomputing facility
• The term ‘computing grid’ was coined in analogy
to the electrical power grid
• Three types of grids:
– Compute
– Collaborative
– Data
13 June 2002
99
Compute Grids
• Compute grids tie together disparate computing
facilities to create a metacomputer.
• Supercomputers: Globus is an experimental
system that historically focuses on tying together
supercomputers
• PCs:
– Entropia is a commercial product that aims to tie
together multiple PCs
– SETI@Home
13 June 2002
100
Collaboration Grids
• http://www-fp.mcs.anl.gov/fl/accessgrid/
13 June 2002
101
Data Grids
• Globus – beginning to integrate data grid
functionality
• Avaki – commercial data grid product
• Data Grids “virtualize” data locality
13 June 2002
102
Layered Grid Architecture
(By Analogy to Internet Architecture)
“Coordinating multiple resources”:
ubiquitous infrastructure services,
app-specific distributed services
“Sharing single resources”:
negotiating access, controlling use
Collective
Application
Resource
“Talking to things”: communication
(Internet protocols) & security
Connectivity
Transport
Internet
“Controlling things locally”: Access
to, & control of, resources
Fabric
Link
June 5, 2002
Introduction to Grid Computing
1
http://www.globus.org/about/events/US_tutorial/slides/index.html
13 June 2002
103
Internet Protocol Architecture
Application
Example:
Data Grid Architecture
App
Discipline-Specific Data Grid Application
Collective Coherency control, replica selection, task management,
(App)
virtual data catalog, virtual data code catalog, …
Collective Replica catalog, replica management, co-allocation,
(Generic) certificate authorities, metadata catalogs,
Resource
Connect
Access to data, access to computers, access to network
performance data, …
Communication, service discovery (DNS),
authentication, authorization, delegation
Fabric Storage systems, clusters, networks, network caches, …
June 5, 2002
Introduction to Grid Computing
50
http://www.globus.org/about/events/US_tutorial/slides/index.html
13 June 2002
104
Example Data Grids
• GriPhyN (Grid Physics Network) – The key
problem: too much data (PB per year)
• Biomedical data
– Stanford Genome Gateway Browser mirrors
– Humane Genome Database mirrors
– Other examples….
13 June 2002
105
Federated databases
• A federation of databases is a group of databases
that are tied together in some reasonable way
permitting data retrieval (generally) and
sometimes (maybe in the future) data writing
• Benefits of federated approach:
– Local access control. Lets data owner control access
– Acknowledges multiple sources of data
– By focusing on the edges of contact, should be more
flexible over the long run
• Shortcomings: Right now, significant hand work
in constructing such systems
13 June 2002
106
DiscoveryLink
13 June 2002
107
Web-accessible databases
•
Especially prominent in biomedical sciences. E.g. NCBI:
•
•
enterez
http://www.ncbi.nlm.nih.gov/entrez/
pubmed
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Provides access to over 11 million MEDLINE citations
nucleotide http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
collection of sequences from several sources, including GenBank, RefSeq,
and PDB.
protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Genome
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
The whole genomes of over 800 organisms.
•
•
•
13 June 2002
108
Real-time data reduction as a
critical strategy
• Data: bits and bytes
• Information: that which reduces uncertainty (Claude
Shannon). Literally that which forms within, but more
adequately: the equivalent of or the capacity of something
to perform organizational work, the difference between
two forms of organization or between two states of
uncertainty before and after a message has been received,
but also the degree to which one variable of a system
depends on or is constrained by (see constraint) another. *
• In other words, if there is no realistic circumstance in
which you would take an action based on or influenced by
a certain number, than this number is data, not information
• We collect a lot more data than we do information
13 June 2002
*http://pespmc1.vub.ac.be/ASC/INFORMATION.html 109
Real-time data reduction
• Given that we collect much more data than
information, what do we do?
• If we can identify something as reliably just data,
and definitely not possibly information, why keep
it?
• In some cases of instruments that produce data
continually, a PC dedicated to on-the-fly data
reduction can drastically reduce data storage
requirements
13 June 2002
110
Knowledge management,
searchers, and controlled
vocabularies
• A tremendous amount of effort has gone in to
natural language processing, AI, knowledge
discovery, etc. with results ranging from mixed to
disappointing.
• If you want to be able to search large volumes of
data on an ad-hoc basis, then controlled
vocabularies are essential. Results here are mixed
as well, but at least the problems are sociological,
not technological.
• Good example: Gene Ontology Consortium,
http://www.geneontology.org/
13 June 2002
111
Data Visualization
13 June 2002
112
Visualization
• The days when you could take a stack of greenbar
down to your favorite bar, page through the
output, and understand your data are gone.
• Data visualization is becoming the only means by
which we can have any hope of understanding the
data we are producing
• A single gene expression chip can produce more
pixels of data than the human eye&mind together
are capable of processing
13 June 2002
113
Gene expression chips*
*http://www.microarrays.org/
13 June 2002
114
http://www.research.ibm.com/dx/imageGallery/
13 June 2002
115
http://www.research.ibm.com/dx/imageGallery/
13 June 2002
116
http://www.research.ibm.com/dx/imageGallery/image212.html
13 June 2002
117
Visualization Options
• 2D – commercial software and open source
• 2D Open source: IBM’s Data Explorer
http://www.research.ibm.com/dx/
• 3D –CAVE or Immersadesk
13 June 2002
118
CAVE™
• Cave Automatic
Virtual Environment
• Anything *but*
automatic
• Best immersive 3D
technology available
Image created by Eric Wernert of
Indiana University
13 June 2002
119
Immersadesk™
• Furniture-scale 3-D
environment
• Easier to program than
CAVE
• Immersive 3D feel not
as good as CAVE, but
one can install an
Immersadesk™ or
similar equipment
within a lab!
13 June 2002
Image created by Eric Wernert of
Indiana University
120
Heirarchical Storage
Management Systems
• Differential cost of media
–
–
–
–
RAM
RAID
CD
Tape
$60-$100/MB
$4-$10/MB
~$1 (readers included)
$0.05-$1
• Differential read rates and access times:
– Disk: 1 GB/sec; 9-20 ms access time
– Tape: 200 MB/sec; <1 min (autoloader)
13 June 2002
121
HSM
• The objective of an HSM is to optimize the
distribution of data between disk and tape
so as to store extremely large amounts of
data at reasonably economical costs while
keeping track of everything
13 June 2002
122
HSM basic concepts
• Most data is read rarely. Tape is cheap. Keep
rarely read data on disk.
• Data that is often used keep on disk.
• Stage data to disk on command for faster access
when you know you’re going to need it later.
• Stage data to disk in output.
• Manage data on tape so as to handle security and
reliability.
• Metadata system keeps track of what everything is
and where it is!
13 June 2002
123
HSM products
• EMASS Inc. - AMASS (Archival Management
and Storage System). http://www.emass.com
• Veritas – www.veritas.com
• LSF – Sun Microsystems, Inc.
• HPSS (High Performance Storage System) – a
consortium-lead product designed originally for
weapons labs and now marketed by IBM
13 June 2002
124
HPSS – High Performance
Storage System
• Controlled by a consortium, but produced and released as a
service from IBM (as opposed to a product)
• Designed to meet the needs of some of the most
demanding and security-conscious customers in the world
• Customers include:
– Lawrence Berkely Laboratories
– Los Alamos National Laboratories
– Sandia National Laboratories
– San Diego Supercomputer Center
– Indiana University
13 June 2002
125
Requirements for HPSS
• Absolute reliability of data in all forms
(reliably read whenever authorized person
wants, and reliably not available to anyone
unauthorized)
• High capacity
• Speed
• Fault detection/Correction
13 June 2002
126
HPSS Components
• Name Server (NS) – translates standard file names and
paths into HPSS object identifier
• Bitfile Server (BFS) – provides logical bitfiles to clients
• Storage Server (SS) – manages relationship between
logical files and physical files
• Physical Volume Library (PVL) – maps logical volumes to
physical cartridges. Issues commands to PVR
• Physical Volume Repository – mounts and dismounts
cartridges
• Mover (MVR) – transfers data from a source to a sink
13 June 2002
127
13 June 2002
128
Backup
• Backup systems and HSMs are fundamentally
different!
• Backup systems are designed for operational
continuity of computing systems, not for archival
storage, and vice versa
• Efforts to mix the two technologies tend not to
work well (e.g. restoring onto bare metal from an
HSM)
13 June 2002
129
Some Backup Systems
• Omnibak (HP)
• Legato (www.legato.com)
• Brightstore Arcserve (Computer associates www.ca.com)
• Tivoli (IBM)
13 June 2002
130
Backup schedules
• Good backup schedules essential!
• Example backup schedule:
–
–
–
–
Full backup every 6 months
Incremental since full every month
Incremental since monthly every week
Incremental since weekly every day
• Offsite copies of fulls are a good idea…
13 June 2002
131
The future of storage
• “In-place” increases in density
• New technologies:
– WORM Optical Storage & holographics
– Millepedes
– Non-corrosive metal
13 June 2002
132
Holographic storage
• Based on 3-D rather than
2-D data storage
• Constantly going to
revolutionize storage RSN
• Significant problems with
media stability
• WORM (Write Once Read
Many) technologies may
someday deliver
Image © IBM may not be reused
without permission
13 June 2002
133
Millipede Storage
•
•
•
•
Based on atomic force microscopy
(AFM): tiny depressions melted by
an AFM tip into a polymer medium
represent stored data bits that can
then be read by the same tip.
Thermomechanical storage is
capable of achieving data densities
in the hundreds of Gb/in² range
Current best – 20 to 100 Gb/in²
Expected limits for magnetic
recording (60–70 Gb/in²).
Image © IBM may not be reused
without permission
13 June 2002
*http://www.zurich.ibm.com/st/storage/millipede.html
134
Millipede Storage, Part 2
• Read/Write rate of
individual probe is
limited
• The Read/Write head
consists of ~1,000
individual probes that
read in parallel
Image © IBM may not be reused
without permission
13 June 2002
*http://www.zurich.ibm.com/st/storage/millipede.html
135
Storage of text on nonreactive
metal disks
• All of the commonly used storage media
depend upon arbitrary standards and are
fragile
• If you have data that you really want to
keep secure for a long time, why not write it
as text on non-corrosive metal disks?
13 June 2002
136
Future of computing
• The PC market will continue to be driven
largely by home uses (esp games)
• In scientific data management, the utility of
computing systems will be less determined
by chip speeds and more by memory and
disk configurations, and internal and
external bandwidth
13 June 2002
137
And the future is uncertain!
• If you can see what your storage requirements are
25 years into the future, and they are large scale
and significant,then a tremendous investment
based on what’s available today may be
reasonable.
• In any other case, it may be best to take shorter
views – 5 to perhaps 10 years, and build into your
thinking the constant need to refresh
13 June 2002
138
The ongoing challenge
• One of the key problems in data storage is that you
can’t just store it. Data stored and left alone is
unlikely under most circumstances to be readable
– and less likely to be comprehensible and useable
– in 20 years. The problem, of course, is that there
is an ever increasing need for tremendous
longevity in the utility of data. Because of this it is
essential that data receive ongoing curation, and
migration from older media and devices to newer
media and devices. Only in this way can data
remain useful year after year.
13 June 2002
139
References
• Simon, A.R. 1997. Data warehousing for
Dummies. IDG Books, Foster City, CA.
13 June 2002
140