Data services in Canada: who, where, when, and why

Transcript Data services in Canada: who, where, when, and why

A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02 http://www.chass.utoronto.ca/~laine/misc/balisage2010.ppt

Overview:

• What are data (that is, quantitative social science data)?

• History of social science quantitative data and metadata • Lessons learned

What are data?

Data are…

• Representations of selected characteristics of a population of entities, eg individuals, companies, periods of time, etc • Characteristics are grouped, and variations of a characteristic are assigned (normally) numeric values • Assigning numeric values to variations of a characteristic allows their manipulation by mathematical/statistical procedures

wisdom knowledge information (statistics) data

Data and statistics are not equals

• Statistics are two kinds: –

Descriptive statistics

common characteristics of the raw data units : summaries of (one-way tables, two way tables … multi way tables) –

Inferential statistics

: measure strength and direction of relationships among characteristics of raw data units

Of course, statistics (descriptive or inferential) can become data in their turn, and used in other statistical procedures.

Data and statistics are not equals (cont’d) • Ie, data are:

– the raw materials from which statistics are generated – ideally, available at the level at which the data were originally collected (=microdata) – need to be manipulated with statistical software in order to be comprehensible

Data

raw data

Metadata

record layout

variable description (aka data dictionary)

province gender

syntax file for SPSS

Metadata are…

• Instructions to explain the content and coding of a data set (whether numeric, alphabetic, or other), and aid in their correct interpretation • Can be intended for human or computer consumption, but are ideally both

Raw data + a syntax file, processed through a statistical software package results in a system file – average shelf life less than 10 years

The beginnings

• Hollerith cards first used to process the 1890 US census of population • By 1930s, public opinion polling was being used to eg predict electoral outcomes – 1936 Literary Digest poll predicted defeat of Roosevelt in the US presidential election • Data gathering make-work projects in the 30s in the US, such as economic censuses, surveys on unemployment, crop production, etc

By the 1940s

• Polling and survey taking matured • Beginnings of improved sampling methods, such a Gallup’s quota samples • 1948 polls chose Dewey over Truman in the US presidential election, leading to formation of a committee to determine why the error • the Roper Center was created, the first data archive (1946) • Data stored on punched cards, and analyzed using card sorters and similar equipment • And metadata usually looked like this…

The metadata for the May 1945 Canadian Gallup Poll…..

The 1950s…

• UNIVAC 1, the first alphanumeric computer • UNIVAC 2 correctly predicted the Eisenhower sweep in the 1952 US presidential election • MIT began working on keyboard entry • Development of the COBOL compiler and Fortran • Magnetic tapes, at 200 bpi, could store the contents of 70,000 punched cards, ie about 5.6 megabytes of data • Lucci & Rokkan promoted the idea of data management by libraries

But the metadata for the August 1958 Canadian Gallup poll still looked like this…………

1960s…

• Development of Basic, the Unix operating system, and ASCII which allowed interchange of data among different computers • Statistical software packages: DATA-TEXT, SPSS, P-STAT, BIOMED, NUCROS, SAS • Magnetic tapes moved from 556 to 800 bpi • Most social scientists were still writing own local software, or using card-sorters and calculators to produce cross-tabulations and compute chi-squares

1970s a watershed decade…

• Microprocessors, and 8” and later 5-1/4” diskettes • Wang word processor, Ataris, Apple 1 and the Commodore PET • dBASE, VISICALC and WORD STAR • ARPANET and expansion of time-sharing and online systems • Online bibliographic services such as Dialog, BRS, and Orbit

1970s (cont’d) • David Nasatir wrote first manual on data management under aegis of UNESCO (1972) • Mid-decade saw the creation of IASSIST, and the first training at ICPSR for data librarians • US census of population 1970 partly disseminated on computer tapes instead of print, forcing libraries to consider this new medium

1970s (cont’d) • OSIRIS software developed at University of Michigan, included statistical capabilities as well as outstanding data and metadata management • NSF funded the National Conference on Cataloging and Information Services for Machine-Readable Data Files at Airlie House in Virginia • US Department of Justice funded the project which resulted in Roistacher’s Style manual for machine-readable data files – bibliographic identity, methodology, and data dictionary

An OSIRIS codebook generally followed the Roistacher recommendations. The record layout and data dictionary portion looked like this:

1980s

• Supercomputers and NSFNET changed face of large scale computing, and PCs and MACs did the same for small scale computing • BITNET, followed by the Internet, provided e mail, listservs and remote login • tape cartridges held the equivalent of 8 million cards or four times that of a 6250 tape. Five megabyte hard drives became available for microcomputers • IBM brought microcomputing to the academic sector • CD-ROMS, and the Quadra directory of databases

1980s (cont’d) • Sue Dodd’s Cataloging machine-readable data files : an interpretive manual, 1982 • Social forces one of the first journals to include guidelines on citing machine-readable data files • Population index the first bibliographic journal to cite data files • A draft revision of AACR2 chapter 9 (renamed: Computer Files) was published in 1987 – bibliographic control for data files

1990s

• Migration from IBM mainframes (EBCDIC) to Unix (ASCII) • Demise of tapes for storage, in favour of widespread use of CD-ROM • Statistics Canada makes the electronic products from census the primary product • Gopher, developed in 1991, was replaced by the WWW and html, and by 1996 there are about 100,000 web servers • Beginning of the DDI (Data documentation initiative) project in 1995, published its first DTD in 1996

Three major developments lead up to DDI:

• OSIRIS’ metadata management capability • Roistacher ‘s outline of machine-readable data file documentation (1980) • Dodd’s cataloguing manual (1982)

OSIRIS metadata

• OSIRIS dictionary provided structural information: location, size, missing data, a variable name and a variable label (brief) • OSIRIS codebook provided a tagged format: – Introduction (unstructured) – Full question text – Variable values and value labels – Variable-level comments • North American institutions standardized on the OSIRIS type-1 and type-4 codebooks, Europe on the type-3 format codebook

Roistacher’s style manual

• Provided outline of the information that should be contained in the full metadata (aka codebook), including – Bibliographic identity – Project history – File processing summary – Data dictionary contents – Recommended appendices

Sue Dodd’s cataloguing manual

• Further refined the bibliographic identity component of the metadata • Provided a cross-walk to AACR cataloguing rules • Provided the foundation for the development of a MARC record • Dodd also defined the components of a bibliographic citation

Many kinds of metadata for many purposes

• Data collection • Data interpretation • Data preservation • Data discovery • Coding standardization

Based on the NISO metadata classification:

Descriptive

•MARC records •RAD records •Thesauri •Concordances

Structural

•Syntax files for eg SAS or SPSS •Programming syntax •Record layouts •Data dictionnaries •Missing data specifications •Definitions of derived variables

Administrative

•Project conception, implementation and funding •Methodology reports, sampling frames, etc.

•Questionnaires and data collection protocols •Interviewer instructions •Post-processing, weighting, etc •Access and dissemination restrictions •Question banks

DDI provides a format …

• From which other subtypes of metadata (bibliographic records, syntax files, question banks, etc) can be generated • Describes not just microdata but also an intelligent means of describing aggregate statistics as data • Can incorporate all documentation from original project conception to edition management and post-processing

DDI provides a format … (cont’d) • 3 rd generation data access tools (Nesstar, DDI, and Dataverse (VDC)) all support DDI 2.0 at present and provide a useful way to provide on-line remote distributed access to data discovery and data • Leads to proliferation of new applications of metadata and realization of initiatives from earlier decades

Lessons learned

• Three killers of data: – Software dependence – Lost metadata – Physical medium on which data are stored • No solution as yet combines data, full metadata and statistical capability in a non software dependant format