Transcript Slide 1

Workshop on Metadata Standards and Best Practices
November 19-20th, 2007
Session 3
Researcher Metadata in RDCs
Pascal Heus
Open Data Foundation
[email protected]
http://www.opendatafoundation.org
Outline
•
•
•
•
•
RDC Needs
Metadata in RDCs
Potential solutions
Examples
Conclusions / Q&A
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
RDC Overview
• Provide an environment for the researcher to
perform the in depth analysis of data in the
most efficient way
• Simple access to data file and codebook is
insufficient
• Need a high quality metadata and
collaborative environment to promote
dynamic research
• Should capture the research process
• Provide benefits to all stakeholders:
producers, librarians, researcher, general
public, etc.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Metadata and the survey life cycle
•
•
•
•
A survey is not a static process
It dynamically evolved across time and involves many players
It extends to aggregate data to reach decision makers
Metadata is crucial to capture knowledge
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Importance of metadata
Imagine a world without metadata….
• Users would say:
–
–
–
–
–
–
I can’t find the right data! How do I get access?
Where is the report / questionnaire / methodology?
I don’t understand this survey / file / variable
I can’t merge the files
How do I weight the data?
My results don’t match the report, I can’t reproduce the
same results
– Are these things comparable?
– I didn’t know someone did this research before?
•
Sounds familiar?
– Metadata is an answer to a researcher’s frustrations
• Producers and archivists are making efforts to
improve metadata but similarly, metadata must also
be captured by researchers (Life Cycle!)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
When to capture metadata?
• Metadata must be captured at the time the event occurs!
• Documenting after the facts leads to considerable loss of
information
• This is true for producers and researchers
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Metadata and the Replication standard
• Replication standard
– Gary King, Harvard, 1995
http://gking.harvard.edu/projects/repl.shtml
– "The replication standard holds that sufficient information
exists with which to understand, evaluate, and build upon a
prior work if a third party can replicate the results without
any additional information from the author."
– The only way to understand and evaluate an empirical
analysis fully is to know the exact process by which the
data were generate
– Replication dataset include all information necessary to
replicate empirical results
• Metadata crucial to meet the standard
– Composed of documentation and structured metadata
– Undocumented data is useless
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
RDC issues
• Without producer metadata
– researchers can’t work discover data or perform efficient
work
• Without researcher metadata
– producer don’t know about data usage and quality issues
– Other researcher are not aware of what has been done
• Without standards
– Information can’t be properly managed and exchanged
between agencies or with the public
• Without tools:
– Can’t capture and preserve/share knowledge
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
RDC Metadata Framework
1. Producer provide data & basic docs
2. Need to enhance existing metadata
3. Start capturing researcher metadata
4. Knowledge grows and gets reused
5. Provides usage and quality
feedback to producer / RDC
6. Repeat across surveys/topics
7. Metadata facilitates output
8. Public metadata facilitates data
discovery / fosters global knowledge
9. Metadata exchange between agencies
Researcher
Research
Output
Research
Metadata
Producer/Archive
Metadata
Public Use
metadata
Producers
Data
RDC
RDC
RDC
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
External users
RDC Solutions
• Metadata management
– Adopt standards and provide researcher with
comprehensive metadata
– Use related tools to capture research process
• Collaborative environment
– Used web technologies to foster a dynamic research
environment
• Connected and Remote enclaves
– Connect RDCs through secure networks
– Consider virtual data enclave
• Data disclosure
– Protect respondent through sound data disclosure
techniques
• Train providers / researchers
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Simple techniques
• Starts with good practices
– File and variable naming conventions (embed
metadata)
– Code documentation
– Good statistical methods
• Web tools
– Take advantage of common web technologies
– Organize: calendar, events & news, task/todo
– Knowledge capture/sharing: shared
document/script libraries, wiki, blogs, discussion
groups, citation bases, etc.
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Coding and naming conventions (1)
• Give meaningful names to files
–
–
–
–
Avoid spaces in names, don’t use upper case
Version your files (capture progress)
Use “middle” extensions
Include metadata in the name
• Not too good:
– report.doc, notes.txt
– myfile.dta, table2.xls
– reg.do, test.do,, results.
• Better
– usda_arms_2005_final_report_v200607.doc
– usda_arms_results_v200706.dta , usda_farms_by_crop.xls,
– income_regression_v200706.do
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Coding and naming conventions (2)
• Give meaningful names to variables
– Not too good:
• tmp3, ag_exp2, v324
– Better:
• valid_enterprise, agricultural_expenditure, s1q3
• Avoid complex code
• Comments, comments, comments!!
– Make sure to include lots of comments in your source code
– This is the best time to capture knowledge!
– It also promotes replicability and will help you in a few
months when to try to remember what you did
• Share source code, use peer review
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Not so good code example
local mypath
= “c:\data\anonymization\"
global data_in = "`mypath'" + "\" + "Demohh1000.dta"
global data_out = "`mypath'" + "\" + "Demohh1000.out.dta"
global threshold = 0.8
cd $mypath
set more off
use $data_in, clear
tempfile temp
gen fk=1
gen wi=weight
collapse (sum) fk wi, by (town province marstat sex age)
gen pk=fk/wi
gen qk=1-pk
gen rk= (pk/qk) * log(1/pk) if fk==1
replace rk= (pk/(qk^2)) * ((pk*log(pk))+qk) if fk==2
replace rk=(pk/(2*(qk^3))) * (qk*(3*qk-2) - (2*pk^2)*log(pk)) if fk==3
#delimit ;
replace rk= (pk/fk) * (1+ (qk/(fk+1)) + ((2*qk^2)
/ ((fk+1)*(fk+2))) +
((6*qk^3) / ((fk+1)*(fk+2)*(fk+3))) +
((24*qk^4)
/
((fk+1)*(fk+2)*(fk+3)*(fk+4))) +
((120*qk^5) /
((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5))) +
((720*qk^6) /
((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6))) +
((5040*qk^7) /
((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6)*(fk+7)))) if fk>3 ;
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Better code example
/**
* Computes the disclosure risk at individual level
*
* @author John Anonymous ([email protected])
* @version 2007.06
* References:
* - micro-Argus 4.1 manual, p27-25
*/
// Configuration
local mypath
= “C:\data\anonymization\"
global data_in = "`mypath'" + "\" + "Demohh1000.dta"
global data_out = "`mypath'" + "\" + "Demohh1000.out.dta"
global threshold = 0.8
// Initialize
cd $my_path
set more off
// Load the data
use $data_in, clear
tempfile temp
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Canada RDC Project
• Consists of 14 Research Data Centres
Centres, 6 branch RDCs and the Federal
Research Data Centre in Ottawa
• Data provided by Statistics Canada
• RDC are now connected through a high
speed secure network
• Project to adopt a DDI 3.0 based metadata
framework for survey documentation and
research work and sponsor development of
tools
• ODaF providing technical assistance
• http://www.statcan.ca/english/rdc/index.htm
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
The Canada RDC Research Life Cycle
Research
Commun-
Project
Application
icatons
Output
Disclosure
Analysis
Project
Approval
Stages in the life cycle
Managing Data
Stages
Generate
Analysis
Files
Project
Creation
Access to
Data
[Chuck Humphrey, University of Alberta]
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Metadata in Canada RDC
RDC
1
8
Producer
Analyst
2
Original
Survey
Master
Survey
6
Researcher
3
4
Virtual
Survey
5
Research
Output
8
Conferences
…
7
6
Security
Security
1.
2.
3.
4.
5.
6.
Producer makes survey available
Analyst packages for RDC
Researcher gets access and reshapes the data
Researcher perform complex analysis
Researchers publishes results
Information flowing in/out and activities are controlled
and monitored
7. Outside users get access to the research output
8. Analyst includes results, activity, feedback
and reports to the producer
http://www.opendatafoundation.org
Other researchers
Policy Makers
General Public
…
Publication
Open Data Foundation – IZA 2007/11
The information
flow relies on
metadata and
also generates
new information
that must be
captured!!
Metadata Framework in Canada RDC
Original
Survey
2.0
Master
Survey
Virtual
Survey
Tables
2.0 / 3.0
DDI 3.0
ORIGINAL
Editor
Repurpose
Question
Version
Legacy
SPSS,
SAS,
Stata
Quality
Publication
Conferences
…
Research
Output
MASTER
Concepts
VIRTUAL
Analysis
OUTPUT
Disclosure
Other
Log
Metadata
Mining
Report
Group
Compare
Resources
Training
Documentation
Project
Admin
Metadata Management
Storage
http://www.opendatafoundation.org
Query
Registry
Virtual File System
Exchange
Communication
Collaborative
Intranet
Audit
Logs
Data
Files
Open Data Foundation – IZA 2007/11
Security
Authorization
Authentication
i18n
NORC Data Enclave
• National Opinion Research Center
• provides a secure environment within which
authorized researchers can access sensitive
microdata remotely from their offices or
onsite
• Data from National Institute for Standards
and Technology’s (NIST) Technology
Innovation Program (TIP), the Ewing Marion
Kauffman Foundation, and the Economic
Research Service at the US Department of
Agriculture
• Possibly the first virtual data enclave
• http://dataenclave.norc.org
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
NORC Virtual Enclave
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Benefits (1)
• Data documentation
– Through good metadata practices,
comprehensive documentation is available to the
researchers
• Preservation, integration and sharing of
knowledge
– Research process is captured and preserved in
harmonized format
– Research knowledge becomes integrant part of
the survey and available to others
– Producer gets feedback from the data users
(usage, quality issues)
– Reduce duplication of efforts and facilitates reuse
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Benefits (2)
• Research outputs and dissemination
– Facilitate production of research outputs
– Facilitate dissemination and fosters broader
visibility of research outputs
• Exchange of information
– Metadata exchange between RDC, producers,
librarians
– Importance of public metadata for sensitive
datasets
– Facilitate data discovery (inside and outside
RDC)
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11
Conclusions
• Metadata plays a crucial roles in RDC’s
• Benefits all stakeholders
– Better use of the data (return on investment)
– Improves research quality
– Foster production of high quality data (more
relevant and accurate) accompanied by
comprehensive metadata
• Adopting good practices may mean
changing the way you work
– This requires good change management
techniques and discipline
– But the benefits are worth the effort
http://www.opendatafoundation.org
Open Data Foundation – IZA 2007/11