Bank of Italy Remote access to micro Data

Download Report

Transcript Bank of Italy Remote access to micro Data

The new

B

ank of

I

taly

R

emote access to micro

D

ata (BIRD)

G. Bruno, L. D’Aurizio, R. Tartaglia-Polcini Q2008 – Rome, July 10, 2008 1

Motivation • Information release and data protection as competing goals • The risk-utility tradeoff: • risk of data disclosure • utility of widespread availability of data for research 2

Motivation GOALS (UTILITY): • • • satisfy growing demand from external researchers for business data improve the accountability of the Central Bank as economic research centre provide a service to the scientific community • • • • CONSTRAINTS (RISK): Data confidentiality must be guaranteed: as a prerequisite for respondents’ collaboration to foster quality of the data provided • is required by the law Public Use File (PUF) with individual data judged unfeasible: anonymisation very problematic with business data 3

Motivation SYNTHETIC DATA LIMITATIONS: • Identity disclosure impossible in principle, but, particularly with extreme values, it may be possible to re-identify a source record • Attribute disclosure may happen • Ample literature on data confounding and synthetic data (Duncan & Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; Fienberg et al.

1996; Kennickell 1997; Abowd & Woodcock 2001; Reiter 2002; Raghunathan et al. 2003; etc.) 4

Choices

• • •

Data confounding: create a PUF containing perturbed data to prevent identification of individual information. Downside: results (esp.

regressions) may heavily depend on the confounding technique adopted - controversial literature Data lab ( à la Istat: ADELE) – the researcher has to go to the lab in person.

Remote processing, using internet, without direct access to individual data Luxembourg Income Study: LISSY) ( à la 5

Other remote processing systems • • • • • • • Luxembourg Income Study (LISSY, 1987) Statistics Canada (2001) Statistic Denmark (2001) Statistic Netherlands (2002) Australian Bureau of Statistics (2003) Statistic Sweden (2003) US Federal Agencies: NCHS (1997), NCES (1998), Census Bureau (2003) 6

The solution adopted at the Bank of Italy BIRD

• • • • • • •

Modeled on LISSY Low setup cost Easily customisable Supports multiple packages Maximum accessibility for users Multi-level control (user/group, dataset, keyword) Automatic and manual checks & review 7

How BIRD works USER ELIGIBILITY CRITERIA • • • Researcher status (not necessarily academic) proved by a presentation letter Identification via valid personal id Detailed information via form to be filled in 8

How BIRD works USER PROFILE CREATION • • • • The researcher indicates an e-mail address which will be recognised by the system.

The researcher indicates her own user and password User-chosen parameters are input in the user database Access profile is created 9

How BIRD works SUBMISSION PROCEDURE • • • • • • Communication with the processing environment via e mail Send a message containing user authentication info + statements to be submitted Input message is parsed and checks are performed If no error/security violation  submit statements Output is parsed (automatically / manually) If no security violation mail  forward to the user via e 10

Confidentiality safeguards

User level

Data level

Processing level

11

Confidentiality safeguards User level: • • • • Users are identified, qualified and registered Registered mailboxes are whitelisted; ordinarily only one mailbox per user Outputs are monitored and archived Deontological code, privacy law, specific penalties Sanctions • • Forbidden submissions or outputs are deleted Grant of access for users trying to perform forbidden commands may be revoked • Any other sanctions or penalties required by the law where applicable 12

Confidentiality safeguards Data level:

• • •

Extreme data are censored (Winsorized) Identifying variables (ids, names, addresses) are expunged from the datasets used for remote processing Stratification variables are collapsed (geographical areas and not regions; Ateco aggregations and not codes) 13

Confidentiality safeguards Processing level:

• • • •

Formally forbidden to display individual data Keyword parser implemented with ceiling, blacklist e graylist Particularly long and/or complex programmes are always reviewed manually In the learning stage, all submissions are reviewed manually 14

How the parser works

check type check performed action if failed on INPUT action if failed on OUTPUT job cancelled blacklist length

parsing text for specific words and sequences checking the length of text

job cancelled n/a graylist (*)

parsing text for specific words and sequences

manual review

(*) This feature will be available in the next release of the system.

n/a n/a soft ceiling: manual review hard ceiling: job cancelled manual review

15

Datasets available STANDARD DATASET: quantitative data for the biggest firms (in terms of workforce) are censored (Winsorised) COMPLETE DATASET: no data censoring Id variables are expunged from both datasets, obviously 16

Datasets available Aggravated procedure for accessing the complete dataset: • • • Access must be explicitly requested – a special profile is created Review is exclusively manual Wait times are longer than average as time allocated to manual review on complete dataset is reduced 17

Documentation on the website • • • • • Application form Instruction manual Dataset description Examples of submissions in the supported packages (SAS, Stata) Methodological notes on the survey 18

Support 1.

Documentation available on the Bank of Italy website (manuals, variables description, questionnaires) http://www.bancaditalia.it/statistiche/indcamp/indimpser/bird 2.

Mailbox for queries and assistance: [email protected] 19

An example Program submitted by the user in Stata. Authentication is in the first four lines.

20

An example Output forwarded after review 21

Usage of the system in the first weeks

System started officially on Mar 13, 2008

Beta users from Feb 1, 2008

8 registered users

172 submissions in 21 weeks 22

Usage of the system in the first weeks

BIRD: # of weekly submissions, from Feb 1, 2008 3 5 3 0 2 5 10 5 2 0 15 0 w 1 w 3 w 5 w 7 w 9 w 11 w 13 w 15 w 17 w 19 w 21

23

Future developments • • • • • Web submission available alongside e-mail submission Other datasets will be made available in the future (e.g. data from the Business Outlook Survey) Open source packages processing (e.g.

R ) Merging with external datasets provided by the user, for special projects, on a discretionary basis, under an aggravated procedure and higher security levels.

Creation of closed groups with special authorisation levels for specific projects 24

Thank you for your attention 25