Protecting Confidentiality in a Virtual Data Centre Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012 COMPUTATIONAL.

Download Report

Transcript Protecting Confidentiality in a Virtual Data Centre Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012 COMPUTATIONAL.

Protecting Confidentiality

in a Virtual Data Centre Christine O’Keefe , Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute

28 October 2012

COMPUTATIONAL INFORMATICS

Overview

Introduction to the problem Virtual Data Centres Proposed solution Confidentiality in Virtual Data Centres | Christine O’Keefe

Population Health Research Network*

Provides access to linkable de-identified health data for research  Improving outcomes  Improving policy Traditionally  Supplies linkable de-identified health data directly to researchers Loss of control over data heightens risk of:  External attack on datasets  Accidental or inadvertent actions by researcher  Deliberate attack by trusted researcher *www.phrn.org.au

Confidentiality in Virtual Data Centres | Christine O’Keefe

Secure Unified Research Environment*

Secure remote access to virtual workstations and network in a data centre Confidentiality in Virtual Data Centres | Christine O’Keefe *Sax Institute SURE User Guide v1.2

Confidentiality Protection for Health Data

Governance  Comply with privacy legislation and regulation  Honour assurances to data providers Restrict access to approved researchers  Information security measures Restrict amount and detail of data available  Apply statistical disclosure control methods before releasing data to researcher – No further confidentiality measures  Enable access via secure on-line system – Manual checking for confidentiality issues in statistical analysis outputs – “…developing valid output checking processes that are automated is an open research question” (Duncan, Elliot, Salazar-González 2012) Confidentiality in Virtual Data Centres | Christine O’Keefe

Conceptual Model for online access

VDC RA Remote Analysis

 Researcher cannot see data itself, only “Output for publication”

Virtual Data Centre

 Researcher authorised to see data and “Output” as well as “Output for publication” Confidentiality in Virtual Data Centres | Christine O’Keefe

Virtual Data Centre

Assumptions  Custodian prepares data to comply with legislation, regulation and assurances  Researcher complies with applicable researcher agreements  Researcher authorised to see data itself – Do not need to protect dataset records from researcher – – Do not need to protect against malicious attacks by researcher Data transformations and analyses are unrestricted – – Confidentiality issues with respect to readers of academic literature Confidentiality issues with repect to outputs of genuine queries Confidentiality in Virtual Data Centres | Christine O’Keefe

Main Disclosure Risks in Statistical Output

Individual values Small cells/samples … threshold Dominance Differencing Linear or other algebraic relationships in data Precision Confidentiality in Virtual Data Centres | Christine O’Keefe

Confidentiality Protection in a Virtual Data Centre – two stage process

1. Dataset preparation - by Custodian 2. Confidentialisation of statistical analysis output for publication – by Researcher 2 1 Similarities to: • ESSNet SDC Guidelines for checking output based on microdata research … Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer, de Wolf 2012 • Statistics New Zealand Data Lab Output Guide Confidentiality in Virtual Data Centres | Christine O’Keefe

Dataset preparation – by Custodian

Custodian 1.

2.

3.

4.

5.

6.

Removes obvious identifiers Ensures dataset has sufficient records Ensures published datasets differ by sufficiently many records Ensures variables and combinations of variables have suff many records Reduces detail in data using aggregation (esp dates, locations) Other measures as needed – statistical disclosure control 1 Confidentiality in Virtual Data Centres | Christine O’Keefe

Confidentialisation of statistical analysis output for publication – by Researcher

Researcher 1. uses Checklist of tests to identify outputs that fail one or more tests 2. considers context and interations of outputs to identify potential disclosure risks 3. applies treatments from Checklist to reduce potential disclosure risk Confidentiality in Virtual Data Centres | Christine O’Keefe

Checklist of Tests

         Individual value: an individual data value is directly revealed Threshold n: A cell or statistic is calculated on fewer than n data values Threshold p%: A cell contains more than p% of the values in a table margin Dominance (n,k): Amongst the records used to calculate a cell value or statistic, the n largest account for at least k% of the value Dominance p%: Amongst the records used to calculate a cell value or statistic, the total minus the two largest values is less than p% of the largest value Differencing: A statistic is calculated on populations that differ in fewer than n records Relationships: The statistic involves linear or other algebraic relationships Precision: The output involves a high level of precision in terms of significant figures and/or decimal places Degrees of Freedom: The model output has fewer than n degrees of freedom Confidentiality in Virtual Data Centres | Christine O’Keefe

Checklist - examples

Statistic

Number e.g. sample size Mean

Confidentiality Test

Threshold n Threshold n Dominance (n,k) Dominance p%      

Treatment

Try to get more data Suppress value Recode variable Round reported value Suppress denominator Suppress value Differencing  Redefine one or both populations Ratios and percentages Individual values Threshold n Threshold p% Dominance (n,k) Dominance p% Differencing Relationships Precision        Suppress individual values Recode variables Round reported values Suppress values Redefine one or both populations Round reported values Round reported values   

Notes

If this test is failed, the study is probably unreliable due to the small sample size.

The tests and treatments are only necessary if the denominator is known so the sum can be inferred The mean has a strong algebraic relationship with the sum so is potentially disclosive  For a ratio, the tests and treatments are only necessary if one of the terms is known so the other can be inferred (this is an example of the relationship test) Confidentiality in Virtual Data Centres | Christine O’Keefe

Checklist - examples

Statistic

Relative risk Kaplan Meier plot Other cumulative distributio n plots

Confidentiality Test

Precision Threshold n Precision    Individual value Threshold n Threshold p% Dominance (n,k) Dominance p%

Treatment

Round reported values Recode variables Round reported value Confidence interval Degrees of freedom Threshold n Precision p value of a test Precision     Change model or data groups to increase degrees of freedom Recode variables Round reported values  Round reported value      Do not show individual values This can be done by either smoothing the plot or recoding variables Only relevant if data already grouped in plot Recode variables  

Notes

In some cases data might be reconstructed from sample size and relative risk value alone. If so, the data would need to be checked for disclosure risk, and treatments applied if necessary.

A confidence interval based on a normal distribution reveals a mean and standard error. These might be disclosive - see the tests and treatments under Summary Statistics Note that in a regression context it is claimed they can be used to reconstruct the fitted values A p value can reveal the value of a test statistic which might be disclosive in combination with other reported information; see the 1 st note on Confidence Intervals There exists software that can read data values from a pdf version of a plot Confidentiality in Virtual Data Centres | Christine O’Keefe

Summary

Virtual Data Centres  Becoming more popular  Manual checking of outputs for confidentiality risk not sustainable  Automated methods for confidentiality protection in statistical analysis outputs still under development Interim Solution 1.

2.

– – Dataset preparation by Custodian Researchers confidentialise their own outputs for publication Training Checklist of tests and confidentiality treatments Confidentiality in Virtual Data Centres | Christine O’Keefe

Thank you

Computational Informatics

Dr Christine O’Keefe Research Program Leader, Decision and User Science t +61 2 6216 7021 e [email protected]

w www.csiro.au

COMPUTATIONAL INFORMATICS