WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Yosi Rinott Hebrew University Natalie Shlomo Hebrew University Southampton University.

Download Report

Transcript WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Yosi Rinott Hebrew University Natalie Shlomo Hebrew University Southampton University.

WP 10

On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation

Yosi Rinott Hebrew University Natalie Shlomo Hebrew University Southampton University 1

Disclosure Risk Measures

Notation: Sample (size

n

): Population (size

N

):

f

{

f k

:

k

1 ,...,

K

}

F

 {

F k

 ( :

k

1

k

 1 ,...,

K

}

Risk Measures:

 1  2    

I I

( (

f f k k

 1 ,

F k

 1 ) 1  1 )

F k

)

=

expected number of correct matches of sample uniques

Estimates:

 ˆ 1  

I

(

f k

 1 ) (

F k

 1 |

f k

 1 )  ˆ 2  

I

(

f k

 1 )

E

[ 1

F k

|

f k

 1 ] 2

On Definitions of Disclosure Risk

• In the statistics literature, we present examples of risk measures, and , but we lack formal definitions of when a file is safe • In the computer science literature, there is a formal definition of disclosure risk ( e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989), who write “it may be argued that elimination of disclosure is possible only by elimination of statistics”) In some of the CS literature

any data must be released with noise

. The noise must be small enough so that legitimate information on large subsets of the data will be useful, and large enough so that information on small subsets, or individuals will be too noisy and therefore useless (regardless of whether they are obtained by direct queries or differencing etc.) 3

On Definitions of Disclosure Risk

Worst Case scenario of the CS approach (for example, that the intruder has all information on anyone in the data set except the individual being snooped) simplify definitions, there is no need to consider other, more realistic but more complicated scenarios.

But would Statistics Bureaus and statisticians agree to adding noise to any data?

Other approaches like query restriction or query auditing do not lead to formal definitions.

4

Definition of Disclosure Risk

{ :

i d i

 { 0 , 1 }

m

A Query is a sum over a subset of . Query is perturbed by 

i

Proven that almost all can be reconstructed if and none of them can be reconstructed if  

n

on individuals and small groups, but yields meaningful information about sums of O(

n

) units for which noise of order is natural.

Work further expanded to lessen the magnitude of the noise by limiting the number of queries.

5

Definition of Disclosure Risk

Collaboration with the CS and Statistical Community where: 1. In the statistical community, there is a need for more formal and clear definitions of disclosure risk 2. In the CS community, there is a need for statistical methods to preserve the utility of the data - allow sufficient statistics to be released without perturbation - methods for adding correlated noise - sub-sampling and other methods for data masking

Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

6

Probabilistic Models

Focus on sample microdata and not whole population (sampling provides a priori protection against disclosure) Standard (natural) Assumptions

F k

| 

k

~

Poisson

(

N

k

) ind.

 

k

 1

f k

|

F k

~

Bin

(

F k

, 

k

) Bernoulli or Poisson sampling

f k

| 

k

~

Poisson

(

N

k

k

)

F k

|

f k

~

f k

Poisson

( 

k

N

k

( 1  

k

)) In particular

F k

|

f k

 1 ~ 1 

Poisson

( 

k

) the size biased Poisson distribution 7

Probabilistic Models

Add

k

~

Gamma

(  ,  )

F k

|

f k

~

f k

NB

(  

f k

,

N

N k

  1 / 1   ) 

F k

|

f k

~

f k

NB

(

f k

, 

k

) Model 8

Mu-Argus Model

(Benedetti, Capobianchi, Franconi (1998))

w i

is the sampling weight of individual

i

obtained from design or post-stratification  ˆ

k

f k

ˆ

k

where ˆ

k

 

i

sample w i cell k F k

k F

ˆ

k

 

i w i

N

  ˆ

k k

 estimates increase to the correct level in , but how to  9

Poisson Log-linear Models

(Skinner, Holmes (1998), Elamir, Skinner (2005), Skinner, Shlomo (2005))

N

k

 exp(

x

 

)

Monotonicity in the size of the model (number of parameters):  underestimated overestimated Intermediate models with conditional independence involves smaller products of marginal proportions and therefore we expect  will be a model which will give a good risk estimate 10

Neighborhood of a Log-linear Model

Log-linear models takes into account a neighborhood of cells to infer on 

k

for determining the risk.

j

For example: Independence Neighborhood,

k=(i,j):

 ˆ of marginal proportions obtained by fixing one attribute at a time, thus if one attribute is income group then inference on very rich involves information on very poor, provided there is another attribute in common, such as marital status.

i

11

Discussion of Neighborhoods

How likely is a sample unique a population unique? If a sample unique has mostly small or empty neighboring cells, it is more likely to be a population unique. • Argus is based on weights and no learning from other cells. • The log linear Poisson model takes into account neighborhoods, reduces the number of parameters and also reduces their standard deviation and hence of risk measures (provided that the model is valid). Are there other types of neighborhoods which may be more natural? We focus on ORDINAL variables 12

Proposed Neighborhoods

• Local smoothers for large sparse (ordinal) tables, e.g. Bishop, Fienberg, Holland (1975), Simonoff (1998) • Use local neighborhoods to fit a simple smooth function to

k

N k

, by varying the coordinates of ordinal attributes, and fixing non-ordinal attributes

N c

Neighborhood of cell

k

at distance

c

from cell

k

13

Proposed Neighborhoods

Regressors, for cell

k

:

x c

(

k

)  

l

N

(

k c

)

f l

k

 exp{  0  

c

C

c x c

(

k

) }

f k

| 

k

~

Poisson

(

n

k

)

F k

|

f k

~

f k

Poisson

( 

k

N

k

( 1  

k

))

i j

Define structural zeros if all neighborhoods of a cell which are used in the regression contain only empty cells 14

Example

Population from 1995 Israeli Census File, Age>15, N=746,949, n=14,939, and K=337,920 Key: Sex(2), Age groups(16), Groups of years of study(10), Number of years in Israel(11), Income groups(12), Number of persons in household (8) Sex is not ordinal and is fixed Weights for Argus obtained by post-stratification on weighting classes: sex, age and geographical location 15

Example

Model True Values Argus Log-linear model: Independence Log-linear model: 2-way Interactions Neighborhood Method

M k a

structural zeros Neighborhood Method

N c k N c k

structural zeros  1 430.0

114.5

773.8

470.0

786.8

385.4

723.3

344.8

 2 1,125.8

456.0

1,774.1

1,178.1

2,146.9

1,674.1

2,099.6

1,624.2

16

Results of Example

• Independent log-linear model and neighborhoods over estimate the two risk measures • Argus Model under estimates • The all 2-way interaction log-linear Poisson Model has the best estimates • Taking into account the structural zeros in the neighborhoods yield more reasonable estimates 17

Conclusions

• Need to refine the neighborhood approach, define the model better and develop MLE theory • We expect the new model to work well in multi-way tables when simple log-linear models are not valid • Incorporate the approach into a more general regression model, the Negative Binomial Regression, which subsumes both the Poisson Risk Model and the Argus Model 18