Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Download Report

Transcript Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Numerical Data Masking
Techniques for Maintaining
Sub-Domain Characteristics
Krish Muralidhar
University of Kentucky
Rathindra Sarathy
Oklahoma State University
1
Ideal Data Utility for Masking
Numerical Data
• Ideally, results of all analyses using the
masked data should be identical to that
using the original data.
• Impossible to achieve in practice.
2
Practical Data Utility
• Results of most analyses using the
masked data should be very similar to that
using the original data.
• Performance of the masking technique
should be predictable (theory-based
methods are preferable over ad hoc
methods)
3
Practical Assessment of Data Utility
• Univariate (Marginal) characteristics
• Maintain some sufficient statistics
– When sufficient statistics are maintained in the
masked data, results for analyses based on these
statistics using the masked data can be guaranteed to
be exactly the same as that using the original data
• Relationships
– Linear
– Monotonic
– Non-monotonic
4
Sub-domain Characteristics
• An important component of data utility for
Government agencies and users is the need to
maintain characteristics of the original data
within sub-domains, in the masked data
• With a few exceptions, this aspect of data utility
has NOT been directly addressed when
evaluating techniques for masking numerical
data
5
Preferred Techniques
• In this study, we investigate the performance of
two techniques in maintaining sub-domain
characteristics when masking numerical data.
– Sufficiency Based perturbation approach (Burridge
2003; Muralidhar and Sarathy 2007)
– Data Shuffling (Muralidhar and Sarathy 2006)
• Why these two techniques?
– These two techniques can maintain certain
characteristics for sub-domains exactly
– They dominate the performance of other techniques
for masking numerical data
6
Sufficiency Based Linear Models
• X, S, and Y represent the confidential, non-confidential, and
masked data, respectively; ε represent the noise term.
• Σ represents the covariance matrix between variables.
• Specification of β2 dictates the extent of relationship
between original and masked data
7
Data Shuffling
(US Patent # 7200757)
8
Examples
• Simulated example
• Census Data
• In our presentation, we will focus on the
simulated data. The manuscript has a
complete discussion of the results for the
Census data.
9
Simulated Example
• Number of observations = 50000
• Three categorical, non-confidential variables
– Gender (Male or Female)
– Marital Status (Married or Other)
– Age Group (1 to 6)
• Total of 24 sub-groups
• Three numerical, confidential variables
– Home value (Positive, non-normal)
– Mortgage balance (Positive, non-normal)
– Net value of assets (normal)
10
Methods
• Data Shuffling
• Three Sufficiency based perturbations
– Given S
• Y is conditionally independent of X (d = 0.00)
• Y is moderately related to X (d = 0.50)
• Y is closely related to X (d = 0.90)
where d are the values of the diagonal
elements of the diagonal matrix β2
11
Evaluation
• Compare performance of techniques in sub-domains
– Disclosure risk
• Identity (assessed using the procedure by Fuller (1993)
• Value (assessed by comparing proportion of variance
explained in confidential variables, before & after
masking)
– Data utility
• Marginal (or univariate) distribution
• Linear relationship between variables
• Non-linear relationship between variables
12
Risk of Identity Disclosure
13
Risk of Value Disclosure
• Perturbed data with d = 0.50, 0.90 results in
increased predictive accuracy.
• Does is matter?
14
Marginal (or Univariate) Distribution
(Mortgage Balance) (Entire Data Set)
15
Sub-group Marginal Distribution
(Home Value) (Gender = 0, Marital = 0, Age = 1)
16
Product Moment Correlation
17
Non-Linear Relationships
18
Rank Order Correlation
19
Comparison of the Methods
Data Shuffling
1. Disclosure risk
1. Identity disclosure risk is 1/n
2. Providing access to masked data does not improve
predictive ability [R2(X|S,Y) = R2(X|S)]
2. The mean, covariance and in fact the entire univariate
distributions of masked data are exactly the same as the
original data for every sub-group and the entire data set
3. Maintains (asymptotically)
1. Covariance matrix
2. Product moment correlation matrix
3. Rank order correlation matrix
for every sub-domain and the overall data set
20
Comparison of the Methods
Sufficiency Based Method
1. Disclosure risk is minimized for the perturbed data set when d
= 0, but not in the other cases.
2. The univariate distribution of the masked data is very different
from the original data.
3. Maintains (exactly)
1. Mean Vector
2. Covariance matrix
3. Product moment correlation
for every sub-domain and the entire data set.
4. Does not maintain rank order correlation
21
Conclusion
• If it is known that the data will be used
exclusively for traditional, parametric
analysis, sufficiency based methods offer
the best performance
• In all other cases, Data shuffling offers the
best performance
22
Future Research
• We need to explore this topic further
– Our initial result suggests that both
techniques may even be capable of
maintaining all types of relationships between
the non-confidential variables and the masked
variables. Is this true for all cases?
– What if arbitrary sub-domains are created by
using numerical variables?
23
For more details on our work,
please visit:
gatton.uky.edu/faculty/muralidhar/maskingpapers
We have CD’s with copies of our paper,
presentation, and the data sets.
We will be happy to share it with you.
([email protected] or [email protected])
24
We welcome your questions
or comments or suggestions.
Thank you.
25