An Ad Omnia Approach to Defining and Achieving Private

Download Report

Transcript An Ad Omnia Approach to Defining and Achieving Private

I’m in the Database, But Nobody Knows Cynthia Dwork, Microsoft Research

Many Threats to Privacy of Electronic Data

      Theft Phishing Viruses Cryptanalysis Changing Privacy Policies …

This Talk: Privacy-Preserving Data Analysis

C

?

   “First Tier” Motivating Examples Analysis of Census Data, Medical Outcomes Data, GWAS data, Epidemiology, Analysis of Vehicle Braking Records  “Second Tier” Examples Training an advertising classifier, Recommendation System, Netflix Challenge

“Pure” Privacy Problem

C

?

   Difficult Even if Curator is Angel Data are in Vault

Typical Suggestions

   “Large Set” Queries How many MSFT employees have Sickle Cell Trait (CST)?

How many MSFT employees who are not female Distinguished Scientists with very curly hair have the CST?

   Add Random Noise to True Answer Average of responses to repeated queries converges to true answer Can’t simply detect repetition (undecidable)   Detect When Answering is Unsafe Refusal can be disclosive

A Litany

AOL Search History Release (2006)

Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA

William Weld’s Medical Record

[Sweeney02] HMO data ethnicity visit date diagnosis procedure medication total charge ZIP birth date sex name address date reg.

voter registration data party affiliation last voted

GWAS Membership

[Homer et al. ‘08]

T C

T T

SNP: Single Nucleotide (A,C,G,T) polymorphism

Reference Population

Major Allele (C): 94% Minor Allele (T): 6%

Genome-Wide Association Study

Allele frequencies for many thousands of SNPS … …

Anonymized Social Networks

[BackstromDK07]    Magic Step Isolate lightly linked-in subgraphs from rest of graph Special structure of subgraph permits finding A, B S A B J

Definitional Failures

   Failure to Cope with Auxiliary Information Existing and future databases, newspaper reports, Flikr, literature, etc.

Definitions are Syntactic and Ad Hoc  Dalenius’s Ad Omnia Guarantee (1977): Anything that can be learned about a respondent from the statistical database can be learned without access to the database  Unachievable

Parable: How Tall is Pamela Jones (Groklaw)?

An Admittedly Unreasonable Impossibility Proof    Database teaches average heights of population subgroups “PJ is 2 inches shorter than avg Swedish woman” PJ’s height learnable with the DB, not learnable without.

PJ loses privacy whether or not she is in the database   Suggests new notion of privacy: risk incurred by joining DB The outcome of any analysis is essentially equally likely, independent of whether any individual joins or refrains from joining the dataset. (The likelihood is over the choices made by the algorithm.)

Differential Privacy

[Dwork-McSherry-Nissim-Smith 2006] M gives  all C µ range( M ): Pr[ M (D1) 2 C] ≤ e Pr[ M (D2) 2 C ] Neutralizes all linkage attacks.

Composes unconditionally and automatically: Σ i  i ratio bounded Pr [response] Bad Responses: X X X http://research.microsoft.com/en-us/projects/DatabasePrivacy/

Differential Privacy

  Resilience to All Auxiliary Information Past, present, future data sources and algorithms   Low-error high-privacy DP techniques exist for many problems datamining tasks (association rules, decision trees, clustering, …), contingency tables, histograms, synthetic data sets for query logs, machine learning (boosting, statistical queries learning model, SVMs, logistic regression), various statistical estimators, network trace analysis, recommendation systems, …    Programming Platforms http://research.microsoft.com/en-us/projects/PINQ/ http://userweb.cs.utexas.edu/~shmat/shmat_nsdi10.pdf

Download today!

Can we store and share your information with health officials and researchers?

“…This information can be very helpful in monitoring regional health conditions, plan flu response, and conduct health research. By allowing the responses to the survey questions to be used for public health, education and research purposes, you can help your community.”

Snow 1854

Cholera cases Suspected pump

https://h1n1.cloudapp.net/Privacy.aspx

“Microsoft may also disclose information if required to do so by law or in the good faith belief that such action is necessary to (a) conform to the edicts of the law or comply with legal process served on Microsoft or the Site; (b) protect and defend the rights or property of Microsoft and our family of Web sites; or (c) act in urgent circumstances to protect the personal safety of users of Microsoft products or members of the public.”

Mission Creep “Think of the children!”

C

?

Never store the data!

Pan-Private Streaming Algorithms

[DNPRY10]    Private “inside and out”  Completely hide the pattern of appearances of any individual Presence, absence, frequency, etc.

Protect against mission creep, subpoena, intrusion

DiffeP: Limitations and Challenges

    Can’t study outliers Privacy erosion over multiple analyses is cumulative   Privacy erosion over multiple databases is cumulative Compare real world to one in which my info is everywhere deleted, looking at a lifetime of exposure against worst-case adversary/information/collection of databases Formally capture “reasonable” worlds?

 What are the right questions to ask about social networks?

Removing one person can affect data of many other people

Utility Implies Exposure to Harm

  Database teaches that smoking causes cancer.  Smoker S’s insurance premiums rise. But learning that smoking causes cancer is the whole point.

 Smoker S enrolls in a smoking cessation program.

  May be fine for “First-Tier” Uses, but what about “Second Tier”?

Who decides, and how?

Pause

     “Ad Omnia” definition of privacy Composes automatically and obliviously; independent of aux info Achievable, frequently with great accuracy Usable  Can program using a privacy-preserving interface  Many questions remain Is there a weaker “ad omnia” definition than differential privacy that also composes automatically?

Which Ad(s) Am I Charged For?

Å Å

More Subtle Attack

     A potentially embarrassing interest: In a Long Leg Cast Anna … models her cast on a stoop, wiggling/rubbing her casted toes. Length: 90 minutes ) User clicks on ad targeted to wealthy, whale loving user is wealthy, whale loving LLC enthusiast LLC enthusiasts User understands that interest in LLC is communicated to seller, but does not understand that wealth and love of whales are also communicated Should privacy of these attributes be protected? What about race?

Fairness: should poor children see different ads than rich children?

Wall Street Journal 4/4/2010

    User visits capitalone.com

Capital One uses tracking information via tracking network [x+1] to personalize offers

Danger:

Steering minorities into higher rates In principle, law seems to allow credit outcomes based on browsing history; may encode race   How can an ad network prevent steering?

What is fairness in classification, and how can we achieve it?

Work in Progress

D., Fiat, Hardt, Pitassi, Reingold, Zemel

Thank You!

-4R -3R -2R -R 0 R 2R 3R 4R 5R