Differential Privacy

LINDDUN

  • Linking - associating a data item to a user

  • Identifying - learning the identity of an individual

  • Non-repudiation - Being able to attribute a claim to an individual

  • ooDetecting - deducing the involvement of an individual through observation

  • Data Disclosure - Excessively collecting, storing, processing, or sharing data

  • Unawareness - Insufficiently informing, involving, or empowering individuals in the processing of personal data

  • Non-Compliance - Deviating from standards and best practices

Overview of Differential Privacy

Introduction

  • The subject matter is differential privacy, a technique used to enhance data security and privacy.

  • The presentation is heavily based on the work of Yuxiang Wang, particularly the examples provided.

Background and Context

Big Picture of Privacy Today
  • Various organizations like governments, companies, and research centers are involved in the collection and analysis of personal data.

  • Examples include:

    • Social networks like Facebook and LinkedIn.

    • E-commerce platforms like Amazon using viewing and buying records for recommendations.

    • Gmail employs email data for targeted advertisements.

Challenges in Protecting Privacy
  • Conventional privacy measures include:

    • Control over access to information.

    • Regulation of information flow.

    • Specific purposes for data usage.

  • Typical privacy-preserving practices such as anonymization and sanitization do not offer robust guarantees for privacy as they merely limit exposure.

Anonymization Failures

Example of Anonymization Breakdown
  • An example highlighted is the failure of anonymization with the Netflix dataset:

    • A phenomenon referred to as "sparsity" of data exists – typically, no two profiles are similar up to a given threshold of auau.

    • In the Netflix case, records were sufficiently unique that, when matched with external data like IMDB profiles, the identity of users could often be successfully deduced.

    • Referenced work: A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets”, 2008.

Other Cases of De-anonymization
  • Instances of potential re-identification include:

    • Medical records of a Governor.

    • AOL query records linked to an individual.

    • Individual DNA profiles in genomic studies.

  • There are potentially thousands of other methods for gathering sufficient background information to identify individuals.

  • The difficulty arises in the fact that while one dataset can be anonymized, it is near impossible to anonymize all datasets collectively.

Intuitions around Data Privacy

  • Two fundamental intuitions regarding privacy:

    1. Dalenius principle: The release of statistics should not allow for improved accuracy in deducing individual private information beyond what was known before the release.

    2. Gavison principle: Privacy can be thought of as the ability to 'blend into a crowd', minimizing individual visibility.

Survey Example for Privacy Considerations

  • Sample survey questions related to preferences and demographic data:

    1. Do you like listening to Justin Bieber?

    2. How many Justin Bieber albums do you own?

    3. What is your gender?

    4. What is your age?

  • If music taste is sensitive information, individuals might feel more secure if they remain anonymous.

Privacy Needs from Surveys

  • Perspectives on privacy based on survey data:

    • Individuals desire assurance that their responses would not influence the published results:

    • Q(D(Ime))=Q(DI)Q(D(I−me)) = Q(D_I)

    • The probability of an attacker deducing personalized information from the released results should align with that of not having their information released at all:

    • Pr(secret(me)R)=Pr(secret(me))Pr(secret(me)|R) = Pr(secret(me))

Limitations and Challenges of Privacy

Reasons for Current Privacy Limitations
  1. If individual inputs do not affect the outcomes, then the results lose their utility.

  2. This means that if statistical trends are publicly available and the individual’s response is not considered, the individual’s situation could still be inferred with significant probability:

    • Pr(secret(me)|secret(Pop)) > Pr(secret(me))

  3. Even with additional general facts known to an attacker, they can narrow down information about an individual (e.g., knowing average age, gender).

    • For instance:

      • age(me)=2imesmeanextageage(me) = 2 imes mean ext{ age}

      • gender(me)<br>eqmodeextgendergender(me) <br>eq mode ext{ gender}

Concept of Differential Privacy

Introduction and Formal Description
  • Introduced by Cynthia Dwork in 2006.

  • The aim of differential privacy is that the chance that a noisy released result will be equal to a particular value is almost equivalent regardless of whether an individual's data was included.

  • Definition of extϵdifferentialprivacyext{ϵ-differential privacy}:

    • Pr(M(D) = C) / Pr(M(D±i) = C) < e^{ ext{ϵ}} for any D±iD1|D±i - D| \leq 1 and any CRange(R).C \in Range(R).

Common Misunderstandings

  • It is a prevalent misconception that differential privacy alone resolves all privacy-related issues:

    • Differential privacy does not fully eradicate harm prefacing individual participation in studies.

    • It prevents guessing whether an individual participated in contributing to a specific dataset, under certain assumptions regarding group structure.

  • An illustrative case:

    • Mary, a smoker, is impacted by studies that demonstrate links between smoking and cancer:

    • Her insurance rates may rise irrespective of her participation in the study, thus indicating the limits of differential privacy in preemptively mitigating harm.

Summary of Differential Privacy Goals

  • The primary intent of differential privacy is to deconstruct the harm achievable from data analysis and restrict any harm strictly to results:

    • Protects personal identifiable information to the highest degree allowable while allowing individuals to provide their data only about themselves.

Formal Framework of Differential Privacy

Sensitivity of Functions
  • Sensitivity is defined mathematically as:

    • rianglef=maxadjacent(x,x)f(x)f(x)riangle f = max_{adjacent(x,x')}|f(x) - f(x')|

    • Adjacent databases are those differing by a single row.

    • Sensitivity determines the extent to which a single person's data can influence the output of a given function.

Examples of Function Sensitivity
  1. Querying the number of female respondents yields a sensitivity of 1 because only one individual can change the count.

  2. For albums owned by respondents, sensitivity by 2022 is 6 (Mary had 6 different albums).

Implementation of Differential Privacy
  • The goal is to blur the distinctions within the target function using noise addition.

  • Key challenges include:

    • Adding excessive noise diminishes the value of the function output.

    • Insufficient noise fails to provide adequate privacy protection.

Laplace Mechanism
  • To attain extϵdifferentialprivacyext{ϵ-differential privacy}, use:

    • A noise function represented as Lap(b)Lap(b) where b=racrianglefϵb = rac{ riangle f}{ϵ}.

    • The noise is proportional to the sensitivity rianglefriangle f and the privacy budget ϵϵ.

    • Laplace distribution: The probability density function is:

    • f(xextμ,b)=12bexextμbf(x | ext{μ}, b) = \frac{1}{2b} e^{-\frac{|x - ext{μ}|}{b}}.

Mechanism Validity
  • The verification of extϵdifferentialprivacyext{ϵ-differential privacy} can be shown as follows:

    • Pr(f(x)+Lap(rianglefϵ)=y)Pr(f(x) + Lap(\frac{ riangle f}{ϵ}) = y) compared to Pr(f(x)+Lap(rianglefϵ)=y)Pr(f(x') + Lap(\frac{ riangle f}{ϵ}) = y) leads to expressions that confirm the differential privacy condition holds true.

Example: Counting Queries
  • To count the number of participants in a dataset, e.g., female respondents:

    • Sensitivity is 1 and noise would be added as Lap(1ϵ)Lap(\frac{1}{ϵ}).

    • A very small ϵϵ implies a flatter Laplace curve, which highlights privacy measures.

Types of Differential Privacy

  1. Local Differential Privacy: Noise is added at the individual client side before data is sent to the server.

  2. Central Differential Privacy: Noise is added on the server side using the aggregated data from multiple inputs.