Differential Privacy
LINDDUN
Linking - associating a data item to a user
Identifying - learning the identity of an individual
Non-repudiation - Being able to attribute a claim to an individual
ooDetecting - deducing the involvement of an individual through observation
Data Disclosure - Excessively collecting, storing, processing, or sharing data
Unawareness - Insufficiently informing, involving, or empowering individuals in the processing of personal data
Non-Compliance - Deviating from standards and best practices
Overview of Differential Privacy
Introduction
The subject matter is differential privacy, a technique used to enhance data security and privacy.
The presentation is heavily based on the work of Yuxiang Wang, particularly the examples provided.
Background and Context
Big Picture of Privacy Today
Various organizations like governments, companies, and research centers are involved in the collection and analysis of personal data.
Examples include:
Social networks like Facebook and LinkedIn.
E-commerce platforms like Amazon using viewing and buying records for recommendations.
Gmail employs email data for targeted advertisements.
Challenges in Protecting Privacy
Conventional privacy measures include:
Control over access to information.
Regulation of information flow.
Specific purposes for data usage.
Typical privacy-preserving practices such as anonymization and sanitization do not offer robust guarantees for privacy as they merely limit exposure.
Anonymization Failures
Example of Anonymization Breakdown
An example highlighted is the failure of anonymization with the Netflix dataset:
A phenomenon referred to as "sparsity" of data exists – typically, no two profiles are similar up to a given threshold of .
In the Netflix case, records were sufficiently unique that, when matched with external data like IMDB profiles, the identity of users could often be successfully deduced.
Referenced work: A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets”, 2008.
Other Cases of De-anonymization
Instances of potential re-identification include:
Medical records of a Governor.
AOL query records linked to an individual.
Individual DNA profiles in genomic studies.
There are potentially thousands of other methods for gathering sufficient background information to identify individuals.
The difficulty arises in the fact that while one dataset can be anonymized, it is near impossible to anonymize all datasets collectively.
Intuitions around Data Privacy
Two fundamental intuitions regarding privacy:
Dalenius principle: The release of statistics should not allow for improved accuracy in deducing individual private information beyond what was known before the release.
Gavison principle: Privacy can be thought of as the ability to 'blend into a crowd', minimizing individual visibility.
Survey Example for Privacy Considerations
Sample survey questions related to preferences and demographic data:
Do you like listening to Justin Bieber?
How many Justin Bieber albums do you own?
What is your gender?
What is your age?
If music taste is sensitive information, individuals might feel more secure if they remain anonymous.
Privacy Needs from Surveys
Perspectives on privacy based on survey data:
Individuals desire assurance that their responses would not influence the published results:
The probability of an attacker deducing personalized information from the released results should align with that of not having their information released at all:
Limitations and Challenges of Privacy
Reasons for Current Privacy Limitations
If individual inputs do not affect the outcomes, then the results lose their utility.
This means that if statistical trends are publicly available and the individual’s response is not considered, the individual’s situation could still be inferred with significant probability:
Pr(secret(me)|secret(Pop)) > Pr(secret(me))
Even with additional general facts known to an attacker, they can narrow down information about an individual (e.g., knowing average age, gender).
For instance:
Concept of Differential Privacy
Introduction and Formal Description
Introduced by Cynthia Dwork in 2006.
The aim of differential privacy is that the chance that a noisy released result will be equal to a particular value is almost equivalent regardless of whether an individual's data was included.
Definition of :
Pr(M(D) = C) / Pr(M(D±i) = C) < e^{ ext{ϵ}} for any and any
Common Misunderstandings
It is a prevalent misconception that differential privacy alone resolves all privacy-related issues:
Differential privacy does not fully eradicate harm prefacing individual participation in studies.
It prevents guessing whether an individual participated in contributing to a specific dataset, under certain assumptions regarding group structure.
An illustrative case:
Mary, a smoker, is impacted by studies that demonstrate links between smoking and cancer:
Her insurance rates may rise irrespective of her participation in the study, thus indicating the limits of differential privacy in preemptively mitigating harm.
Summary of Differential Privacy Goals
The primary intent of differential privacy is to deconstruct the harm achievable from data analysis and restrict any harm strictly to results:
Protects personal identifiable information to the highest degree allowable while allowing individuals to provide their data only about themselves.
Formal Framework of Differential Privacy
Sensitivity of Functions
Sensitivity is defined mathematically as:
Adjacent databases are those differing by a single row.
Sensitivity determines the extent to which a single person's data can influence the output of a given function.
Examples of Function Sensitivity
Querying the number of female respondents yields a sensitivity of 1 because only one individual can change the count.
For albums owned by respondents, sensitivity by 2022 is 6 (Mary had 6 different albums).
Implementation of Differential Privacy
The goal is to blur the distinctions within the target function using noise addition.
Key challenges include:
Adding excessive noise diminishes the value of the function output.
Insufficient noise fails to provide adequate privacy protection.
Laplace Mechanism
To attain , use:
A noise function represented as where .
The noise is proportional to the sensitivity and the privacy budget .
Laplace distribution: The probability density function is:
.
Mechanism Validity
The verification of can be shown as follows:
compared to leads to expressions that confirm the differential privacy condition holds true.
Example: Counting Queries
To count the number of participants in a dataset, e.g., female respondents:
Sensitivity is 1 and noise would be added as .
A very small implies a flatter Laplace curve, which highlights privacy measures.
Types of Differential Privacy
Local Differential Privacy: Noise is added at the individual client side before data is sent to the server.
Central Differential Privacy: Noise is added on the server side using the aggregated data from multiple inputs.