Differential Privacy

LINDDUN

Linking - associating a data item to a user
Identifying - learning the identity of an individual
Non-repudiation - Being able to attribute a claim to an individual
ooDetecting - deducing the involvement of an individual through observation
Data Disclosure - Excessively collecting, storing, processing, or sharing data
Unawareness - Insufficiently informing, involving, or empowering individuals in the processing of personal data
Non-Compliance - Deviating from standards and best practices

The subject matter is differential privacy, a technique used to enhance data security and privacy.
The presentation is heavily based on the work of Yuxiang Wang, particularly the examples provided.

Various organizations like governments, companies, and research centers are involved in the collection and analysis of personal data.
Examples include:
- Social networks like Facebook and LinkedIn.
- E-commerce platforms like Amazon using viewing and buying records for recommendations.
- Gmail employs email data for targeted advertisements.

Conventional privacy measures include:
- Control over access to information.
- Regulation of information flow.
- Specific purposes for data usage.
Typical privacy-preserving practices such as anonymization and sanitization do not offer robust guarantees for privacy as they merely limit exposure.

An example highlighted is the failure of anonymization with the Netflix dataset:
- A phenomenon referred to as "sparsity" of data exists – typically, no two profiles are similar up to a given threshold of $au$ .
- In the Netflix case, records were sufficiently unique that, when matched with external data like IMDB profiles, the identity of users could often be successfully deduced.
- Referenced work: A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets”, 2008.

Instances of potential re-identification include:
- Medical records of a Governor.
- AOL query records linked to an individual.
- Individual DNA profiles in genomic studies.
There are potentially thousands of other methods for gathering sufficient background information to identify individuals.
The difficulty arises in the fact that while one dataset can be anonymized, it is near impossible to anonymize all datasets collectively.

Two fundamental intuitions regarding privacy:
1. Dalenius principle: The release of statistics should not allow for improved accuracy in deducing individual private information beyond what was known before the release.
2. Gavison principle: Privacy can be thought of as the ability to 'blend into a crowd', minimizing individual visibility.

Sample survey questions related to preferences and demographic data:
1. Do you like listening to Justin Bieber?
2. How many Justin Bieber albums do you own?
3. What is your gender?
4. What is your age?
If music taste is sensitive information, individuals might feel more secure if they remain anonymous.

Perspectives on privacy based on survey data:
- Individuals desire assurance that their responses would not influence the published results:
- $Q(D(I−me)) = Q(D_I)$
- The probability of an attacker deducing personalized information from the released results should align with that of not having their information released at all:
- $Pr(secret(me)|R) = Pr(secret(me))$

If individual inputs do not affect the outcomes, then the results lose their utility.
This means that if statistical trends are publicly available and the individual’s response is not considered, the individual’s situation could still be inferred with significant probability:
- Pr(secret(me)|secret(Pop)) > Pr(secret(me))
Even with additional general facts known to an attacker, they can narrow down information about an individual (e.g., knowing average age, gender).
- For instance:
  - $age(me) = 2 imes mean ext{ age}$
  - $gender(me) <br>eq mode ext{ gender}$

Introduced by Cynthia Dwork in 2006.
The aim of differential privacy is that the chance that a noisy released result will be equal to a particular value is almost equivalent regardless of whether an individual's data was included.
Definition of $ext{ϵ-differential privacy}$ :
- Pr(M(D) = C) / Pr(M(D±i) = C) < e^{ ext{ϵ}} for any $|D±i - D| \leq 1$ and any $C \in Range(R).$

It is a prevalent misconception that differential privacy alone resolves all privacy-related issues:
- Differential privacy does not fully eradicate harm prefacing individual participation in studies.
- It prevents guessing whether an individual participated in contributing to a specific dataset, under certain assumptions regarding group structure.
An illustrative case:
- Mary, a smoker, is impacted by studies that demonstrate links between smoking and cancer:
- Her insurance rates may rise irrespective of her participation in the study, thus indicating the limits of differential privacy in preemptively mitigating harm.

The primary intent of differential privacy is to deconstruct the harm achievable from data analysis and restrict any harm strictly to results:
- Protects personal identifiable information to the highest degree allowable while allowing individuals to provide their data only about themselves.

Sensitivity is defined mathematically as:
- $riangle f = max_{adjacent(x,x')}|f(x) - f(x')|$
- Adjacent databases are those differing by a single row.
- Sensitivity determines the extent to which a single person's data can influence the output of a given function.

Querying the number of female respondents yields a sensitivity of 1 because only one individual can change the count.
For albums owned by respondents, sensitivity by 2022 is 6 (Mary had 6 different albums).

The goal is to blur the distinctions within the target function using noise addition.
Key challenges include:
- Adding excessive noise diminishes the value of the function output.
- Insufficient noise fails to provide adequate privacy protection.

To attain $ext{ϵ-differential privacy}$ , use:
- A noise function represented as $Lap(b)$ where $b = rac{ riangle f}{ϵ}$ .
- The noise is proportional to the sensitivity $riangle f$ and the privacy budget $ϵ$ .
- Laplace distribution: The probability density function is:
- $f(x | ext{μ}, b) = \frac{1}{2b} e^{-\frac{|x - ext{μ}|}{b}}$ .

The verification of $ext{ϵ-differential privacy}$ can be shown as follows:
- $Pr(f(x) + Lap(\frac{ riangle f}{ϵ}) = y)$ compared to $Pr(f(x') + Lap(\frac{ riangle f}{ϵ}) = y)$ leads to expressions that confirm the differential privacy condition holds true.

To count the number of participants in a dataset, e.g., female respondents:
- Sensitivity is 1 and noise would be added as $Lap(\frac{1}{ϵ})$ .
- A very small $ϵ$ implies a flatter Laplace curve, which highlights privacy measures.

Local Differential Privacy: Noise is added at the individual client side before data is sent to the server.
Central Differential Privacy: Noise is added on the server side using the aggregated data from multiple inputs.