1/61
CSC 533 Privacy Final Exam Review
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Privacy Attacks on Data
Attribute Inference
Property Inference
Kinds of Privacy Attacks on Model
Membership Inference
Model Extraction
Attribute Inference Attack/Model Inversion Attack
Given an output of a machine learning model, infer something about the input
Example: Given a patients dosage of a drug, infer something about their genome
Property Inference Attack
The ability to extract dataset properties which were not explicitly encoded as features or not correlated to the learning task
Information that the model learned unintentionally
Example: A classifier identifying gender can also be used to infer information about whether someone wears glasses
Meta Classifier
Predicts if the target model was trained on a dataset that has property P or not
Model Extraction Attack
Learning a close approximation of the model using as few queries as possible
Example: Logistic regression model has n+1 inputs and can be queries n+1 times and solve a system of linear equations
How many unknowns in a logistic regression model
n+1
W is a vector of size n
b is a scalar
Model Extraction Countermeasure
Only give the decision classifier and not the floating point result
Model Extraction Countermeasure Attack
Query until you find points on the decision boundary and linear search between the two points
Membership Inference
Determining if data was part of the training set which is possible since cloud models are overfitted.
Needs target model as a whitebox
Shadow model which has same inputs and outputs as target model
Attack model which takes in the classification distribution from the shadow model as an input and returns binary true false
Federated Learning
Server has an untrained model
Sends a copy of that model to the local nodes
Nodes train on their own data
Each node sends the trained model back to the server
The server combines them by taking an average
The server now has a general model
Federated Learning Benefits
Ever user has different data
Some users have more data then others
Distributed between many nodes
Limited Communication between nodes
SGD
Stochastic Gradient Descent
Updating the weights proportional to the error
DSSGD
Distributive Selective Stochastic Gradient Descent
Each node locally trains the model and computes weights
Select gradients to upload but doesn’t have to be all of them
Server averages uploaded weights and updates parameters for the next iteration
DSSGD Privacy Properties
Participants’ Data remains private
Full control over parameter selection
Known learning objective
Resulting model available to all parties
Secure Aggregation
Server aggregates users updates but doesn’t inspect the individual updates
Secure Aggregation Noise
Random noise of positive and negative pairs that cancel each other out and don’t influence the model
Differential Private Aggregation
The nodes each have their own noise and give it back to the server. Some utility lost as the model is impacted
Fairness through blindness
Ignore all irrelevant/protected attributes
Issue: You don’t need to see an attribute to be able to predict it
Statistical Parity
S: Protected Subset
Sc: Rest of population
Want Pr(Outcome | S) = Pr(Outcome | Sc)
Quantitative Input Influence (QII)
A technique for measuring the influence of an input of a system on its outputs
Replaces feature with random values from population and examines distribution over outcomes
k-Eidetic Memorization
If a string is extractable from the model and appears in at most k samples from the training data
Okay for large k when its like words
Bad when its an address or a name and k is small
How to Mitigate Privacy Leakage in LLMs
Train with differential privacy so that one entry or row doesn’t result in a significantly different model. They don’t memorize any single training sample
Curate training data from trusted sources
Deduplicate training data
HIPAA
Health Insurance Portability & Accountability Act
Establishment of nationwide protection for patient confidentiality
Fines range from 100 to 50k per incident
FERPA
Family Educational Rights and Privacy Act
Gives rights to students enrolled at an educational institution to inspect, review, and amend their records and control disclosure.
COPPA
Children Online Privacy Protection Act
Grants parents control over the information collected from children online and designed with consumer protection in mind
GDPR
General Data Protection Regulation (EU)
Providing uniform data protection regulations and is one of the highest standards of privacy and data protection in the world
CCPA
California Consumer Privacy Act
For profit entities who have 25m+ in revenue or 50K+ consumer PII or mainly sell consumer data.
Provides rights to request information and deletion of their data
OECD
Organization for Economic Co-operation and Development
Framework that dictates how data should be collected, limited, safeguarded and be transparent
Most commonly used privacy framework
APEC
Asia-Pacific Economic Cooperation
Similar to OECD but mainly for Asia-Pacific Region
NIST
National Institute of Standards and Technology
Identify, Govern, Control, Communicate, and Protect
IAPP
International Association of Privacy Professionals
Proactive not Reactive and privacy as the default
Privacy all the way through
Privacy Nutrition Label
Emulate nutrition label with what data, what purpose, and who is it being shared with
Privacy Rating Labels
Rate website on scales inspired by energy labels which show a rating compared to specific alternatives
Privacy Notice Timing
At setup
Just in time
Context-dependent (checkup)
Periodic (do you want to continue to allow this)
Persistent (Showing an icon the lifetime of the privacy notice)
On Demand (opting out through settings)
Privacy Notice Channel
Primary (Gives you a policy)
Secondary (Something else gives you an email)
Public (Sign or public notice)
Privacy Notice Modality
Visual
Auditory
Haptic
Machine Readable
Privacy Notice Control
Blocking (blocked by default)
Non blocking (allowed by default)
Decoupled (relies on a third party setting)
Platform for Privacy Preferences Project (P3P)
AN easy way for websites to communicate about their privacy policies in a standard machine readable format
Labelling Privacy Practices (Food Label)
Shows the types of data collected, general data collection practices and the sharing practices.
Each policy received an evaluation of YES, NO, or UNCLEAR
Terms of Service Didn’t Read
Tos;DR
Terms are divided into small points
Each point gets assigned one or several topics
Topics are then scored
Privee Privacy Extension
Policy Rating Extension
Uses NLP techniques to find the presence or absence of topics
Automated Policy Analysis
Extract websites data practices through natural language processing and machine learning
Privacy Policy Annotation Tool
Segment policy broken into paragraphs
Paragraphs then categorized
Goes through and asks questions on whether a paragraph does something or not
Westin’s Privacy Index Survey
Asks 3 questions which people either agree or disagree with
1) Consumers have lost all control over how personal information is collected and used
2) Most businesses handle the personal information they collect in a proper and confidential way
3) Existing laws and organizational practices provide a reasonable level of protection for consumer privacy
Westin’s Privacy Segmentation
Fundamentalist: Consumers lost control, most businesses don’t care about consumers, and existing laws are not enough
Unconcerned: Consumers haven’t lost control, businesses care, and existing laws are enough
Pragmatist: Anyone else
Factors that increase privacy concerns
Data aggregation
Data distortion
Data sharing
Data breaches
Factors that reduce privacy concerns
Privacy policy, License agreements
Privacy Laws
Anonymizing all data
Technical Details
Details on usage
Distributed Ledger
Book of all transactions where the one with the most pages is deemed accurate
Blockchain
Linked list with has pointers is literally a chain of blocks that relies on the previous one to show work
Deanonymization Attack
Transaction graph and some side channel information that can be used to link pseudonyms to real identities with blockchain transactions
Multiple Input Transactions (Conjoining)
Having multiple senders and receivers in a single block so nobody knows where it went exactly
Zcash
Uses 0 knowledge proofs but has issues with a trusted setup
Remote Device Identification
Works by taking a look at clocks on a machine and looking at the differences between them.
Can identify machines even after they change location or ISP
Website Fingerprinting
A tracking method that creates a unique digital profile of a user based on their browser and device's configuration, such as screen size, operating system, fonts, and installed plugins
kNN Fingerprinting
Uses KNN and packet information that tunes weight and difference calculations to determine a user when they visit a website. Results in 90-95% accuracy
CUMUL Fingerprinting
A website fingerprinting attack that uses the cumulative packet size of a data flow to identify the content of encrypted web traffic
90-93% accurate
K fingerprinting
Uses random forest to identify users data even when encrypted.
Next performs knn on that fingerprint compared to known fingerprints
90% accurate on onion services
Site level feature analysis
Sites have features that make them identifiable between themselves such as number of links, fonts, videos etc
Website Fingerprinting Countermeasures
Network Layer
Add padding
Add latency
Make packets look similar
Page Design
Small size
Dynamic Pages
Side Channel Attack
Any attack based on information gained from physical implementation of a system rather than a weakness in an algorithm
Unintentional leakage
Acoustic Side Channel Attack
Using frequencies or other noises from a physical device like a keyboard or microphone to interact with a system or make unintentionally deductions.
What key is pressed on a keyboard or talking to a smart device in a frequency humans can’t hear