1/124
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Ethics
Shared principles guiding moral judgement
Morals
Individual’s beliefs or principles concerning what is right or wrong, often shaped by cultural, religious, or personal beliefs or values
3 V’s
Volume, velocity, variety
Data Science Project Process
Identify the Hypothesis
Design the research plan
Collect the Data
Analyze the data
extract the results
exploit the results
Data Lifecycle
Data collection
Processing
EDA
Analysis/hypothesis testing/ML
Insight and policy decision
Data Engineers
Specialize in data gathering and storage
Areas of concern in Data Ethics
Data collection, ownership, privacy, anonymity, validity, algorithm, statistical fairness
To protect individual’s privacy
Store data in a secure database
use security methods to protect privacy
transparency and informed consent
Data Anonymity:
Preventing the identification of individuals within a dataset when handling and sharing data
Confidentiality
Protecting data from unauthorized access or disclosures; involves implementing security measures
Algorithm Fairness
Ensuring the algorithms used in decision-making do not produce unfair or biased outcomes
Imbalance Data Issue
Dataset overrepresents a certain outcome
What ethical concepts are involved with designing a research plan
transparency, informed consent, and algorithmic fairness
What ethical approaches are related to collecting data
transparency, informed consent, data privacy, anonymity, and data ownership
What ethical approaches and concepts are related to data analysis
check business, data validity, transparency
What ethics are associated with extracting data results
Algorithmic fairness, data validity, and transparency
What data results are associated with publishing and exploiting results
data ownership, transparency, and algorithmic fairness
Data Privacy v Anonymity v Confidentiality
Privacy: Protection and appropriate use of personal information
Anonymity: Prevents the identification of individuals within a dataset
Confidentiality: Safeguarding data from an unauthorized source
Nominal v Ordinal
Nominal = classified into various groups with no ranks or orders
Ordinal = Grouped based on order or ranking
EU’s set of tough data protection laws
GDPR
California’s data protection law that requires the protection of the privacy and security of consumer’s financial information
California Consumer Privacy Act
PII
Personally Identifiable Information: Any data or information that can be used to identify, locate, or contact an individual
a/B testing
comparing two versions of something to see which performs better in achieving a specific goal
Selection Bias
Broad category of bias that occurs when the participants included in a study are not representative of the target population
Reasons for selection bias
Unrepresentative sample, non random selection, participant factors, etc.
Types of selection bias
sampling, non-response, self-selection/volunteer bias
Sampling Bias
Data is from a non-representative subset:
participants might be selected from a specific demographic group
convenience sampling
not represent the entire population
Non-response bias
Some groups don’t respond, skewing results
Self selection bias
people choose whether to participate
Measurement Bias
Method you use to measure something consistently gives wrong results
Response Bias
Someone responds falsely to a question
Common reasons for response bias
social desirability, acquiescence, low effort, lack of interest, survey context, recall bias, cultural bias, researcher/experimenter bias (researchers unintentionally influence data collection), data entry error
Data ownership
Having complete control over the data, property, an asset, object, or real estate
4 paradigms of Data Ownership
subject as owner - Individual Autonomy; A person has the right to be forgotten
creator as owner - Data belongs to person that created it; may clash with the subject’s rights; ex: researcher or an artist
custodian as owner - entity storing/managing data is the owner ; ex: cloud provider
funder as owner - data belongs to the funding entity;
Types of data ownership
personal data ownership, organizational data ownership, shared data ownership, collaborative data ownership, open data and public domain, third party data ownership, data custodianship,
Shared data ownership
Responsibility is shared based on who creates, uses, and manages the data. Responsibility is shared but ownership or decision power may differ. Divided responsibility
Collaborative Data ownership
Multiple parties jointly create the data and own the final output; equal or joint ownership of the results
Open data
freely used, reused, and redistributed. At most might need to be attributed; government data and scientific research
Public Domain
Data that is not protected by copyright and is available for public use without any restrictions
Data custodianship
ownership does not transfer; third party only temporarily holds and manages the data on behalf of the original owner
TRIPS agreement
Trade-Related Intellectual Property Rights. Notes that intellectual property rights can be sorted into copyright, industrial design, trademark, trade secret, aptent, geographical indication, and integrated chip design
Copyright
owners have exclusive rights to reproduce, distribute, display, and perform creations.
how long does US copy right last
70 + lifespan
CC By
allows for copying, sharing, adapting, and using the work even commercially with credit to the creator
CC By NC
allows for copying, sharing, adapting, and using the work non commercially only with credit
CC by ND
Allows copying and sharing only in original form with credit
CC by SA
Allows adapting the work but derivative works must use the same license w/ credit
Types of AI-processed Data
Extracted Data, restructured data, augmented data (new variables added), inferred data, modeled data
Quasi Identfier
Data attribute that isn’t identifying on its own but can potentially lead to re-identification when combined with other data
Pseudoanonymization
Replacing personally identifiable information with a pseudonym so the data isn’t directly identifiable but can be traced back to an individual if necessary
Data minimization
Collecting and storing only the minimum amount of data needed
Access controls
Password protection, multi-factor authentication, etc.
CCPA
California Consumer Privacy act; Right to Information, Data Transfer Restrictions, User Rights
Nevada Senate Bill 220
Data Sale Restriction (websites can’t sell customer data without consumer consent or knowledge), covered information included PII, and consumers can opt out of the sale of covered information
Encryption
Encodes data to make it unreadable to anyone who doesn’t have the decryption key
Access Control
Specifies who can access specific resources or data. Enforces authentication, authorization, and permissions
Firewalls
Protect networks from unathorized access by blocking incoming traffic that doesn’t meet specific ccriteria and restricting outgoing traffic
Anonymization
Removes identifying infromation from personal data
Tokenization
Replaces senstive data with non senstivie substitutes to ensure that stolen data is useless to thieves
Data Loss prevention
set of tech used to prevent data from being lost or stolen. DLP’s monitor traffic and block sensitive data from being sent outside the organizaiton, and restricts unauthorized access
VPN
Used to protect data in transit and can help to rpevent eavesdropping and interception of sensitive data
2FA
Users need to provide two forms of identification to access a system or application
Procedure for encoding amessage
Cipher
How Encryption Works
Sender Encrypts the message using an encryption algorithm and a secret encryption key to encrypt the original message to create a ciphertext
Sender shares the encrypted message with the recipient through a secure channel or medium
Sender protects the encryption key, keeping it secret and never sharing it in the same communication as the ciphertext
Recipient decrypts the message using the decryption key to recover the original message
Symmetric Encryption
One key is used for both encryption and decryption
Assymetric/Public Cryptography
Each party has a pair of keys: a public key and a private key. The private key is a piece of information used to decrypt and encrypt the data. The public key is the piece of information people use to verify the authenticity of the message.
What are the benefits of public key cryptography
Security due to two keys, non-repudiation (sender can’t deny sending the message), authentication, scalability, and key exchange
Steganography
Cryptography except the information itself is hidden
What is the caesar cipher vulnerable to
Letter frequency, Common phrases, brute force attack
Simple substitution Cipher
Jumble the entire alphabet in a secret permutation
Common techniques for data anonymity
de-identification, aggregation, anonymization
Data Anonymity techniques
Suppressions, Generalization, noise addition
PPDP
Privacy preserving data publishing; a set of techniques used to share or publish datasets while protecting the privacy of individuals
K anonymity
Each row is indistinguishable from at least k- 1 other records
Equivalent classes
set of records that have the same values for the quasi-identifiers
Homogeneity Attack
if the records within the group are too similar, it might be easier for attackers to make educated guesses about the identiies of the individuals
How to counter homogeneity attack
Reducing the size of the dataset so knowing who’s included isn’t possible
L-diversity
There should be l distinct values for the sensitive attribute for each group of instances that have the same quasi-identifiers
Sources of data validity errors
Nonrepresentative samples
wrong choice of attributes and measures - do the chosen metrics accurately capture the relevant aspects of the phenomena (irrelevant, omitted, poor operationalization like flawed survey questions)
errors in the data
errors in model design
error in data processing
errors in managing change
Types of data validation checks
Field level validation
record level validation
file level validation
business rule
Field level validation
checks individual data fields for correctness. Ex: Dates in right format, data type check, range check, etc.
Record Level validation
Cross field validation (checks for consistency; end data after start date, etc.) + mandatory field checks
File Level validation
row counts are as expected, there’s no duplicate entries, all id’s are unique, etc.
Business Rule Validation
data complies with business logic
What assumptions do models often make about data
linearity, normality, independence
Where does algorithmic bias come from
Historical bias, unabalanced datasets, and proxy variables
True Positive Rate
True Positive / (True Positive + False Negative); proportion of actual incidents that are accurately detected as positive by the model
False Positive Rate
False Positive / (False Positive + True Negative); the proportion that is incorrectly classified as positive when it’s actually negative
Adversarial Debiasing
Identify and mitigate bias in model predictions by countering discriminatory patterns
Regularization
Penalize correlations between sensitve attributes and model predictions
Re-weighted loss functions
different data points are assigned varying levels of importance with underrepresented or sensitive groups given more weight
Equalized odds post-processing
Adjust prediction probabilities to equalize true/false positive rates
Pre-processing techniques
Re sampling, re-weighting, data transformations to ensure that all groups are equally represented before training a hiring model
Fair relabelling
Adjust or anonymizing sensitive attributes in a dataset to reduce bias
by changing the class label of some instances from the
sensitive group from negative to positive, and from positive to negative for
some instances from the non-sensitive group
Statistical Parity
Evaluate whether the probability of a favorable outcome is the same for both sensitive and non-sensitive groups; take the probability for a positive outcome for btoh groups and subtract them
Disparate Impact
calculate the ratio of favorable outcomes for the sensitive: non-sensitive group
If statistical parity is 40%, the sentence is
Men have a 40% higher probability than woman to be hired, in absolute terms
If disparate impact is two, the sentence is:
men are twice as likely to be hired than women, in relative terms
Massaging
Make changes to he dataset to reduce or eliminate bias by using data manipulations
Formula to calculate the number of instances to relabel to bring discrimination to 0
M = (disc(D) (S^T * ns^T)) / S^T + ns^T
Multiply the oriignal discrimination by the number of people in sensitive group times the number of people in the non-sensitive group and then divide by total population