Data Ethics Final Review

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/124

There's no tags or description

Looks like no tags are added yet.

Last updated 12:57 PM on 5/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

125 Terms

New cards

Ethics

Shared principles guiding moral judgement

New cards

Morals

Individual’s beliefs or principles concerning what is right or wrong, often shaped by cultural, religious, or personal beliefs or values

New cards

3 V’s

Volume, velocity, variety

New cards

Data Science Project Process

Identify the Hypothesis
Design the research plan
Collect the Data
Analyze the data
extract the results
exploit the results

New cards

Data Lifecycle

Data collection
Processing
EDA
Analysis/hypothesis testing/ML
Insight and policy decision

New cards

Data Engineers

Specialize in data gathering and storage

New cards

Areas of concern in Data Ethics

Data collection, ownership, privacy, anonymity, validity, algorithm, statistical fairness

New cards

To protect individual’s privacy

Store data in a secure database
use security methods to protect privacy
transparency and informed consent

New cards

Data Anonymity:

Preventing the identification of individuals within a dataset when handling and sharing data

New cards

Confidentiality

Protecting data from unauthorized access or disclosures; involves implementing security measures

New cards

Algorithm Fairness

Ensuring the algorithms used in decision-making do not produce unfair or biased outcomes

New cards

Imbalance Data Issue

Dataset overrepresents a certain outcome

New cards

What ethical concepts are involved with designing a research plan

transparency, informed consent, and algorithmic fairness

New cards

What ethical approaches are related to collecting data

transparency, informed consent, data privacy, anonymity, and data ownership

New cards

What ethical approaches and concepts are related to data analysis

check business, data validity, transparency

New cards

What ethics are associated with extracting data results

Algorithmic fairness, data validity, and transparency

New cards

What data results are associated with publishing and exploiting results

data ownership, transparency, and algorithmic fairness

New cards

Data Privacy v Anonymity v Confidentiality

Privacy: Protection and appropriate use of personal information

Anonymity: Prevents the identification of individuals within a dataset

Confidentiality: Safeguarding data from an unauthorized source

New cards

Nominal v Ordinal

Nominal = classified into various groups with no ranks or orders

Ordinal = Grouped based on order or ranking

New cards

EU’s set of tough data protection laws

GDPR

New cards

California’s data protection law that requires the protection of the privacy and security of consumer’s financial information

California Consumer Privacy Act

New cards

PII

Personally Identifiable Information: Any data or information that can be used to identify, locate, or contact an individual

New cards

a/B testing

comparing two versions of something to see which performs better in achieving a specific goal

New cards

Selection Bias

Broad category of bias that occurs when the participants included in a study are not representative of the target population

New cards

Reasons for selection bias

Unrepresentative sample, non random selection, participant factors, etc.

New cards

Types of selection bias

sampling, non-response, self-selection/volunteer bias

New cards

Sampling Bias

Data is from a non-representative subset:

participants might be selected from a specific demographic group
convenience sampling
not represent the entire population

New cards

Non-response bias

Some groups don’t respond, skewing results

New cards

Self selection bias

people choose whether to participate

New cards

Measurement Bias

Method you use to measure something consistently gives wrong results

New cards

Response Bias

Someone responds falsely to a question

New cards

Common reasons for response bias

social desirability, acquiescence, low effort, lack of interest, survey context, recall bias, cultural bias, researcher/experimenter bias (researchers unintentionally influence data collection), data entry error

New cards

Data ownership

Having complete control over the data, property, an asset, object, or real estate

New cards

4 paradigms of Data Ownership

subject as owner - Individual Autonomy; A person has the right to be forgotten
creator as owner - Data belongs to person that created it; may clash with the subject’s rights; ex: researcher or an artist
custodian as owner - entity storing/managing data is the owner ; ex: cloud provider
funder as owner - data belongs to the funding entity;

New cards

Types of data ownership

personal data ownership, organizational data ownership, shared data ownership, collaborative data ownership, open data and public domain, third party data ownership, data custodianship,

New cards

Shared data ownership

Responsibility is shared based on who creates, uses, and manages the data. Responsibility is shared but ownership or decision power may differ. Divided responsibility

New cards

Collaborative Data ownership

Multiple parties jointly create the data and own the final output; equal or joint ownership of the results

New cards

Open data

freely used, reused, and redistributed. At most might need to be attributed; government data and scientific research

New cards

Public Domain

Data that is not protected by copyright and is available for public use without any restrictions

New cards

Data custodianship

ownership does not transfer; third party only temporarily holds and manages the data on behalf of the original owner

New cards

TRIPS agreement

Trade-Related Intellectual Property Rights. Notes that intellectual property rights can be sorted into copyright, industrial design, trademark, trade secret, aptent, geographical indication, and integrated chip design

New cards

owners have exclusive rights to reproduce, distribute, display, and perform creations.

New cards

how long does US copy right last

70 + lifespan

New cards

CC By

allows for copying, sharing, adapting, and using the work even commercially with credit to the creator

New cards

CC By NC

allows for copying, sharing, adapting, and using the work non commercially only with credit

New cards

CC by ND

Allows copying and sharing only in original form with credit

New cards

CC by SA

Allows adapting the work but derivative works must use the same license w/ credit

New cards

Types of AI-processed Data

Extracted Data, restructured data, augmented data (new variables added), inferred data, modeled data

New cards

Quasi Identfier

Data attribute that isn’t identifying on its own but can potentially lead to re-identification when combined with other data

New cards

Pseudoanonymization

Replacing personally identifiable information with a pseudonym so the data isn’t directly identifiable but can be traced back to an individual if necessary

New cards

Data minimization

Collecting and storing only the minimum amount of data needed

New cards

Access controls

Password protection, multi-factor authentication, etc.

New cards

CCPA

California Consumer Privacy act; Right to Information, Data Transfer Restrictions, User Rights

New cards

Nevada Senate Bill 220

Data Sale Restriction (websites can’t sell customer data without consumer consent or knowledge), covered information included PII, and consumers can opt out of the sale of covered information

New cards

Encryption

Encodes data to make it unreadable to anyone who doesn’t have the decryption key

New cards

Access Control

Specifies who can access specific resources or data. Enforces authentication, authorization, and permissions

New cards

Firewalls

Protect networks from unathorized access by blocking incoming traffic that doesn’t meet specific ccriteria and restricting outgoing traffic

New cards

Anonymization

Removes identifying infromation from personal data

New cards

Tokenization

Replaces senstive data with non senstivie substitutes to ensure that stolen data is useless to thieves

New cards

Data Loss prevention

set of tech used to prevent data from being lost or stolen. DLP’s monitor traffic and block sensitive data from being sent outside the organizaiton, and restricts unauthorized access

New cards

VPN

Used to protect data in transit and can help to rpevent eavesdropping and interception of sensitive data

New cards

2FA

Users need to provide two forms of identification to access a system or application

New cards

Procedure for encoding amessage

Cipher

New cards

How Encryption Works

Sender Encrypts the message using an encryption algorithm and a secret encryption key to encrypt the original message to create a ciphertext
Sender shares the encrypted message with the recipient through a secure channel or medium
Sender protects the encryption key, keeping it secret and never sharing it in the same communication as the ciphertext
Recipient decrypts the message using the decryption key to recover the original message

New cards

Symmetric Encryption

One key is used for both encryption and decryption

New cards

Assymetric/Public Cryptography

Each party has a pair of keys: a public key and a private key. The private key is a piece of information used to decrypt and encrypt the data. The public key is the piece of information people use to verify the authenticity of the message.

New cards

What are the benefits of public key cryptography

Security due to two keys, non-repudiation (sender can’t deny sending the message), authentication, scalability, and key exchange

New cards

Steganography

Cryptography except the information itself is hidden

New cards

What is the caesar cipher vulnerable to

Letter frequency, Common phrases, brute force attack

New cards

Simple substitution Cipher

Jumble the entire alphabet in a secret permutation

New cards

Common techniques for data anonymity

de-identification, aggregation, anonymization

New cards

Data Anonymity techniques

Suppressions, Generalization, noise addition

New cards

PPDP

Privacy preserving data publishing; a set of techniques used to share or publish datasets while protecting the privacy of individuals

New cards

K anonymity

Each row is indistinguishable from at least k- 1 other records

New cards

Equivalent classes

set of records that have the same values for the quasi-identifiers

New cards

Homogeneity Attack

if the records within the group are too similar, it might be easier for attackers to make educated guesses about the identiies of the individuals

New cards

How to counter homogeneity attack

Reducing the size of the dataset so knowing who’s included isn’t possible

New cards

L-diversity

There should be l distinct values for the sensitive attribute for each group of instances that have the same quasi-identifiers

New cards

Sources of data validity errors

Nonrepresentative samples
wrong choice of attributes and measures - do the chosen metrics accurately capture the relevant aspects of the phenomena (irrelevant, omitted, poor operationalization like flawed survey questions)
errors in the data
errors in model design
error in data processing
errors in managing change

New cards

Types of data validation checks

Field level validation
record level validation
file level validation
business rule

New cards

Field level validation

checks individual data fields for correctness. Ex: Dates in right format, data type check, range check, etc.

New cards

Record Level validation

Cross field validation (checks for consistency; end data after start date, etc.) + mandatory field checks

New cards

File Level validation

row counts are as expected, there’s no duplicate entries, all id’s are unique, etc.

New cards

Business Rule Validation

data complies with business logic

New cards

What assumptions do models often make about data

linearity, normality, independence

New cards

Where does algorithmic bias come from

Historical bias, unabalanced datasets, and proxy variables

New cards

True Positive Rate

True Positive / (True Positive + False Negative); proportion of actual incidents that are accurately detected as positive by the model

New cards

False Positive Rate

False Positive / (False Positive + True Negative); the proportion that is incorrectly classified as positive when it’s actually negative

New cards

Adversarial Debiasing

Identify and mitigate bias in model predictions by countering discriminatory patterns

New cards

Regularization

Penalize correlations between sensitve attributes and model predictions

New cards

Re-weighted loss functions

different data points are assigned varying levels of importance with underrepresented or sensitive groups given more weight

New cards

Equalized odds post-processing

Adjust prediction probabilities to equalize true/false positive rates

New cards

Pre-processing techniques

Re sampling, re-weighting, data transformations to ensure that all groups are equally represented before training a hiring model

New cards

Fair relabelling

Adjust or anonymizing sensitive attributes in a dataset to reduce bias

by changing the class label of some instances from the

sensitive group from negative to positive, and from positive to negative for

some instances from the non-sensitive group

New cards

Statistical Parity

Evaluate whether the probability of a favorable outcome is the same for both sensitive and non-sensitive groups; take the probability for a positive outcome for btoh groups and subtract them

New cards

Disparate Impact

calculate the ratio of favorable outcomes for the sensitive: non-sensitive group

New cards

If statistical parity is 40%, the sentence is

Men have a 40% higher probability than woman to be hired, in absolute terms

New cards

If disparate impact is two, the sentence is:

men are twice as likely to be hired than women, in relative terms

New cards

Massaging

Make changes to he dataset to reduce or eliminate bias by using data manipulations

100

New cards

Formula to calculate the number of instances to relabel to bring discrimination to 0

M = (disc(D) (S^T * ns^T)) / S^T + ns^T

Multiply the oriignal discrimination by the number of people in sensitive group times the number of people in the non-sensitive group and then divide by total population