Data Ethics Final Review

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/124

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 12:57 PM on 5/18/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

125 Terms

1
New cards

Ethics

Shared principles guiding moral judgement

2
New cards

Morals

Individual’s beliefs or principles concerning what is right or wrong, often shaped by cultural, religious, or personal beliefs or values

3
New cards

3 V’s

Volume, velocity, variety

4
New cards

Data Science Project Process

  1. Identify the Hypothesis

  2. Design the research plan

  3. Collect the Data

  4. Analyze the data

  5. extract the results

  6. exploit the results

5
New cards

Data Lifecycle

  1. Data collection

  2. Processing

  3. EDA

  4. Analysis/hypothesis testing/ML

  5. Insight and policy decision

6
New cards

Data Engineers

Specialize in data gathering and storage

7
New cards

Areas of concern in Data Ethics

Data collection, ownership, privacy, anonymity, validity, algorithm, statistical fairness

8
New cards

To protect individual’s privacy

  1. Store data in a secure database

  2. use security methods to protect privacy

  3. transparency and informed consent

9
New cards

Data Anonymity:

Preventing the identification of individuals within a dataset when handling and sharing data

10
New cards

Confidentiality

Protecting data from unauthorized access or disclosures; involves implementing security measures

11
New cards

Algorithm Fairness

Ensuring the algorithms used in decision-making do not produce unfair or biased outcomes

12
New cards

Imbalance Data Issue

Dataset overrepresents a certain outcome

13
New cards

What ethical concepts are involved with designing a research plan

transparency, informed consent, and algorithmic fairness

14
New cards

What ethical approaches are related to collecting data

transparency, informed consent, data privacy, anonymity, and data ownership

15
New cards

What ethical approaches and concepts are related to data analysis

check business, data validity, transparency

16
New cards

What ethics are associated with extracting data results

Algorithmic fairness, data validity, and transparency

17
New cards

What data results are associated with publishing and exploiting results

data ownership, transparency, and algorithmic fairness

18
New cards

Data Privacy v Anonymity v Confidentiality

Privacy: Protection and appropriate use of personal information

Anonymity: Prevents the identification of individuals within a dataset

Confidentiality: Safeguarding data from an unauthorized source

19
New cards

Nominal v Ordinal

Nominal = classified into various groups with no ranks or orders

Ordinal = Grouped based on order or ranking

20
New cards

EU’s set of tough data protection laws

GDPR

21
New cards

California’s data protection law that requires the protection of the privacy and security of consumer’s financial information

California Consumer Privacy Act

22
New cards

PII

Personally Identifiable Information: Any data or information that can be used to identify, locate, or contact an individual

23
New cards

a/B testing

comparing two versions of something to see which performs better in achieving a specific goal

24
New cards

Selection Bias

Broad category of bias that occurs when the participants included in a study are not representative of the target population

25
New cards

Reasons for selection bias

Unrepresentative sample, non random selection, participant factors, etc.

26
New cards

Types of selection bias

sampling, non-response, self-selection/volunteer bias

27
New cards

Sampling Bias

Data is from a non-representative subset:

  • participants might be selected from a specific demographic group

  • convenience sampling

  • not represent the entire population

28
New cards

Non-response bias

Some groups don’t respond, skewing results

29
New cards

Self selection bias

people choose whether to participate

30
New cards

Measurement Bias

Method you use to measure something consistently gives wrong results

31
New cards

Response Bias

Someone responds falsely to a question

32
New cards

Common reasons for response bias

social desirability, acquiescence, low effort, lack of interest, survey context, recall bias, cultural bias, researcher/experimenter bias (researchers unintentionally influence data collection), data entry error

33
New cards

Data ownership

Having complete control over the data, property, an asset, object, or real estate

34
New cards

4 paradigms of Data Ownership

  1. subject as owner - Individual Autonomy; A person has the right to be forgotten

  2. creator as owner - Data belongs to person that created it; may clash with the subject’s rights; ex: researcher or an artist

  3. custodian as owner - entity storing/managing data is the owner ; ex: cloud provider

  4. funder as owner - data belongs to the funding entity;

35
New cards

Types of data ownership

personal data ownership, organizational data ownership, shared data ownership, collaborative data ownership, open data and public domain, third party data ownership, data custodianship,

36
New cards

Shared data ownership

Responsibility is shared based on who creates, uses, and manages the data. Responsibility is shared but ownership or decision power may differ. Divided responsibility

37
New cards

Collaborative Data ownership

Multiple parties jointly create the data and own the final output; equal or joint ownership of the results

38
New cards

Open data

freely used, reused, and redistributed. At most might need to be attributed; government data and scientific research

39
New cards

Public Domain

Data that is not protected by copyright and is available for public use without any restrictions

40
New cards

Data custodianship

ownership does not transfer; third party only temporarily holds and manages the data on behalf of the original owner

41
New cards

TRIPS agreement

Trade-Related Intellectual Property Rights. Notes that intellectual property rights can be sorted into copyright, industrial design, trademark, trade secret, aptent, geographical indication, and integrated chip design

42
New cards

Copyright

owners have exclusive rights to reproduce, distribute, display, and perform creations.

43
New cards

how long does US copy right last

70 + lifespan

44
New cards

CC By

allows for copying, sharing, adapting, and using the work even commercially with credit to the creator

45
New cards

CC By NC

allows for copying, sharing, adapting, and using the work non commercially only with credit

46
New cards

CC by ND

Allows copying and sharing only in original form with credit

47
New cards

CC by SA

Allows adapting the work but derivative works must use the same license w/ credit

48
New cards

Types of AI-processed Data

Extracted Data, restructured data, augmented data (new variables added), inferred data, modeled data

49
New cards

Quasi Identfier

Data attribute that isn’t identifying on its own but can potentially lead to re-identification when combined with other data

50
New cards

Pseudoanonymization

Replacing personally identifiable information with a pseudonym so the data isn’t directly identifiable but can be traced back to an individual if necessary

51
New cards

Data minimization

Collecting and storing only the minimum amount of data needed

52
New cards

Access controls

Password protection, multi-factor authentication, etc.

53
New cards

CCPA

California Consumer Privacy act; Right to Information, Data Transfer Restrictions, User Rights

54
New cards

Nevada Senate Bill 220

Data Sale Restriction (websites can’t sell customer data without consumer consent or knowledge), covered information included PII, and consumers can opt out of the sale of covered information

55
New cards

Encryption

Encodes data to make it unreadable to anyone who doesn’t have the decryption key

56
New cards

Access Control

Specifies who can access specific resources or data. Enforces authentication, authorization, and permissions

57
New cards

Firewalls

Protect networks from unathorized access by blocking incoming traffic that doesn’t meet specific ccriteria and restricting outgoing traffic

58
New cards

Anonymization

Removes identifying infromation from personal data

59
New cards

Tokenization

Replaces senstive data with non senstivie substitutes to ensure that stolen data is useless to thieves

60
New cards

Data Loss prevention

set of tech used to prevent data from being lost or stolen. DLP’s monitor traffic and block sensitive data from being sent outside the organizaiton, and restricts unauthorized access

61
New cards

VPN

Used to protect data in transit and can help to rpevent eavesdropping and interception of sensitive data

62
New cards

2FA

Users need to provide two forms of identification to access a system or application

63
New cards

Procedure for encoding amessage

Cipher

64
New cards

How Encryption Works

  1. Sender Encrypts the message using an encryption algorithm and a secret encryption key to encrypt the original message to create a ciphertext

  2. Sender shares the encrypted message with the recipient through a secure channel or medium

  3. Sender protects the encryption key, keeping it secret and never sharing it in the same communication as the ciphertext

  4. Recipient decrypts the message using the decryption key to recover the original message

65
New cards

Symmetric Encryption

One key is used for both encryption and decryption

66
New cards

Assymetric/Public Cryptography

Each party has a pair of keys: a public key and a private key. The private key is a piece of information used to decrypt and encrypt the data. The public key is the piece of information people use to verify the authenticity of the message.

67
New cards

What are the benefits of public key cryptography

Security due to two keys, non-repudiation (sender can’t deny sending the message), authentication, scalability, and key exchange

68
New cards

Steganography

Cryptography except the information itself is hidden

69
New cards

What is the caesar cipher vulnerable to

Letter frequency, Common phrases, brute force attack

70
New cards

Simple substitution Cipher

Jumble the entire alphabet in a secret permutation

71
New cards

Common techniques for data anonymity

de-identification, aggregation, anonymization

72
New cards

Data Anonymity techniques

Suppressions, Generalization, noise addition

73
New cards

PPDP

Privacy preserving data publishing; a set of techniques used to share or publish datasets while protecting the privacy of individuals

74
New cards

K anonymity

Each row is indistinguishable from at least k- 1 other records

75
New cards

Equivalent classes

set of records that have the same values for the quasi-identifiers

76
New cards

Homogeneity Attack

if the records within the group are too similar, it might be easier for attackers to make educated guesses about the identiies of the individuals

77
New cards

How to counter homogeneity attack

Reducing the size of the dataset so knowing who’s included isn’t possible

78
New cards

L-diversity

There should be l distinct values for the sensitive attribute for each group of instances that have the same quasi-identifiers

79
New cards

Sources of data validity errors

  1. Nonrepresentative samples

  2. wrong choice of attributes and measures - do the chosen metrics accurately capture the relevant aspects of the phenomena (irrelevant, omitted, poor operationalization like flawed survey questions)

  3. errors in the data

  4. errors in model design

  5. error in data processing

  6. errors in managing change

80
New cards

Types of data validation checks

  1. Field level validation

  2. record level validation

  3. file level validation

  4. business rule

81
New cards

Field level validation

checks individual data fields for correctness. Ex: Dates in right format, data type check, range check, etc.

82
New cards

Record Level validation

Cross field validation (checks for consistency; end data after start date, etc.) + mandatory field checks

83
New cards

File Level validation

row counts are as expected, there’s no duplicate entries, all id’s are unique, etc.

84
New cards

Business Rule Validation

data complies with business logic

85
New cards

What assumptions do models often make about data

linearity, normality, independence

86
New cards

Where does algorithmic bias come from

Historical bias, unabalanced datasets, and proxy variables

87
New cards

True Positive Rate

True Positive / (True Positive + False Negative); proportion of actual incidents that are accurately detected as positive by the model

88
New cards

False Positive Rate

False Positive / (False Positive + True Negative); the proportion that is incorrectly classified as positive when it’s actually negative

89
New cards

Adversarial Debiasing

Identify and mitigate bias in model predictions by countering discriminatory patterns

90
New cards

Regularization

Penalize correlations between sensitve attributes and model predictions

91
New cards

Re-weighted loss functions

different data points are assigned varying levels of importance with underrepresented or sensitive groups given more weight

92
New cards

Equalized odds post-processing

Adjust prediction probabilities to equalize true/false positive rates

93
New cards

Pre-processing techniques

Re sampling, re-weighting, data transformations to ensure that all groups are equally represented before training a hiring model

94
New cards

Fair relabelling

Adjust or anonymizing sensitive attributes in a dataset to reduce bias

by changing the class label of some instances from the

sensitive group from negative to positive, and from positive to negative for

some instances from the non-sensitive group

95
New cards

Statistical Parity

Evaluate whether the probability of a favorable outcome is the same for both sensitive and non-sensitive groups; take the probability for a positive outcome for btoh groups and subtract them

96
New cards

Disparate Impact

calculate the ratio of favorable outcomes for the sensitive: non-sensitive group

97
New cards

If statistical parity is 40%, the sentence is

Men have a 40% higher probability than woman to be hired, in absolute terms

98
New cards

If disparate impact is two, the sentence is:

men are twice as likely to be hired than women, in relative terms

99
New cards

Massaging

Make changes to he dataset to reduce or eliminate bias by using data manipulations

100
New cards

Formula to calculate the number of instances to relabel to bring discrimination to 0

M = (disc(D) (S^T * ns^T)) / S^T + ns^T

Multiply the oriignal discrimination by the number of people in sensitive group times the number of people in the non-sensitive group and then divide by total population