Looks like no one added any tags here yet for you.
ethics and legality in an ideal world
would fully overlap
ethics and legality in the real world
only have a slight amount of overlap
data ethics
the set of principles and processes that guide the ethical collection, processing, analysis, use and application of data having an effect on human lives and society
- aim to create an ethical/moral code of conduct for data use and collection
data science
application of computational and statistical techniques to address/gain insight on a real-world problem
- data can be unstructured, structured, semi-structured
- data cleansing, prep, analysis
Data Science Project Process
Identify
Design Research Plan
Collect data
Analyze data
Extract result
Publish/exploit results
where in the Data Science Project Process does "ethics approval" normally occur?
Designing your research plan
Step 1 DSPP
Identify hypothesis/equation
Example: A study about online shopping recommendations. Researchers plan to collect data about users' shopping habits.
Step 2 DSPP
Design research plan
If there any potential positive (get more personalized shoppingsuggestion) or negative (might feel their privacy is invaded) impacts on participants,ensuring their well-being and informed consent. Researchers should focus on bothpotential positive and negative impacts
this is where ethics approval occurs
Step 3 DSPP
Collect Data
* How could bias in the data affect results? Could data be used for other purposes?
Bias in Data: Biased data can lead to biased results, affecting fairness and accuracy of findings.(data only comes from a certain demographic, the recommendations might only work well for that group, causing unfairness for others.)
Data Usage: Being aware that collected data might be repurposed for unintended uses that could harm privacy and consent. (later be used to target users with ads without their consent.)
Step 4 DSPP
Analyze data
* Is the analysis of the data introducing bias? any uncertainty or assumptions?
Bias in Analysis: Identify if the analysis method introduces bias (researchers only focuson certain types of purchases, the recommendations could be biased towards thoseitems, affecting the accuracy and fairness of results.)● Uncertainty and Assumptions: they impact the reliability and scope of the results.(Assuming that users' past purchases predict their future preferences might not hold truefor everyone, introducing uncertainty into the results.)
Step 5 DSPP
Extract results
* Can the result be misinterpreted? Are the uncertainties, assumptions, biases properly represented?
Misinterpretation: Presenting improved shopping recommendations might be misinterpreted as an absolute guarantee of satisfaction when it's actually a probability.
Representing Uncertainties, Assumptions, Biases: What are the limitations and potential sources of error to provide an accurate context for the results. (The researchers should include a clear explanation of how they arrived at the recommendations, addressing potential biases and limitations.
Step 6 DSPP
Publish/Exploit Result
* Can the result outcomes be abused? Can they disclosing or not disclosing have ethical consequences?
Abuse of Results: cause harm or unethical actions.(manipulate users into buying things they don't need.)
Ethical Consequences of Disclosure: Weighing the ethical implications of sharing or withholding certain results, and how they might affect individuals or society. It's like when you have a secret and you need to decide if you should tell your friend.
why is the Data Science Project Process important?
Because you're actively practicing data ethics, which involves not only complying with legal regulations but also making ethical choices that promote fairness, transparency, and the well-being of individuals and society. This approach enhances the credibility and impact of your research while minimizing potential negative consequences
ethics
shared principles guiding moral judgement
- cornerstone of civilization
ethics vs laws
ethics guide the creating of laws
- laws can be used to enforce ethical behaviors, but ethical values are not laws. even though it is not ethical to tell a secret, it is not against the law.
morals
individual's beliefs concerning what is right or wrong
- often shaped by cultural, religious, or personal values
- subjective and varied amongst everyone
example: vegetarianism
- it is not against the law or unethical to eat meat overall, vegetarians just believe that they do not want to consume it
data
any set of information that can be collected, stored, analyzed
big data
datasets with large volume, created and updated with high velocity, that have various structure and format
volume
the amount of data from sources
velocity
the speed at which big data is generated
variety
the types of data (structure,d unstructured, semi-structured)
how do companies manage large datasets?
Apache Spark, Hadoop, Cloud-based storage (AWS, google cloud)
how do data scientists deal with variety in data?
statistics, CS, machine learning, AI
why utilize data?
for better decision-making, customer-centric facilities, and target marketing
example of data facilitator
internet because it is used by all types of people and organizations
What can you analyze from this data?
- medical records
- patient demographics
lab results
predict disease outbreaks
optimize treatment plans
provide insights for medical research
what data are you sharing?
social network friend's list
location
web searches
chat logs
IP addresses
web history
why are data ethics needed?
though it can be very helpful, people can misuse it. therefore, ethics need to be considered
- rules still evolving
decision-making and policy development
policies must be fair and based on reliable data
social and economic inequality
need to avoid systematic bias and ensure equitable outcomes
privacy and data protection
maintain individual's privacy rights
societal impact of data science and the need for responsible practices
trust and transparency
ethical considerations in AI and automation
social good and public interest
data governance and regulation
benefits of data ethics
consistency
better data-driven decisions
increased transparency
consistency
helps all data users navigate the ethical considerations of data use
better data-driven decisions
uncover data limitations, gaps, and biases; facilitate justifiable decisions with data --> promote transparency
increased transparency
trustworthy data processes to increase the transparency of their data
structured data
names, addresses, credit card, numbers, geolocation
unstructured data
photos, audio, video, social media posts
major sources of abundant data
business, science, society and everyone
why is data ethics needed?
data is a valuable asset that can build the next great business/innovation, but it is also a resource lots of organizations are not protecting or using ethically.
ethics
govern professional interactions
laws
govern society as a whole
morals
governs private, personal interactions
what are areas of concern in data ethics?
data collection
data ownership
data privacy
data anonymity
data validity
algorithm, statistical fairness
data collection
process of gathering information from various sources
example of data collection ethics
ensuring that respondents give consent before providing their information
data ownership
the act of having legal rights and complete control to make decisions about data
example of data ownership ethics
you took a stunning sunset photo of me, but before you include it in a blog post, you ask for explicit permission from me to use the photo for a specific purpose
important lesson in data ownership
must ask for consent and respect the right of the data owner
data privacy
the protection and appropriate use of personal information
involves safeguarding individuals' personal data and ensuring that their privacy rights are respected
example of data privacy ethics
a customer gives the company consent to collect and store their PII, but that does not mean they want it publicly available
ways to implement data privacy
dual authentication passwords
file encryption
informed consent
data anonymity
preventing the identification of individuals within a dataset when handling and sharing data, so that data cannot be linked back to someone
confidentiality
centers around protecting data from unauthorized access or disclosure --> implementing security measures
data validity
the integrity and accuracy of data being collected, analyzed for various purposes
represents real-world phenomena and refrains from errors and biases
algorithmic fairness
ethical consideration and practice of ensuring that algorithms used in decision-making processes do not result in unfair or biased outcomes
misuse of statistics
unethical practice of distorting or manipulating statistical information to support a particular agenda or draw false conclusions
data confidentiality
ensuring that data is protected from unauthorized use.
implementing security measures to prevent data breaches
example of data privacy
social media privacy
example of data anonymity
healthcare data does not have names, just identification numbers
example of data confidentiality
bank employing encryption methods to protect data and implement strict access controls
Facebook Cambridge Analytica Scandal
data acquisition through a third party app called "This is your Digital Life"
Facebook Cambridge Analytica Scandal - Data harvesting from friends
the real ethical data breach occurred when the app accessed personal data like their facebook friends
Facebook Cambridge Analytica Scandal - Scope of data collected
profile details, likes, friends' lists, users' psychological profiles and preferences
Facebook Cambridge Analytica Scandal - Purpose of data collected
used by cambridge analytica (political consulting firm) to target political ads and to influence voting behavior
First AI Beauty Contest in 2016
use AI to judge and select winners based on the contestants' submitted photos
- included only one certain skin tone although many people of color submitted photos
First AI Beauty Contest in 2016 - Violation of Fairness
exhibited racial bias by disproportionately selecting winners who were unrepresentative of global population
First AI Beauty Contest in 2016 - Underrepresentation
lack of diversity leads the questions of whether the AI system was genuinely inclusive and fair in its assessment
First AI Beauty Contest in 2016 - Root causes of bias
data bias and algorithmic bias
data bias
lack of diversity in the training data for AI
algorithmic bias
if training data lacks diversity, the algorithm may not have learned to recognize and appreciate a broad range of beauty standards
concerns with TikTok
owned by a foreign company (ByteDance)
questions have been raised about where TikTok stores user data, especially given the foreign country ownership of the company
- potential for data access by the foreign government
- how does it select and prioritize recommended content
Facebook and mental health services
crisis intervention bot
crisis intervention bots
engage with uses who express thoughts of self-hard or suicide and provide immediate support
the issue with Facebook and mental health services
visitors of websites are often forced to have their information collected when it is said to be anonymous
Meta Pixel allows facebook users to call the crisis lines, but it seems their data was being sent to facebook and was not anonymous anymore (pixel-based data could be unscrambled easily to reveal true identities
Quest Diagnostics Breach 2019
significant cybersecurity incident where an unauthorized user gained access to the systems of a third-party billing collections vendor used by Quest and exposed personal and medical information of 12 million patients
scope of the Quest Diagnostics Breach 2019
one of the largest healthcare data breaches at the time
data exposed in Quest Diagnostics Breach 2019
names, addresses, phone #s, DOB, SSN, lab results
Quest Diagnostics Breach 2019 and its relation to complex ownership
highlights the challenges in determining who ultimately owns and is responsible for protecting patient data when so many parties are involved
BBC News: Facial recognition fails on race
NIST found that facial recognition software exhibited lower accuracy in identifying faces of African American and Asian Americans compared to Caucasians
particularly pronounced in one-to-one matches
root cause of the bias reason for BBC News scandal
the facial recognition software was primarily trained on databases from govt. agencies (State dept., FBI) consiting of primarily white individuals
therefore the lack of diversity in the training data resulted in the algorithm inheriting biases
lesson learned fro BBC News scandal
critical need for representative datasets to address algorithmic bias
Question Diagnostics data breach in 2019: what to consider
clear data ownership agreements
data access controls
data encryption
data minimization
security audits and assessments
sources of data
primary and secondary
primary data
no related research is done on the subject/topic
collecting brand new data
secondary data
data regarding a specific topic is readily available or collected by someone else
before collecting data, need to consider
the question you aim to answer
the data subjects you need to collect data from
the collection timeframe
data collection methods best suited to your needs
types of quantitative data
raw #s and digits
ration
internal
types of qualitative data
customer reviews or feedback
nominal and ordinal
how can qualitative data be collected?
answering questions (how much, how often)
one-on-one interviews, observations, focus group meetings, surveys
allows mathematical analysis
categorical data
grouped based on the categories
three classification types for categorical data
binary
nominal
ordinal
binary
only take two possible states (true/false)(yes/no)
nominal
labeled data classified into various groups with no ranks or order between them
(country, gender, haircolor)
ordinal
groups based on order or ranking
(economic status)
first-party data
collected directly from users by your organization
(most valuable because you receive information about how your audience behaves, thinks, feels - all from a trusted source)
second-party data
data shared by another organization about its customers (or its first-party data
third-party data
data that has been aggregated and rented or sold by orgs. that do not have a connection to your company or users
data collection methods
online survey
paper survey
interview and focus groups
forms
polls
voting
social media monitoring
online tracking
interviews
one-on-one conversations