1/61
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Define Data Science(be familiar with the contextual model diagram)
Data Science is the field that advances methods to improve the use of data for human progress.
Explain how a social networking site illustrates the various aspects of our definition of data science.
1. Data Objects representing you, your friends, and your connections mathematically.
2. Report and Dashboards giving insight about friendship extracted from data.
3a. AI system for Recommending New friends
3b. New insights, Measurable performance(Time spent, likes, taps, clicks, ad revenue)
Define digital transformation
The adoption of digital technology to replace manual processes with digital processes. (Digitalization)
What three factors make "Big Data" so important now?
• Massive amounts of data about many aspects of human life
• Abundance of inexpensive computing power
• Competitive advantage when data are actually used
How do companies use data science to gain a competitive advantage?
-Business planning
-Performance tracking
-Process automation
-Market research
What Knowledge, Skills, and Abilities are involved in Data Science teamwork?
• Cultural Understanding
• Curiosity and Understanding
• Subject-Matter Expertise
• Mathematics / Statistics / Analytics
• Data Wrangling - parsing, scraping, formatting data
• Visualization
• Programming Novel Computing Tools
• Use of Existing Computing Tools
• Logic / Wisdom / Expertise
• Leadership and Communication
• General Flow: Problem -> Data Science -> Value
What are the characteristics of a data-science-literate professional?
• Ability to know when, how, and in what ways data science teams could benefit the problem at hand.
• Ability to anticipate what might be relevant for a data science team to know about a subject-matter domain about which the data scientist may be partially or wholly ignorant
.• Ability to articulate what data are available in one's field
• Ability to anticipate data that could be collected or created for future analysis by data scientists
Define data
Pieces of information that have been translated into a form that is more efficient for storage, movement, or processing.
Why is context important for understanding the meaning of data?
The data do not speak for themselves and require theory from a particular subject matter area to provide the context for understanding data.
Sociology of knowledge - study of the relationship between human thought and the social context within which it arises, and of the effects that prevailing ideas have on societies
Define datafication
Taking all aspects of life and turning them into data.
Once we datafy things, we can transform their purpose and turn the information into new forms of value.
What are examples of data produced by humans?
1. Information / Measurements Gathered as Part of an Experimental Research Design
2. Data Collected as Part of Case Management
3. Digital Communications or Actions
4. Social Media and Chat
5. Mass Communication
6. Polling and Surveys
How often is the US Census performed?
The US Census is performed every 10 Years.
What kinds of sensors produce data?
1. Cameras and Microphones -> Photos, Video, Sound Files
2. Internet of Things
3. Modern Manufacturing and Industrial Production
What kind of data would help us understand why a computer system might be malfunctioning?
1. Application Logs
2. Operating System Logs
3. Access Logs
What is a programming language?
A programming language is a formal language (as opposed to an informal language like human languages). Its form has been determined to give the programmer the ability to implement sets of instructions that work together to achieve a specific output.
What are two key programming languages used for Data Science work?
• R
• Python
When would a data scientist use GIS?
Useful for any field where location is important (Energy, Real Estate, Military, City Planning, Politics, etc.)
Give three examples of Python libraries commonly used for machine learning
TensorFlow, Keras, and PyTorch
What is the most important Data Science Tool?
Your brain
Define Privacy
Rights of people to control how information about them is collected, used and disclosed
What right does the fourth amendment to the US Constitution protect?
The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
What key principles did the 1970 Fair Credit Reporting Act establish?
• There should be no secret collections of data that are used to make decisions about a person's financial life / credit
• Individuals should have a right to examine and challenge the accuracy of information held in such collections
• Information in such a collection should expire after a reasonable amount of time
Which US law made hacking illegal?
1986 Computer Fraud and Abuse Act
What kind of data does HIPAA protect?
Regulate collection, use and disclosure of medical information by health care providers or those who come into contact with medical records.
What kind of data does FERPA protect?
The Privacy of student records
What law amended nearly every previous privacy law?
The Patriot Act
The GDPR governs data privacy in what political region?
The European Union
What is ethics?
Ethics is the study of moral principles - the science of right vs. wrong
What is the end goal of ethics?
The end goal of ethics is to establish justice.
What does "informed consent" mean for data collection?
"Is the subject aware of how the data collected about them will be stored and used?"
Why is ensuring data security an ethical obligation?
To protect individuals from potential harm, such as identity theft or privacy breaches.
What is an algorithm?
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.
How could facial recognition be unjust
• Algorithms boast 90% accuracy. This is uneven.
• Poorest accuracy is for subjects who are female, Black, and 18-30 years old
What are the three types of algorithmic bias?
• Pre-Existing
• Technical
• Emergent
How might it be unethical NOT to use data?
• Like any technology, there is a process of adoption and diffusion, unsettling older technologies, forcing us to learn to use new technologies wisely, etc.
• Data Science, at its best, will help us to move beyond human action not informed by reality. At its worst, it can reproduce existing biases at a larger scale.
For Research with human subjects in the university(or any place of research study) to be conducted, whom must it be approved by first?
The IRB (institutional review board)
What are some of the safeguards that keep a Google data center secure?
-Security Operations Center
-Vehicle Crash Barriers
-Overlapping Cameras
-Biometric access centers
-Thermal camera detection
-Redundant Power Systems
-Motion Detecting Fences
-Data encryption
-ID scanners
-Disk and Hard Drive Grinder
-Ethical hackers
What is Data Governance and Management?
A collection of administrative processes that affect acquisition, validation, storage, protection, and processing of required data to ensure the quality, accessibility, reliability, and timeliness of the data so that stake holders can use the data to achieve organizational goals effectively.
To what does good Data Management or Governance lead?
• Data Quality
• Privacy
• User, Customer, or Constituent Trust
• Compliance with Regulatory Requirements
• Effective Use of Data to meet Strategic Goals(Accurate Reporting, Analytics, Dashboards, Enables better decision-making)
• Good Management
What is Data Security?
The condition of data measured in terms of factors like:
• Relevance
• Timeliness and Recency
• Accuracy
• Consistency - a single source of truth
• Completeness
• Reliability
• Reduced Risk - i.e., Safe to Use for Intended Purpose
• Traceability / Lineage
How did the Yahoo Data Breach of 2014 affect Yahoo?
Attack by state-sponsored hackers compromised the real names, email addresses, date of birth, and telephone numbers of 500 Million Users
What are the five parts (core functions) of the NIST Cybersecurity Framework?
• Identify
• Protect
• Detect
• Respond
• Recover
Be able to describe the five core functions of NIST
• Identify - the "what" of your organization and its data and systems.
• Protect - take steps to secure your organization's data and systems.
• Detect - watch for and be able to know when your data and systems are threatened.
• Respond - create and carry out procedures to take action when a threat has been detected
• Recover - develop and carry out procedures to restore business activities, eliminate the exploited vulnerability, and come back stronger.
What is encryption?
A way to scramble data so that only an authorized party can unscramble it.
What is the difference between system access and system entitlements?
• System Access - Whether a particular person may use a system.
• System Entitlements - What actions can a user with access to the system take?
What is a zero-day vulnerability?
A true flaw in software or systems sometimes kept private by hackers. Discovered before there is a patch available
What are some ways to establish physical security?
• Locks
• Keys (Physical Keys, Codes, Biometrics, Multi-Factor)
• Separation
• Access Records / Logs
• Prevention of Line of Sight / EarshotExploits
• Shredding / Destruction
What is a Data Entity?
Person, Place, or Thing that you want to represent as a data object and track in a database, system of files, etc.
Each entity has ____?
"Attributes" (properties or traits)
What do we use ERD diagrams to do? What does ERD stand for?
We often use ERD (Entity Relationship Diagrams) diagrams to explain or document a data model.
What is cardinality?
The number of items in a set, minimums and maximums.
What are the three types of data models?
conceptual, logical, physical
What is normalization?
The attempt to store a piece of information only once, link related information through "keys"
What is SQL?
Structured Query Language
Purpose:
- Great for data-gathering software applications
-Case management systems (e.g., Banner at MSU)
-Data warehousing
What is an unstructured database?
An unstructured database is information not stored in a specific format. Sometimes referred to as "noSQL". Information that is complex and cannot be reduced to a small number of fields
Why might a flat data structure be good?
• Often preferred for analysis / analytics
• Sometimes preferred to speed up searches (indexing)
• Used for "batch" exchanges of data (exporting and importing)
• Used for reports
Vector databases are often used in ______?
AI applications
Why would a graph database be useful?
• When one might need to retrieve and traverse complicated relationships between data objects quickly.
• If Faster retrieval of complicated relationships is required
• If a different language other than SQL is required
What are the pros of the Block Chain?
• Solves "Trust" Problem
• No Intermediaries, Better than Third-Party Solutions(saves time and money)
• Is a nondestructive way to track data changes over time
What are the cons of the Block Chain?
Brings with it complex policy questions around:
• Governance
• Economics
• International law
• Security
Define Confidentiality
Obligation of those who have access to private information not to disclose private information to others.
Define Security
Technological, physical, or administrative safeguards or tools designed to protect data from unwarranted access or disclosure.