Data Science I Final Exam

0.0(0)

Studied by 1 person

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/123

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

124 Terms

New cards

descriptive statistics

uses the data to provide descriptions (numerical calculations) and visualizations (graphs or tables) of the population

New cards

inferential statistics

uses the data to make inferences and predictions about a population (such as inferring trends)

New cards

mean

the average of a set of numbers

New cards

median

the middle value in a datset when the numbers are arranged in order

New cards

mode

the value that appears most often in a dataset (if no numbers repeat, there is no mode)

New cards

what type of distribution should use mean

normal/symmetric distributions (also called bell curve or gaussian curve)

New cards

what type of distribution should use median

skewed distributions

New cards

what type of distribution should use mode

bimodal distributions

New cards

kurtosis

the measure of the tailedness of a distribution (how often outliers occur, how thin/fat the distribution is)

New cards

standard deviation

a measure of how dispersed the data is in relation to the mean (low SD means data is more clustered around mean, high SD means data is more spread out)

New cards

variance

how large the spread is within a dataset (how far each number is from the mean)

New cards

quartiles

four equal parts in the data set (each represents 25%) (three quartiles: Q1, Q2, Q3, since three lines = 4 parts)

New cards

quantiles

how many values in a distribution are above or below a certain limit (5 quantiles: 0.01, 0.25, 0.50, 0.75, 0.99)

New cards

percentiles

expresses the relative position of a score within a dataset

New cards

interquantile range (IQR)

the middle 50% of values when ordered from lowest to highest (Q3-Q1)

New cards

fences

the maximum and minimum values that are not outliers

New cards

equation for upper fence

Q3 + (1.5 * IQR)

New cards

equation for lower fence

Q1 - (1.5 * IQR)

New cards

regression

analyzes the relationship between a dependent variable and one or more independent variables to find trends in data

New cards

why do we use regressions?

they help us understand how values of a certain response (DV) are associated with values of a predictor (IV)

New cards

method of least squares

used if a linear correlation between two variables exists by selecting a linear trend line that best represents the data

New cards

what function is used to run a regression in R

model <- lm(DV ~ IV1 + IV2..., data = dataset)

New cards

residual value

the difference between the predicted value of the data and the actual value of the data

New cards

statistical error

the difference between a value obtained from a data collection process and the "true" value for the population (the greater the error, the less representative the data is for the population)

New cards

tidy data

data arranged in a simple but precisely defined pattern (exists in defined data tables)

New cards

why is tidy data important?

when data is in this form, it is easy to transform it into arrangements that are more useful for answering questions

New cards

categorical variables

record the type of category and are often in word form

New cards

quantitative variables

record a numerical attribute and are in number form

New cards

database

a program that helps store data and provides functionality for adding, modifying, and querying that data

New cards

relational database

stores its data in one or more tables, with multiple tables typically related to one another

New cards

primary key

an attribute that is a unique identifier of rows

New cards

foreign key

an attribute of another existing table that is a reference to the primary key

New cards

what does it mean to query a database?

ask the database questions to retrieve specific information (most commonly by using SQL)

New cards

database server

runs a database management system, manages data access and retrieval, and provides database services to clients

New cards

how do you use a database server?

ask the database server for what you need; do not need to touch that database directly since it will do all of the storing, finding, securing, and managing the data itself

New cards

dbConnect(drv, path)

returns a connection object using the specified driver (drv) to connect to the database at the specified path

New cards

SQLite()

returns a driver for the SQLite database (can be inputted as the drv in dbConnect)

New cards

dbDisconnect(con)

disconnects from the database and frees resources used by the connection

New cards

what packages are needed for SQLite?

library(DBI), library(RSQLite)

New cards

SELECT statement (querying)

selects all rows and columns (can select multiple things at a time)

New cards

what conditions can be added onto the SELECT statement?

FROM (the location you want to select from)

WHERE (where in that location the data is located)

ORDER (the order you want the output in)

New cards

dbSendQuery(conn, statement)

runs the query on the database specified by the connection object and returns a response object

New cards

dbFetch(res)

returns the results from the response object as a data frame

New cards

dbClearResult

clears the result set and frees the resources used by it

New cards

database administration (DBA)

managing and maintaining a database management system (DBMS)

New cards

database administrator

the person/team responsible for keeping the database running smoothly, keeping the data safe and secure, ensuring data is available when needed, and making sure queries run efficiently

New cards

5 big responsibilities of DBA

security, performance, availability, reliability (integrity), configuration

New cards

professional ethics

the special responsibilities not to take unfair advantage of the trust and confidence placed by clients on you

New cards

data scientist's oath

respect the privacy of data subjects, understand/recognize the data represents real people and situations, always maintain fair treatment and nondiscrimination, and remember you are a member of society with special obligations to all fellow human beings

New cards

what values does the data values and principles manifesto espouse?

inclusion, experimentation, accountability, impact

New cards

algorithmic bias

caused by biased data; the algorithm is not 'thinking' or 'being sexist,' but is rather reflecting and amplifying the stereotypes that were already present in its training data, which can reinforce harmful biases about who 'belongs' in certain professions

New cards

data and disclosure

the ability to link multiple data sets and use public information to identify individuals is a growing problem, so you must be able to balance disclosure (to help improve something) and nondisclosure (to ensure private information is not made public)

New cards

data scraping

finding available data online meant for human consumption and collecting it without authorization

New cards

safe data storage

ensuring data protections remain even when equipment is transferred or disposed of

New cards

reproducable analysis

the process of recording each and every step (no matter how trivial) to ensure the result is reproducible for others

New cards

collective ethics

recognizing that although science is carried out by individuals and teams, the scientific community as a whole is a stakeholder, so ethical obligation outweighs individual reputation

New cards

what topics are included in the professional guidelines for ethical conduct?

professionalism, integrity of data and methods, responsibilities to stakeholders, conflicts of interest, the response to allegations of misconduct

New cards

data mining

the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets

New cards

sample

subset of a population with a manageable size used for analysis (chosen using random selection)

New cards

population

complete data set that is often too large

New cards

random variables

a measure of a trait or value associated with an object, person, or place that is unpredictable

New cards

probability distributions

a function that gives the probabilities of occurrence of possible events for an experiment (used to model an unpredictable variable)

New cards

what should you keep in mind when considering a probability distribution?

consider what other events are possible, and define a set of events that are mutually exclusive (since only one event can occur at a time)

New cards

what are the main characteristics of a probability distribution?

the probability of a single event never goes below 0 or exceeds 1, and the probability of all events always sums to exactly 1

New cards

discrete variable

a random variable where values can be counted by groupings (ex: car color)

New cards

continuous variable

a random variable that assigns probability to a range of values (ex: car mileage/mpg)

New cards

binomial distributions

variables that assume only one of two values (ex: heads-or-tails coin flip scenario)

New cards

categorical distributions

variables are grouped into categories and ranked according to those categories (ex: socioeconomic status, blood type)

New cards

what are pearson's r and spearman's rank correlaction used for?

used to determine if a relationship between two variables exists

New cards

what is the difference between pearson's r and spearman's rank correlaction?

pearson's r: assumes data is normally distributed and that variables are numeric, continuous, and linearly related

spearman's rank correlation: assumes data is not normally distributed and that variables are ordinal and related non-linearly

New cards

singular value decomposition (SVD)

reduces the dimensionality of the dataset by helping to compress the dataset and removing redundant information and noise (ex: image compression)

New cards

time series

a collection of data on attribute values over time that is used to predict future instances of the measure based on past observational data

New cards

constant time series

remains at roughly the same level over time

New cards

trended time series

shows a stable linear movement up or down over time

New cards

untrended/trended seasonal time series

predictable, cyclic fluctuations that re-occur seasonally throughout a year

New cards

autoregressive moving averages (ARMA)

a class of forecasting methods that can be used to predict future values from current and historical data (needs at least 50 observations for reliable results)

New cards

machine learning (algorithmic learning)

the practice of applying algorithmic methods to data in an iterative manner so the computer discovers hidden patterns/trends that can be used to make predictions

New cards

how long do learning algorithms typically run?

until the final analysis results will not change, no matter how many additional times the data is fed to the algorithm (the algorithm needs to be repeated over and over until a certain set of predetermined conditions is met)

New cards

what are the steps of the machine learning process?

setting up: acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand, then breaking the data into training and test datasets

learning: model experimentation, training, building, and testing

application: model deployment and prediction

New cards

what is the rule of thumb for breaking data into training and test data sets?

apply random sampling to take two-thirds of the original datasets to use as data to train the model, then use the remaining one-third for evaluating the model's predictions

New cards

supervised learning

algorithms learn from known features of the data to produce an output model, which is then used to successfully predict labels for new incoming, unlabeled data points (all input data must have labeled features)

New cards

unsupervised learning

accepts unlabeled data and attempts to group observations into categories based on underlying similarities in input features

New cards

semisupervised/reinforcement learning

a behavior-based learning model where the model is given 'rewards' based on how it behaves (model subsequently learns how to maximize rewards by adapting the decisions it makes)

New cards

clustering

a type of machine learning that uses an unsupervised learning technique

New cards

when should clustering be used?

if there is a dataset that describes multiple features about a set of observations and you want to group the observations by their feature similarities

New cards

partitional clustering algorithms

algorithms that create only a single set of clusters

New cards

hierarchal clustering algorithms

algorithms that create separate sets of nested clusters, each in its own hierarchical level

New cards

k-means clustering algorithm

a simple and fast unsupervised learning algorithm that can be used to predict groupings within a dataset (ex: pizza party example)

New cards

how does a k-means clustering algorithm make its prediction?

predictions are made based on the number of centroids present, which are represented by k (a model parameter that must be defined), and the nearest mean values, which are measured based on the distance between points plotted

New cards

what does k mean in a k-means clustering algorithm?

the number of centroids present (the centers of the clusters) (ex: if k = 2, there will be 2 centroids, or 2 cluster centers, so 2 clusters)

New cards

kernel density estimation (KDE)

a smoothing method that works by placing a kernel (a weighing function that is used for quantifying density) on each data point in the dataset and then summing the kernels to generate a kernel density estimate for the overall region (ultimately just makes a smooth curve instead of a curve made of chunks)

New cards

when is KDE helpful?

when eyeballing clusters

New cards

dendrogram

a visualization tool that depicts the similarities and branching between groups in a data cluster (can be built bottom-up or top-down)

New cards

bottom-up dentrogram

assembling pairs of points and then aggregating them into larger and larger groups

New cards

top-down dendrogram

starting with the full dataset and splitting it into smaller and smaller groups

New cards

classification

a form of supervised machine learning, where the algorithm learns from labeled data in order to build predictive models that they can use to forecast the classification of future observations

New cards

binary classification

data classified into two possible classes

New cards

multi-class classification

data classified into one of three or more classes

New cards

multi-label classification

assigns one or more labels to each observation rather than just a single label

100

New cards

imbalanced classification

data classified into significantly more of one class than another