1/123
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
descriptive statistics
uses the data to provide descriptions (numerical calculations) and visualizations (graphs or tables) of the population
inferential statistics
uses the data to make inferences and predictions about a population (such as inferring trends)
mean
the average of a set of numbers
median
the middle value in a datset when the numbers are arranged in order
mode
the value that appears most often in a dataset (if no numbers repeat, there is no mode)
what type of distribution should use mean
normal/symmetric distributions (also called bell curve or gaussian curve)
what type of distribution should use median
skewed distributions
what type of distribution should use mode
bimodal distributions
kurtosis
the measure of the tailedness of a distribution (how often outliers occur, how thin/fat the distribution is)
standard deviation
a measure of how dispersed the data is in relation to the mean (low SD means data is more clustered around mean, high SD means data is more spread out)
variance
how large the spread is within a dataset (how far each number is from the mean)
quartiles
four equal parts in the data set (each represents 25%) (three quartiles: Q1, Q2, Q3, since three lines = 4 parts)
quantiles
how many values in a distribution are above or below a certain limit (5 quantiles: 0.01, 0.25, 0.50, 0.75, 0.99)
percentiles
expresses the relative position of a score within a dataset
interquantile range (IQR)
the middle 50% of values when ordered from lowest to highest (Q3-Q1)
fences
the maximum and minimum values that are not outliers
equation for upper fence
Q3 + (1.5 * IQR)
equation for lower fence
Q1 - (1.5 * IQR)
regression
analyzes the relationship between a dependent variable and one or more independent variables to find trends in data
why do we use regressions?
they help us understand how values of a certain response (DV) are associated with values of a predictor (IV)
method of least squares
used if a linear correlation between two variables exists by selecting a linear trend line that best represents the data
what function is used to run a regression in R
model <- lm(DV ~ IV1 + IV2..., data = dataset)
residual value
the difference between the predicted value of the data and the actual value of the data
statistical error
the difference between a value obtained from a data collection process and the "true" value for the population (the greater the error, the less representative the data is for the population)
tidy data
data arranged in a simple but precisely defined pattern (exists in defined data tables)
why is tidy data important?
when data is in this form, it is easy to transform it into arrangements that are more useful for answering questions
categorical variables
record the type of category and are often in word form
quantitative variables
record a numerical attribute and are in number form
database
a program that helps store data and provides functionality for adding, modifying, and querying that data
relational database
stores its data in one or more tables, with multiple tables typically related to one another
primary key
an attribute that is a unique identifier of rows
foreign key
an attribute of another existing table that is a reference to the primary key
what does it mean to query a database?
ask the database questions to retrieve specific information (most commonly by using SQL)
database server
runs a database management system, manages data access and retrieval, and provides database services to clients
how do you use a database server?
ask the database server for what you need; do not need to touch that database directly since it will do all of the storing, finding, securing, and managing the data itself
dbConnect(drv, path)
returns a connection object using the specified driver (drv) to connect to the database at the specified path
SQLite()
returns a driver for the SQLite database (can be inputted as the drv in dbConnect)
dbDisconnect(con)
disconnects from the database and frees resources used by the connection
what packages are needed for SQLite?
library(DBI), library(RSQLite)
SELECT statement (querying)
selects all rows and columns (can select multiple things at a time)
what conditions can be added onto the SELECT statement?
FROM (the location you want to select from)
WHERE (where in that location the data is located)
ORDER (the order you want the output in)
dbSendQuery(conn, statement)
runs the query on the database specified by the connection object and returns a response object
dbFetch(res)
returns the results from the response object as a data frame
dbClearResult
clears the result set and frees the resources used by it
database administration (DBA)
managing and maintaining a database management system (DBMS)
database administrator
the person/team responsible for keeping the database running smoothly, keeping the data safe and secure, ensuring data is available when needed, and making sure queries run efficiently
5 big responsibilities of DBA
security, performance, availability, reliability (integrity), configuration
professional ethics
the special responsibilities not to take unfair advantage of the trust and confidence placed by clients on you
data scientist's oath
respect the privacy of data subjects, understand/recognize the data represents real people and situations, always maintain fair treatment and nondiscrimination, and remember you are a member of society with special obligations to all fellow human beings
what values does the data values and principles manifesto espouse?
inclusion, experimentation, accountability, impact
algorithmic bias
caused by biased data; the algorithm is not 'thinking' or 'being sexist,' but is rather reflecting and amplifying the stereotypes that were already present in its training data, which can reinforce harmful biases about who 'belongs' in certain professions
data and disclosure
the ability to link multiple data sets and use public information to identify individuals is a growing problem, so you must be able to balance disclosure (to help improve something) and nondisclosure (to ensure private information is not made public)
data scraping
finding available data online meant for human consumption and collecting it without authorization
safe data storage
ensuring data protections remain even when equipment is transferred or disposed of
reproducable analysis
the process of recording each and every step (no matter how trivial) to ensure the result is reproducible for others
collective ethics
recognizing that although science is carried out by individuals and teams, the scientific community as a whole is a stakeholder, so ethical obligation outweighs individual reputation
what topics are included in the professional guidelines for ethical conduct?
professionalism, integrity of data and methods, responsibilities to stakeholders, conflicts of interest, the response to allegations of misconduct
data mining
the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets
sample
subset of a population with a manageable size used for analysis (chosen using random selection)
population
complete data set that is often too large
random variables
a measure of a trait or value associated with an object, person, or place that is unpredictable
probability distributions
a function that gives the probabilities of occurrence of possible events for an experiment (used to model an unpredictable variable)
what should you keep in mind when considering a probability distribution?
consider what other events are possible, and define a set of events that are mutually exclusive (since only one event can occur at a time)
what are the main characteristics of a probability distribution?
the probability of a single event never goes below 0 or exceeds 1, and the probability of all events always sums to exactly 1
discrete variable
a random variable where values can be counted by groupings (ex: car color)
continuous variable
a random variable that assigns probability to a range of values (ex: car mileage/mpg)
binomial distributions
variables that assume only one of two values (ex: heads-or-tails coin flip scenario)
categorical distributions
variables are grouped into categories and ranked according to those categories (ex: socioeconomic status, blood type)
what are pearson's r and spearman's rank correlaction used for?
used to determine if a relationship between two variables exists
what is the difference between pearson's r and spearman's rank correlaction?
pearson's r: assumes data is normally distributed and that variables are numeric, continuous, and linearly related
spearman's rank correlation: assumes data is not normally distributed and that variables are ordinal and related non-linearly
singular value decomposition (SVD)
reduces the dimensionality of the dataset by helping to compress the dataset and removing redundant information and noise (ex: image compression)
time series
a collection of data on attribute values over time that is used to predict future instances of the measure based on past observational data
constant time series
remains at roughly the same level over time
trended time series
shows a stable linear movement up or down over time
untrended/trended seasonal time series
predictable, cyclic fluctuations that re-occur seasonally throughout a year
autoregressive moving averages (ARMA)
a class of forecasting methods that can be used to predict future values from current and historical data (needs at least 50 observations for reliable results)
machine learning (algorithmic learning)
the practice of applying algorithmic methods to data in an iterative manner so the computer discovers hidden patterns/trends that can be used to make predictions
how long do learning algorithms typically run?
until the final analysis results will not change, no matter how many additional times the data is fed to the algorithm (the algorithm needs to be repeated over and over until a certain set of predetermined conditions is met)
what are the steps of the machine learning process?
setting up: acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand, then breaking the data into training and test datasets
learning: model experimentation, training, building, and testing
application: model deployment and prediction
what is the rule of thumb for breaking data into training and test data sets?
apply random sampling to take two-thirds of the original datasets to use as data to train the model, then use the remaining one-third for evaluating the model's predictions
supervised learning
algorithms learn from known features of the data to produce an output model, which is then used to successfully predict labels for new incoming, unlabeled data points (all input data must have labeled features)
unsupervised learning
accepts unlabeled data and attempts to group observations into categories based on underlying similarities in input features
semisupervised/reinforcement learning
a behavior-based learning model where the model is given 'rewards' based on how it behaves (model subsequently learns how to maximize rewards by adapting the decisions it makes)
clustering
a type of machine learning that uses an unsupervised learning technique
when should clustering be used?
if there is a dataset that describes multiple features about a set of observations and you want to group the observations by their feature similarities
partitional clustering algorithms
algorithms that create only a single set of clusters
hierarchal clustering algorithms
algorithms that create separate sets of nested clusters, each in its own hierarchical level
k-means clustering algorithm
a simple and fast unsupervised learning algorithm that can be used to predict groupings within a dataset (ex: pizza party example)
how does a k-means clustering algorithm make its prediction?
predictions are made based on the number of centroids present, which are represented by k (a model parameter that must be defined), and the nearest mean values, which are measured based on the distance between points plotted
what does k mean in a k-means clustering algorithm?
the number of centroids present (the centers of the clusters) (ex: if k = 2, there will be 2 centroids, or 2 cluster centers, so 2 clusters)
kernel density estimation (KDE)
a smoothing method that works by placing a kernel (a weighing function that is used for quantifying density) on each data point in the dataset and then summing the kernels to generate a kernel density estimate for the overall region (ultimately just makes a smooth curve instead of a curve made of chunks)
when is KDE helpful?
when eyeballing clusters
dendrogram
a visualization tool that depicts the similarities and branching between groups in a data cluster (can be built bottom-up or top-down)
bottom-up dentrogram
assembling pairs of points and then aggregating them into larger and larger groups
top-down dendrogram
starting with the full dataset and splitting it into smaller and smaller groups
classification
a form of supervised machine learning, where the algorithm learns from labeled data in order to build predictive models that they can use to forecast the classification of future observations
binary classification
data classified into two possible classes
multi-class classification
data classified into one of three or more classes
multi-label classification
assigns one or more labels to each observation rather than just a single label
imbalanced classification
data classified into significantly more of one class than another