Data Science I Final Exam

0.0(0)
studied byStudied by 1 person
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/123

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

124 Terms

1
New cards

descriptive statistics

uses the data to provide descriptions (numerical calculations) and visualizations (graphs or tables) of the population

2
New cards

inferential statistics

uses the data to make inferences and predictions about a population (such as inferring trends)

3
New cards

mean

the average of a set of numbers

4
New cards

median

the middle value in a datset when the numbers are arranged in order

5
New cards

mode

the value that appears most often in a dataset (if no numbers repeat, there is no mode)

6
New cards

what type of distribution should use mean

normal/symmetric distributions (also called bell curve or gaussian curve)

7
New cards

what type of distribution should use median

skewed distributions

8
New cards

what type of distribution should use mode

bimodal distributions

9
New cards

kurtosis

the measure of the tailedness of a distribution (how often outliers occur, how thin/fat the distribution is)

10
New cards

standard deviation

a measure of how dispersed the data is in relation to the mean (low SD means data is more clustered around mean, high SD means data is more spread out)

11
New cards

variance

how large the spread is within a dataset (how far each number is from the mean)

12
New cards

quartiles

four equal parts in the data set (each represents 25%) (three quartiles: Q1, Q2, Q3, since three lines = 4 parts)

13
New cards

quantiles

how many values in a distribution are above or below a certain limit (5 quantiles: 0.01, 0.25, 0.50, 0.75, 0.99)

14
New cards

percentiles

expresses the relative position of a score within a dataset

15
New cards

interquantile range (IQR)

the middle 50% of values when ordered from lowest to highest (Q3-Q1)

16
New cards

fences

the maximum and minimum values that are not outliers

17
New cards

equation for upper fence

Q3 + (1.5 * IQR)

18
New cards

equation for lower fence

Q1 - (1.5 * IQR)

19
New cards

regression

analyzes the relationship between a dependent variable and one or more independent variables to find trends in data

20
New cards

why do we use regressions?

they help us understand how values of a certain response (DV) are associated with values of a predictor (IV)

21
New cards

method of least squares

used if a linear correlation between two variables exists by selecting a linear trend line that best represents the data

22
New cards

what function is used to run a regression in R

model <- lm(DV ~ IV1 + IV2..., data = dataset)

23
New cards

residual value

the difference between the predicted value of the data and the actual value of the data

24
New cards

statistical error

the difference between a value obtained from a data collection process and the "true" value for the population (the greater the error, the less representative the data is for the population)

25
New cards

tidy data

data arranged in a simple but precisely defined pattern (exists in defined data tables)

26
New cards

why is tidy data important?

when data is in this form, it is easy to transform it into arrangements that are more useful for answering questions

27
New cards

categorical variables

record the type of category and are often in word form

28
New cards

quantitative variables

record a numerical attribute and are in number form

29
New cards

database

a program that helps store data and provides functionality for adding, modifying, and querying that data

30
New cards

relational database

stores its data in one or more tables, with multiple tables typically related to one another

31
New cards

primary key

an attribute that is a unique identifier of rows

32
New cards

foreign key

an attribute of another existing table that is a reference to the primary key

33
New cards

what does it mean to query a database?

ask the database questions to retrieve specific information (most commonly by using SQL)

34
New cards

database server

runs a database management system, manages data access and retrieval, and provides database services to clients

35
New cards

how do you use a database server?

ask the database server for what you need; do not need to touch that database directly since it will do all of the storing, finding, securing, and managing the data itself

36
New cards

dbConnect(drv, path)

returns a connection object using the specified driver (drv) to connect to the database at the specified path

37
New cards

SQLite()

returns a driver for the SQLite database (can be inputted as the drv in dbConnect)

38
New cards

dbDisconnect(con)

disconnects from the database and frees resources used by the connection

39
New cards

what packages are needed for SQLite?

library(DBI), library(RSQLite)

40
New cards

SELECT statement (querying)

selects all rows and columns (can select multiple things at a time)

41
New cards

what conditions can be added onto the SELECT statement?

FROM (the location you want to select from)

WHERE (where in that location the data is located)

ORDER (the order you want the output in)

42
New cards

dbSendQuery(conn, statement)

runs the query on the database specified by the connection object and returns a response object

43
New cards

dbFetch(res)

returns the results from the response object as a data frame

44
New cards

dbClearResult

clears the result set and frees the resources used by it

45
New cards

database administration (DBA)

managing and maintaining a database management system (DBMS)

46
New cards

database administrator

the person/team responsible for keeping the database running smoothly, keeping the data safe and secure, ensuring data is available when needed, and making sure queries run efficiently

47
New cards

5 big responsibilities of DBA

security, performance, availability, reliability (integrity), configuration

48
New cards

professional ethics

the special responsibilities not to take unfair advantage of the trust and confidence placed by clients on you

49
New cards

data scientist's oath

respect the privacy of data subjects, understand/recognize the data represents real people and situations, always maintain fair treatment and nondiscrimination, and remember you are a member of society with special obligations to all fellow human beings

50
New cards

what values does the data values and principles manifesto espouse?

inclusion, experimentation, accountability, impact

51
New cards

algorithmic bias

caused by biased data; the algorithm is not 'thinking' or 'being sexist,' but is rather reflecting and amplifying the stereotypes that were already present in its training data, which can reinforce harmful biases about who 'belongs' in certain professions

52
New cards

data and disclosure

the ability to link multiple data sets and use public information to identify individuals is a growing problem, so you must be able to balance disclosure (to help improve something) and nondisclosure (to ensure private information is not made public)

53
New cards

data scraping

finding available data online meant for human consumption and collecting it without authorization

54
New cards

safe data storage

ensuring data protections remain even when equipment is transferred or disposed of

55
New cards

reproducable analysis

the process of recording each and every step (no matter how trivial) to ensure the result is reproducible for others

56
New cards

collective ethics

recognizing that although science is carried out by individuals and teams, the scientific community as a whole is a stakeholder, so ethical obligation outweighs individual reputation

57
New cards

what topics are included in the professional guidelines for ethical conduct?

professionalism, integrity of data and methods, responsibilities to stakeholders, conflicts of interest, the response to allegations of misconduct

58
New cards

data mining

the use of machine learning and statistical analysis to uncover patterns and other valuable information from large datasets

59
New cards

sample

subset of a population with a manageable size used for analysis (chosen using random selection)

60
New cards

population

complete data set that is often too large

61
New cards

random variables

a measure of a trait or value associated with an object, person, or place that is unpredictable

62
New cards

probability distributions

a function that gives the probabilities of occurrence of possible events for an experiment (used to model an unpredictable variable)

63
New cards

what should you keep in mind when considering a probability distribution?

consider what other events are possible, and define a set of events that are mutually exclusive (since only one event can occur at a time)

64
New cards

what are the main characteristics of a probability distribution?

the probability of a single event never goes below 0 or exceeds 1, and the probability of all events always sums to exactly 1

65
New cards

discrete variable

a random variable where values can be counted by groupings (ex: car color)

66
New cards

continuous variable

a random variable that assigns probability to a range of values (ex: car mileage/mpg)

67
New cards

binomial distributions

variables that assume only one of two values (ex: heads-or-tails coin flip scenario)

68
New cards

categorical distributions

variables are grouped into categories and ranked according to those categories (ex: socioeconomic status, blood type)

69
New cards

what are pearson's r and spearman's rank correlaction used for?

used to determine if a relationship between two variables exists

70
New cards

what is the difference between pearson's r and spearman's rank correlaction?

pearson's r: assumes data is normally distributed and that variables are numeric, continuous, and linearly related

spearman's rank correlation: assumes data is not normally distributed and that variables are ordinal and related non-linearly

71
New cards

singular value decomposition (SVD)

reduces the dimensionality of the dataset by helping to compress the dataset and removing redundant information and noise (ex: image compression)

72
New cards

time series

a collection of data on attribute values over time that is used to predict future instances of the measure based on past observational data

73
New cards

constant time series

remains at roughly the same level over time

74
New cards

trended time series

shows a stable linear movement up or down over time

75
New cards

untrended/trended seasonal time series

predictable, cyclic fluctuations that re-occur seasonally throughout a year

76
New cards

autoregressive moving averages (ARMA)

a class of forecasting methods that can be used to predict future values from current and historical data (needs at least 50 observations for reliable results)

77
New cards

machine learning (algorithmic learning)

the practice of applying algorithmic methods to data in an iterative manner so the computer discovers hidden patterns/trends that can be used to make predictions

78
New cards

how long do learning algorithms typically run?

until the final analysis results will not change, no matter how many additional times the data is fed to the algorithm (the algorithm needs to be repeated over and over until a certain set of predetermined conditions is met)

79
New cards

what are the steps of the machine learning process?

setting up: acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand, then breaking the data into training and test datasets

learning: model experimentation, training, building, and testing

application: model deployment and prediction

80
New cards

what is the rule of thumb for breaking data into training and test data sets?

apply random sampling to take two-thirds of the original datasets to use as data to train the model, then use the remaining one-third for evaluating the model's predictions

81
New cards

supervised learning

algorithms learn from known features of the data to produce an output model, which is then used to successfully predict labels for new incoming, unlabeled data points (all input data must have labeled features)

82
New cards

unsupervised learning

accepts unlabeled data and attempts to group observations into categories based on underlying similarities in input features

83
New cards

semisupervised/reinforcement learning

a behavior-based learning model where the model is given 'rewards' based on how it behaves (model subsequently learns how to maximize rewards by adapting the decisions it makes)

84
New cards

clustering

a type of machine learning that uses an unsupervised learning technique

85
New cards

when should clustering be used?

if there is a dataset that describes multiple features about a set of observations and you want to group the observations by their feature similarities

86
New cards

partitional clustering algorithms

algorithms that create only a single set of clusters

87
New cards

hierarchal clustering algorithms

algorithms that create separate sets of nested clusters, each in its own hierarchical level

88
New cards

k-means clustering algorithm

a simple and fast unsupervised learning algorithm that can be used to predict groupings within a dataset (ex: pizza party example)

89
New cards

how does a k-means clustering algorithm make its prediction?

predictions are made based on the number of centroids present, which are represented by k (a model parameter that must be defined), and the nearest mean values, which are measured based on the distance between points plotted

90
New cards

what does k mean in a k-means clustering algorithm?

the number of centroids present (the centers of the clusters) (ex: if k = 2, there will be 2 centroids, or 2 cluster centers, so 2 clusters)

91
New cards

kernel density estimation (KDE)

a smoothing method that works by placing a kernel (a weighing function that is used for quantifying density) on each data point in the dataset and then summing the kernels to generate a kernel density estimate for the overall region (ultimately just makes a smooth curve instead of a curve made of chunks)

92
New cards

when is KDE helpful?

when eyeballing clusters

93
New cards

dendrogram

a visualization tool that depicts the similarities and branching between groups in a data cluster (can be built bottom-up or top-down)

94
New cards

bottom-up dentrogram

assembling pairs of points and then aggregating them into larger and larger groups

95
New cards

top-down dendrogram

starting with the full dataset and splitting it into smaller and smaller groups

96
New cards

classification

a form of supervised machine learning, where the algorithm learns from labeled data in order to build predictive models that they can use to forecast the classification of future observations

97
New cards

binary classification

data classified into two possible classes

98
New cards

multi-class classification

data classified into one of three or more classes

99
New cards

multi-label classification

assigns one or more labels to each observation rather than just a single label

100
New cards

imbalanced classification

data classified into significantly more of one class than another