data science midterm

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/52

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

53 Terms

New cards

individuals

the objects described by our data (observations, population units)

New cards

variable

any attribute of an individual, can take on different values for different individuals

categorical - can take on fixed number of labels, assigns each individual to a particular group or category based on a quantitative property (ex: favourite colour, introversion type)
quantitative - takes on numerical values, measures or counts something (ex: a person’s height, weight)

New cards

pie chart (categorical)

represents a categorical variable, telling us what values it takes and how often it takes them

Each slice’s angle or area shows the relative frequency (%) for a category.
Categories must be mutually exclusive and exhaustive, totaling 100%; otherwise, a pie chart is inappropriate.

New cards

bar chart (categorical)

consists of bars whose heights represent the proportion/frequency of values for each category

can be used to visualize data that doesn’t add up to 100% as the y-axis is quantitative and the x-axis is categorical (bars provide measure of the categories)

New cards

histograms (quantitative)

A histogram is a graphical display of a quantitative variable using adjacent bars to show the frequency (%) of values within each interval (bin).

It approximates the variable’s density, and as the number of bins increases (with narrower widths), it approaches a smooth density curve representing the overall distribution pattern.

New cards

overall pattern of histograms

Center (location): the mean or median.
Shape: symmetry or skewness.

Symmetric: left and right sides are mirror images.
Right-skewed: right tail (larger values) extends farther; mean > median.
Left-skewed: left tail (smaller values) extends farther; mean < median.
Variability (spread): measured by variance, standard deviation, range, or IQR.
Deviations (outliers): observations that break the overall pattern.

New cards

stemplot (quantitative)

a graphical display of quantitative data that uses the actual digits of the observations to help convey the distribution of the data

you can determine the shape of a distribution from a stemplot by observing the symmetry, skewness, number of peaks, and any gaps or outliers in the data

New cards

timeplot (quantitative)

a graphical display of a quantitative variable against time

time is always on the x-axis, each point represents a value at a specific time
useful for identifying trends, cycles, and patterns

New cards

mean

the sum of the values divided by the number of values

New cards

median

the midpoint of the distribution

order values from least to greatest
if n is odd, the median is the middle value
if n is even, the median is the average of the 2 middle values

New cards

measures of central tendency

tells us about the typical or average value in a dataset

New cards

mode

most frequently occurring value in the dataset

New cards

measuring variability: the quartiles

Q1 (first quartile) - the median of the lower half
Q2 (second quartile) - overall median
Q3 (third quartile) - median of the upper half

New cards

five-number summary

consists of 5 key values that provide a concise overview of its distribution

minimum (smallest value in the dataset)
Q1, Q2, Q3
maximum (largest value)

New cards

box and whisker plot

a graphical representation of the five-number summary

draw box from Q1 to Q3
mark median (Q2) inside the box
whiskers to min and max

New cards

interquartile range

a measure of statistical dispersion

calculated by Q3 - Q1 = IQR

New cards

inner fence

the interval (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)

any observation that falls outside of this interval is considered an outlier

New cards

modified boxplot

extends whiskers only to the most extreme values within the inner fence, with outliers plotted as individual points beyond the whiskers

New cards

variance

a measure of how much the values in the dataset vary/spread out from the mean

New cards

standard deviation

the square root of the variance, provides measure of spread in the same unit as data

New cards

choosing measures of centre and variability

2 options - mean and standard deviation or five-number summary

use mean and standard deviation for symmetrical distributions without outliers
use five-number summary for skewed distribution or if extreme outliers are present

New cards

density curve

A density curve f(x)is a smooth function where f(x)≥0 for all x (the curve lies on or above the x-axis).

The total area under the curve equals 1.
It can be used to compute probabilities.
The mean (μ) is the balance point of the curve.
The median (m) divides the area under the curve into two equal halves.
The mode is the point where the curve reaches its maximum height.

New cards

population

the target collection of individuals of interest

New cards

parameter

a numeric description of an attribute of the population

New cards

sample

a subset of the population which is available to collect data from

New cards

statistic

a numeric summary/description of the sample

New cards

normal distribution

A normal distribution is a common, bell-shaped distribution found in many statistical and real-world phenomena.

Notation: N(μ,σ) — mean μ, standard deviation σ
Shape: bell-shaped, symmetrical, with mean = median = mode
Empirical Rule:
- 68% of data within 1 standard deviation of the mean
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations

New cards

cumulative proportion

of a value x in a distribution is the proportion of observations that are less than or equal to x

New cards

standard normal distribution

the normal standard distribution with mean 0 and standard deviation 1, denoted N(0,1)

New cards

standardization

the process of transforming from a normal distribution to a standard distribution

if x is an observations from a distribution that has mean μ and standard deviation σ, then the standardized value of x is z = x - μ / σ
to convert a z-score back to a raw score x = σ x z + μ

New cards

scatterplot (quantitative)

a graphical display of the relationship between 2 quantitative variables measured on the same individuals

each individual appears as a point on the plot
x and y coordinates each give the value of different variables to the same individuals

New cards

response variable

measures the outcome of a study (y-axis)

New cards

explanatory variable

may explain or influence changes in the response variable (x-axis)

New cards

interpreting scatterplots

when examining a scatterplot, looking for the following features:

outliers - an individual that falls outside the overall pattern
form - linear or non-linear
direction - positive or negative association
strength - how closely the points follow the form (strong, moderate, weak)

New cards

correlation r

measures the direction and strength of the linear relationship between 2 quantitative variables

ranges from -1 to +1

New cards

downsides to correlation

correlation doesn’t imply causation (may be confounding variables at play)
sensitivity to outliers
limited to linear relationships
ignores other factors (variables not being studied)

New cards

regression line

a straight line that describes how a response variable y changes as an explanatory variable x changes

New cards

least squares regression line (line of best fit)

the line that minimizes the sum of the squared differences between the observed values and the values predicted by the line

equation - ŷ = a + bx
slope - b = r x Sy/Sx
intercept - a = y -bx

New cards

coefficient of determination R²

measures the proportion of variance in the response variable explained by the explanatory variable

equation - R² = corr(x,y)²
- a value of 1 means the model perfectly predicts the outcome
- a value of 0 means the model explains none of the variability in the outcome variable

New cards

residual

the difference between a response variable and its corresponding predicted value from the regression line

equation - e = y - ŷ

New cards

residual plot

A residual plot helps assess whether a linear model fits the data well.

Residuals should be randomly scattered around 0
The spread of residuals should be consistent across all x-values
There should be no visible curves or clusters
Residuals should be roughly symmetric

New cards

influential observation

an observation which drastically impacts the slope/intercept of the regression line

a datapoint has high leverage if its x-value is extreme to the rest of the points

New cards

lurking variable

an unobserved variable that influences both the explanatory and response variables, potentially leading to a spurious association between them

New cards

ecological fallacy

occurs when conclusions about individual behavior are drawn from group-level data.

It involves two main errors:

Assuming that a relationship observed for groups also applies to individuals within those groups.
Overestimating the strength of the relationship based on aggregated (group-level) data.

New cards

two-way table

describes 2 categorical variables by providing counts for each possible combination of a row variable and a column variable

New cards

marginal distribution

a variable describes the value that a specific variable takes irrespective of the other (only applies in the context of multiple variables)

New cards

conditional distribution

the probability distribution of a variable given that another variable has a specific value (separate conditional distribution for each value of the other variable)

New cards

probability, risk, and odds

suppose x in some categorical variable and c is some category of x, the percent of individuals in category c is given b: # of individuals in category c / # of individuals in total x 100%
the proportion, probability, and risk of individuals in category c is given by: # of individuals in category c / # of individuals in total
the odds of category c given by: # of individuals in category c / # of individuals not in category c

New cards

relative risk (RR)

risk in 1 group (group 1) compared to the risk in another group (group 2 or baseline/reference)

convention: in medical applications in which risk refers to the negative outcome (disease, death), the baseline group is the control group
equation - RR = risk in group 1 / risk in group 2

New cards

properties of relative risk

RR = 1 - both groups have equal risk
RR > 1 - group 1 has higher risk than baseline
RR < 1 - group 1 has lower risk than baseline

New cards

odds ratio (OR)

odds in one category (group 1) compared to the odds of another category (group 2)

equation - OR = odds in group 1 / odds in group 2

New cards

properties of odds ratios

OR = 1 - both groups have equal odds
OR > 1 - group 1 has higher odds than baseline
OR < 1 - group 1 has lower odds than baseline

New cards

simpson’s paradox

occurs when a trend that appears in different groups of data disappears or reverses when the groups are combined.

This happens when a lurking variable (often a grouping variable) affects the relationship between two variables differently across subgroups. The overall aggregate data can show the opposite pattern from what is observed in each individual subgroups