population
the entire group of individuals about which we want information
sample
a subset of the population from which we collect information from
census
collects data from the entire population
convenience sample
choosing individuals who are easiest to reach, not randomized
voluntary response sampling
allows people to choose to be in the sample by responding to a general invitation (biased)
simple random sample (SRS)
every group of n individuals in the population has an equal chance to be selected as the sample
stratified random sample
selects a sample by choosing an SRS from each stratum and combining SRS's into one overall sample
cluster sample
obtained by selecting all individuals within a randomly selected collection or group of individuals
strata
groups within a population that are homogeneous (similar) based on a characteristic that is relevant to the study.
cluster
groups within diverse populations and ideally represent the population on a smaller scale (location)
systematic random sample
selects a sample from an ordered arrangement of the population by randomly selecting one of the first k individuals and choosing every kth individual thereafter
bias
the systematic favoring of certain outcomes due to the method of collecting data, usually due to the design of the statistical study and its over/underestimation of the value we want to know
sampling frame
a list of all individuals in the population
nonresponse bias
occurs when an individual chosen for the sample can't be contacted or refuses to respond
undercoverage
occurs when some groups in the population are left out of the process of choosing the sample
observational study
observes individuals and measures variables of interest but does not attempt to influence the responses
response bias
when participants in a study or survey provide answers that are inaccurate, false, or misleading, often due to the way the questions are asked or other outside influences
association
a relationship between two or more variables, but does NOT imply causation
confounding variable
an outside factor that influences both the independent and dependent variable, potentially distorting the true relationship between them
experiment
imposes treatment on individuals to measure their responses
treatment
a specific condition applied to the individuals
factors
the explanatory variables that are being manipulated and may cause a change in the response variable (smoking & drinking)
levels
different values of the factors (0, 1, or 2 packs of cigarettes/beers)
placebo
a treatment with no active ingredient, but is "similar" to the other treatments
single-blind study
either the subjects or the researchers are unaware of who receives active treatment or placebo
double-blind study
neither the participant nor the researcher knows whether the participant has received the treatment or the placebo
placebo effect
experimental results caused by expectations alone
control group
the group that does not receive the experimental treatment; used to provide a basis for comparison
random assignment
creates groups that are roughly equivalent at the beginning of an experiment (cause and effect)
statistical significance
a statistical statement of how likely an obtained result occurred by chance (p < 0.05)
replication
giving each treatment to a sufficient number of experimental units so that any observed differences in treatment effects can be distinguished from random variation caused by the process of random assignment
completely randomized design
the treatments are assigned to all the experimental units completely by chance
randomized block design
the random assignment of experimental units to treatments is carried out separately within each block
matched pairs design
A method of assigning subjects to groups in which pairs of subjects are first matched on some characteristic and then individually assigned randomly to groups.
inference
using information from a sample to draw conclusions about the population
sampling variability
the natural tendency of randomly drawn samples to differ, one from another
scope of inference
the extent to which we can make conclusions
frequency table
summarizes one categorical variable using counts
relative frequency table
summarizes one categorical variable using percentages/proportions
two-way table
A table containing counts for two categorical variables, displaying totals for different categories
marginal relative frequency
the percent or proportion of individuals that have a specific value for one categorical variable (two-way table)
joint relative frequency
the percent or proportion of individuals that have a specific value for one categorical variable and a specific value for another categorical variable (two-way table)
conditional relative frequency
the percent or proportion of individuals that have a specific value for one categorical variable among individuals who share the same value of another categorical variable (the condition)
mosaic plot
a modified segmented bar graph in which the width of each rectangle is proportional to the number of individuals in the corresponding category
SUCS
For describing distribution - Shape, Unusual Values, Center, Spread
stemplot
a graphical representation of a dataset that organizes and displays data in a way that preserves the original values. Includes a stem and single-digit leaves. MUST HAVE A KEY
5 number summary
min, Q1, median, Q3, max
density curve
A mathematical model used to describe the overall pattern of the distribution of a random variable. Always above x-axis and has an area of exactly 1 underneath.
DUFS
For describing correlation - direction, unusual values, form, strength
high leverage points
points with much larger/smaller x-values than the other points in the data set
outliers (regression)
points that do not follow the pattern of the data and have large residuals
influential point
An extreme value whose removal would drastically change the slope, y-intercept, correlation, r², or SD of the LSRL.
regression line (LSRL)
predicted y = a + bx
power model
when logging both variables linearizes the data (y=ax^p)
exponential model
when logging only the y variable linearizes the data (y = abˣ)
large counts condition
using normal approximation when np>=10 and n(1-p)>=10
central limit theorem
when the number of samples is ≥3, the sampling distribution of the sample mean is approximately normal