SOC 1100 Midterm #3

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/46

There's no tags or description

Looks like no tags are added yet.

Last updated 6:14 PM on 4/9/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

47 Terms

New cards

conditional distributions

describe the probability distribution of a random variable 1 given that another random variable 2 has a specific value
notation: P (category 1 | category 2)=”what is the probability that some people fall into category 1 given that they are a member of category 2)
since there are several categories included in a conditional distribution table, we want a single test that will capture the full pattern (e.g., are gender and education associated overall?) at once, rather than several tests

New cards

difference column

a column included to specify the difference between two conditional distirbutions (e.g., percentage of men with a bachelor’s degree vs percentage of women with a bachelor’s degree)
difference=difference between two proportions: p₁-p₂=14.5%-12.3%=2.2pp

New cards

why is a single statistical test that captures full picture better than running several tests simultaneously?

each test has a certain level of uncertainty associated with it
don’t want to compound that error over a bunch of tests

New cards

multiple testing problem

researchers often test hundreds or thousands of comparisons at once (e.g., geneticists scan millions of DNA variants for association with a disease)
with enough groups, you are almost guaranteed to find “significant” results that are really just noise (false positives)
a reason why many published findings have not been replicated by other researchers (replication crisis)

New cards

chi squared test

a test used to analyze categorical data to determine if observed frequencies differ significantly from expected frequencies
provides one statistic, one p-value for the entire conditional distribution table
data must be in raw frequencies of categorical items (not percentages), samples must be randomly selected, and each sample size should be large enough (at least 5 per cell)
under the null (H₀), we can compute the expected cell counts
in calculations, the difference between expected and observed values is squared to prevent positive and negative differences from canceling out
also divide by the expected frequency to normalize the data (i.e., scale it relative to magnitude to expected results to avoid disproportionately skewing results)

New cards

chi-squared test steps

state the hypotheses: H₀: the two categories are independent, H_a: the two categories are associated
calculate the expected frequencies under the null hypothesis=row total/column total
calculate the chi-square statistic (x²)
find the critical value: (1) calculate degrees of freedom (2) find the critical value in a table
degrees of freedom=(number of rows-1)(number of columns-1)
conclude: if the chi squared statistic is greater than the critical value, reject the null

New cards

chi square conclusions

all the chi square can tell you is whether or not there is a statistically significant association between the two categories
reject the null if the chi squared value is greater than the p value
fail to reject the null if the chi squared value is less than the p value

New cards

effect size

quantifies the strength of an association

New cards

relationship between chi squared test and z test

a 2 by 2 chi square test is the mathematical equivalent of a two-tailed z test for two proportions (i.e., x²=z²)
both tests use the same underlying probability distribution approximation for comparing proportions across two groups
chi squared generalized the z-test to tables with more than 2 rows, meaning you can compare more than two groups across multiple categorical outcomes

New cards

chi squared test worked example

H₀: income and happiness are independent, H_a: income and happiness are associated
Above income respondents are more than twice as likely to report being “very happy” as compared to below income respondents (44% vs 20%), suggesting a chi square test should give a very small p-value
calculate expected frequency: (300 individuals who are very happy x 500 above average income individuals/900 total respondents)=166.7
calculate the chi-squared value: x²=71.9, df (3-1)(2-1)=2, p<0.001
since p<0.001 is far smaller than the critical value x²=4.605 for 0.10, we reject the null hypothesis

New cards

effect size

quantify the magnitude of a relationship or difference between two groups
provides a standardized measure of practical significant, independent of sample size
examples: (1) difference of proportions (2) odds ratio

New cards

odds ratio

a measure of effect size indicating the strength of association between two binary variables
define odds as the probability of success divided by the probability of failure
the odds will always be nonnegative
OR=1: no association, OR>1: greater likelihood of sucesses, OR<1: greater likelihood of failures

New cards

why we need regressions

conclusions drawn from group comparisons are really only applicable to the specific groups in our data (i.e., you can’t predict beyond what you observe)
also exists a problem when you want to make conclusions about individuals who don’t fall neatly into the data set (e.g., someone with 10.5 years of schooling when data was made to be categorical)

New cards

regressions

modeling techniques used to analyze the relationship between a response variable (target) and one or more independent variables (predictors)
predicts continous outcomes and estimates how changes in predictors affect the target
often represented by a line of best fit through data points

New cards

bivariate regression model (linear)

for an observation i: y_i=β₀+β₁x_i+e_i
β₀is the intercept
β₁is the slope
e_i is essentially the associated error; accounts for unobserved factors, randomness, etc.
because the goal of a regression is to predict the average of y, each individual observation will deviate from the mode
sometimes you get intercept values that wouldn’t make sense for the data frame (but that’s okay! it works for the most part)

New cards

using linear regression formula to draw predictions

need to use predicted values for three variables: y_i, β₀, and β₁

New cards

residuals

for an observation i: residual_i=y_i-sample y_i
residuals measure how far the y value of a given point is from the regression line (in terms of y)
positive value: the regression line is an underestimation
negative value: the regression line is an overestimation

New cards

least squares

method used to find the line of best fit by minimizing the sum of squared residuals
you want to square the residuals since positive and negative residuels both count and may “cancel” each other out
also square them to account for larger errors that would be able to skew data
choose the line with the smallest squared residual value (∑e_i²); same as choosing the line with the smallest vertical distance from the actual data points

New cards

sum of squared errors

sum of squared errors (SSE): ∑(y_i-y_ihat)²=∑e_i²
y hat is the predicted value of the dependent variable
the sum of residual values calculated for every single observation in the sample
measures total variation in the sample relative to the predicted y values for a given x (regression line)
small SSE=smaller residuals=line fits the data pretty well
can also do sum of absolute or quartic errors, but it’s not optimal

New cards

total sum of squares

SST=∑(y_i-y mean)²
a second method for measuring variation in the data
measures total variation within the data relative to the mean
measures how far each point is from the overall mean
is mathematically equivalent to the variance formula, only without dividing by degrees of freedom (n-1)

New cards

Gauss Markov theorem

states that the ordinary least squares estimator generates the best linear regression model
provided the errors are uncorrelated, have zero mean, and have equal variance, the sum of least squares will provide the highest efficiency

New cards

coefficient of determination

R²=1-SSE/SST
R² is the proportion of variance in y explained by the regression (i.e., how closely the line of best fit matches the data points)
SSE=0: perfect line of best fit
SSE=SST: the line is no better than the average value—>R²=0 and the predicted values do not track the actual values better than a simple horizontal line at the mean value

New cards

interpreting the coefficient of determination

R²=0: the points scatter widely and the line is barely better than a horizontal line through the mean value
R²=0.5: there is a clear trend but the data points are still substantially scattered
R²=1: every point falls exactly on the line

New cards

interpreting y intercept of line of best fit

intercept simply represents the predicted value of the response variable (y) when the explanatory variable (x) is zero
often, the intercept has no real meaning; if x cannot realistically be zero (e.g., birth weight) or the data collected does not include x-values near zero, then the intercept is only a mathematical anchor

New cards

outliers and regression lines

outliers=points far from the regression line (i.e., points with large residual values)
a single outlier can strongly shift the regression line by changing the slope
always plot your data first via a scatterplot to eyeball potential outliers
if the outlier is due to a data error such as a typo or coding mistake, fix or remove it
if the outlier is a real but unusual data point, run the regression with and without it and report both
if there are no outliers, proceed as usual!

New cards

choosing whether to use a linear regression

linear regressions should only be used for scatterplots that seem to represent a linear relationship
a straight line used for curved, bent, or fanned out patterns will produce a low R² value and provide a misleading slope value
before running a regression, you should always plot x vs y to determine if a straight line seems like a reasonable summary
the observations must also be independent, meaning each data point should not influence other data points
structured data cannot be used, including: repeated measurements on same individual, clustered data, observations over time
to verify independence, check study design; to verify linearity, check the scatterplot

New cards

slope of a regression line

β₁=r x s_y/s_x
s_y=standard deviation of the y data
s_x=standard deviation of the x data
r=correlation coefficient (quantifies correlation between x and y)
r controls the direction and strength of the slope
greater s_y means steeper slope (data points are spread far from the average y value), greater s_x means a flatter slope (independent variable is spread out over a wider range)
the slope value gives you average rate of change in y per one unit increase in x

New cards

statistical testing for linear regressions

are tests of independence (between x and y) using slope or correlation
H₀: slope=0
a slope of 0 means the two variables are statistically independent (i.e., they are linearly uncorrelated)

New cards

assumptions for linear regression hypothesis testing

statistical inference for linear regressions (e.g., calculating confidence intervals, p values, etc.) requires assumptions
linear relationship between x and y
independence of observations: x and y do not influence each other
the residual values for individual points follows a normal distribution
the residuals must also have a constant variance across all levels of the independent variable (i.e., the spread of y around the line is similar for all values of x)

New cards

why normality of residuals matters less with larger y

the slope formula can be rewritten as β₁=∑c_iy_i
c_i=(x_i-x mean)/∑(x_i-x mean)²
since c_i are fixed constants depending only on the values of x_i and the sample x, the predicted slope is a weighted average of the y values
by the central limit theorem, the averages of independent observations have approximately normal sampling distributions when n is large
even if the distribution of individual y values are not normal, the sampling distribution of predicted slopes should be

New cards

plots of residuals vs fitted values

a check for constant variance
residuals: the vertical distances between observed data points and the fitted regression line
fitted values: predicted values of the response variable calculating by plugging the independent variable into the estimated regression equation
so long as the assumptions hold, a plot should look like a horizontal band of points centered around the zero line
funnel shape means non-constant variance
curved shape means non-linearity and you should not be using a linear regression

New cards

what do you do if the constant variance assumption does not hold

constant variance indicates the residuals (differences in y) have a consistent spread across all levels of the independent variables
violations are incredbily common in real data because the variability of an outcome often changes as the predictor variables increase or decrease
if you proceed as though constant variance holds, you can use the simplest versions of SE, CI, and p value formulas
can just use a robust standard error in post

New cards

standard error of the slope

the standard error of the slope measures how much slope would vary across repeated samples
SE(β₁)=s/√(∑x_i-x mean)²
s=residual standard error (how spread out the residuals are)
∑(x_i-x mean)²=total variation in x
with larger n, the denominator gets larger, and you get a smaller standard error

New cards

degrees of freedom for linear regressions

df=n-2
represents the number of independent data points used to estimate parameters
we estimated two parameters, each of which “uses up” one degree of freedom: slope and y-intercept

New cards

confidence intervals for population slope

confidence interval formula: predicted β₁ ± multiplier*SE(β₁)
we are X% confidence that each additional x variable is associated with between lower bound and upper bound more y value in the population
if the generated confidence itnerval does not contain 0, we can reject the null hypothesis (slope=0 at 1-% confidence)

New cards

hypothesis testing for the slope of a linear regression line

hypotheses test the slope, as this quantifies the relationship between the two variables
null: H₀: β₁=0 (no linear association in the population)
alternative: H_a: β₁≠0 (there is a linear association)
t=(predicted β₁-0)/SE(predicted β₁)
the test statistic quantifies how many standard errors away the estimated slope is from the null value 0
use a t table to determine the p value

New cards

interpreting results of hypothesis testing for linear regressions

if we reject the null, all we can say is that the positive association is unlikely to be due to chance alone
rejecting the null be used to prove causality
a significant slope also does not mean a given independent variable is the only, or even most important, predictor for values of a given y

New cards

summary of regression assumptions

linearity: assumes the relationship between two values is roughly linear (check using a scatterplot)
independence: assumes the observations do not influence each other (check study design)
normality: assumes residuals are approximately normal, or that you have a sufficiently large sample size (check using a histogram or Q-Q plot)
constant variance: assumes the spread of y around the line is similar for all x (check using a residual vs fitted values plot)
linearity and independence ensure the regression line can actually be used as a reasonable summary of the data
normality and constant variance are needed to validate the confidence intervals and p-values

New cards

dummy variables

dummy variables convert categorical data into numerical 0 and 1 values for regression analysis
the group coded 0 is the reference or control category
the group coded 1 possesses the characteristic of interest
provides the y intercept of a regression line a bit more meaning

New cards

intercepts for regressions using dummy variables

with a single dummy predictor:
- β₀ becomes the average value for the reference group (0)
- β₀+β₁ becomes the average value for the group of interest (1)
- β₁ difference in group means
- the regression with one dummy is just a comparison of two group averages
- depending on which group you set as the reference, might get positive or negative slopes

New cards

regression coding example

take an experiment in which you randomly assign subjects to control vs treatment groups
treatment=1, control=0
run the regression: y=β₀+β₁*treatment
β₀=average outcome in the control group
β₀+β₁=average outcome in the treatment group
β₁=treatment effect (difference caused by intervention)

New cards

kidney stone categorical regression example

open surgery=1, closed procedure=0
β₀=0.83=success rate for closed procedure
β₀+β₁=0.83-0.05=0.78=open surgery success rate
β₁=—0.05=open surgery does 5 percentage points worse than closed procedure

New cards

what if you use 1 and 2 for dummy coding?

consider 1=man, 2=woman
plugging x=1 into regression: β₀+β₁=mean for men
plugging x=2 into regression: β₀+2β₁=mean for women
the intercept β₀ is the predicted value when x=0, but now you no longer have a group that has been coded 0
not necessarily wrong to code using 1 and 2, just makes things far more incovenient

New cards

multivariate dummy coding for regressions

if the variable has 1 categories (e.g., degrees has q=5 levels), create q-1 dummy variables (d)
once category is omitted as the reference group
have 0=miss, 1=hit for each of the different dummy variables
example dummy variables: d_<HS, d_assoc, d_BA, d_grad
for each person, assign 0 or 1 depending on what degree they have
person 4 with PhD: d_<HS=0, d_assoc=0, d_BA=0, d_grad=1
would have a reference group (person with a HS diploma who does not fall neatly into any category) who has all dummy variables set to 0
each dummy coefficient β=(group mean-reference mean)

New cards

what happens if you change the reference category?

only thing that changes is the coefficient values
R² and the model fit remain the same
choose the reference that makes comparisons most meaningful

New cards

why use multivariate regressions

most of the time, there are other variables associated with the original variable of interest (e.g., race, socioeconomic status, geographic location are correlated with educational attainment)
by including these variables in the regression, you can isolate each one’s association with your y variable (e.g., income)
without controls, β₁ captures the association of education and everything correlated with it
with controls, β₁ captures association between education and income, holding the other variables constant→isolating single variable of interest

New cards

multiple regression model

y=β₀+β₁X₁+β₂X₂
β₁=estimated change in y for a one unit increase in x₁ when x₂ is held constant
β₂=estimated change in y for a one unit increase in x₂, holding x₁ constant
each coefficient is a partial (controlled) association