SOC 1100 Midterm #3

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/46

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 6:14 PM on 4/9/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

47 Terms

1
New cards

conditional distributions

  • describe the probability distribution of a random variable 1 given that another random variable 2 has a specific value

  • notation: P (category 1 | category 2)=”what is the probability that some people fall into category 1 given that they are a member of category 2)

  • since there are several categories included in a conditional distribution table, we want a single test that will capture the full pattern (e.g., are gender and education associated overall?) at once, rather than several tests

2
New cards

difference column

  • a column included to specify the difference between two conditional distirbutions (e.g., percentage of men with a bachelor’s degree vs percentage of women with a bachelor’s degree)

  • difference=difference between two proportions: p1-p2=14.5%-12.3%=2.2pp

3
New cards

why is a single statistical test that captures full picture better than running several tests simultaneously?

  • each test has a certain level of uncertainty associated with it

  • don’t want to compound that error over a bunch of tests

4
New cards

multiple testing problem

  • researchers often test hundreds or thousands of comparisons at once (e.g., geneticists scan millions of DNA variants for association with a disease)

  • with enough groups, you are almost guaranteed to find “significant” results that are really just noise (false positives)

  • a reason why many published findings have not been replicated by other researchers (replication crisis)

5
New cards

chi squared test

  • a test used to analyze categorical data to determine if observed frequencies differ significantly from expected frequencies

  • provides one statistic, one p-value for the entire conditional distribution table

  • data must be in raw frequencies of categorical items (not percentages), samples must be randomly selected, and each sample size should be large enough (at least 5 per cell)

  • under the null (H0), we can compute the expected cell counts

  • in calculations, the difference between expected and observed values is squared to prevent positive and negative differences from canceling out

  • also divide by the expected frequency to normalize the data (i.e., scale it relative to magnitude to expected results to avoid disproportionately skewing results)

6
New cards

chi-squared test steps

  1. state the hypotheses: H0: the two categories are independent, Ha: the two categories are associated

  2. calculate the expected frequencies under the null hypothesis=row total/column total

  3. calculate the chi-square statistic (x2)

  4. find the critical value: (1) calculate degrees of freedom (2) find the critical value in a table

  5. degrees of freedom=(number of rows-1)(number of columns-1)

  6. conclude: if the chi squared statistic is greater than the critical value, reject the null

7
New cards

chi square conclusions

  • all the chi square can tell you is whether or not there is a statistically significant association between the two categories

  • reject the null if the chi squared value is greater than the p value

  • fail to reject the null if the chi squared value is less than the p value

8
New cards

effect size

  • quantifies the strength of an association

9
New cards

relationship between chi squared test and z test

  • a 2 by 2 chi square test is the mathematical equivalent of a two-tailed z test for two proportions (i.e., x2=z2)

  • both tests use the same underlying probability distribution approximation for comparing proportions across two groups

  • chi squared generalized the z-test to tables with more than 2 rows, meaning you can compare more than two groups across multiple categorical outcomes

10
New cards

chi squared test worked example

  1. H0: income and happiness are independent, Ha: income and happiness are associated

  2. Above income respondents are more than twice as likely to report being “very happy” as compared to below income respondents (44% vs 20%), suggesting a chi square test should give a very small p-value

  3. calculate expected frequency: (300 individuals who are very happy x 500 above average income individuals/900 total respondents)=166.7

  4. calculate the chi-squared value: x2=71.9, df (3-1)(2-1)=2, p<0.001

  5. since p<0.001 is far smaller than the critical value x2=4.605 for 0.10, we reject the null hypothesis

11
New cards

effect size

  • quantify the magnitude of a relationship or difference between two groups

  • provides a standardized measure of practical significant, independent of sample size

  • examples: (1) difference of proportions (2) odds ratio

12
New cards

odds ratio

  • a measure of effect size indicating the strength of association between two binary variables

  • define odds as the probability of success divided by the probability of failure

  • the odds will always be nonnegative

  • OR=1: no association, OR>1: greater likelihood of sucesses, OR<1: greater likelihood of failures

13
New cards

why we need regressions

  • conclusions drawn from group comparisons are really only applicable to the specific groups in our data (i.e., you can’t predict beyond what you observe)

  • also exists a problem when you want to make conclusions about individuals who don’t fall neatly into the data set (e.g., someone with 10.5 years of schooling when data was made to be categorical)

14
New cards

regressions

  • modeling techniques used to analyze the relationship between a response variable (target) and one or more independent variables (predictors)

  • predicts continous outcomes and estimates how changes in predictors affect the target

  • often represented by a line of best fit through data points

15
New cards

bivariate regression model (linear)

  • for an observation i: yi01xi+ei

  • β0 is the intercept

  • β1 is the slope

  • ei is essentially the associated error; accounts for unobserved factors, randomness, etc.

  • because the goal of a regression is to predict the average of y, each individual observation will deviate from the mode

  • sometimes you get intercept values that wouldn’t make sense for the data frame (but that’s okay! it works for the most part)

16
New cards

using linear regression formula to draw predictions

  • need to use predicted values for three variables: yi, β0, and β1

17
New cards

residuals

  • for an observation i: residuali=yi-sample yi

  • residuals measure how far the y value of a given point is from the regression line (in terms of y)

  • positive value: the regression line is an underestimation

  • negative value: the regression line is an overestimation

18
New cards

least squares

  • method used to find the line of best fit by minimizing the sum of squared residuals

  • you want to square the residuals since positive and negative residuels both count and may “cancel” each other out

  • also square them to account for larger errors that would be able to skew data

  • choose the line with the smallest squared residual value (∑ei2); same as choosing the line with the smallest vertical distance from the actual data points

19
New cards

sum of squared errors

  • sum of squared errors (SSE): ∑(yi-yi hat)2=∑ei2

  • y hat is the predicted value of the dependent variable

  • the sum of residual values calculated for every single observation in the sample

  • measures total variation in the sample relative to the predicted y values for a given x (regression line)

  • small SSE=smaller residuals=line fits the data pretty well

  • can also do sum of absolute or quartic errors, but it’s not optimal

20
New cards

total sum of squares

  • SST=∑(yi-y mean)2

  • a second method for measuring variation in the data

  • measures total variation within the data relative to the mean

  • measures how far each point is from the overall mean

  • is mathematically equivalent to the variance formula, only without dividing by degrees of freedom (n-1)

21
New cards

Gauss Markov theorem

  • states that the ordinary least squares estimator generates the best linear regression model

  • provided the errors are uncorrelated, have zero mean, and have equal variance, the sum of least squares will provide the highest efficiency

22
New cards

coefficient of determination

  • R2=1-SSE/SST

  • R2 is the proportion of variance in y explained by the regression (i.e., how closely the line of best fit matches the data points)

  • SSE=0: perfect line of best fit

  • SSE=SST: the line is no better than the average value—>R2=0 and the predicted values do not track the actual values better than a simple horizontal line at the mean value

23
New cards

interpreting the coefficient of determination

  • R2=0: the points scatter widely and the line is barely better than a horizontal line through the mean value

  • R2=0.5: there is a clear trend but the data points are still substantially scattered

  • R2=1: every point falls exactly on the line

24
New cards

interpreting y intercept of line of best fit

  • intercept simply represents the predicted value of the response variable (y) when the explanatory variable (x) is zero

  • often, the intercept has no real meaning; if x cannot realistically be zero (e.g., birth weight) or the data collected does not include x-values near zero, then the intercept is only a mathematical anchor

25
New cards

outliers and regression lines

  • outliers=points far from the regression line (i.e., points with large residual values)

  • a single outlier can strongly shift the regression line by changing the slope

  • always plot your data first via a scatterplot to eyeball potential outliers

  • if the outlier is due to a data error such as a typo or coding mistake, fix or remove it

  • if the outlier is a real but unusual data point, run the regression with and without it and report both

  • if there are no outliers, proceed as usual!

26
New cards

choosing whether to use a linear regression

  • linear regressions should only be used for scatterplots that seem to represent a linear relationship

  • a straight line used for curved, bent, or fanned out patterns will produce a low R2 value and provide a misleading slope value

  • before running a regression, you should always plot x vs y to determine if a straight line seems like a reasonable summary

  • the observations must also be independent, meaning each data point should not influence other data points

  • structured data cannot be used, including: repeated measurements on same individual, clustered data, observations over time

  • to verify independence, check study design; to verify linearity, check the scatterplot

27
New cards

slope of a regression line

  • β1=r x sy/sx

  • sy=standard deviation of the y data

  • sx=standard deviation of the x data

  • r=correlation coefficient (quantifies correlation between x and y)

  • r controls the direction and strength of the slope

  • greater sy means steeper slope (data points are spread far from the average y value), greater sx means a flatter slope (independent variable is spread out over a wider range)

  • the slope value gives you average rate of change in y per one unit increase in x

28
New cards

statistical testing for linear regressions

  • are tests of independence (between x and y) using slope or correlation

  • H0: slope=0

  • a slope of 0 means the two variables are statistically independent (i.e., they are linearly uncorrelated)

29
New cards

assumptions for linear regression hypothesis testing

  • statistical inference for linear regressions (e.g., calculating confidence intervals, p values, etc.) requires assumptions

  • linear relationship between x and y

  • independence of observations: x and y do not influence each other

  • the residual values for individual points follows a normal distribution

  • the residuals must also have a constant variance across all levels of the independent variable (i.e., the spread of y around the line is similar for all values of x)

30
New cards

why normality of residuals matters less with larger y

  • the slope formula can be rewritten as β1=∑ciyi

  • ci=(xi-x mean)/∑(xi-x mean)2

  • since ci are fixed constants depending only on the values of xi and the sample x, the predicted slope is a weighted average of the y values

  • by the central limit theorem, the averages of independent observations have approximately normal sampling distributions when n is large

  • even if the distribution of individual y values are not normal, the sampling distribution of predicted slopes should be

31
New cards

plots of residuals vs fitted values

  • a check for constant variance

  • residuals: the vertical distances between observed data points and the fitted regression line

  • fitted values: predicted values of the response variable calculating by plugging the independent variable into the estimated regression equation

  • so long as the assumptions hold, a plot should look like a horizontal band of points centered around the zero line

  • funnel shape means non-constant variance

  • curved shape means non-linearity and you should not be using a linear regression

32
New cards

what do you do if the constant variance assumption does not hold

  • constant variance indicates the residuals (differences in y) have a consistent spread across all levels of the independent variables

  • violations are incredbily common in real data because the variability of an outcome often changes as the predictor variables increase or decrease

  • if you proceed as though constant variance holds, you can use the simplest versions of SE, CI, and p value formulas

  • can just use a robust standard error in post

33
New cards

standard error of the slope

  • the standard error of the slope measures how much slope would vary across repeated samples

  • SE(β1)=s/√(∑xi-x mean)2

  • s=residual standard error (how spread out the residuals are)

  • ∑(xi-x mean)2=total variation in x

  • with larger n, the denominator gets larger, and you get a smaller standard error

34
New cards

degrees of freedom for linear regressions

  • df=n-2

  • represents the number of independent data points used to estimate parameters

  • we estimated two parameters, each of which “uses up” one degree of freedom: slope and y-intercept

35
New cards

confidence intervals for population slope

  • confidence interval formula: predicted β1 ± multiplier*SE(β1)

  • we are X% confidence that each additional x variable is associated with between lower bound and upper bound more y value in the population

  • if the generated confidence itnerval does not contain 0, we can reject the null hypothesis (slope=0 at 1-% confidence)

36
New cards

hypothesis testing for the slope of a linear regression line

  • hypotheses test the slope, as this quantifies the relationship between the two variables

  • null: H0: β1=0 (no linear association in the population)

  • alternative: Ha: β1≠0 (there is a linear association)

  • t=(predicted β1-0)/SE(predicted β1)

  • the test statistic quantifies how many standard errors away the estimated slope is from the null value 0

  • use a t table to determine the p value

37
New cards

interpreting results of hypothesis testing for linear regressions

  • if we reject the null, all we can say is that the positive association is unlikely to be due to chance alone

  • rejecting the null be used to prove causality

  • a significant slope also does not mean a given independent variable is the only, or even most important, predictor for values of a given y

38
New cards

summary of regression assumptions

  • linearity: assumes the relationship between two values is roughly linear (check using a scatterplot)

  • independence: assumes the observations do not influence each other (check study design)

  • normality: assumes residuals are approximately normal, or that you have a sufficiently large sample size (check using a histogram or Q-Q plot)

  • constant variance: assumes the spread of y around the line is similar for all x (check using a residual vs fitted values plot)

  • linearity and independence ensure the regression line can actually be used as a reasonable summary of the data

  • normality and constant variance are needed to validate the confidence intervals and p-values

39
New cards

dummy variables

  • dummy variables convert categorical data into numerical 0 and 1 values for regression analysis

  • the group coded 0 is the reference or control category

  • the group coded 1 possesses the characteristic of interest

  • provides the y intercept of a regression line a bit more meaning

40
New cards

intercepts for regressions using dummy variables

  • with a single dummy predictor:

    • β0 becomes the average value for the reference group (0)

    • β01 becomes the average value for the group of interest (1)

    • β1 difference in group means

    • the regression with one dummy is just a comparison of two group averages

    • depending on which group you set as the reference, might get positive or negative slopes

41
New cards

regression coding example

  • take an experiment in which you randomly assign subjects to control vs treatment groups

  • treatment=1, control=0

  • run the regression: y=β01*treatment

  • β0=average outcome in the control group

  • β01=average outcome in the treatment group

  • β1=treatment effect (difference caused by intervention)

42
New cards

kidney stone categorical regression example

  • open surgery=1, closed procedure=0

  • β0=0.83=success rate for closed procedure

  • β01=0.83-0.05=0.78=open surgery success rate

  • β1=—0.05=open surgery does 5 percentage points worse than closed procedure

43
New cards

what if you use 1 and 2 for dummy coding?

  • consider 1=man, 2=woman

  • plugging x=1 into regression: β01=mean for men

  • plugging x=2 into regression: β0+2β1=mean for women

  • the intercept β0 is the predicted value when x=0, but now you no longer have a group that has been coded 0

  • not necessarily wrong to code using 1 and 2, just makes things far more incovenient

44
New cards

multivariate dummy coding for regressions

  • if the variable has 1 categories (e.g., degrees has q=5 levels), create q-1 dummy variables (d)

  • once category is omitted as the reference group

  • have 0=miss, 1=hit for each of the different dummy variables

  • example dummy variables: d<HS, dassoc, dBA, dgrad

  • for each person, assign 0 or 1 depending on what degree they have

  • person 4 with PhD: d<HS=0, dassoc=0, dBA=0, dgrad=1

  • would have a reference group (person with a HS diploma who does not fall neatly into any category) who has all dummy variables set to 0

  • each dummy coefficient β=(group mean-reference mean)

45
New cards

what happens if you change the reference category?

  • only thing that changes is the coefficient values

  • R2 and the model fit remain the same

  • choose the reference that makes comparisons most meaningful

46
New cards

why use multivariate regressions

  • most of the time, there are other variables associated with the original variable of interest (e.g., race, socioeconomic status, geographic location are correlated with educational attainment)

  • by including these variables in the regression, you can isolate each one’s association with your y variable (e.g., income)

  • without controls, β1 captures the association of education and everything correlated with it

  • with controls, β1 captures association between education and income, holding the other variables constant→isolating single variable of interest

47
New cards

multiple regression model

  • y=β01X12X2

  • β1=estimated change in y for a one unit increase in x1 when x2 is held constant

  • β2=estimated change in y for a one unit increase in x2, holding x1 constant

  • each coefficient is a partial (controlled) association