1/46
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
conditional distributions
describe the probability distribution of a random variable 1 given that another random variable 2 has a specific value
notation: P (category 1 | category 2)=”what is the probability that some people fall into category 1 given that they are a member of category 2)
since there are several categories included in a conditional distribution table, we want a single test that will capture the full pattern (e.g., are gender and education associated overall?) at once, rather than several tests
difference column
a column included to specify the difference between two conditional distirbutions (e.g., percentage of men with a bachelor’s degree vs percentage of women with a bachelor’s degree)
difference=difference between two proportions: p1-p2=14.5%-12.3%=2.2pp
why is a single statistical test that captures full picture better than running several tests simultaneously?
each test has a certain level of uncertainty associated with it
don’t want to compound that error over a bunch of tests
multiple testing problem
researchers often test hundreds or thousands of comparisons at once (e.g., geneticists scan millions of DNA variants for association with a disease)
with enough groups, you are almost guaranteed to find “significant” results that are really just noise (false positives)
a reason why many published findings have not been replicated by other researchers (replication crisis)
chi squared test
a test used to analyze categorical data to determine if observed frequencies differ significantly from expected frequencies
provides one statistic, one p-value for the entire conditional distribution table
data must be in raw frequencies of categorical items (not percentages), samples must be randomly selected, and each sample size should be large enough (at least 5 per cell)
under the null (H0), we can compute the expected cell counts
in calculations, the difference between expected and observed values is squared to prevent positive and negative differences from canceling out
also divide by the expected frequency to normalize the data (i.e., scale it relative to magnitude to expected results to avoid disproportionately skewing results)
chi-squared test steps
state the hypotheses: H0: the two categories are independent, Ha: the two categories are associated
calculate the expected frequencies under the null hypothesis=row total/column total
calculate the chi-square statistic (x2)
find the critical value: (1) calculate degrees of freedom (2) find the critical value in a table
degrees of freedom=(number of rows-1)(number of columns-1)
conclude: if the chi squared statistic is greater than the critical value, reject the null
chi square conclusions
all the chi square can tell you is whether or not there is a statistically significant association between the two categories
reject the null if the chi squared value is greater than the p value
fail to reject the null if the chi squared value is less than the p value
effect size
quantifies the strength of an association
relationship between chi squared test and z test
a 2 by 2 chi square test is the mathematical equivalent of a two-tailed z test for two proportions (i.e., x2=z2)
both tests use the same underlying probability distribution approximation for comparing proportions across two groups
chi squared generalized the z-test to tables with more than 2 rows, meaning you can compare more than two groups across multiple categorical outcomes
chi squared test worked example
H0: income and happiness are independent, Ha: income and happiness are associated
Above income respondents are more than twice as likely to report being “very happy” as compared to below income respondents (44% vs 20%), suggesting a chi square test should give a very small p-value
calculate expected frequency: (300 individuals who are very happy x 500 above average income individuals/900 total respondents)=166.7
calculate the chi-squared value: x2=71.9, df (3-1)(2-1)=2, p<0.001
since p<0.001 is far smaller than the critical value x2=4.605 for 0.10, we reject the null hypothesis
effect size
quantify the magnitude of a relationship or difference between two groups
provides a standardized measure of practical significant, independent of sample size
examples: (1) difference of proportions (2) odds ratio
odds ratio
a measure of effect size indicating the strength of association between two binary variables
define odds as the probability of success divided by the probability of failure
the odds will always be nonnegative
OR=1: no association, OR>1: greater likelihood of sucesses, OR<1: greater likelihood of failures
why we need regressions
conclusions drawn from group comparisons are really only applicable to the specific groups in our data (i.e., you can’t predict beyond what you observe)
also exists a problem when you want to make conclusions about individuals who don’t fall neatly into the data set (e.g., someone with 10.5 years of schooling when data was made to be categorical)
regressions
modeling techniques used to analyze the relationship between a response variable (target) and one or more independent variables (predictors)
predicts continous outcomes and estimates how changes in predictors affect the target
often represented by a line of best fit through data points
bivariate regression model (linear)
for an observation i: yi=β0+β1xi+ei
β0 is the intercept
β1 is the slope
ei is essentially the associated error; accounts for unobserved factors, randomness, etc.
because the goal of a regression is to predict the average of y, each individual observation will deviate from the mode
sometimes you get intercept values that wouldn’t make sense for the data frame (but that’s okay! it works for the most part)
using linear regression formula to draw predictions
need to use predicted values for three variables: yi, β0, and β1
residuals
for an observation i: residuali=yi-sample yi
residuals measure how far the y value of a given point is from the regression line (in terms of y)
positive value: the regression line is an underestimation
negative value: the regression line is an overestimation
least squares
method used to find the line of best fit by minimizing the sum of squared residuals
you want to square the residuals since positive and negative residuels both count and may “cancel” each other out
also square them to account for larger errors that would be able to skew data
choose the line with the smallest squared residual value (∑ei2); same as choosing the line with the smallest vertical distance from the actual data points
sum of squared errors
sum of squared errors (SSE): ∑(yi-yi hat)2=∑ei2
y hat is the predicted value of the dependent variable
the sum of residual values calculated for every single observation in the sample
measures total variation in the sample relative to the predicted y values for a given x (regression line)
small SSE=smaller residuals=line fits the data pretty well
can also do sum of absolute or quartic errors, but it’s not optimal
total sum of squares
SST=∑(yi-y mean)2
a second method for measuring variation in the data
measures total variation within the data relative to the mean
measures how far each point is from the overall mean
is mathematically equivalent to the variance formula, only without dividing by degrees of freedom (n-1)
Gauss Markov theorem
states that the ordinary least squares estimator generates the best linear regression model
provided the errors are uncorrelated, have zero mean, and have equal variance, the sum of least squares will provide the highest efficiency
coefficient of determination
R2=1-SSE/SST
R2 is the proportion of variance in y explained by the regression (i.e., how closely the line of best fit matches the data points)
SSE=0: perfect line of best fit
SSE=SST: the line is no better than the average value—>R2=0 and the predicted values do not track the actual values better than a simple horizontal line at the mean value
interpreting the coefficient of determination
R2=0: the points scatter widely and the line is barely better than a horizontal line through the mean value
R2=0.5: there is a clear trend but the data points are still substantially scattered
R2=1: every point falls exactly on the line
interpreting y intercept of line of best fit
intercept simply represents the predicted value of the response variable (y) when the explanatory variable (x) is zero
often, the intercept has no real meaning; if x cannot realistically be zero (e.g., birth weight) or the data collected does not include x-values near zero, then the intercept is only a mathematical anchor
outliers and regression lines
outliers=points far from the regression line (i.e., points with large residual values)
a single outlier can strongly shift the regression line by changing the slope
always plot your data first via a scatterplot to eyeball potential outliers
if the outlier is due to a data error such as a typo or coding mistake, fix or remove it
if the outlier is a real but unusual data point, run the regression with and without it and report both
if there are no outliers, proceed as usual!
choosing whether to use a linear regression
linear regressions should only be used for scatterplots that seem to represent a linear relationship
a straight line used for curved, bent, or fanned out patterns will produce a low R2 value and provide a misleading slope value
before running a regression, you should always plot x vs y to determine if a straight line seems like a reasonable summary
the observations must also be independent, meaning each data point should not influence other data points
structured data cannot be used, including: repeated measurements on same individual, clustered data, observations over time
to verify independence, check study design; to verify linearity, check the scatterplot
slope of a regression line
β1=r x sy/sx
sy=standard deviation of the y data
sx=standard deviation of the x data
r=correlation coefficient (quantifies correlation between x and y)
r controls the direction and strength of the slope
greater sy means steeper slope (data points are spread far from the average y value), greater sx means a flatter slope (independent variable is spread out over a wider range)
the slope value gives you average rate of change in y per one unit increase in x
statistical testing for linear regressions
are tests of independence (between x and y) using slope or correlation
H0: slope=0
a slope of 0 means the two variables are statistically independent (i.e., they are linearly uncorrelated)
assumptions for linear regression hypothesis testing
statistical inference for linear regressions (e.g., calculating confidence intervals, p values, etc.) requires assumptions
linear relationship between x and y
independence of observations: x and y do not influence each other
the residual values for individual points follows a normal distribution
the residuals must also have a constant variance across all levels of the independent variable (i.e., the spread of y around the line is similar for all values of x)
why normality of residuals matters less with larger y
the slope formula can be rewritten as β1=∑ciyi
ci=(xi-x mean)/∑(xi-x mean)2
since ci are fixed constants depending only on the values of xi and the sample x, the predicted slope is a weighted average of the y values
by the central limit theorem, the averages of independent observations have approximately normal sampling distributions when n is large
even if the distribution of individual y values are not normal, the sampling distribution of predicted slopes should be
plots of residuals vs fitted values
a check for constant variance
residuals: the vertical distances between observed data points and the fitted regression line
fitted values: predicted values of the response variable calculating by plugging the independent variable into the estimated regression equation
so long as the assumptions hold, a plot should look like a horizontal band of points centered around the zero line
funnel shape means non-constant variance
curved shape means non-linearity and you should not be using a linear regression
what do you do if the constant variance assumption does not hold
constant variance indicates the residuals (differences in y) have a consistent spread across all levels of the independent variables
violations are incredbily common in real data because the variability of an outcome often changes as the predictor variables increase or decrease
if you proceed as though constant variance holds, you can use the simplest versions of SE, CI, and p value formulas
can just use a robust standard error in post
standard error of the slope
the standard error of the slope measures how much slope would vary across repeated samples
SE(β1)=s/√(∑xi-x mean)2
s=residual standard error (how spread out the residuals are)
∑(xi-x mean)2=total variation in x
with larger n, the denominator gets larger, and you get a smaller standard error
degrees of freedom for linear regressions
df=n-2
represents the number of independent data points used to estimate parameters
we estimated two parameters, each of which “uses up” one degree of freedom: slope and y-intercept
confidence intervals for population slope
confidence interval formula: predicted β1 ± multiplier*SE(β1)
we are X% confidence that each additional x variable is associated with between lower bound and upper bound more y value in the population
if the generated confidence itnerval does not contain 0, we can reject the null hypothesis (slope=0 at 1-% confidence)
hypothesis testing for the slope of a linear regression line
hypotheses test the slope, as this quantifies the relationship between the two variables
null: H0: β1=0 (no linear association in the population)
alternative: Ha: β1≠0 (there is a linear association)
t=(predicted β1-0)/SE(predicted β1)
the test statistic quantifies how many standard errors away the estimated slope is from the null value 0
use a t table to determine the p value
interpreting results of hypothesis testing for linear regressions
if we reject the null, all we can say is that the positive association is unlikely to be due to chance alone
rejecting the null be used to prove causality
a significant slope also does not mean a given independent variable is the only, or even most important, predictor for values of a given y
summary of regression assumptions
linearity: assumes the relationship between two values is roughly linear (check using a scatterplot)
independence: assumes the observations do not influence each other (check study design)
normality: assumes residuals are approximately normal, or that you have a sufficiently large sample size (check using a histogram or Q-Q plot)
constant variance: assumes the spread of y around the line is similar for all x (check using a residual vs fitted values plot)
linearity and independence ensure the regression line can actually be used as a reasonable summary of the data
normality and constant variance are needed to validate the confidence intervals and p-values
dummy variables
dummy variables convert categorical data into numerical 0 and 1 values for regression analysis
the group coded 0 is the reference or control category
the group coded 1 possesses the characteristic of interest
provides the y intercept of a regression line a bit more meaning
intercepts for regressions using dummy variables
with a single dummy predictor:
β0 becomes the average value for the reference group (0)
β0+β1 becomes the average value for the group of interest (1)
β1 difference in group means
the regression with one dummy is just a comparison of two group averages
depending on which group you set as the reference, might get positive or negative slopes
regression coding example
take an experiment in which you randomly assign subjects to control vs treatment groups
treatment=1, control=0
run the regression: y=β0+β1*treatment
β0=average outcome in the control group
β0+β1=average outcome in the treatment group
β1=treatment effect (difference caused by intervention)
kidney stone categorical regression example
open surgery=1, closed procedure=0
β0=0.83=success rate for closed procedure
β0+β1=0.83-0.05=0.78=open surgery success rate
β1=—0.05=open surgery does 5 percentage points worse than closed procedure
what if you use 1 and 2 for dummy coding?
consider 1=man, 2=woman
plugging x=1 into regression: β0+β1=mean for men
plugging x=2 into regression: β0+2β1=mean for women
the intercept β0 is the predicted value when x=0, but now you no longer have a group that has been coded 0
not necessarily wrong to code using 1 and 2, just makes things far more incovenient
multivariate dummy coding for regressions
if the variable has 1 categories (e.g., degrees has q=5 levels), create q-1 dummy variables (d)
once category is omitted as the reference group
have 0=miss, 1=hit for each of the different dummy variables
example dummy variables: d<HS, dassoc, dBA, dgrad
for each person, assign 0 or 1 depending on what degree they have
person 4 with PhD: d<HS=0, dassoc=0, dBA=0, dgrad=1
would have a reference group (person with a HS diploma who does not fall neatly into any category) who has all dummy variables set to 0
each dummy coefficient β=(group mean-reference mean)
what happens if you change the reference category?
only thing that changes is the coefficient values
R2 and the model fit remain the same
choose the reference that makes comparisons most meaningful
why use multivariate regressions
most of the time, there are other variables associated with the original variable of interest (e.g., race, socioeconomic status, geographic location are correlated with educational attainment)
by including these variables in the regression, you can isolate each one’s association with your y variable (e.g., income)
without controls, β1 captures the association of education and everything correlated with it
with controls, β1 captures association between education and income, holding the other variables constant→isolating single variable of interest
multiple regression model
y=β0+β1X1+β2X2
β1=estimated change in y for a one unit increase in x1 when x2 is held constant
β2=estimated change in y for a one unit increase in x2, holding x1 constant
each coefficient is a partial (controlled) association