W2 L2 - The beast of bias

detecting outlier

graphs

  • sctterplots

  • histograms

standardized residual

  • in an average sample, 95% of standardized residuals should lie between +- 2

  • 99% of standardized residuals should lie between +- 2.5

  • any case for which the absolute value of the standardized residual is 3 or more, is likely to be an outlier

detecting influential cases

  • actually affect data

  • there is a chnage in b when the case is removed

  • DF beta - the difference between coefficient values with and without influential case

  • do not just remove influential cases - instead look for explanations of why it is different; present results with and without it.

cook’s disntance

  • measures the influence of a single case on the model as a whole

  • absolute values greater than 1 are cause for concern

robust estimation

  • lmRob function - same information but uninfluenced by influential cases

key assumptions of the general linear model

linearity and additivity

Spherical errors

  • the population model should have:

  • homoscedastic errors - inspect the model residuals

  • independent errors - inspect the model residuals

Normality of something-or-other

  • population model errors

  • sampling distribution

residuals - distance between model prediction and variables observed

residuals vs. predicted values

banana, diamond or bow shaped lines are non-linearity

homoscedastic and heteroscedastic

  • lenght between the first and the last few scores should be around the same

levene’s test

  • significance test - really useless

  • tells us to worry when its not needed and not to worry when its needed

normally distributed errors

estimation

  • normal errors dont really matter

  • when errors are not normally distributed, b will be unbiased and optimal, but there will be classes of estimator that are more accurate

confidence intervals and significance tests

- when residuals are normal

  • it can be shown that the bs have a normal sampling distribution

  • test statistics of bs (usually testing the null of b = 0) follow a t-distribution

- if they are not normal then

  • we cant base confidence intervals on the properties of the normal distribution

  • we dont know what distribution tests statistics have.

the central limit theorem (CLT)

  • bigger samples → normal distribution

  • smaller samples → estimate distributions looks more abnormal

  • use bootstrap in small samples to get an emperical confidence interval and standard error

the K-S test

  • do not use

normal residuals

  • diagonal line if perfectly normally distributed

  • curvature, snake, etc. - non-normality

correcting problems

the bootstrap

  • take different sample groups within the same sample

  • extract scores at random but reput them so they can be picked up a second time randomly - different sample group

  • we can get an estimate destribution from the bootstrap samples as well as derive a confidence interval from the spread of the scores