Deviance/ Error
The distance of each score from the mean
Sum squared errors
The total amount of error in the mean (the errors/deviances are squared before adding them up)
Variance
The average distance of scores from the mean. (it is the sum of squares divided by the number of scores) Tells us how widely dispersed scores are around the mean.
Standard Deviation
The square root of the variance
Z-score
The sign tells us if the original score was above or below the mean, the value tells us how far the score was from the mean in sd units. (z-score = (score - mean of all scores)/ standard deviation of all scores)
Probability theory
Uses language of sets
Sets
A collection of things/ elements
Universal Set
is the set of all things that we could possibly consider in the context of what we are studying (S={1,2,3,4,5,6} --> for a dice)
Function
A rule that takes an input from a specific set, called the domain and produces an output from another set, called co-domain
Sample Space
The set of all possible outcomes
Range
A function as the set containing all the possible values of f(x). Thus the range of a function is always a subset of its co-domain
Mutually exclusive Independence
--> Mutually exclusive events cannot be statistically independent, since knowing that one occurs gives information about the other (specifically, that it certainly does not occur) --> can’t happen at the same time --> If A and B are mutually exclusive events they are statistically independent if and only if P(A)=0 or P(B)=0 or both are zero
The law of large numbers
The higher the numbers of trials the closer we get to the true probability
Central limit theorem
framework so we can do statistical inference, as the sample size increases becomes more and more like a normal distribution
Sample Distribution
When you take the average of the sample averages, it will look like your population mean because it is a normal distribution we can come up with p-values
Descriptive statistics
Summarize the characteristics of a data set (The data with regards to average mean, median, range)
Inferential statistics
Allows you to test a hypothesis or asses whether your data is generalizable to the broader population.
takes data from a sample and makes inferences about the larger population from which the sample was drawn.
goal of inferential statistics is to draw conclusions from a sample and generalize them to a population
draw conclusion from the sample, happens when calculating p values
Normal Distribution
99,7% of sample results are contained within 3 standard errors 95% within 2 standard errors 68% within 1 standard error
Standard error
= standard deviation of a sampling distribution
Null hypothesis
Confidence Level
probability between the 2 rejection regions for a two tailed test. If α=0,10, then the Confidence Level is 1-0,10 = 0,90 or 90%
Confidence Interval
The bounds equal to the lower and upper critical values (The area (region can measure anything))
Type 1
False positive (reject a true null hypothesis)
Type 2
False negative (accept a false null hypothesis)
non parametric test (distribution free test)
does not assume anything about the underlying distribution (for example, that the data comes from anormal distribution and does not have a normal distribution)
parametric test
makes assumptions about a population’s parameters(for example, the mean or standard deviation)
One tailed test
When you want to know if something is simply higher or lower
What test should be used with unequal variances?
Welch's ANOVA
ANOVA
The analysis of the variances, it tells you if there is a difference between at least 2 of the groups, not which groups are different from another.
Total sum of squares (TSS or SST)
tells you how much variation there is in the dependent variable, it is a measure of how a data set varies around a central number (like the mean)
Sum of squares
the main goal is to see if there is any overlapping, just like variance
Between Sum of Squares Between Sum of Squares (a.k.a. Explained/Model/Treatment) (SSB)
the explained Sum of Squares tells you how much of the variation in the dependent variable of your model is explained.
Residual (Error) Sum of Squares (within sum of squares) (SSE)
tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y (observed vs expected)
F-distribution (use)
We use an F-distribution when we are studying the ratio of the variances of two normally distributed populations
F-test
The further the groups are from the grand mean, the larger the variance in the numerator becomes. In our F-test, this corresponds to having a higher variance in the numerator.
F-ratio
=MSB/MSE
What type of Anova? (1 grouping variable)
one-way ANOVA
What type of Anova? (Another grouping variable)
two-way anova
What type of Anova? (factorial ANOVA)
three-way Anova
ANOVA Assumptions
Check Assumptions a. Normality (Sharpiro-Wilk) b. Outliers (BoxPlots)
Run one-way ANOVA with post-hoc a. Tukey & Games Howell b. Levene’s Test= do the distribution for each group, looking almost the same, is there homogeneity?
Run GLM to check partial eta squared a. Estimates of effect size
Calculate omega-squared
Interpret the data
Eta square (Eta^2)
How good that measures the outcome; how much does my model explain the total variance in the observations; how much does it explain of the total variation (= SSbetween / SStotal)
Omega square
less biased alternative measure of the how good your model explains the results (especially when sample size is small)
Factoiral ANOVA
= we are examining how much of the variance in our data can be explained by our independent variables (>1) = it looks at the main effects of the PV and their interaction effect on the OV =a (name) with 2 PVs is a two-way ANOVA, etc.
When do we use a Factoiral ANOVA
a) OV (outcome variable) = quantitative b) PV= categorical c) Independent groups (between-subject design) d) Variance is homogenous across groups (similar in shape) e) (Residuals (actual obs. to the average) are normally distributed) If you have an interaction effect= there is dependency
Moderation
It is a way to check whether that third variable influences the strength or direction of the relationship between independent and dependent.
Mediator
= mediated the relationship between independent and dependent; explains the reason for such a relationship to exist Is the influence of mediator stronger than the influence of the direct independent variable (imagine you have grades lead to happiness and self-esteem is the mediator; what you want to do with mediation= we try to see if the variable “self-esteem” explains the existence of the “grades” variable completely)
Correlation
= measures the degree of a relationship between two variables (x and y) = find the numerical value that shows the relationship between the two variables and how they move together
Regression
= analysis helps to determine the functional relationship between two variables (x and y) so that you’re able to estimate the unknown variable to make future projections on events and goals
= to estimate the values of a random variable (z) based on the values of your known (or fixed) variables (x and y). = is considered to be the best fitting line through the data points
Pearson 'r'
= measures the strength of the linear relationship between two quantitative variables. = is always a number between -1 and 1. r > 0 indicates a positive association.
R squared
Coefficient of determination= tells you SSB (SSM)/ SST= tells you whether your model, how much the variability in the outcome is explained by the model; indicator on how good your linear model is/ proportion of variability in your outcome that is explained by the model
= shows how well the data fit the regression model (the goodness of fit) = The higher the better
Bootstrap Procedure (non-parametric)
Choose a number of bootstrap samples to perform
Choose a sample size
For each bootstrap sample
Draw a sample with replacement with the chosen size
Calculate the statistic on the sample
Calculate the mean of the calculated sample statistics
Simple linear regression
represented by: y = β0 +β1x+ε β0 --> y-intercept β1 --> slope E(y) --> is the mean or expected value of y for a given value of x
Adjusted R-squared
A modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. Typically, the adjusted R-squared is positive, not negative. It is always lower than the R-squared.
Standardized coefficient beta
= Standardize (we use it; use apples & apples – just brings everything to the same base) = we want to compare between two variables (for example 0.467 has a bigger effect than 0.146); which has a bigger predictive effect Use it when we want to compare effect sizes across PV Easier to compare
Unstandardized beta
= we want to figure out the exact relation/ predictive relation between the variables and our outcome The math aptitude test scores for every unit increase in that, we can see at point 0.116 increase in our statistics exam results= the actual outcome that happens
If you want to interpreted individual PVs impact on the OV Easier to interpret
Orthogonalization
= refers to axes being at a right angle = in moderation we need it to fix the distorting effect of multicollinearity (increasing standard errors and decreasing the t-statistic) In factor analysis we also make use of orthogonalization when we rotate the factors because all the multidimensional axes have to be at a right angle to form the factor/component
Tolerance
= an indication of a percent of variance in the predictor that cannot be accounted for by the other predictors; meaning that very small values indicate that a predictor is redundant
Dummy variables
=in statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes (converts – main goal) only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. =dummy variable is a dichotomous, but a dichotomous is not necessarily a dummy variable
In ANOVA, the dependent variable was continuous Independent variables can be dichotomous (dummy variables), but not the dependent variables
Odds
= probability of success/ probability of failure
Multivariate statistical methods
= the joint behavior of more than one random variable
Goal of PCA (principal components analysis)
= reduce the number of dimensions that we have; We decide on the 2 or 3 variables (15 variables) we do that by Scree Plot
We started in the survey with 15 questions, but we don’t really know what variable they are measuring, so PCA will help us to see if any underline variables that we cannot see just by looking at the data
Principal Components & Factor Analysis
= reduce dimensionality of the problem to better understand the underlying factors affecting those variables
Factors
= Linear combination (variate) of the original variables. Factors also represent the underlying dimensions (constructs) that summarize or account for the original set of observed variables. Factors are a type of latent (hidden/ underlying/ its hidden somewhere there, but you don’t know yet) variable. It is a variable that is dependent on any other variables
Factor loading
= correlation between variables (it is the SWLS1) A question is a variable
Communality
= another word from R-square (how much does you PV explain the OV variance)
Eigenvalue
% of variance explained * the total number of variables (throw away eigenvalue below 1)
Covariance
= involves 2 dimensions; think of it like correlation
Correlation (R)
= can be calculated covariance of 2 dimensions/ SD of X * SD of Y