1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Explained Variance
Variance between samples: An estimate of σ2 that is the variance of the sample means multiplied by n (when the sample sizes are the same.). If the samples are different sizes, the variance between samples is weighted to account for the different sample sizes. The variance is also called variation due to treatment or explained variation.
(The explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y. And on the other hand
Unexplained Variance
Variance within samples: An estimate of σ2 that is the average of the sample variances (also known as a pooled variance). When the sample sizes are different, the variance within samples is weighted. The variance is also called the variation due to error or unexplained variation.
The unexplained variation is the sum of the squared of the differences between the y-value of each ordered pair and each corresponding predicted y-value.)
Variance
standard deviation squared
Anova test
Analysis of Variance; used to compare three or more independent continuous outcome variables; parametric
Anova Multiple 2 sample t - test
The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.
Assumptions
. the population follows the normal distribution
2. the populations have equal standard deviations
3. the populations are independent
Each population from which a sample is taken is assumed to be normal.
All samples are randomly selected and independent.
The populations are assumed to have equal standard deviations (or variances).
The factor is a categorical variable.
The response is a numerical variable.
Null and Alternative
H0: The null hypothesis: It is a statement of no difference between the variables-they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.
Ha The alternative hypothesis is the contender and must win with significant evidence to overthrow the status quo. This concept is sometimes referred to the tyranny of the status quo because as we will see later, to overthrow the null hypothesis takes usually 90 or greater confidence that this is the proper decision.
Variation due chance
any change in hereditary traits due to unknown factors
Categories of Variables
- presage variables
- context variables
- process variables
- product variables
criterion variable
the variable in a multiple-regression analysis that the researchers are most interested in understanding or predicting / dependent variable
Classification variable
an independent variable that is observed but not controlled by the researcher
factor variable
those aspects of a situation that may influence particular phenomena
treatment variable
an independent variable that is manipulated in an experiment
F-Distribution in ANOVA
F= (estimate of pop variance based on the differences BETWEEN the sample means) / (estimate of the population variance based on the variation WITHIN the sample)
**If the ratio does not equal 1, we can conclude that the treatment means are not the same. There is a difference in the....
F distribution (shape)
mathematically defined curve that is the comparison distribution used in an analysis of variance / skewed right / always positive
F distribution (variance ratio distribution)
The distribution of the ratio of two independent quantities each of which is distributed like a variance in normally distributed samples. So named in honor of R.A. Fisher who first described the distribution. / range [0, + ∞)
Correlation
is mainly about relationships
Correlation of X
X: Independent variable and predictor variable
Correlation of Y
Y: dependent variable, variable of interest, criterion variable
sample & population notation for the correlation coefficient
x: population : p
y: sample: r
Range of r
r= +/- 1
R is measuring
the common variance
Correlation Relationship
changes in one variable are associated with changes in another but it is not known whether one variable directly influences the other
In negative correlation when y increases x decreases / perfect relationship for Negative is r = -1
If r=0 is perfect no correlation
In positive correlation when x increases y increases
Linear Relationships
A relationship that has a straight line graph / where one variable changes by consistent amounts as you increase the other variable.
Non-linear relationships
Curve Graph /
slope increases at different rates at different points on the curve
Where one variable changes by inconsistent amounts as you increase the other variable.
Ex: emotional investment and performance
Proceed with Caution
. Sample Size
. Relationships Change
. Correlation is NOT Causation
. correlation - > causation -> liability
. Affect of Sample size
Relationships change over time
Correlation has to do with measuring the strength of the relationship
Underlyn factors of that relationship may change and therefore our correlation or our regression model becomes outdated
Null and Alternative Hypothesis
Null hypothesis (Ho)
- stating no difference
Alternative hypothesis (Ha)
- stating there's a difference
T- test in R
an inferential statistical analysis used when comparing two samples of data in either a matched groups design or a repeated-measures design
degrees of freedom
df = n-1
Chance Model
generate data from random processes to help them investigate such processes.
is the foundation upon which regression is built
Chance Model
How it's calculated:
no predictor variables
Importance:
to generate data from random processes to help statisticians investigate the process.
How is it graphed:
Slope is 0 so best forecast is always the mean of Y
Full Model
(FM) all predictor variables
Restricted Model (RM)
(RM) Some predictor variables
Correlations vs Regressions
Coefficient of Determination -> RSQ
RSQ: The percentage of the variation in the y-variable that is accounted for by the variation in the x-variable (s).
A percentage range from 0 to 1
Practical more so than r
Properties & Qualities regarding residuals (aka error)
Notation for a sample -> e
for a population -> E
Ordinary least squares regression : mimizing error.
If r= +/- we would have a perfect error (e=0)
{ ( Y- ^Y) = 0 Always !
Plot errors
R and only scattered around the x-axis normally distributed.
Residual Plot for Sale Predictions:
If the plots make a pattern it is a non-linear relationship
We expect a majority of the errors to be near the x-axis with fewer outliers.
The variation in the y-variable consists of 2 components:
Y's relationship with the x-variable
Random Factors NOT in the model
Isolating the Slope
The affect of marginal inputs on predicted outcomes
A look at major league baseball
X-var : payroll ($mil)
Y-var: Wins
Adjusted RSQ
The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.
r2: sample r-square
p: number of predictors
N: total sample size
Multicollinearity
The term used to describe the correlation among the independent variables.
2 or more highly correlated predictor variables
Each predictor explains a unique portion of the variability within the y-variable
multiple regression equation
ŷ = b0 + b1x1 + b2x2 + ... + bk-1xk-1 + bkxk
Characteristics of R:
A single value representing the strength of a simuultaneous relationship between the x-variables and y-variables.
R is never negative (unlike r)
R ranges between 0-1 (r range is 1 to +1)
R does not indicate the direction of the relationship (unlike r)
R > or equal any single r for any single x-y relationship
dummy variable
A variable for which all cases falling into a specific category assume the value of 1, and all cases not falling into that category assume a value of 0.
- using 2 category nominal data
1-way anova
to determine the existence of a statistically significant difference among several group means.
The test actually uses variances to help determine if the means are equal or not.
multiple regression
a statistical technique that computes the relationship between a predictor variable and a criterion variable, controlling for other predictor variables
1. Prediction, forecasting
2. To determine underlying causes of changes in y-variable (variable of interests
Based on the least squares method.
Trying to minimize error, don't care if it's positive or negative.
Summary outputs:
This is an almost infinitesimal level of probability and is certainly less than our alpha level of .05 for a 5 percent level of significance.
Anova Table
The ANOVA table breaks down the components of variation in the data into variation between treatments and error or residual variation.
Coefficient table
A coefficient (input-output) table records the amount of each product (or the amount of output by each industry)
used as input per unit of output of the various products/industries.
Collinearity Matrix
is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.