1/86
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Univariate Distribution
Describing and analyzing one variable.
Statistic
An estimate of a parameter based on a sample.
Descriptive Statistics
Measurements used to summarize and organize the observed values of one variable.
Inferential statistics
Measurements used to make decisions about variable(s) by interpreting one variable.
Frequency Distributions
Number of cases for each category of a variable. To construct one, list the categories of variables and then list the number of observations is each.
Proportion
Number of observations in each category divided by the total number of observations
Percentage
Proportion multiplied by 100.
Central Tendency
The value around which most of the data are clustered.
Mode
The value that appears most frequently
Median
The midpoint in an ordered series of data. (N+1)/2. For an even #, is the average of the two middle values.
Bimodal
when two categories occur just as frequently
Mean
The sum of the observed values divided by the number of cases
Dispersion
The extent to which the data are spread out from their central tendency
Range
Indicates the difference between the lowest and highest values of the distribution
Quantiles
points taken at regular intervals of an ordered data set that divide the set into equal groups from lowest to highest.
Quartiles
4 equal parts
Quintiles
5 equal parts
Deciles
Ten equal parts
Percentiles
100 equal parts
Variance
A measure of dispersion for interval or ratio level data based on finding the variation around the mean value of the distribution.
Is the average of the total square of the deviation from the mean of the data.
How far the data is spread from the mean.
Standard Deviation
The square root of the variance - since the variance has transformed the data into squared units and we want to report them in original units.
Standard Deviation Advantages (over other measures of dispersion)
it is more stable from sample to sample since it is based on all observations
Standard Deviation is most useful when…
It is most useful when they are interpreted by comparing the dispersion of several clusters of data.
Six examples of frequency curve
Bell shaped
U-shaped
Positively Skewed j-curve
Negatively skewed j-curve
Bimodal
Rectangular
Bell Shaped
A symmetrical distribution (mean and median are identical and frequencies going toward the right and left tails are identical) - where most of the data (the mode) is centered near the median and mean.
U-shaped
A symmetrical distribution where most of the data is spread evenly away from the mean and median (Very few cases are average and most are at either end of the extreme).
Positively skewed j-curve
A non-symmetrical distribution with a large number of low scores and a few extremely high scores.
Negatively skewed j-curve
A non-symmetrical distribution with a large number of high scores and a few extremely low scores
Bimodal
A distribution where the data is clustered at two different points away from the mean. Takes an “m” shape.
Rectangular
A symmetrical distribution where the data is spread equally in every category.
Normal Curve
A special type of bell curve where the distribution of values nears a direct and known relationship to the size of the standard deviation. A given percentage of cases are within 1, 2, or 3 standard deviations from the mean.
Four properties of Normal Curves
Symmetrical and bell-shaped
Mode, median, and mean coincide at the center of the distribution
Curve is based on an infinite number of observations
A fixed proportion of observations lies between the mean and fixed units of standard deviations
Normal Distribution
When data is distributed normally the mean divides the data in half. The following holds true:
68.26% of the data lies within ±1 sd from the mean
95.46% of the observations fall within ±2 sd from the mean
99.73% of the observations fall within ±3 sd from the mean
Outlier
An observation in a normally distributed data set that lies beyond ±3 standard deviations from the mean (only .27% of observations fall in this category).
Are sometimes dropped from the data when it is analyzed since they do not represent the average cases and tend to skew the results.
Z-scores
Also known as standardized score; is the number of standard deviations an observation is above or below the mean. A positive score is above the mean, and a negative is below the mean.
Are used if the data is normally distributed.
To compute z-scores
Subtract the mean of the data from the score of a specific observation. Then divide the results by the standard deviation of the data.
Bivariate Relationships
Relationships between two variables.
Contingency Tables
aka Crosstabulation. Compares or cross-tabulates two nominal and/or ordinal variables in a table to see if the values of one are contingent on the other.
This helps determine if there is a relationship between the variables.
Cross Tabulations
a statistical method used to analyze the relationship between two or more categorical variables by displaying the frequency of their combinations in a table
“Percentage” the Table
Percent the Table and Subtract Across to Compute the Percentage Difference
Difference of Means
Comparing the appropriate central tendency in two groups to look for patterns
Tests of Statistical Significance
Determines whether a relationship between variables in a probability sample can be generalized to the population from which the sample was selected.
Significance Level
states the probability that a relationship in a probability sample occurred by chance and doesn’t really exist in the population. Frequently symbolized with the letter “p” for probability - “p-value”
Probability or “p” value
the significance level. The probability that a relationship in a probability sample occurred by chance.
.05 Cutoff Level
95% confidence. The most commonly used level to use as a cutoff point. If reported significance level is greater than .05, the relationship is assumed not significant. if it is less than .05, the relationship is assumed to be significant.
Chi - squared statistic for statistical significance
Used for contingency tables of nominal or ordinal data.
Computed by figuring out the difference between the observed and expected values of the variables in contingency table.
If it is significant at .05 or less then you can state that the sample relationship in the contingency table is significant. ..01 or .001 is Yes. .20 or .30 No.
Measures of Association
Allow us to summarize the strength of a relationship in more accurate ways than relying on percentage differences.
Strength of Relationships
Extent to which changes in one variable are accompanied by changes in another variable.
Yule’s Q
A summary measure for use in a bivariate table with nominal (or ordinal) data that indicates the strength of the relationship.
Is based on the number of cases that show a positive relationship minus the number of cases that show a negative relationship.
Q= ad-bc / ad+bc
Proportional Reduction of Error (PRE)
The amount by which errors in predicting the dependent variable can be reduced by knowing the relationship between the DV and the IV.
The extent to which we can reduce possible errors in estimating the value of a case if we know its value on a second variable.
errors w/o knowledge of iv - errors w/ knowledge of iv [divided by] errors w/out knowledge of iv
Lambda
Used with two nominal variables or with one nominal and one ordinal variable. Can be used for tables that are larger than two by two. It compares the modal value for each value of the IV.
Gamma
an ordinal measure of relationships. Measures the number of similarly ordered pairs as a proportion of all relevant pairs. Does not include tied pairs.
(identical to Yule’s Q for a 2X2 table). For Ordinal level data. Is frequently higher than the other measures and can over estimate relationships when there are many tied pairs since it does not include them.
Kendall’s Tau
an ordinal measure of relationships. is based on pairs of cases.
For Ordinal level data; can be used for nominal data when lambda is inappropriate (1 DV category very high). Tau b for square tables (2x2 or 3x3 for instance) and tau c for rectangular ones (2x3 or 2x4 or 3x4 for instance). Includes tied pairs in denominator in calculations (those not on diagonal) and thus gives more accurate (lowest, most conservative) measure. (tau c harder to interpret and can only say which of two tables of similar proportions is stronger.
Somer’s D
an ordinal measure of relationship.
Generally for ordinal level data. Only counts pairs tied on the DV in the denominator. This has the effect of focusing on pairs of cases where the IV actually changes. Thus it is better for causal analysis. Usually gives a moderate measure between those arrived at by Gamma and Tau.
List the Variables:
Dependent (Y), Independent (X), Intervening (Z)`
Dependent Variable
(Y). The variable we wish to explain or predict; its value is influenced by the Independent Variable
Independent Variable
(X) The variable we think causes a change in (has an effect on) the Dependent variable
Intervening Variable
(Z) A third variable that can affect the relationship between the independent variable and dependent variable. It is also another independent variable.
Multivariate Analysis
Examining a relationship between more than two variables; usually looks at the effect of several Independent variables on one Dependent variable.
Control
Examining a relationship between an Independent Variable and a Dependent variable by holding another Independent variable constant.
This attempts to rule out the third variable as an alternative explanation. It may or may not turn out to be an intervening variable depending on the results.
Elaboration
Breaking down the original contingency table into two or more tables based on the values of the control variable
Contingency Table
Independent Variable across the top; Dependent Variable down the side
Gross Effect
% difference between IV and DV
Net Effect
% difference of the IV on the DV controlling for the new variable
Replication
If the relationships are basically the same in the different tables, then the control variable is not really an intervening variable.
It does not have an effect on the relationship between the DV and IV.
The original contingency table repeats itself after controlling
Spuriousness
There is an association between the variables in the original table which disappears after elaboration.
Conditioning
The relationship in the original table is modified so that a different association appears in one or both (or all) sub-groups of the elaborated table
Additive Effect
If the modification is about the same for both or all subgroups then the relationship is considered this.
They both are a little or a lot higher or both a little or a lot lower.
Interaction
If the modification is different for one or more of the subgroups then the relationship is considered this.
One is higher and one is lower; or one is the same and one is higher; or one is the same and one is lower; or one is a little higher and one is a lot higher; or one is a little lower and one is a lot lower
Linear Regression
The relationship between two variables so that the prediction line “fits” as closely as possible to the actual data.
Is often used when both the DV and IV are measured at the interval/ratio level but can also be used without much modification when variables are at the ordinal level.
Scatterplot
A graph constructed by plotting the values from 2 interval variables along an X and Y axis.
Allows us to visualize the relationship between the IV and DV. If the dots fall about randomly then there is not much of a relationship.
Criterion of Least Squares
The object of regression is to draw a line that fits the data “best” and then rely on the formula for that line to predict and explain other values.
We draw the line that minimizes the sum of the squares of the deviations of the observed values of y from those values of y predicted by the regression line.
Regression Line
When a researcher collects interval/ratio data, this can be computed to see if the IV has an impact on the DV
Regression line equation
Y = a + bX
Y = values of DV
X = values of IV
Regression constant “a” - y intercept
“a” is the constant or y-intercept (when x=0 then y=a)- that is, the value of the DV when the independent variable is 0
Regression coefficient “b” slope
a 1 unit rise inthe IV results in a “b” unit rise in the DV. The slope can be positive or negative indicating the direction of the relationship.
Slope also tells how much of a change in one variable is associated with the other.
Prediction Line
once a regression equation is computed, it can be used to predict values of the DV by inserting different values for the IV.
Pearson’s R Correlation Coefficient
A measure of closeness of fit that indicates the strength of a relationship between interval/ratio level data.
A summary of measure that tells how likely it is that an estimate of a value of the DV will be more accurate given information about the DV
Runs from -1 to +1. The closer the cases fit the regression line, the higher r will be.
R-squared
A PRE measure or goodness of fit that tells the extent to which variation in the Y (the DV) is explained by variation in the X (the IV).
Tells us to what extent knowing X enables us to predict Y.
Runs from 0 to 1. The higher the number the more variance explained and the better the fit.
Is computed by multiplying Pearson’s r by itself (r x r)
T-statistic
Sometimes researchers compute this to determine statistical significance for interval/ratio data.
Indicates the probability that Pearson’s correlation coefficient is not zero in the population.
Commonly used in regression to say whether or not the relationship under study occurred by chance.
Multiple Regression
Using regression to explain the variation in a dependent variable with two or more independent variables.
Allows the researcher to isolate the effect of each IV included in the model while controlling for the effects of the other IVs
Y = a + b1X1 + b2X2 = b3X3 + bnXn
Partial Regression Coefficient
each “b” in Multiple Regression equation. There are several of them in one equation and each accounts for only part of the overall change in the DV.
F-statistic
Statistical significance of the overall regression model.
Will usually be reported with asterisks indicating the level of statistical significance.
No asterisk at all would mean that the model overall is not statistically significant
Beta Coefficient
Regression coefficients that have been standardized.
Is interpreted as how many standard deviations of change are caused in the DV
Adjusted R-squared
Smaller than regular R-squared.
Gives total amount of variance explained in DV by all the IV, but it takes into account the number of IV in the model and the sample size.
Is considered more reliable and conservative measure of total amount of variance explained.
Regression with Dummy Variables
Regression analysis can still be used.
Logistic Regression
Regression analysis when the DV is measured at the nominal level and has only two categories.