1/68
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Correlation
strength and direction relationship between 2 variables
Univariate
summarizes 1 variable at a time
Bivariate
compares 2 variables (correlation & linear regression)
Multivariate
compares 2+ variables (cluster analysis, PCA, & correspondence analysis)
Models used for correlation
scatterplots & general linear model
What types of data for correlation?
continuous/rank normally distributed data
Pearson’s R
correlation coefficient that measures the strength and direction of linear relationships for normal, interval data (-1 to 0 to 1 scale)
beware outliers, must be linear
Covariance
indicates the direction of a linear relationship
deviations
differences between observed values and the mean
standardization
converting variables’ units of measurement to be able to compare by using standard deviation
standard deviation
standardized unit of measurement used to show how far away values are from the mean
partial correlation
quantifies the relationship between two variables while controlling the effect of one variable
Correlation Tests
Pearson’s R
Spearman’s rho
Kendall’s tau
Spearman’s rho
correlation coefficient test used for non-parametric ranked data, but not good for tied ranks
Kendall’s tau
correlation coefficient test used for non-parametric ranked data and is better for tied ranks
Linear Regression
predicts how the independent variable influences the dependent variable to change as well as the relationship between the variables
Steps of Linear Regression
identify dependent and independent variables
plot cases in scatterplot
fit a line to the points
What types of data for correlation, linear regression, and PCA?
continuous (interval/ratio)
Slope intercept formula
y = mx + b
y = dependent variable
m = slope
x = independent variable
b = y-intercept
Least square regression line
line determined through method of least squares that passes through the means of x & y values and minimizes the distance between itself and other points
Method of Least Squares
used to estimate parameters of the model and the line that best fits the data
simple regression
outcome variable is predicted from 1 variable
multiple regression
outcome variable is predicted from multiple variables
residuals
like deviations but for linear regression; assesses model fit by looking at the differences between the observed values and regression line
Coefficient of Determination
output for linear regression; percentage of variance in the dependent that can be explained by the variance in the independent variable (0 to 1)
ANOVA Regression
sum of squared differences (model fit)
significance for regression (p-value)
F-value/F-ratio (Between-group variance/within group variance; > 1)
R
degree of correlation
R²
percentage of the total variation is able to be explained
Why linear regression?
allows quantification of the relationship between 2 variables
allows you to make predictions when you only have independent variables based on a known relationship
explore exceptional cases
Cluster Analysis
groups sets of objects so that objects in groups (cluster) are more similar to each other than objects in other groups (can use all data types)
Methods of Cluster Analysis
Hierarchical Cluster Analysis
Non-Hierarchical Cluster Analysis
Hierarchical Cluster Analysis
clusters are formed during every step of the process and as a new case is entered, it is grouped into a larger cluster as well and the output is not sorted in a linear order
Dendrogram
tree branch plot used in cluster analysis
Steps for Cluster Analysis
choose variables you want
decide whether to use raw or standardized variables
choose coefficient that quantifies the similarity or dissimilarity between all cases
select method for forming clusters
Cluster Analysis Coefficients
Euclidean Distance
City Block Metrics
Jaccard Coefficient
Simple Matching Coefficient
Euclidean Distance
coefficient method for cluster analysis that uses Pythagorean theorem to measure a straight line distance from one point to another for continuous data
City Block Metric
coefficient method for cluster analysis that makes an x, y grid and moves one axis at a time to measure how many units away for large continuous data
Jaccard Coefficient
coefficient method for cluster analysis that counts negative matches as weighed differences and is best for presence/absence data
Simple Matching Coefficient
coefficient method for cluster analysis that counts negative matches as weighing the same and is best for presence/absence data
Methods for forming clusters
Simple Linkage
Average Linkage
Complete Linkage
Ward’s Procedure
Simple Linkage
method for forming clusters in cluster analysis that uses the closest variables to draw a line
Average Linkage
method for forming clusters in cluster analysis that uses the shortest distance between centroids to draw a link
Complete Linkage
method for forming clusters in cluster analysis that uses the furthest distance from the edge of each cluster
Ward’s Procedure
method for forming clusters in cluster analysis that analyzes the variance of the clusters and is best for quantitative data
Types of Hierarchical Clustering
Agglomerative
Divisive
Agglomerative Hierarchical Clustering
uses bottom up approach that repeatedly merges clusters into larger ones until a single cluster emerges (similarity based on proximity using Euclidean distance) and is more commonly used because of its ease to implement
Divisive Hierarchical Clustering
top down approach that starts from one group, then splits until more clusters are created and is better for large data sets
Rules for how Divisive clusters are formed
Monothetic
Polythetic
Monothetic
rule for forming divisive clusters where decisions are made based on one variable at a time
Polythetic
rule for forming divisive clusters where clusters are made based on multiple variables at a time
Non-Hierarchical Cluster Analysis
uses some measure to evaluate whether or not a case should be in a cluster by merging or splitting clusters instead of putting them into a hierarchical order and is better for small data sets
Simple K-means
ensures that non-overlapping groups that have no hierarchical relationships between them
Principal Component Analysis
a way to identify clusters of variables by reducing the number of dimensions in large datasets down to its principal components that still retain most of the original data (continuous); tries to explain the maximum amount of total variance in a correlation matrix by transforming the original variables into its smaller linear components (correlation & variance)
Variance
spread of data
Principal Components
lines that describe the relationship between 2 variables and is predicted from measured variables
R matrix
table that arranges the correlation between each pair of variables
What type(s) of data for correspondence analysis?
categorical/count
Steps of PCA
cases are plotted in multi-dimensional space
find where the data is most spread out
identify where the center of the spread of points is
Criteria of a Principal Component Line
must pass through center of data
must pass through spread of data along the axis that will capture the most variation
all consecutive lines must be drawn at right angles to the prior line
Evaluating PCA/Eigenvectors for significance/meaning
eigenvalues
scree plot
component loading
What kind of plot for correlation, linear regression, PCA, and correspondence analysis?
scatterplots
Eigenvalues
values derived from eigenvectors that measure the distance from one end of the matrix to another in order to understand the distribution of variance (how much variation along that dimension of data is described; >1)
Eigenvector
measures height and width of ellipse encompassing the data in the scatterplot
Scree Plot
plot that elbows where a lot of variance is accounted for/eigenvalues level off
Component Loading
breaks down principal components to understand correlations between original variables and unit-scaled components (greater # = stronger correlation)
Correspondence Analysis
plots (scatterplots/biplots) different categories in multidimensional space and breaks it down into 2 dimensional space to understand the similarities and why they are similar (expected vs observed)
Residual
like error for correspondence analysis; describes how the relationship between 2 dependent variables is influenced by individual differences in participant’s performance (difference between model prediction and value observed)
Inertia
degree to which values of rows and columns correspond to each other in correspondence analysis (chi-square/n)
Steps of Correspondence Analysis
compute averages for each row and column
compute the expected values for each cell
compute residuals for each cell
divide the residuals by the expected values
plot indexed residuals in 2 dimensions