1/18
A collection of vocabulary flashcards based on descriptive statistics and the introduction to R programming.
Name  | Mastery  | Learn  | Test  | Matching  | Spaced  | 
|---|
No study sessions yet.
Descriptive Statistics
Statistical methods used to summarize, organize, and describe the main features of a dataset, such as central tendency (mean, median, mode) and variability (range, standard deviation).
Univariate
A type of statistical analysis focusing on describing or summarizing the characteristics of a single variable at a time, often using measures like frequency distributions, mean, or standard deviation.
Bivariate
A type of statistical analysis that examines the relationship or association between two variables simultaneously, often looking at how changes in one variable correspond to changes in another.
Crosstabs
Also known as contingency tables, crosstabs are a method for displaying and summarizing the relationship between two categorical variables by counting the number of observations that fall into each combination of categories.
Covariance
A measure indicating the extent to which two variables change together. A positive covariance means they tend to increase or decrease simultaneously, while a negative covariance means one increases as the other decreases. Its magnitude is influenced by the scales of the variables.
Correlation
A standardized statistical measure that quantifies the strength and direction of a linear relationship between two variables. Unlike covariance, correlation is scaled to a range (e.g., -1 to +1) making it easier to interpret the strength of the relationship regardless of the variables' units.
Pearson’s Correlation (r)
A specific type of correlation coefficient that measures the strength and direction of the linear relationship between two quantitative variables. It requires both variables to be interval or ratio scale and assumes a linear relationship.
Assumptions of Pearson’s r
These are specific conditions that must be met for the Pearson's correlation coefficient to be a valid and reliable measure of a linear relationship. Key assumptions include: 1) both variables are quantitative (interval or ratio scale), 2) the relationship is linear, 3) there is homoscedasticity, and 4) bivariate normality (or sufficiently large sample size).
Homoscedasticity
A statistical assumption, particularly important for regression and correlation, where the variance (or spread) of the residuals (the differences between observed and predicted values) is approximately constant across all levels of the independent variable. In the context of Pearson's r, it means the variability in one variable is similar for all values of the other variable.
Bivariate Normality
An assumption that for any given value of one variable, the values of the other variable are normally distributed, and vice versa. It implies that the joint distribution of the two variables forms a 3D bell shape, which is often desirable for parametric tests involving two variables.
Spearman’s Correlation (ρ)
A nonparametric measure of correlation that assesses the strength and direction of the monotonic (not necessarily linear) relationship between two variables. It is often used when variables are ordinal, or when the assumptions for Pearson's r are violated for interval/ratio variables.
Monotonic Relationship
A type of relationship between two variables where as one variable increases, the other variable either consistently increases (monotonically increasing) or consistently decreases (monotonically decreasing), but not necessarily at a constant rate (i.e., not strictly linear).
Anscombe’s Quartet
A set of four distinct pairs of datasets that each have nearly identical simple descriptive statistics (e.g., mean, variance, correlation coefficient, linear regression line), but when plotted graphically, they exhibit vastly different distributions and relationships between the variables. It famously illustrates the importance of visualizing data before relying solely on summary statistics.
• Each dataset contains 11 data points and 2 variables, X and Y
• Each dataset has nearly identical descriptive stats and correlations
• BUT the datasets have vastly different scatterplots
Range Restriction
A statistical phenomenon that occurs when the variability of scores for one or both variables in a correlation analysis is artificially limited compared to the true population variability. This often leads to an underestimation of the true correlation coefficient, as the full range of the relationship is not observed.
Spearman’s ρ vs. Pearson’s r
Use Spearman’s ρ instead of Pearson’s r when any of the following
are true about your data:
• The relationship is monotonic but not linear
• The data are at least ordinal or rank-ordered
• The variables have outliers that could distort Pearson’s r
• The assumptions of Pearson’s r aren’t met
• Takeaway: If your scatterplot isn’t a straight-line cloud or if your
data are ranks, ordinal, or contain outliers, check Spearman’s ρ—
it may give a truer measure of the association’s strength.
What’s a “Strong” Correlation?
• Depends on both the value of r and the research domain or
context in which it’s being used
• Depends on both the value of r and the research domain or
context in which it’s being used
Conventional and often cited guidelines in the behavioral and
social sciences:
• r ≈ |.10| is weak
• r ≈ |.30| is moderate
• r ≥ |.50| is moderate
You find a correlation of r = .50 between two variables in your
dataset.
• Question: What factors do you need to consider to determine
whether this correlation is “strong” or “weak”?
Context and field of study:
A moderate correlation in one field might be considered strong in another.
For example, a correlation of 𝑟=0.50 could be a significant finding in social sciences, where complex human behavior is involved
Consequences of error:
The "strength" of a correlation also depends on how much error you can tolerate in your predictions
Sample size:
With a large enough sample size, a correlation of 𝑟=0.50 could be statistically significant and unlikely to be due to random chance.
Outliers:
Be aware of outliers in your data, as a single extreme data point can significantly affect the correlation coefficient.
Statistical significance:
While r=0.50 is a moderate correlation, its statistical significance (p-value) determines the likelihood that this result is real and not due to random chance. A result with a very low p-value is more likely to be "strong" in a statistical sense
You calculate a correlation of r = .82 between two variables. Your
colleague says, “That’s all we need to know — let’s skip the
scatterplot.”
Question: Why is it still important to examine the scatterplot, and
what could you learn from it that the correlation alone would not
reveal?
Examining the scatterplot is crucial because it provides visual insight into the relationship between the two variables. It can reveal patterns such as outliers, non-linearity, homocedaciticity, bivariate normality which could all impact the interpretation and validity of the correlation.
The Y variable in your correlation has low variability. Although this
variable was measured on a scale from 1 to 100, your dataset only
contains scores that range from 80 to 100.
• Question: How is this likely to affect the correlation, and why?
A low variability in the Y variable will likely lower the correlation coefficient, potentially leading to an underestimation of the true relationship between the variables. This is because correlation measures how two variables change together, and if one variable doesn't change much, it's difficult to see a strong pattern or relationship