Final Notes

Correlation and Regression

Correlation

Introduction to Correlation
- Concept of a correlation coefficient originates from Sir Francis Galton.
- It was proposed to quantify relationships between generations.
Example
- Data collected by Galton on the relationship between parent and child heights.
- Demonstration: As parent heights increase, child heights also tend to increase.
- For example, if parent heights averaged 72 inches (approximately 183 cm), the child's height could be predicted to be 70.4 inches (approximately 179 cm).
Ellipse Representation
- You can draw ellipses around the line of best fit to illustrate the values that fall one standard deviation (smaller ellipse) and two standard deviations (larger ellipse) away from the mean.
- The smaller the ellipse, the less error there is in the prediction.
- This visual, combined with the line angle, helps in determining the correlation coefficient.
Definitions of Correlation
- Correlation Types: Can be positive, negative, or null.
- Positive Correlation: Both variables move in the same direction.
- Negative Correlation: As one variable increases, the other decreases.
- Null Correlation: No relationship between the variables.
Coefficient Range
- correlation coefficients range from -1 to +1.
- The absolute value of the correlation coefficient indicates the "magnitude of the relationship".
Types of Analysis
- Different correlation coefficients are suited for different datasets.
- The best place to start correlational analysis is with a scatterplot.
- Helps determine the direction (positive or negative) and form of the relationship (linear or non-linear).
- Estimation of Relationship: Rough degree estimation can also be visually assessed with scatterplots.
Perfect Correlation Definitions
- Perfect Positive Correlation (+1.0):
- All data points fall in a straight line from the lower left-hand corner to the upper right-hand corner of the plot.
- Represents a perfect 45-degree angle with no prediction errors around the line of best fit.
- Perfect Negative Correlation (-1.0):
- Example might include money spent versus money left after spending (no deviations around the line of best fit).
Non-linear Relationships
- Form is crucial since correlation coefficients discussed mainly describe linear relationships.

Types of Coefficients

Association vs Correlation
- Association: Any relationship between two variables which may include linear/non-linear forms and is applicable to continuous/categorical variables as well as parametric/non-parametric data.
- Correlation: Specifically refers to a linear relationship, typically quantified using a correlation coefficient (both parametric and non-parametric variants exist).
- Bivariate Correlation: All correlations covered focus on two variables.
Types of Correlation Coefficients
- Pearson Product-Moment Correlation Coefficient (Pearson's r)
- The most common correlation coefficient, appropriate for interval or ratio variables.
- Spearman Rank Order Correlation Coefficient
- Adapted from Pearson’s r for ranked data and scenarios with non-normal data distributions (e.g., includes ranked data automatically normalized).
- Handles extreme values by analyzing ranks, implicitly controlling for outliers.
- If data is ranked (1 to n, where n = number of subjects), Spearman’s rho can yield similar results to Pearson’s r.
- Point-Biserial Correlation Coefficient
- Used for relationships between one interval/ratio variable and one dichotomous variable (two categories).
- Phi Coefficient
- Used when both variables are measured on a dichotomous scale.
- When variables are dummy coded, the Phi coefficient equals Pearson's r.

Assumptions of Pearson's r

The relationship represented must be linear (assess using scatterplots).
Both variables must be interval or ratio (equal conceptual distances between scale points).
If working with nominal/ordinal data, lead towards using other coefficients like Spearman’s rho.
Requires bivariate normality: all data points must be normally distributed when combined.
Variances
- Assumes homoscedasticity: Variance around each x value is roughly equivalent.
- Residuals: Differences between predicted and observed values in the dataset.

Calculating Pearson's r (Conceptual)

Definition
- Pearson’s r serves as a standardized covariance between x and y.
The formula involves the sum of the cross-products of z-scores of each variable overs the sample size (n-1 for unbiased estimate); resulting correlation coefficient quantifies the average cross-product of z-scores.
Statistical significance tested with a t-test, with degrees of freedom being the number of pairs minus two.
Example #1:
- Calculate Pearson's r between average sleep (minutes) and final term average for 20 students to find the correlation (approx. 0.49).

Significance of Correlation

To ascertain if the relationship is statistically significant, evaluate whether the obtained correlation's absolute value exceeds a critical value.
- Denote significant correlation:
- r{obt} > r{crti} or r{obt} < -r{crti}.

Critical Values of Pearson's r

For one-tailed hypothesis at alpha 0.05, refer to critical value table for reliability. For two-tailed hypothesis, split alpha to 0.025, addressing significance levels accordingly based on correlations.

Effect Size

Coefficient of Determination ( $r^2$ ) serves to show the percentage of variance in the dependent variable attributed to the independent variable; for example, if $r^2 = 0.24$ , then 24% of variance in grades is explained by sleep hours.

Ranked Data

Spearman’s rho Calculation:
- Ordinal data (ranked) requires Spearman's usage; non-parametric methods do not presuppose normal distribution forms.

Bivariate Count Data

Focus on determining associations and relationships between two categorical variables via the Chi-Square Test. Specifically, the McNemar Test assesses changes on dichotomous variables, representing within-subjects designs.

Conducting the Chi-Square Test

Tests for independence between two categorical variables via a cross-tabulation technique. Expectation premise underlies whether no association exists between the two variables.

Example of a Chi-Square Calculation

Use hypothesized proportions to compare with expected proportions across categories; for instance comparing living arrangements and Parkinson's disease stages among participants, revealing independence or correlation.

Summary of Steps

Hypothesis Statements: Define null and alternative based on the relationship sought.
Choosing Test Statistics: Employ Chi-Square in pairwise comparisons of categorical variables, applying post hoc analyses for significant findings.
Interpreting Results: Analyze obtained chi-square value against critical value to determine hypothesis acceptance or rejection.

Significant outcome: emergent from additional analyses showcasing specific cells influencing overall chi-square result, with magnitude identified through phi or Cramer’s V as measures of association.

ANOVA – Analysis of Variance

Introduction to ANOVA

ANOVA compares means across multiple groups, determining whether at least one group mean differs significantly from others.
Ensure to employ the appropriate test for research objectives and hypotheses.

Testing Treating Variability

Analyzing partitioning between treatment and error variance, assessing F-ratio for significance.

Concluding Comments

ANOVA encompasses assumptions of normality, independence, and homogeneity of variance crucial for robust outcomes. Using post hoc tests allows for further exploration of differences among grouped means following ANOVA findings.

This guide provides a structured overview of correlation, regression, and variances necessary for understanding and applying statistical methodologies effectively. Each section aims to support the learning and application of statistical principles in research contexts.

Cross-Tabulation Comparison: In a cross-tabulation, we are comparing individual categories across two categorical variables to assess their association.
Chi-Square Omnibus Test: The chi-square is an omnibus test, which means a significant chi-square cannot tell us which specific cells differ from our expectations without further analysis.
Null Hypothesis in Bivariate Chi-Square: Failing to reject the null hypothesis indicates that the variables are not significantly correlated with each other.
Degrees of Freedom for Chi-Square Test: For a chi-square test of independence, the degrees of freedom are calculated as df = (rows - 1) * (columns - 1).
Chi-Square Critical Value: Given 4 rows and 5 columns at alpha 0.05, the chi-square critical value to compare is typically 21.0261.
Standardized Residuals: We calculate standardized residuals to identify the cells that significantly contribute to the overall chi-square statistic.
Information from Standardized Residuals: From standardized residuals, we learn which cell(s) differ from expectation, among other insights about overall matrix conformity.
McNemar Test Appropriate Use: The McNemar test is most appropriate when assessing change over two time points on a single dichotomous variable.
Scenario for McNemar Test: The McNemar test is used to analyze repeated measurements within the same group.
Degrees of Freedom for McNemar Test: The degrees of freedom for a McNemar test is df = 1.
Observed Values in McNemar Test: If there is no change between time 1 and time 2, the observed values will be consistent, typically yielding 0 for cells that don’t change.
Expected Values in McNemar Test: Similarly, if there is no change, expected values will also depend on the distributions expected under the hypothesis.
Magnitude of Association for Larger Matrices: When computing the magnitude of association for matrices larger than 2x2, use Cramer's V.
Identical Phi Coefficient and Cramer's V: The phi coefficient and Cramer's V will be identical when at least one of the variables has only two categories.
Testing Significance of Association: To test the significance of the association between categorical variables, we would typically employ a chi-square test.
Chi-Square Significance and Phi Coefficient: If the chi-square is statistically significant, the phi coefficient may also be significant but could potentially be influenced by sample size.
Using Cramer's V for Larger Matrices: Cramer's V is advantageous for larger matrices as it corrects for the size of the cross-tabulation.
Chi-Square Value from Cramer’s V: To obtain the chi-square value from a Cramer’s V of 0.10 in a 4 x 3 cross-tabulation, we lack the necessary formula application details.
Chi-Square Value of Exam Success: To find the absolute value for the obtained chi-square related to the relationship between sex and success on the medical exam, calculation yields 3.84.
Magnitude of Association in Exam Success: The magnitude of association between sex and success on the first medical school exam is calculated to be 0.06, indicating a weak association.