Regression Analysis Example: Predicting Course Quality

The example is based on an American study from the HA textbook, adapted to an Australian university context.
The goal is to determine which factors predict higher student ratings of course quality (CO).
Students rate overall course quality on a Likert scale from 1 to 5.

Statistical software packages (SPSS, R, JASP, Jamovi) provide three types of information:
- Descriptive statistics: Means, standard deviations, and sample sizes (N).
- Inferential statistics of the equation (overall model): R^2 value, indicating the proportion of variance in course quality explained by all five predictors combined, and its significance.
- Inferential statistics of the predictors: Whether each predictor uniquely contributes to predicting course quality (Y).

Descriptive statistics include means and standard deviations.
Correlations show the relationships between independent variables and the dependent variable (course quality).
Exam quality, lecturer knowledge, and lecturer ability show correlations above 0.6 with course quality, indicating significant relationships (p < 0.05).
Correlations among independent variables (e.g., exam quality and grade expected at 0.61, lecturability and exam quality at 0.72) may suggest multi-collinearity or redundancy.

Means and standard deviations can provide insights into the distribution of variables.
- For Enrol, a mean of 88 and a standard deviation of 145 indicate a positively skewed distribution, as subtracting two standard deviations from the mean results in a negative enrollment number (impossible).
- Variables measured on a 1-5 Likert scale (e.g., course quality has mean 3.8, SD 0.66) are also likely to be skewed if one standard deviation from the mean exceeds the scale's boundaries.
Non-normality can be a concern, and researchers might need to clean up data (remove outliers or transform data) before proceeding.

The R^2 value indicates the proportion of variance in course quality explained by all five predictors.
Adjusted R^2 is used to account for sample size; it is typically lower than R^2 , especially with small sample sizes.
Report the adjusted R^2 if it is substantially different from the R^2.
Sample Size Calculation:
- Degrees of freedom are used to determine the sample size.
- Given degrees of freedom of 5 (number of independent variables) and 44 (error term), N - 1 = 49, so N = 50.
  - df = N - 1 can be rearranged to N = df + 1.

R^2 = 0.755 represents a large effect size.
An F value far from 1 indicates a significant finding.
A p-value < 0.05 indicates that the five variables, considered collectively, significantly predict course quality.
An R^2 greater than 0.26 is considered a substantial effect, according to Cohen's estimates.

Lecturer knowledge and lecturability are identified as significant predictors due to p-values < 0.05.
Unstandardized beta coefficients are sometimes reported but can be difficult to interpret because they are on different scales of measurement.
Standardized beta coefficients allow for comparison of the strength of each predictor.
Positive beta values indicate a positive relationship: as lecturer knowledge and lecturability increase, perceived course quality also increases.

A standard regression was performed using the variables to predict course quality.
Results: The set of five variables significantly predicted course quality, R^2 = 0.76, with a sample size of 50 and a significant F value (p < 0.01 or p < 0.001).
Lecturer ability and lecturer knowledge made statistically significant unique contributions to the prediction.

Z-residuals are part of the assumption checks; residual plots are examined to identify potential outliers.
A line is drawn around +1.96 and -1.96 on the residual plot.
Cases outside these boundaries may be considered outliers and statistically examined using measures such as Mahalanobis distance, leverage value, or Cook's distance.
Consistency in the choice of outlier detection measure across studies is recommended.
Scatter plots (e.g., course quality vs. enrollment) can visually reveal outlying scores.
Mahalanobis distance is used in this course for detecting multivariate outliers.
The criterion to reject the null hypothesis and identify an outlier is p < 0.001.

Before removing outliers, lecturer knowledge and teaching ability were the best predictors.
After removing 7 cases, course enrollment becomes a significant predictor, while lecturer knowledge is no longer significant.
Lecturer teaching ability remains an important predictor, with an increased beta.
The R-squared increased from 0.755 to 0.824 after removing the outliers, and the F-value went from 27 to 34.
Reporting should include mentioning the removal of outliers, how many cases were removed, and why they were removed.

The motivation for removing outliers should not be solely to achieve significance.
Scientists are responsible for ensuring that findings are reliable and valid.
Data cleaning is necessary to obtain correct conclusions.

Making sure that the answers you get to you questions are reliable and valid, and to that end, help to move science forward.