Regression Analysis Example: Predicting Course Quality
Course Quality Prediction Example
Introduction
The example is based on an American study from the HA textbook, adapted to an Australian university context.
The goal is to determine which factors predict higher student ratings of course quality (CO).
Students rate overall course quality on a Likert scale from 1 to 5.
Predictors (Independent Variables)
Five predictors (independent variables or X's) are included in the study:
Enrolled: Number of students enrolled in the course.
Rationale: Larger enrollments might lead to more resources (more staff, better equipment), or smaller enrollments might foster a sense of community and individual attention.
Exam Quality: Student satisfaction with the exam content (alignment with topics covered).
Example: If an exam in PRM is supposed to have a 50/50 split between ANOVA and regression, but it's 80% regression, students may rate the exam quality poorly.
In PRM, efforts are made to ensure an equal number of multiple-choice questions for each lecture.
Grade Expected: The grade students expect to receive in the course, based on their previous academic history.
Lecturer Knowledge: How much the lecturer knows about the subject matter.
Lecturer Ability: How well the lecturer can teach the subject matter.
Note: Lecturer knowledge and ability are not necessarily mutually exclusive.
Statistical Output
Statistical software packages (SPSS, R, JASP, Jamovi) provide three types of information:
Descriptive statistics: Means, standard deviations, and sample sizes (N).
Inferential statistics of the equation (overall model): R^2 value, indicating the proportion of variance in course quality explained by all five predictors combined, and its significance.
Inferential statistics of the predictors: Whether each predictor uniquely contributes to predicting course quality (Y).
Descriptives and Correlations
Descriptive statistics include means and standard deviations.
Correlations show the relationships between independent variables and the dependent variable (course quality).
Exam quality, lecturer knowledge, and lecturer ability show correlations above 0.6 with course quality, indicating significant relationships (p < 0.05).
Correlations among independent variables (e.g., exam quality and grade expected at 0.61, lecturability and exam quality at 0.72) may suggest multi-collinearity or redundancy.
Assessing Normality from Descriptives
Means and standard deviations can provide insights into the distribution of variables.
For Enrol, a mean of 88 and a standard deviation of 145 indicate a positively skewed distribution, as subtracting two standard deviations from the mean results in a negative enrollment number (impossible).
Variables measured on a 1-5 Likert scale (e.g., course quality has mean 3.8, SD 0.66) are also likely to be skewed if one standard deviation from the mean exceeds the scale's boundaries.
Non-normality can be a concern, and researchers might need to clean up data (remove outliers or transform data) before proceeding.
Overall Model Fit
The R^2 value indicates the proportion of variance in course quality explained by all five predictors.
Adjusted R^2 is used to account for sample size; it is typically lower than R^2 , especially with small sample sizes.
Report the adjusted R^2 if it is substantially different from the R^2.
Sample Size Calculation:
Degrees of freedom are used to determine the sample size.
Given degrees of freedom of 5 (number of independent variables) and 44 (error term), N - 1 = 49, so N = 50.
df = N - 1 can be rearranged to N = df + 1.
Interpreting R-squared and F Value
R^2 = 0.755 represents a large effect size.
An F value far from 1 indicates a significant finding.
A p-value < 0.05 indicates that the five variables, considered collectively, significantly predict course quality.
An R^2 greater than 0.26 is considered a substantial effect, according to Cohen's estimates.
Individual Predictors
Lecturer knowledge and lecturability are identified as significant predictors due to p-values < 0.05.
Unstandardized beta coefficients are sometimes reported but can be difficult to interpret because they are on different scales of measurement.
Standardized beta coefficients allow for comparison of the strength of each predictor.
Positive beta values indicate a positive relationship: as lecturer knowledge and lecturability increase, perceived course quality also increases.
Reporting Results
A standard regression was performed using the variables to predict course quality.
Results: The set of five variables significantly predicted course quality, R^2 = 0.76, with a sample size of 50 and a significant F value (p < 0.01 or p < 0.001).
Lecturer ability and lecturer knowledge made statistically significant unique contributions to the prediction.
Non-Significant Multiple Regression
A non-significant multiple regression could be due to:
Low sample size.
Small effect sizes.
Inappropriate inclusion of predictors (redundancy).
Poor measurement of the dependent variable (course quality).
The model not applying to the specific population.
Residual Plots and Outliers
Z-residuals are part of the assumption checks; residual plots are examined to identify potential outliers.
A line is drawn around +1.96 and -1.96 on the residual plot.
Cases outside these boundaries may be considered outliers and statistically examined using measures such as Mahalanobis distance, leverage value, or Cook's distance.
Consistency in the choice of outlier detection measure across studies is recommended.
Scatter plots (e.g., course quality vs. enrollment) can visually reveal outlying scores.
Mahalanobis distance is used in this course for detecting multivariate outliers.
The criterion to reject the null hypothesis and identify an outlier is p < 0.001.
Handling Outliers
Options for handling outliers:
Exclusion.
Adjustment/trimming.
Before excluding outliers, verify that it is not a data entry error.
The removal of outliers can substantially change the results.
Example of Outlier Impact
Before removing outliers, lecturer knowledge and teaching ability were the best predictors.
After removing 7 cases, course enrollment becomes a significant predictor, while lecturer knowledge is no longer significant.
Lecturer teaching ability remains an important predictor, with an increased beta.
The R-squared increased from 0.755 to 0.824 after removing the outliers, and the F-value went from 27 to 34.
Reporting should include mentioning the removal of outliers, how many cases were removed, and why they were removed.
Importance of Reliable and Valid Findings
The motivation for removing outliers should not be solely to achieve significance.
Scientists are responsible for ensuring that findings are reliable and valid.
Data cleaning is necessary to obtain correct conclusions.
Moving Science Forward
Making sure that the answers you get to you questions are reliable and valid, and to that end, help to move science forward.