Regression Analysis Example: Predicting Course Quality

Course Quality Prediction Example

Introduction

  • The example is based on an American study from the HA textbook, adapted to an Australian university context.

  • The goal is to determine which factors predict higher student ratings of course quality (CO).

  • Students rate overall course quality on a Likert scale from 1 to 5.

Predictors (Independent Variables)

  • Five predictors (independent variables or X's) are included in the study:

    • Enrolled: Number of students enrolled in the course.

      • Rationale: Larger enrollments might lead to more resources (more staff, better equipment), or smaller enrollments might foster a sense of community and individual attention.

    • Exam Quality: Student satisfaction with the exam content (alignment with topics covered).

      • Example: If an exam in PRM is supposed to have a 50/50 split between ANOVA and regression, but it's 80% regression, students may rate the exam quality poorly.

      • In PRM, efforts are made to ensure an equal number of multiple-choice questions for each lecture.

    • Grade Expected: The grade students expect to receive in the course, based on their previous academic history.

    • Lecturer Knowledge: How much the lecturer knows about the subject matter.

    • Lecturer Ability: How well the lecturer can teach the subject matter.

      • Note: Lecturer knowledge and ability are not necessarily mutually exclusive.

Statistical Output

  • Statistical software packages (SPSS, R, JASP, Jamovi) provide three types of information:

    • Descriptive statistics: Means, standard deviations, and sample sizes (N).

    • Inferential statistics of the equation (overall model): R^2 value, indicating the proportion of variance in course quality explained by all five predictors combined, and its significance.

    • Inferential statistics of the predictors: Whether each predictor uniquely contributes to predicting course quality (Y).

Descriptives and Correlations

  • Descriptive statistics include means and standard deviations.

  • Correlations show the relationships between independent variables and the dependent variable (course quality).

  • Exam quality, lecturer knowledge, and lecturer ability show correlations above 0.6 with course quality, indicating significant relationships (p < 0.05).

  • Correlations among independent variables (e.g., exam quality and grade expected at 0.61, lecturability and exam quality at 0.72) may suggest multi-collinearity or redundancy.

Assessing Normality from Descriptives

  • Means and standard deviations can provide insights into the distribution of variables.

    • For Enrol, a mean of 88 and a standard deviation of 145 indicate a positively skewed distribution, as subtracting two standard deviations from the mean results in a negative enrollment number (impossible).

    • Variables measured on a 1-5 Likert scale (e.g., course quality has mean 3.8, SD 0.66) are also likely to be skewed if one standard deviation from the mean exceeds the scale's boundaries.

  • Non-normality can be a concern, and researchers might need to clean up data (remove outliers or transform data) before proceeding.

Overall Model Fit

  • The R^2 value indicates the proportion of variance in course quality explained by all five predictors.

  • Adjusted R^2 is used to account for sample size; it is typically lower than R^2 , especially with small sample sizes.

  • Report the adjusted R^2 if it is substantially different from the R^2.

  • Sample Size Calculation:

    • Degrees of freedom are used to determine the sample size.

    • Given degrees of freedom of 5 (number of independent variables) and 44 (error term), N - 1 = 49, so N = 50.

      • df = N - 1 can be rearranged to N = df + 1.

Interpreting R-squared and F Value

  • R^2 = 0.755 represents a large effect size.

  • An F value far from 1 indicates a significant finding.

  • A p-value < 0.05 indicates that the five variables, considered collectively, significantly predict course quality.

  • An R^2 greater than 0.26 is considered a substantial effect, according to Cohen's estimates.

Individual Predictors

  • Lecturer knowledge and lecturability are identified as significant predictors due to p-values < 0.05.

  • Unstandardized beta coefficients are sometimes reported but can be difficult to interpret because they are on different scales of measurement.

  • Standardized beta coefficients allow for comparison of the strength of each predictor.

  • Positive beta values indicate a positive relationship: as lecturer knowledge and lecturability increase, perceived course quality also increases.

Reporting Results

  • A standard regression was performed using the variables to predict course quality.

  • Results: The set of five variables significantly predicted course quality, R^2 = 0.76, with a sample size of 50 and a significant F value (p < 0.01 or p < 0.001).

  • Lecturer ability and lecturer knowledge made statistically significant unique contributions to the prediction.

Non-Significant Multiple Regression

  • A non-significant multiple regression could be due to:

    • Low sample size.

    • Small effect sizes.

    • Inappropriate inclusion of predictors (redundancy).

    • Poor measurement of the dependent variable (course quality).

    • The model not applying to the specific population.

Residual Plots and Outliers

  • Z-residuals are part of the assumption checks; residual plots are examined to identify potential outliers.

  • A line is drawn around +1.96 and -1.96 on the residual plot.

  • Cases outside these boundaries may be considered outliers and statistically examined using measures such as Mahalanobis distance, leverage value, or Cook's distance.

  • Consistency in the choice of outlier detection measure across studies is recommended.

  • Scatter plots (e.g., course quality vs. enrollment) can visually reveal outlying scores.

  • Mahalanobis distance is used in this course for detecting multivariate outliers.

  • The criterion to reject the null hypothesis and identify an outlier is p < 0.001.

Handling Outliers

  • Options for handling outliers:

    • Exclusion.

    • Adjustment/trimming.

  • Before excluding outliers, verify that it is not a data entry error.

  • The removal of outliers can substantially change the results.

Example of Outlier Impact

  • Before removing outliers, lecturer knowledge and teaching ability were the best predictors.

  • After removing 7 cases, course enrollment becomes a significant predictor, while lecturer knowledge is no longer significant.

  • Lecturer teaching ability remains an important predictor, with an increased beta.

  • The R-squared increased from 0.755 to 0.824 after removing the outliers, and the F-value went from 27 to 34.

  • Reporting should include mentioning the removal of outliers, how many cases were removed, and why they were removed.

Importance of Reliable and Valid Findings

  • The motivation for removing outliers should not be solely to achieve significance.

  • Scientists are responsible for ensuring that findings are reliable and valid.

  • Data cleaning is necessary to obtain correct conclusions.

Moving Science Forward

  • Making sure that the answers you get to you questions are reliable and valid, and to that end, help to move science forward.