PS

Lecture 8 Notes - K-Nearest Neighbors Regression

Regression Prediction Problem
  • Focus is on predicting quantitative values (continuous outcomes) vs. class labels (discrete categories).

    • Example: Predicting house prices based on features like size, location, and number of bedrooms.

    • Scatter plots reveal correlations aiding predictions.

K-Nearest Neighbors Regression (K-NN)
  • K-NN regression finds k-nearest neighbors from the predictors.

    • Example: If k=5, identify the 5 closest points based on distance metrics (e.g., Euclidean).

    • Prediction is the average of their values, yielding a continuous outcome.

Evaluating Model Quality
  • Key evaluation questions:

    1. Is the model effective in capturing data relationships?

    2. How do we optimally choose k?

  • Evaluation techniques:

    • Cross-validation for robust assessment using training, validation, and test sets.

    • Visualization of errors helps diagnose model performance.

Root Mean Squared Prediction Error (RMSPE)
  • RMSPE assesses prediction quality:

    • Formula: RMSPE = √((Σ(yᵢ - ŷᵢ)²) / n).

    • Differentiate between RMSE (training data) and RMSPE (validation/testing).

  • Example: The final model's RMSPE was 91,620.4 for k=52, highlighting prediction performance on unseen data.

Choosing the Value of k
  • k is selected via:

    1. Cross-validation to minimize RMSPE.

    2. Training on the entire dataset with the chosen k.

    3. Performance evaluation with a test set.

Overfitting and Underfitting
  • Overfitting: Complex models fit training data too closely, capturing noise.

  • Underfitting: Simple models fail to capture trends.

  • K impacts model complexity and predictive trends, illustrated with visualizations comparing different k values.

Model Performance Evaluation
  • Observations:

    • A test RMSPE of ~90,529 indicates strong generalizability.

    • Evaluations influence practical interpretation of predictions.

Multivariable K-NN Regression
  • Utilizing multiple predictors enhances predictions.

    • Example: Combining house size, number of bedrooms, and location for better accuracy.

  • Strengths: Improved accuracy from additional information while addressing scaling issues.

  • Analysis shows slight performance improvements with more predictors.

Strengths and Limitations of K-NN Regression

Strengths:

  1. Simple and intuitive, accessible for beginners.

  2. Minimal assumptions about data structure.

  3. Effective with non-linear relationships.

Limitations:

  1. Computationally intensive with large datasets.

  2. Challenges in high-dimensional spaces (curse of dimensionality).

  3. Predictions may lack generalization outside training data range.

Conclusion
  • K-NN regression is a versatile method for predicting outcomes based on past data.

  • Emphasizes parameter tuning and thorough evaluation for reliable predictions in various applications.

Function

Definition

knn(train, test, cl, k)

Performs K-Nearest Neighbors classification or regression.

train(formula, data, method)

Trains a model using the specified formula and data.

predict(model, newdata)

Predicts outcomes based on a trained model and new data.

scale(data)

Standardizes the features in the dataset (mean=0, sd=1).

confusionMatrix(actual, predicted)

Evaluates classification performance by comparing actual vs. predicted classes.

dist(x)

Computes the distance matrix between the rows of a data frame or matrix, which is crucial for K-NN.

knn_cv(train, cl, k)

Performs cross-validation for K-NN but uses the training set for both fitting and prediction.