In-Depth Notes on k-Nearest Neighbors and Machine Learning Concepts

Introduction to Data Science with R

  • Course Title: STAT 200 - Introduction to Data Science with R
  • Institution: San Diego State University (SDSU)
  • Department: Mathematics and Statistics
  • Focus: Relationships between variables, prediction, visualization, and machine learning.

Understanding Prediction

  • Objective of Prediction:

    • To predict unknown or future outcomes (dependent variable y) using data (independent variables x).
    • Example Variables:
    • Age, sex, blood pressure, calories, occurrence of heart attack.
  • Types of Outcomes:

    • Continuous/Numeric Outcomes (e.g., surf height):
    • This task is called regression.
    • Categorical Outcomes (e.g., tumor status):
    • This task is called classification.

Key Terminology in Machine Learning

  • Predictors (independent variables):

    • Also referred to as features, inputs, covariates, or explanatory variables.
  • Outcome (dependent variable):

    • Often referred to as response, target, label, or class.
  • Supervised Learning:

    • Process of learning to predict outcomes from predictors.
    • In contrast, unsupervised learning involves learning structure without outcomes.

Supervised Learning Algorithms

  • Common algorithms include:

    • Multiple/logistic regression
    • Naïve Bayes
    • k-nearest neighbors (KNN)
    • Random forests
    • Support vector machines
    • Neural networks
  • Focus on k-nearest neighbors:

    • Simple but powerful, used for both regression and classification.

k-Nearest Neighbors (KNN) Methodology

  • How k-NN Works:
    • Predicts an outcome based on the outcomes of the K most similar observations.
  • Example:
    • Predicting the species of a flower based on sepal width and length.
    • New flower classified based on its K-nearest neighbors.

Prediction Process:

  • Choosing K:

    • K = 1: Nearest single neighbor.
    • K = 2: Two nearest neighbors.
    • Predictions can vary based on K selection.
    • Aim for the most common label among K neighbors.
  • Training and Testing:

    • Randomly select a proportion of sample data for training (commonly 70%) and the remainder for testing.
    • Classification occurs using the training set; predictions made on the testing set.
    • Measure accuracy of predictions against true labels.
  • Cross-validation:

    • Repeat the testing process for different values of K to determine which K provides the highest accuracy.

Implementation in R

  • KNN can be implemented in statistical software like R, focusing on:
    • Loading necessary packages
    • Preparing dataset
    • Conducting the training-testing split
    • Evaluating model accuracy based on chosen K values.

Conclusion

  • Using KNN allows for effective classification, with accuracy dependent on optimal K selection and proper training-testing methodology.
  • Understanding prediction methodologies is crucial for success in data science.