In-Depth Notes on k-Nearest Neighbors and Machine Learning Concepts
Introduction to Data Science with R
- Course Title: STAT 200 - Introduction to Data Science with R
- Institution: San Diego State University (SDSU)
- Department: Mathematics and Statistics
- Focus: Relationships between variables, prediction, visualization, and machine learning.
Understanding Prediction
Objective of Prediction:
- To predict unknown or future outcomes (dependent variable y) using data (independent variables x).
- Example Variables:
- Age, sex, blood pressure, calories, occurrence of heart attack.
Types of Outcomes:
- Continuous/Numeric Outcomes (e.g., surf height):
- This task is called regression.
- Categorical Outcomes (e.g., tumor status):
- This task is called classification.
Key Terminology in Machine Learning
Predictors (independent variables):
- Also referred to as features, inputs, covariates, or explanatory variables.
Outcome (dependent variable):
- Often referred to as response, target, label, or class.
Supervised Learning:
- Process of learning to predict outcomes from predictors.
- In contrast, unsupervised learning involves learning structure without outcomes.
Supervised Learning Algorithms
Common algorithms include:
- Multiple/logistic regression
- Naïve Bayes
- k-nearest neighbors (KNN)
- Random forests
- Support vector machines
- Neural networks
Focus on k-nearest neighbors:
- Simple but powerful, used for both regression and classification.
k-Nearest Neighbors (KNN) Methodology
- How k-NN Works:
- Predicts an outcome based on the outcomes of the K most similar observations.
- Example:
- Predicting the species of a flower based on sepal width and length.
- New flower classified based on its K-nearest neighbors.
Prediction Process:
Choosing K:
- K = 1: Nearest single neighbor.
- K = 2: Two nearest neighbors.
- Predictions can vary based on K selection.
- Aim for the most common label among K neighbors.
Training and Testing:
- Randomly select a proportion of sample data for training (commonly 70%) and the remainder for testing.
- Classification occurs using the training set; predictions made on the testing set.
- Measure accuracy of predictions against true labels.
Cross-validation:
- Repeat the testing process for different values of K to determine which K provides the highest accuracy.
Implementation in R
- KNN can be implemented in statistical software like R, focusing on:
- Loading necessary packages
- Preparing dataset
- Conducting the training-testing split
- Evaluating model accuracy based on chosen K values.
Conclusion
- Using KNN allows for effective classification, with accuracy dependent on optimal K selection and proper training-testing methodology.
- Understanding prediction methodologies is crucial for success in data science.