In-Depth Notes on k-Nearest Neighbors and Machine Learning Concepts

Course Title: STAT 200 - Introduction to Data Science with R
Institution: San Diego State University (SDSU)
Department: Mathematics and Statistics
Focus: Relationships between variables, prediction, visualization, and machine learning.

Objective of Prediction:
- To predict unknown or future outcomes (dependent variable y) using data (independent variables x).
- Example Variables:
- Age, sex, blood pressure, calories, occurrence of heart attack.
Types of Outcomes:
- Continuous/Numeric Outcomes (e.g., surf height):
- This task is called regression.
- Categorical Outcomes (e.g., tumor status):
- This task is called classification.

Predictors (independent variables):
- Also referred to as features, inputs, covariates, or explanatory variables.
Outcome (dependent variable):
- Often referred to as response, target, label, or class.
Supervised Learning:
- Process of learning to predict outcomes from predictors.
- In contrast, unsupervised learning involves learning structure without outcomes.

Common algorithms include:
- Multiple/logistic regression
- Naïve Bayes
- k-nearest neighbors (KNN)
- Random forests
- Support vector machines
- Neural networks
Focus on k-nearest neighbors:
- Simple but powerful, used for both regression and classification.

How k-NN Works:
- Predicts an outcome based on the outcomes of the K most similar observations.
Example:
- Predicting the species of a flower based on sepal width and length.
- New flower classified based on its K-nearest neighbors.

Choosing K:
- K = 1: Nearest single neighbor.
- K = 2: Two nearest neighbors.
- Predictions can vary based on K selection.
- Aim for the most common label among K neighbors.
Training and Testing:
- Randomly select a proportion of sample data for training (commonly 70%) and the remainder for testing.
- Classification occurs using the training set; predictions made on the testing set.
- Measure accuracy of predictions against true labels.
Cross-validation:
- Repeat the testing process for different values of K to determine which K provides the highest accuracy.

KNN can be implemented in statistical software like R, focusing on:
- Loading necessary packages
- Preparing dataset
- Conducting the training-testing split
- Evaluating model accuracy based on chosen K values.

Using KNN allows for effective classification, with accuracy dependent on optimal K selection and proper training-testing methodology.
Understanding prediction methodologies is crucial for success in data science.