DSC510-machine learning
Introduction to Data Science and Analytics
Overview of machine learning (ML) as a crucial component of data science.
Machine Learning facilitates various steps of the data analysis cycle.
Taxonomy of Machine Learning
Types of Learning:
Supervised Learning: Uses labeled input/output pairs to learn a function (y = f(X)).
Types:
Classification: Output y is discrete labels (e.g., cat or dog).
Regression: Output y is continuous (e.g., predicting prices).
Unsupervised Learning: Works with unlabeled input to find patterns.
Types:
Clustering: Group data points based on similarities.
Dimensionality Reduction: Reduces number of variables.
Examples of Machine Learning
Supervised Learning Applications:
Image recognition (deciding if an image is a cat or dog).
Predicting user ratings for restaurants.
Spam detection in emails.
Unsupervised Learning Applications:
Clustering handwritten digits into classes.
Identifying trending topics on social media.
Machine Learning Techniques
Supervised Learning Techniques:
k-Nearest Neighbors (k-NN)
Naïve Bayes
Linear Regression & Logistic Regression
Support Vector Machines (SVM)
Random Forests
Neural Networks
Unsupervised Learning Techniques:
Clustering algorithms
Matrix Factorization (PCA, SVD)
Hidden Markov Models (HMM)
Predictive Performance Criteria
Metrics:
Accuracy
Area Under Curve (AUC)/Receiver Operating Characteristic (ROC)
Precision and Recall
F1 Score
Considerations:
Speed and Scalability
Robustness against outliers, noise, and missing values
Interpretability (transparency of model decisions)
Model compactness for deployment in mobile devices.
Introduction to k-Nearest Neighbors (k-NN)
Concept:
Identify the k closest labeled instances to a query item.
Use the most frequent label among the nearest neighbors for classification.
Voting Method:
Majority voting for classification.
Average for regression.
Distance Measures in k-NN
Common Distances:
Euclidean Distance: d(x, y) = || x - y ||
Manhattan Distance: Sum of absolute differences.
Cosine Similarity: Mainly for text data.
Hamming Distance: Used for categorical data.
Jaccard Distance: Measures similarity between sets.
Bias and Variance in Model Training
Definitions:
Bias: Error due to overly simplistic assumptions in the learning algorithm.
Variance: Error due to excessive complexity in the model leading to model sensitivity to fluctuations in the training set.
Bias-Variance Tradeoff:
Complex models tend to have lower bias and higher variance.
Simple models tend toward higher bias and lower variance.
Choosing the Value of k in k-NN
Tradeoff:
Small k: Low bias but high variance.
Large k: High bias but low variance.
Cross-Validation Techniques
Leave-One-Out: Each instance serves as a validation set at one point during training.
K-Fold Cross-Validation: Data is divided into k subsets for training/testing iterations.
Overfitting and Underfitting
Overfitting: Model performs well on training data but poor on unseen data.
Underfitting: Model does not capture underlying trend of the data adequately.
Decision Trees
Structure:
Flow-chart-like model for decisions and classifications.
Nodes represent features and branches represent outcomes.
Generation:
Constructed using greedy algorithms based on information gain or Gini impurity.
Ensemble Methods
Use multiple models to improve predictions:
Bagging: Combines predictions by averaging or voting.
Boosting: Sequentially builds models, each correcting errors made by previous ones.
Stacking: Combines multiple models at different levels.
Random Forests
Ensemble of decision trees trained on different subsets of data with random feature selection at each split.
Reduces variance and improves predictive performance.
Logistic Regression
Outputs probability estimates which can be transformed into class predictions using a logistic function.
Regression coefficients are estimated using maximum likelihood estimation.
Perceptron Algorithm
Simple online learning model for binary classification.
Adjusts weights based on misclassifications, making it adaptive.
Online Learning Adaptability
Continuously updates weights as new data comes in, adapting to changes without retraining from scratch.