DSC510-machine learning

Introduction to Data Science and Analytics

  • Overview of machine learning (ML) as a crucial component of data science.

  • Machine Learning facilitates various steps of the data analysis cycle.

Taxonomy of Machine Learning

  • Types of Learning:

    • Supervised Learning: Uses labeled input/output pairs to learn a function (y = f(X)).

      • Types:

        • Classification: Output y is discrete labels (e.g., cat or dog).

        • Regression: Output y is continuous (e.g., predicting prices).

    • Unsupervised Learning: Works with unlabeled input to find patterns.

      • Types:

        • Clustering: Group data points based on similarities.

        • Dimensionality Reduction: Reduces number of variables.

Examples of Machine Learning

  • Supervised Learning Applications:

    • Image recognition (deciding if an image is a cat or dog).

    • Predicting user ratings for restaurants.

    • Spam detection in emails.

  • Unsupervised Learning Applications:

    • Clustering handwritten digits into classes.

    • Identifying trending topics on social media.

Machine Learning Techniques

  • Supervised Learning Techniques:

    • k-Nearest Neighbors (k-NN)

    • Naïve Bayes

    • Linear Regression & Logistic Regression

    • Support Vector Machines (SVM)

    • Random Forests

    • Neural Networks

  • Unsupervised Learning Techniques:

    • Clustering algorithms

    • Matrix Factorization (PCA, SVD)

    • Hidden Markov Models (HMM)

Predictive Performance Criteria

  • Metrics:

    • Accuracy

    • Area Under Curve (AUC)/Receiver Operating Characteristic (ROC)

    • Precision and Recall

    • F1 Score

  • Considerations:

    • Speed and Scalability

    • Robustness against outliers, noise, and missing values

    • Interpretability (transparency of model decisions)

    • Model compactness for deployment in mobile devices.

Introduction to k-Nearest Neighbors (k-NN)

  • Concept:

    • Identify the k closest labeled instances to a query item.

    • Use the most frequent label among the nearest neighbors for classification.

  • Voting Method:

    • Majority voting for classification.

    • Average for regression.

Distance Measures in k-NN

  • Common Distances:

    • Euclidean Distance: d(x, y) = || x - y ||

    • Manhattan Distance: Sum of absolute differences.

    • Cosine Similarity: Mainly for text data.

    • Hamming Distance: Used for categorical data.

    • Jaccard Distance: Measures similarity between sets.

Bias and Variance in Model Training

  • Definitions:

    • Bias: Error due to overly simplistic assumptions in the learning algorithm.

    • Variance: Error due to excessive complexity in the model leading to model sensitivity to fluctuations in the training set.

  • Bias-Variance Tradeoff:

    • Complex models tend to have lower bias and higher variance.

    • Simple models tend toward higher bias and lower variance.

Choosing the Value of k in k-NN

  • Tradeoff:

    • Small k: Low bias but high variance.

    • Large k: High bias but low variance.

Cross-Validation Techniques

  • Leave-One-Out: Each instance serves as a validation set at one point during training.

  • K-Fold Cross-Validation: Data is divided into k subsets for training/testing iterations.

Overfitting and Underfitting

  • Overfitting: Model performs well on training data but poor on unseen data.

  • Underfitting: Model does not capture underlying trend of the data adequately.

Decision Trees

  • Structure:

    • Flow-chart-like model for decisions and classifications.

    • Nodes represent features and branches represent outcomes.

  • Generation:

    • Constructed using greedy algorithms based on information gain or Gini impurity.

Ensemble Methods

  • Use multiple models to improve predictions:

    • Bagging: Combines predictions by averaging or voting.

    • Boosting: Sequentially builds models, each correcting errors made by previous ones.

    • Stacking: Combines multiple models at different levels.

Random Forests

  • Ensemble of decision trees trained on different subsets of data with random feature selection at each split.

  • Reduces variance and improves predictive performance.

Logistic Regression

  • Outputs probability estimates which can be transformed into class predictions using a logistic function.

  • Regression coefficients are estimated using maximum likelihood estimation.

Perceptron Algorithm

  • Simple online learning model for binary classification.

  • Adjusts weights based on misclassifications, making it adaptive.

Online Learning Adaptability

  • Continuously updates weights as new data comes in, adapting to changes without retraining from scratch.