Machine Learning Notes

Machine Learning Overview

  • Definition: Machine Learning (ML) is a field of study focused on developing algorithms that allow computers to learn from data and improve their performance on tasks without being explicitly programmed.
  • Key Terms:
    • Experience (E): The data or experience from which the machine learns.
    • Task (T): The specific task the machine is trying to accomplish.
    • Performance Measure (P): A criterion to evaluate how well the machine is performing the task.
  • Quote by Arthur Samuel (1959): "The field of study that gives computers the ability to learn without being explicitly programmed."

Machine Learning Process

  1. Data Collection: Gathering the necessary data for analysis.
  2. Data Cleansing: Removing discrepancies and cleaning the dataset.
  3. Feature Extraction & Selection: Identifying the most relevant attributes in the data.
  4. Model Training: Building the machine learning model using the training set.
  5. Model Evaluation: Assessing the model's performance using a validation/test dataset.
  6. Model Deployment & Integration: Putting the model into production and integrating it into existing systems.
  7. Feedback and Iteration: Continuously improving the model based on performance metrics and new data.

Types of Learning in Machine Learning

Basic Concepts

  • Supervised Learning: The model learns from labeled data, mapping input to output based on examples. Examples: Classification and Regression.
  • Unsupervised Learning: The model identifies patterns and relationships in unlabeled data. Example: Clustering.
  • Semi-supervised Learning: Combines a small amount of labeled data with a large amount of unlabeled data.
  • Reinforcement Learning: An agent learns by interacting with its environment to maximize cumulative rewards.

Evaluation of Models

  • Generalization Capability: The ability to perform well on unseen data.
  • Training Error: Error on the training dataset.
  • Generalization Error: Error when applying the model to new data.
  • Overfitting: Model is too complex and captures noise instead of the underlying pattern.
  • Underfitting: Model is too simple to capture the underlying trend.
  • Model Evaluation Techniques:
    • Cross-Validation (e.g., k-fold): Validates the model on different subsets to assess performance.
    • Hold-Out Method: Splits the dataset into training and testing.
    • Performance Metrics:
    • Accuracy: The ratio of correct predictions to total predictions.
    • Precision: The ratio of true positives to the sum of true positives and false positives.
    • Recall: The ratio of true positives to the sum of true positives and false negatives.
    • F1-Score: The harmonic mean of precision and recall.

Machine Learning Algorithms

  • Logistic Regression: Useful for binary classification, predicts probabilities using the sigmoid function.
  • k-Nearest Neighbors (kNN): Classification based on the distance to the k-nearest neighbors.
    • Pros: Simple, intuitive.
    • Cons: Requires a lot of memory, slow for large datasets.
  • Support Vector Machines (SVM): Finds the best hyperplane that separates classes in the data.
  • Decision Trees: Hierarchical model splitting data based on feature values.
    • Can be used for both regression and classification.
    • Prone to overfitting.
  • Ensemble Methods: Combine multiple models to improve performance.
    • Bagging: Reduces variance (e.g., Random Forest).
    • Boosting: Iteratively corrects errors from models (e.g., XGBoost).

Time Series Analysis (TSA)

  • Definition: A method to analyze time-ordered data points to extract meaningful statistics and characteristics.
  • Components:
    • Trend: Long term increase or decrease in data.
    • Seasonality: Patterns that occur at regular intervals, such as daily, weekly, or monthly.
  • Common Techniques:
    • ARIMA: Combines autoregressive components, differencing, and moving averages for prediction.

AutoML (Automated Machine Learning)

  • Concept: Tools and frameworks that automate the end-to-end process of applying machine learning to real-world problems.
  • Libraries:
    • LazyPredict: Provides simple model evaluation across multiple algorithms.
    • TPOT: Optimizes machine learning pipelines using genetic programming.
    • PyCaret: A low-code library that automates machine learning workflows.