Definition: Machine Learning (ML) is a field of study focused on developing algorithms that allow computers to learn from data and improve their performance on tasks without being explicitly programmed.
Key Terms:
Experience (E): The data or experience from which the machine learns.
Task (T): The specific task the machine is trying to accomplish.
Performance Measure (P): A criterion to evaluate how well the machine is performing the task.
Quote by Arthur Samuel (1959): "The field of study that gives computers the ability to learn without being explicitly programmed."
Machine Learning Process
Data Collection: Gathering the necessary data for analysis.
Data Cleansing: Removing discrepancies and cleaning the dataset.
Feature Extraction & Selection: Identifying the most relevant attributes in the data.
Model Training: Building the machine learning model using the training set.
Model Evaluation: Assessing the model's performance using a validation/test dataset.
Model Deployment & Integration: Putting the model into production and integrating it into existing systems.
Feedback and Iteration: Continuously improving the model based on performance metrics and new data.
Types of Learning in Machine Learning
Basic Concepts
Supervised Learning: The model learns from labeled data, mapping input to output based on examples. Examples: Classification and Regression.
Unsupervised Learning: The model identifies patterns and relationships in unlabeled data. Example: Clustering.
Semi-supervised Learning: Combines a small amount of labeled data with a large amount of unlabeled data.
Reinforcement Learning: An agent learns by interacting with its environment to maximize cumulative rewards.
Evaluation of Models
Generalization Capability: The ability to perform well on unseen data.
Training Error: Error on the training dataset.
Generalization Error: Error when applying the model to new data.
Overfitting: Model is too complex and captures noise instead of the underlying pattern.
Underfitting: Model is too simple to capture the underlying trend.
Model Evaluation Techniques:
Cross-Validation (e.g., k-fold): Validates the model on different subsets to assess performance.
Hold-Out Method: Splits the dataset into training and testing.
Performance Metrics:
Accuracy: The ratio of correct predictions to total predictions.
Precision: The ratio of true positives to the sum of true positives and false positives.
Recall: The ratio of true positives to the sum of true positives and false negatives.
F1-Score: The harmonic mean of precision and recall.
Machine Learning Algorithms
Logistic Regression: Useful for binary classification, predicts probabilities using the sigmoid function.
k-Nearest Neighbors (kNN): Classification based on the distance to the k-nearest neighbors.
Pros: Simple, intuitive.
Cons: Requires a lot of memory, slow for large datasets.
Support Vector Machines (SVM): Finds the best hyperplane that separates classes in the data.
Decision Trees: Hierarchical model splitting data based on feature values.
Can be used for both regression and classification.
Prone to overfitting.
Ensemble Methods: Combine multiple models to improve performance.
Bagging: Reduces variance (e.g., Random Forest).
Boosting: Iteratively corrects errors from models (e.g., XGBoost).
Time Series Analysis (TSA)
Definition: A method to analyze time-ordered data points to extract meaningful statistics and characteristics.
Components:
Trend: Long term increase or decrease in data.
Seasonality: Patterns that occur at regular intervals, such as daily, weekly, or monthly.
Common Techniques:
ARIMA: Combines autoregressive components, differencing, and moving averages for prediction.
AutoML (Automated Machine Learning)
Concept: Tools and frameworks that automate the end-to-end process of applying machine learning to real-world problems.
Libraries:
LazyPredict: Provides simple model evaluation across multiple algorithms.
TPOT: Optimizes machine learning pipelines using genetic programming.
PyCaret: A low-code library that automates machine learning workflows.