Machine Learning Notes

Definition: Machine Learning (ML) is a field of study focused on developing algorithms that allow computers to learn from data and improve their performance on tasks without being explicitly programmed.
Key Terms:
- Experience (E): The data or experience from which the machine learns.
- Task (T): The specific task the machine is trying to accomplish.
- Performance Measure (P): A criterion to evaluate how well the machine is performing the task.
Quote by Arthur Samuel (1959): "The field of study that gives computers the ability to learn without being explicitly programmed."

Data Collection: Gathering the necessary data for analysis.
Data Cleansing: Removing discrepancies and cleaning the dataset.
Feature Extraction & Selection: Identifying the most relevant attributes in the data.
Model Training: Building the machine learning model using the training set.
Model Evaluation: Assessing the model's performance using a validation/test dataset.
Model Deployment & Integration: Putting the model into production and integrating it into existing systems.
Feedback and Iteration: Continuously improving the model based on performance metrics and new data.

Supervised Learning: The model learns from labeled data, mapping input to output based on examples. Examples: Classification and Regression.
Unsupervised Learning: The model identifies patterns and relationships in unlabeled data. Example: Clustering.
Semi-supervised Learning: Combines a small amount of labeled data with a large amount of unlabeled data.
Reinforcement Learning: An agent learns by interacting with its environment to maximize cumulative rewards.

Generalization Capability: The ability to perform well on unseen data.
Training Error: Error on the training dataset.
Generalization Error: Error when applying the model to new data.
Overfitting: Model is too complex and captures noise instead of the underlying pattern.
Underfitting: Model is too simple to capture the underlying trend.
Model Evaluation Techniques:
- Cross-Validation (e.g., k-fold): Validates the model on different subsets to assess performance.
- Hold-Out Method: Splits the dataset into training and testing.
- Performance Metrics:
- Accuracy: The ratio of correct predictions to total predictions.
- Precision: The ratio of true positives to the sum of true positives and false positives.
- Recall: The ratio of true positives to the sum of true positives and false negatives.
- F1-Score: The harmonic mean of precision and recall.

Logistic Regression: Useful for binary classification, predicts probabilities using the sigmoid function.
k-Nearest Neighbors (kNN): Classification based on the distance to the k-nearest neighbors.
- Pros: Simple, intuitive.
- Cons: Requires a lot of memory, slow for large datasets.
Support Vector Machines (SVM): Finds the best hyperplane that separates classes in the data.
Decision Trees: Hierarchical model splitting data based on feature values.
- Can be used for both regression and classification.
- Prone to overfitting.
Ensemble Methods: Combine multiple models to improve performance.
- Bagging: Reduces variance (e.g., Random Forest).
- Boosting: Iteratively corrects errors from models (e.g., XGBoost).

Definition: A method to analyze time-ordered data points to extract meaningful statistics and characteristics.
Components:
- Trend: Long term increase or decrease in data.
- Seasonality: Patterns that occur at regular intervals, such as daily, weekly, or monthly.
Common Techniques:
- ARIMA: Combines autoregressive components, differencing, and moving averages for prediction.

Concept: Tools and frameworks that automate the end-to-end process of applying machine learning to real-world problems.
Libraries:
- LazyPredict: Provides simple model evaluation across multiple algorithms.
- TPOT: Optimizes machine learning pipelines using genetic programming.
- PyCaret: A low-code library that automates machine learning workflows.