BE 530 - Machine Learning in Python Lecture 2 Flashcards
Core Concepts of Machine Learning
Formal Definition of Machine Learning: A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience .
Machine Learning Components:
Experience (): Represented by the data provided to the system.
Tasks (): The specific prediction goals the system is designed to achieve.
Performance (): The metrics used to evaluate how well the task is being performed.
General Objective: Machine Learning (ML) focuses on learning patterns from data to improve task performance.
Supervised Learning
Definition: A type of learning where machines learn from labeled data. Each input example is paired with a specific target value or label.
Goal: The algorithm aims to learn a mapping between inputs and output labels. Given input data and desired outputs , the algorithm seeks to find a function that approximates .
Common Tasks in Supervised Learning:
Classification: Assigning inputs to discrete categories.
Regression: Predicting continuous numeric values.
Examples of Algorithms:
Linear Regression
Logistic Regression (primarily for classification)
Decision Trees
Support Vector Machines (SVM)
Multilayer Perceptrons (MLP)
Limitations and Considerations:
Reliance on Labels: A major limitation is the requirement for high-quality labeled data.
Biomedical Context: While labels (diagnoses, outcomes) often exist in biomedical applications, obtaining them can be expensive, and they may contain noise.
Unsupervised Learning
Definition: Learning where no labels are associated with the input data. The algorithm identifies patterns and relationships among samples without explicit guidance.
Goal: Find hidden structure in unlabeled data .
Types of Unsupervised Learning Tasks:
Association: Identifying patterns that frequently occur together. (Example: In patient populations, diabetes is often found to be associated with high blood pressure).
Clustering: Grouping data samples into clusters based on shared similar features, often utilizing similarity or distance measures.
Anomaly Detection: Detecting rare or unusual patterns that deviate from typical behavior. (Examples: Spam email detection, credit card fraud detection).
Specific Machine Learning Tasks
ML tasks describe what the output of the model produces. Key tasks include:
Classification: Determining which category an input belongs to.
Examples: Classifying a tumor as malignant or benign based on features like area, perimeter, patient age, and sex; Face recognition to identify individuals from images.
Outputs: May be a discrete class label (often numeric) or a probability distribution over classes (e.g., the probability that a tumor is malignant).
Regression: Predicting a continuous numeric value by fitting a model to data.
Example: Predicting disease progression (e.g., diabetes) using patient features like blood pressure and Body Mass Index (BMI).
Association: Finding co-occurrences in data.
Anomaly Detection: Spotting outliers.
Clustering: Discovering natural groupings.
Other Tasks: Transcription, Machine Translation, Synthesis and Sampling, and Denoising.
Performance Measurement ()
Performance measures evaluate how well an algorithm learns from experience. They must be defined before training begins.
Metrics for Classification:
Accuracy: The proportion of total examples for which the model produces the correct output.
Error Rate: The complement of accuracy (), representing the proportion of misclassified examples.
Evaluation Strategy:
Performance must be monitored during training and evaluated on a separate test dataset to assess generalization.
Try-and-See Iterations: Model development involves adjusting metrics and tuning parameters to reach performance goals.
Challenges in Biomedical ML:
Class Imbalance: One category significantly outnumbers another.
Asymmetric Error Costs: The cost of a false positive may be vastly different from a false negative.
Capacity, Generalization, and Fitting
Generalization: The ability of a model to perform well on new, previously unseen data, rather than just the training set.
Training Error (): The error computed on the training set. For linear regression, the Mean Squared Error (MSE) is calculated as:
Test (Generalization) Error (): Evaluated on a separate testing set not used during training:
The Gap: Ideally, the test error should be close to the training error. A large gap signals overfitting.
Model Capacity: Refers to the model's ability to represent a wide variety of functions.
States of Fitting:
Underfitting: Occurs when the model is too simple (low capacity) and cannot capture underlying patterns. Result: High training error and high test error.
Overfitting: Occurs when the model fits training data too closely, capturing noise or outliers (high capacity). Result: Low training error but high test error.
Capacity Diagnostics:
Decreasing training error + increasing validation error = Overfitting.
Both errors high = Underfitting.
Both errors low and close = Good generalization.
Capacity Examples and Solutions
Polynomial Proxies for Capacity:
Linear Model (): Low capacity; prone to underfitting.
Quadratic Model (): Adequate/near-optimal capacity for many datasets; achieves low training error with a small gap.
9th-degree Polynomial (): High capacity; prone to overfitting as it fits complex patterns and noise.
Fixes for Underfitting:
Use richer features.
Increase model capacity.
Decrease regularization.
Fixes for Overfitting:
Obtain more data.
Apply regularization.
Use a simpler model.
Implement early stopping.
Regularization
Definition: A modification to a learning algorithm intended to encourage better generalization while maintaining acceptable training error. It discourages overly complex solutions.
Regularized Objective Function: In regularized linear regression, a penalty term is added to the cost function:
Where () controls the trade-off between fitting the data and keeping weight values small.
Common Regularization Types:
L2 Regularization (Ridge): Adds a penalty equal to the square of the magnitude of coefficients ().
L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the magnitude of coefficients ().
Impact of Hyperparameter :
Large : Penalty term dominates. Weights are forced to be very small, leading to high training error (underfitting).
Moderate : Higher-order terms are driven toward zero, effectively reducing capacity (e.g., making a 9th-degree poly behave like a quadratic) and improving generalization.
Zero : No regularization; prone to overfitting.
The Python Programming Landscape
Characteristics: High-level, interpreted, widely used in scientific computing, data science, and ML.
Strengths: Supports a vast array of libraries; can integrate with C/C++ to optimize performance; suitable for both prototyping and production.
Optimization: While Python is generally slower than compiled languages, high performance in ML comes from optimized back-end implementations in libraries like NumPy and Scikit-learn rather than raw Python loops.
Primary Python Scientific Packages
SciPy Ecosystem: A collection of interoperable packages including NumPy, Pandas, Matplotlib, IPython, and SymPy.
NumPy (https://numpy.org/):
Standard for numerical data representation.
Provides N-dimensional arrays (tensors) and vectorized computations.
Fundamental for linear algebra operations.
SciPy Library (https://scipy.org/):
Routines for numerical integration, interpolation, optimization, and statistics.
Includes tools for Fast Fourier Transforms (FFT) and curve fitting.
Pandas (https://pandas.pydata.org/):
Specialized for tabular data (spreadsheets, SQL tables).
Handles data manipulation, cleaning, and time series.
Provides basic summary statistics and handles various file formats (CSV, Excel, JSON).
Matplotlib (https://matplotlib.org/):
Comprehensive 2D plotting library.
Supports static, animated, and interactive visualizations.
Used in high-profile science, such as the first black hole image visualization.
Scikit-learn (https://scikit-learn.org/stable/):
The core ML library built on NumPy and SciPy.
Implements classification, regression, clustering, dimensionality reduction, and preprocessing.
Seaborn (https://seaborn.pydata.org/):
High-level interface for attractive statistical graphics built on Matplotlib.
Statsmodels (https://www.statsmodels.org/stable/index.html):
Focuses on statistical model estimation, inference, and econometrics. Supports R-style formulas.
Supporting Utilities:
OpenCV: Image I/O and classic computer vision operations.
IPython: Enhanced interactive console and kernel for Jupyter notebooks.
SymPy: Symbolic mathematics (algebraic manipulation, symbolic differentiation/integration).
Installation and Environment Configuration
Anaconda Distribution: A curated scientific stack including the Python interpreter, Navigator, and the Conda package manager.
Conda Environments: Isolated directories holding specific sets of packages to prevent dependency conflicts.
JupyterLab: A browser-based Integrated Development Environment (IDE) for notebooks, files, and terminals.
Jupyter Kernel: The active Python process that a specific notebook connects to; it must match the environment intended for the project.
Channels: Locations where Conda retrieves packages (e.g.,
conda-forge).