BE 530 - Machine Learning in Python Lecture 2 Flashcards

Core Concepts of Machine Learning

  • Formal Definition of Machine Learning: A computer program is said to learn from experience EE with respect to some class of tasks TT and performance measure PP, if its performance at tasks in TT, as measured by PP, improves with experience EE.

  • Machine Learning Components:

    • Experience (EE): Represented by the data provided to the system.

    • Tasks (TT): The specific prediction goals the system is designed to achieve.

    • Performance (PP): The metrics used to evaluate how well the task is being performed.

  • General Objective: Machine Learning (ML) focuses on learning patterns from data to improve task performance.

Supervised Learning

  • Definition: A type of learning where machines learn from labeled data. Each input example is paired with a specific target value or label.

  • Goal: The algorithm aims to learn a mapping between inputs and output labels. Given input data XX and desired outputs yy, the algorithm seeks to find a function f(x)f(x) that approximates yy.

  • Common Tasks in Supervised Learning:

    • Classification: Assigning inputs to discrete categories.

    • Regression: Predicting continuous numeric values.

  • Examples of Algorithms:

    • Linear Regression

    • Logistic Regression (primarily for classification)

    • Decision Trees

    • Support Vector Machines (SVM)

    • Multilayer Perceptrons (MLP)

  • Limitations and Considerations:

    • Reliance on Labels: A major limitation is the requirement for high-quality labeled data.

    • Biomedical Context: While labels (diagnoses, outcomes) often exist in biomedical applications, obtaining them can be expensive, and they may contain noise.

Unsupervised Learning

  • Definition: Learning where no labels are associated with the input data. The algorithm identifies patterns and relationships among samples without explicit guidance.

  • Goal: Find hidden structure in unlabeled data XX.

  • Types of Unsupervised Learning Tasks:

    • Association: Identifying patterns that frequently occur together. (Example: In patient populations, diabetes is often found to be associated with high blood pressure).

    • Clustering: Grouping data samples into clusters based on shared similar features, often utilizing similarity or distance measures.

    • Anomaly Detection: Detecting rare or unusual patterns that deviate from typical behavior. (Examples: Spam email detection, credit card fraud detection).

Specific Machine Learning Tasks

ML tasks describe what the output of the model produces. Key tasks include:

  • Classification: Determining which category an input belongs to.

    • Examples: Classifying a tumor as malignant or benign based on features like area, perimeter, patient age, and sex; Face recognition to identify individuals from images.

    • Outputs: May be a discrete class label (often numeric) or a probability distribution over classes (e.g., the probability that a tumor is malignant).

  • Regression: Predicting a continuous numeric value by fitting a model to data.

    • Example: Predicting disease progression (e.g., diabetes) using patient features like blood pressure and Body Mass Index (BMI).

  • Association: Finding co-occurrences in data.

  • Anomaly Detection: Spotting outliers.

  • Clustering: Discovering natural groupings.

  • Other Tasks: Transcription, Machine Translation, Synthesis and Sampling, and Denoising.

Performance Measurement (PP)

Performance measures evaluate how well an algorithm learns from experience. They must be defined before training begins.

  • Metrics for Classification:

    • Accuracy: The proportion of total examples for which the model produces the correct output.

    • Error Rate: The complement of accuracy (1Accuracy1 - \text{Accuracy}), representing the proportion of misclassified examples.

  • Evaluation Strategy:

    • Performance must be monitored during training and evaluated on a separate test dataset to assess generalization.

    • Try-and-See Iterations: Model development involves adjusting metrics and tuning parameters to reach performance goals.

  • Challenges in Biomedical ML:

    • Class Imbalance: One category significantly outnumbers another.

    • Asymmetric Error Costs: The cost of a false positive may be vastly different from a false negative.

Capacity, Generalization, and Fitting

  • Generalization: The ability of a model to perform well on new, previously unseen data, rather than just the training set.

  • Training Error (J(train)J^{(train)}): The error computed on the training set. For linear regression, the Mean Squared Error (MSE) is calculated as:

    • MSE(train)=1m(train)×sum(f(x(train))y(train))2MSE_{(train)} = \frac{1}{m^{(train)}} \times \text{sum}(f(x^{(train)}) - y^{(train)})^2

  • Test (Generalization) Error (J(test)J^{(test)}): Evaluated on a separate testing set not used during training:

    • MSE(test)=1m(test)×sum(f(x(test))y(test))2MSE_{(test)} = \frac{1}{m^{(test)}} \times \text{sum}(f(x^{(test)}) - y^{(test)})^2

  • The Gap: Ideally, the test error should be close to the training error. A large gap signals overfitting.

  • Model Capacity: Refers to the model's ability to represent a wide variety of functions.

  • States of Fitting:

    • Underfitting: Occurs when the model is too simple (low capacity) and cannot capture underlying patterns. Result: High training error and high test error.

    • Overfitting: Occurs when the model fits training data too closely, capturing noise or outliers (high capacity). Result: Low training error but high test error.

  • Capacity Diagnostics:

    • Decreasing training error + increasing validation error = Overfitting.

    • Both errors high = Underfitting.

    • Both errors low and close = Good generalization.

Capacity Examples and Solutions

  • Polynomial Proxies for Capacity:

    • Linear Model (y=wx+by = wx + b): Low capacity; prone to underfitting.

    • Quadratic Model (y=w2x2+w1x+by = w_2x^2 + w_1x + b): Adequate/near-optimal capacity for many datasets; achieves low training error with a small gap.

    • 9th-degree Polynomial (y=w9x9+...+by = w_9x^9 + ... + b): High capacity; prone to overfitting as it fits complex patterns and noise.

  • Fixes for Underfitting:

    • Use richer features.

    • Increase model capacity.

    • Decrease regularization.

  • Fixes for Overfitting:

    • Obtain more data.

    • Apply regularization.

    • Use a simpler model.

    • Implement early stopping.

Regularization

  • Definition: A modification to a learning algorithm intended to encourage better generalization while maintaining acceptable training error. It discourages overly complex solutions.

  • Regularized Objective Function: In regularized linear regression, a penalty term is added to the cost function:

    • J(θ)=MSE(X,y,θ)+sum(θi2)×lambda2J(\theta) = MSE(X, y, \theta) + \frac{\text{sum}(\theta_i^2) \times \text{lambda}}{2}

    • Where lambda\text{lambda} (lambda2\frac{\text{lambda}}{2}) controls the trade-off between fitting the data and keeping weight values small.

  • Common Regularization Types:

    • L2 Regularization (Ridge): Adds a penalty equal to the square of the magnitude of coefficients (lambda×wi×wi\text{lambda} \times w^i \times w^i).

    • L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the magnitude of coefficients (lambda×wi\text{lambda} \times |w^i|).

  • Impact of Hyperparameter lambda\text{lambda}:

    • Large lambda\text{lambda}: Penalty term dominates. Weights are forced to be very small, leading to high training error (underfitting).

    • Moderate lambda\text{lambda}: Higher-order terms are driven toward zero, effectively reducing capacity (e.g., making a 9th-degree poly behave like a quadratic) and improving generalization.

    • Zero lambda\text{lambda}: No regularization; prone to overfitting.

The Python Programming Landscape

  • Characteristics: High-level, interpreted, widely used in scientific computing, data science, and ML.

  • Strengths: Supports a vast array of libraries; can integrate with C/C++ to optimize performance; suitable for both prototyping and production.

  • Optimization: While Python is generally slower than compiled languages, high performance in ML comes from optimized back-end implementations in libraries like NumPy and Scikit-learn rather than raw Python loops.

Primary Python Scientific Packages

  • SciPy Ecosystem: A collection of interoperable packages including NumPy, Pandas, Matplotlib, IPython, and SymPy.

  • NumPy (https://numpy.org/):

    • Standard for numerical data representation.

    • Provides N-dimensional arrays (tensors) and vectorized computations.

    • Fundamental for linear algebra operations.

  • SciPy Library (https://scipy.org/):

    • Routines for numerical integration, interpolation, optimization, and statistics.

    • Includes tools for Fast Fourier Transforms (FFT) and curve fitting.

  • Pandas (https://pandas.pydata.org/):

    • Specialized for tabular data (spreadsheets, SQL tables).

    • Handles data manipulation, cleaning, and time series.

    • Provides basic summary statistics and handles various file formats (CSV, Excel, JSON).

  • Matplotlib (https://matplotlib.org/):

    • Comprehensive 2D plotting library.

    • Supports static, animated, and interactive visualizations.

    • Used in high-profile science, such as the first black hole image visualization.

  • Scikit-learn (https://scikit-learn.org/stable/):

    • The core ML library built on NumPy and SciPy.

    • Implements classification, regression, clustering, dimensionality reduction, and preprocessing.

  • Seaborn (https://seaborn.pydata.org/):

    • High-level interface for attractive statistical graphics built on Matplotlib.

  • Statsmodels (https://www.statsmodels.org/stable/index.html):

    • Focuses on statistical model estimation, inference, and econometrics. Supports R-style formulas.

  • Supporting Utilities:

    • OpenCV: Image I/O and classic computer vision operations.

    • IPython: Enhanced interactive console and kernel for Jupyter notebooks.

    • SymPy: Symbolic mathematics (algebraic manipulation, symbolic differentiation/integration).

Installation and Environment Configuration

  • Anaconda Distribution: A curated scientific stack including the Python interpreter, Navigator, and the Conda package manager.

  • Conda Environments: Isolated directories holding specific sets of packages to prevent dependency conflicts.

  • JupyterLab: A browser-based Integrated Development Environment (IDE) for notebooks, files, and terminals.

  • Jupyter Kernel: The active Python process that a specific notebook connects to; it must match the environment intended for the project.

  • Channels: Locations where Conda retrieves packages (e.g., conda-forge).