Optimization Routines: Brute-Force, Gradient-Free & Gradient-Based

Overview of Optimization Routines

  • The lecturer divides practical optimization strategies into three high–level categories:
    1. Brute-Force / Grid Search
    2. Gradient-Free Routines
    3. Gradient-Based Routines
  • Each category differs in (a) the information it exploits about the objective function (smoothness, slope, etc.) and (b) its computational cost.
  • Key vocabulary used throughout:
    • Manifold: The mathematical “surface” (often multi-dimensional) on which the objective function is defined.
    • Grid: A finite set of discrete points laid out over that manifold.
    • Gradient: The vector of partial derivatives that gives the slope in every coordinate direction.
    • Curse of Dimensionality: Exponential growth of required computations with respect to dimensionality, making naïve exhaustive methods infeasible.

Method 1 — Brute-Force / Grid Search

Core Idea
  • Literally sample the objective function at every point on a user-defined grid across the entire domain.
  • In 2-D, this means marching over x- and y-coordinates; in dd dimensions, over all parameter directions.
Procedure (Step-by-Step)
  • Define bounds for each parameter: [L<em>1,U</em>1],[L<em>2,U</em>2],,[L<em>d,U</em>d][L<em>1,U</em>1],\,[L<em>2,U</em>2],\,\ldots,[L<em>d,U</em>d].
  • Choose grid resolution (how “fine” or “coarse”). E.g., on Earth you might sample every foot in latitude and longitude when searching for the highest elevation.
  • Nested loops iterate over each coordinate:
    for x<em>1=L</em>1U<em>1  step Δ</em>1   for x<em>2=L</em>2U2    f(x)  \text{for }x<em>1=L</em>1\rightarrow U<em>1\;\text{step }\Delta</em>1\ {\;\text{for }x<em>2=L</em>2\rightarrow U_2\;\ldots\;f(\mathbf{x})\;}
  • Record the minimum (or maximum) of f(x)f(\mathbf{x}) encountered.
Computational Complexity & Costs
  • If kk grid points are taken per dimension, total evaluations =kd=k^d.
  • For realistic AI problems, d10d\gg 10, so kdk^d explodes:
    • Example given: values like 101010^{10} or even 1010000010^{100\,000} are conceivable, far beyond any storage or compute capacity of CPUs, GPUs, or TPUs.
Pros (Why ever use it?)
  • Guaranteed Global Optimum: Exhaustive coverage means you must land on the best point, analogous to visiting every physical location on Earth to find the highest mountain peak.
  • No Smoothness Assumption: Works even if the surface is jagged, discontinuous, or nonsensical—no gradients needed.
Cons (Why we almost never use it?)
  • Astronomical Expense: Classified as “super expensive” by the speaker; essentially infeasible for contemporary AI tasks.
  • Curse of Dimensionality: Cost grows exponentially with each additional parameter.
  • Memory & Time Bottlenecks: No practical hardware stack can house the data or cycles required.
Use-Case Caveats
  • Only applied in very rare, toy, or low-dimensional cases; mostly serves as a conceptual baseline.

Method 2 — Gradient-Free Routines (Overview Only)

  • Purpose: Handle objectives where derivatives are unreliable or nonexistent (e.g., noisy, discontinuous, or black-box surfaces).
  • Example names not provided here, but include techniques like Nelder–Mead, simulated annealing, evolutionary algorithms.
  • Emphasized point: Even when the surface is jagged, there exist strategies that do not rely on gradients yet are far cheaper than brute force.

Method 3 — Gradient-Based Routines (Overview Only)

  • Assumption: Surface is “smooth enough,” or can be smoothed/pre-processed so gradients are meaningful.
  • Modern deep-learning optimizers (SGD, Adam, RMSprop) fall here.
  • Research frontier: Developing advanced smoothing and differentiation tricks to make non-smooth problems amenable to gradient exploitation (active PhD subject area).

Conceptual Analogies & Examples

  • Earth-height example:
    • Imagine discretizing planet Earth into 1-foot squares in both x and y (latitude/longitude) and measuring elevation at each. That is brute-force grid search for the highest mountain.
  • Nested-loop mental model:
    • Think of multiple “for” loops, one per dimension—each extra loop multiplies cost.

Connections & Implications

  • Brute force epitomizes the trade-off between certainty and feasibility: 100 % certainty of global optimum vs. intractable cost.
  • Highlights why modern AI leans on probabilistic or gradient-based heuristics instead of exhaustive enumeration.
  • Demonstrates how dimensionality dictates algorithm choice; same principle shows up in data-sampling, nearest-neighbor search, kernel methods, etc.

Key Numerical & Mathematical References

  • Grid-search complexity: O(kd)O(k^d) evaluations.
  • Potential magnitude of evaluations: 101010^{10} to 1010000010^{100\,000} (illustrative, not exact).
  • Gradient vector definition in dd dimensions: f(x)=[f/x<em>1,,f/x</em>d]\nabla f(\mathbf{x})=[\partial f/\partial x<em>1,\ldots,\partial f/\partial x</em>d].
  • No explicit formulas for gradient-free or gradient-based updates were provided, but their presence was mentioned as the distinguishing factor.

Practical Take-Home Points

  • Never start with brute-force search for serious machine-learning tasks; treat it only as an illustrative or sanity-check tool when d2d\le 2.
  • Understand the assumptions behind your optimizer: If you can compute/approximate gradients reliably, gradient-based methods vastly outperform exhaustive or gradient-free searches.
  • Always account for the curse of dimensionality when scoping computational budgets or designing experiments.