Optimization Routines: Brute-Force, Gradient-Free & Gradient-Based
Overview of Optimization Routines
- The lecturer divides practical optimization strategies into three high–level categories:
- Brute-Force / Grid Search
- Gradient-Free Routines
- Gradient-Based Routines
- Each category differs in (a) the information it exploits about the objective function (smoothness, slope, etc.) and (b) its computational cost.
- Key vocabulary used throughout:
- Manifold: The mathematical “surface” (often multi-dimensional) on which the objective function is defined.
- Grid: A finite set of discrete points laid out over that manifold.
- Gradient: The vector of partial derivatives that gives the slope in every coordinate direction.
- Curse of Dimensionality: Exponential growth of required computations with respect to dimensionality, making naïve exhaustive methods infeasible.
Method 1 — Brute-Force / Grid Search
Core Idea
- Literally sample the objective function at every point on a user-defined grid across the entire domain.
- In 2-D, this means marching over x- and y-coordinates; in d dimensions, over all parameter directions.
Procedure (Step-by-Step)
- Define bounds for each parameter: [L<em>1,U</em>1],[L<em>2,U</em>2],…,[L<em>d,U</em>d].
- Choose grid resolution (how “fine” or “coarse”). E.g., on Earth you might sample every foot in latitude and longitude when searching for the highest elevation.
- Nested loops iterate over each coordinate:
for x<em>1=L</em>1→U<em>1step Δ</em>1 for x<em>2=L</em>2→U2…f(x) - Record the minimum (or maximum) of f(x) encountered.
Computational Complexity & Costs
- If k grid points are taken per dimension, total evaluations =kd.
- For realistic AI problems, d≫10, so kd explodes:
- Example given: values like 1010 or even 10100000 are conceivable, far beyond any storage or compute capacity of CPUs, GPUs, or TPUs.
Pros (Why ever use it?)
- Guaranteed Global Optimum: Exhaustive coverage means you must land on the best point, analogous to visiting every physical location on Earth to find the highest mountain peak.
- No Smoothness Assumption: Works even if the surface is jagged, discontinuous, or nonsensical—no gradients needed.
Cons (Why we almost never use it?)
- Astronomical Expense: Classified as “super expensive” by the speaker; essentially infeasible for contemporary AI tasks.
- Curse of Dimensionality: Cost grows exponentially with each additional parameter.
- Memory & Time Bottlenecks: No practical hardware stack can house the data or cycles required.
Use-Case Caveats
- Only applied in very rare, toy, or low-dimensional cases; mostly serves as a conceptual baseline.
Method 2 — Gradient-Free Routines (Overview Only)
- Purpose: Handle objectives where derivatives are unreliable or nonexistent (e.g., noisy, discontinuous, or black-box surfaces).
- Example names not provided here, but include techniques like Nelder–Mead, simulated annealing, evolutionary algorithms.
- Emphasized point: Even when the surface is jagged, there exist strategies that do not rely on gradients yet are far cheaper than brute force.
Method 3 — Gradient-Based Routines (Overview Only)
- Assumption: Surface is “smooth enough,” or can be smoothed/pre-processed so gradients are meaningful.
- Modern deep-learning optimizers (SGD, Adam, RMSprop) fall here.
- Research frontier: Developing advanced smoothing and differentiation tricks to make non-smooth problems amenable to gradient exploitation (active PhD subject area).
Conceptual Analogies & Examples
- Earth-height example:
- Imagine discretizing planet Earth into 1-foot squares in both x and y (latitude/longitude) and measuring elevation at each. That is brute-force grid search for the highest mountain.
- Nested-loop mental model:
- Think of multiple “for” loops, one per dimension—each extra loop multiplies cost.
Connections & Implications
- Brute force epitomizes the trade-off between certainty and feasibility: 100 % certainty of global optimum vs. intractable cost.
- Highlights why modern AI leans on probabilistic or gradient-based heuristics instead of exhaustive enumeration.
- Demonstrates how dimensionality dictates algorithm choice; same principle shows up in data-sampling, nearest-neighbor search, kernel methods, etc.
Key Numerical & Mathematical References
- Grid-search complexity: O(kd) evaluations.
- Potential magnitude of evaluations: 1010 to 10100000 (illustrative, not exact).
- Gradient vector definition in d dimensions: ∇f(x)=[∂f/∂x<em>1,…,∂f/∂x</em>d].
- No explicit formulas for gradient-free or gradient-based updates were provided, but their presence was mentioned as the distinguishing factor.
Practical Take-Home Points
- Never start with brute-force search for serious machine-learning tasks; treat it only as an illustrative or sanity-check tool when d≤2.
- Understand the assumptions behind your optimizer: If you can compute/approximate gradients reliably, gradient-based methods vastly outperform exhaustive or gradient-free searches.
- Always account for the curse of dimensionality when scoping computational budgets or designing experiments.