Optimization Routines: Brute-Force, Gradient-Free & Gradient-Based

The lecturer divides practical optimization strategies into three high–level categories:
1. Brute-Force / Grid Search
2. Gradient-Free Routines
3. Gradient-Based Routines
Each category differs in (a) the information it exploits about the objective function (smoothness, slope, etc.) and (b) its computational cost.
Key vocabulary used throughout:
- Manifold: The mathematical “surface” (often multi-dimensional) on which the objective function is defined.
- Grid: A finite set of discrete points laid out over that manifold.
- Gradient: The vector of partial derivatives that gives the slope in every coordinate direction.
- Curse of Dimensionality: Exponential growth of required computations with respect to dimensionality, making naïve exhaustive methods infeasible.

Literally sample the objective function at every point on a user-defined grid across the entire domain.
In 2-D, this means marching over x- and y-coordinates; in $d$ dimensions, over all parameter directions.

Define bounds for each parameter: $[L1,U1],\,[L2,U2],\,\ldots,[Ld,Ud]$ .
Choose grid resolution (how “fine” or “coarse”). E.g., on Earth you might sample every foot in latitude and longitude when searching for the highest elevation.
Nested loops iterate over each coordinate:
$\text{for }x1=L1\rightarrow U1\;\text{step }\Delta1\ {\;\text{for }x2=L2\rightarrow U_2\;\ldots\;f(\mathbf{x})\;}$
Record the minimum (or maximum) of $f(\mathbf{x})$ encountered.

If $k$ grid points are taken per dimension, total evaluations $=k^d$ .
For realistic AI problems, $d\gg 10$ , so $k^d$ explodes:
- Example given: values like $10^{10}$ or even $10^{100\,000}$ are conceivable, far beyond any storage or compute capacity of CPUs, GPUs, or TPUs.

Guaranteed Global Optimum: Exhaustive coverage means you must land on the best point, analogous to visiting every physical location on Earth to find the highest mountain peak.
No Smoothness Assumption: Works even if the surface is jagged, discontinuous, or nonsensical—no gradients needed.

Astronomical Expense: Classified as “super expensive” by the speaker; essentially infeasible for contemporary AI tasks.
Curse of Dimensionality: Cost grows exponentially with each additional parameter.
Memory & Time Bottlenecks: No practical hardware stack can house the data or cycles required.

Only applied in very rare, toy, or low-dimensional cases; mostly serves as a conceptual baseline.

Purpose: Handle objectives where derivatives are unreliable or nonexistent (e.g., noisy, discontinuous, or black-box surfaces).
Example names not provided here, but include techniques like Nelder–Mead, simulated annealing, evolutionary algorithms.
Emphasized point: Even when the surface is jagged, there exist strategies that do not rely on gradients yet are far cheaper than brute force.

Assumption: Surface is “smooth enough,” or can be smoothed/pre-processed so gradients are meaningful.
Modern deep-learning optimizers (SGD, Adam, RMSprop) fall here.
Research frontier: Developing advanced smoothing and differentiation tricks to make non-smooth problems amenable to gradient exploitation (active PhD subject area).

Earth-height example:
- Imagine discretizing planet Earth into 1-foot squares in both x and y (latitude/longitude) and measuring elevation at each. That is brute-force grid search for the highest mountain.
Nested-loop mental model:
- Think of multiple “for” loops, one per dimension—each extra loop multiplies cost.

Brute force epitomizes the trade-off between certainty and feasibility: 100 % certainty of global optimum vs. intractable cost.
Highlights why modern AI leans on probabilistic or gradient-based heuristics instead of exhaustive enumeration.
Demonstrates how dimensionality dictates algorithm choice; same principle shows up in data-sampling, nearest-neighbor search, kernel methods, etc.

Grid-search complexity: $O(k^d)$ evaluations.
Potential magnitude of evaluations: $10^{10}$ to $10^{100\,000}$ (illustrative, not exact).
Gradient vector definition in $d$ dimensions: $\nabla f(\mathbf{x})=[\partial f/\partial x1,\ldots,\partial f/\partial xd]$ .
No explicit formulas for gradient-free or gradient-based updates were provided, but their presence was mentioned as the distinguishing factor.

Never start with brute-force search for serious machine-learning tasks; treat it only as an illustrative or sanity-check tool when $d\le 2$ .
Understand the assumptions behind your optimizer: If you can compute/approximate gradients reliably, gradient-based methods vastly outperform exhaustive or gradient-free searches.
Always account for the curse of dimensionality when scoping computational budgets or designing experiments.