First-order optimisation: Using first derivative to optimize.
Gradient descent: An iterative optimization algorithm to find the minimum of a function.
Automatic differentiation: Algorithm to automatically compute derivative of function.
Jacobian matrix: Matrix containing all first-order partial derivatives of a vector-valued function.
Gradient vector: Vector of partial derivatives of scalar function with respect to each input.
Hessian matrix: Matrix of second-order partial derivatives of scalar function.
Lipschitz continuity: Constraint on function's steepness; gradient is bounded.
Stochastic relaxation: Creating smooth objective functions from discontinuous ones.
Stochastic gradient descent: Optimization using gradient descent on random subsets of objective function.
Momentum: Technique to improve gradient descent by considering past updates.
Second-order methods: Optimization methods using second-order derivatives and their limitations.
Deep learning aims to find an approximating function y = f(x; θ). Parameters \theta are adjusted to minimize the distance ||f(x; θ) - y_i||.
Neural networks consist of layers, each applying a linear map followed by a fixed nonlinear activation function.
Optimization adjusts weight matrices in each layer; the nonlinear function remains fixed.
Backpropagation is used for automatic differentiation, efficiently computing the gradient of the objective function with respect to the weights.
Heuristic search methods are slow and offer no guarantee of convergence.
First-order algorithms are faster than heuristic search for optimization.
If f(x) is a scalar function of a scalar x, f'(x) is the first derivative.
Jacobian matrix generalizes the derivative to vector functions, characterizing slope at a point.
J = \begin{bmatrix}
\frac{\partial y0}{\partial x0} & \frac{\partial y1}{\partial x0} & \ldots & \frac{\partial ym}{\partial x0} \
\frac{\partial y0}{\partial x1} & \frac{\partial y1}{\partial x1} & \ldots & \frac{\partial ym}{\partial x1} \
\vdots & \vdots & \ddots & \vdots \
\frac{\partial y0}{\partial xn} & \frac{\partial y1}{\partial xn} & \ldots & \frac{\partial ym}{\partial xn}
\end{bmatrix}
Gradient vector is a single-row Jacobian for scalar functions of a vector, indicating the direction of the steepest change.
\nabla L(\theta) = [\frac{\partial L(\theta)}{\partial \theta1}, \frac{\partial L(\theta)}{\partial \theta2}, \ldots, \frac{\partial L(\theta)}{\partial \theta_n}]
Hessian matrix is the Jacobian of the gradient vector, representing the second derivative for vector functions.
Zeroth order optimization: requires evaluation of function only.
First order optimization: requires evaluation of function and its first derivative.
Second order optimization: requires evaluation of function, first, and second derivative.
Differentiability: Smooth functions with continuous derivatives are easier to optimize iteratively.
First-order optimization requires objective function to be at least C^1 continuous and differentiable.
Lipschitz continuity: Requires function's gradient to be bounded; maximum steepness. |L(\thetai) - L(\thetaj)| \leq K \frac{\partial L(\theta)}{\partial d_i} where K is a fixed constant.
Iterative algorithm: \theta^{(i+1)} = \theta^{(i)} - \delta \nabla L(\theta^{(i)})
\delta: step size hyperparameter.
Gradient descent follows steepest slope to find minimum.
Step size affects convergence speed and stability.
Optimal step size is related to the Lipschitz constant K.
Step size can be set adaptively via line search.
Alternative approach, use scheduled step sizes, reducing step size over iterations \delta = \delta_0 \cdot exp(-\lambda \cdot i)
Line search: adapt step size by starting with a big step and shrinking until gradient matches decrease in the objective function.
Analytical derivatives:
Compute the derivative f'(x).
Solve for the derivative being zero f'(x) = 0.
Check if any of the solutions has positive second derivative f''(x) > 0.
Numerical differentiation (finite differences):
Approximate the derivative using: f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x - h)}{2h}.
Automatic differentiation:
Automatically constructs a function that evaluates the exact derivative at any given point.
Automatic differentiation for NumPy code.
-Example code:
'''python
import autograd.numpy as np
from autograd import grad
def tanh(x):
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
deffun(x):
return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * np.cos(x[1])
x = np.random.randn(3)
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients
'''
Evolved into Google JAX.
-Example code:
import jax.numpy as jnp
from jax import grad, jit
def tanh(x):
return (jnp.exp(x) - jnp.exp(-x)) / (jnp.exp(x) + jnp.exp(-x))
@jit
def fun(x):
return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * jnp.cos(x[1])
x = jnp.array(np.random.randn(3))
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients
Gradient descent may get trapped in local minima.
Gradient descent only works on smooth, differentiable objective functions.
Stochastic relaxation can introduce randomness to allow steep functions optimized.
Averaging over many random instances, some very minor change in coloring might offer an advantage.
The gradient has been rendered (approximately) Lipschitz continuous by integrating many different random conditions.
Can do gradient descent on randomly selected parts independently.
Objective function can be written as a sum: L(\theta) = \sumi Li(\theta)
Advantages: Memory consumption, cache misses, reduce likelihood of getting stuck in minima.
Used to reduce the chance of gradient descent becoming trapped by small fluctuations.
\theta^{(i+1)} = \alpha v + \delta \nabla L(\theta)
\theta = \theta - v
Critical points: gradient vector components are the zero vector.
Second-order derivatives: represent the “curvature” of a function.
Hessian matrix captures how gradients of a function vary with each other.
If all eigenvalues are strictly positive, the matrix is called positive definite and the point is a minimum.
If all eigenvalues are strictly negative (negative definite) and the point is a maximum.
If eigenvalues have mixed sign the point is a saddle point.
If the eigenvalues are all positive/negative, but with some zeros, the matrix is semidefinite and the point is plateau/ridge.
Uses the Hessian matrix to jump to the bottom of each local quadratic approximation in single step.
Second-order methods are faster in general than first-order methods.
Evaluating the Hessian matrix requires d^2 computations, and d^2 storage.