MD

Optimisation II Notes

First Order Optimisation and Gradient Descent
  • First-order optimisation: Using first derivative to optimize.

  • Gradient descent: An iterative optimization algorithm to find the minimum of a function.

  • Automatic differentiation: Algorithm to automatically compute derivative of function.

  • Jacobian matrix: Matrix containing all first-order partial derivatives of a vector-valued function.

  • Gradient vector: Vector of partial derivatives of scalar function with respect to each input.

  • Hessian matrix: Matrix of second-order partial derivatives of scalar function.

  • Lipschitz continuity: Constraint on function's steepness; gradient is bounded.

  • Stochastic relaxation: Creating smooth objective functions from discontinuous ones.

  • Stochastic gradient descent: Optimization using gradient descent on random subsets of objective function.

  • Momentum: Technique to improve gradient descent by considering past updates.

  • Second-order methods: Optimization methods using second-order derivatives and their limitations.

Deep Neural Networks
  • Deep learning aims to find an approximating function y = f(x; θ). Parameters \theta are adjusted to minimize the distance ||f(x; θ) - y_i||.

  • Neural networks consist of layers, each applying a linear map followed by a fixed nonlinear activation function.

  • Optimization adjusts weight matrices in each layer; the nonlinear function remains fixed.

  • Backpropagation is used for automatic differentiation, efficiently computing the gradient of the objective function with respect to the weights.

  • Heuristic search methods are slow and offer no guarantee of convergence.

  • First-order algorithms are faster than heuristic search for optimization.

Gradients
  • If f(x) is a scalar function of a scalar x, f'(x) is the first derivative.

  • Jacobian matrix generalizes the derivative to vector functions, characterizing slope at a point.

    J = \begin{bmatrix}

    \frac{\partial y0}{\partial x0} & \frac{\partial y1}{\partial x0} & \ldots & \frac{\partial ym}{\partial x0} \

    \frac{\partial y0}{\partial x1} & \frac{\partial y1}{\partial x1} & \ldots & \frac{\partial ym}{\partial x1} \

    \vdots & \vdots & \ddots & \vdots \

    \frac{\partial y0}{\partial xn} & \frac{\partial y1}{\partial xn} & \ldots & \frac{\partial ym}{\partial xn}

    \end{bmatrix}

  • Gradient vector is a single-row Jacobian for scalar functions of a vector, indicating the direction of the steepest change.

    \nabla L(\theta) = [\frac{\partial L(\theta)}{\partial \theta1}, \frac{\partial L(\theta)}{\partial \theta2}, \ldots, \frac{\partial L(\theta)}{\partial \theta_n}]

  • Hessian matrix is the Jacobian of the gradient vector, representing the second derivative for vector functions.

Objective Functions
  • Zeroth order optimization: requires evaluation of function only.

  • First order optimization: requires evaluation of function and its first derivative.

  • Second order optimization: requires evaluation of function, first, and second derivative.

Conditions for Optimisation with Derivatives
  • Differentiability: Smooth functions with continuous derivatives are easier to optimize iteratively.

  • First-order optimization requires objective function to be at least C^1 continuous and differentiable.

  • Lipschitz continuity: Requires function's gradient to be bounded; maximum steepness. |L(\thetai) - L(\thetaj)| \leq K \frac{\partial L(\theta)}{\partial d_i} where K is a fixed constant.

Gradient Descent Algorithm
  • Iterative algorithm: \theta^{(i+1)} = \theta^{(i)} - \delta \nabla L(\theta^{(i)})

    • \delta: step size hyperparameter.

  • Gradient descent follows steepest slope to find minimum.

  • Step size affects convergence speed and stability.

Step Size
  • Optimal step size is related to the Lipschitz constant K.

  • Step size can be set adaptively via line search.

  • Alternative approach, use scheduled step sizes, reducing step size over iterations \delta = \delta_0 \cdot exp(-\lambda \cdot i)

  • Line search: adapt step size by starting with a big step and shrinking until gradient matches decrease in the objective function.

Derivatives
  • Analytical derivatives:

    • Compute the derivative f'(x).

    • Solve for the derivative being zero f'(x) = 0.

    • Check if any of the solutions has positive second derivative f''(x) > 0.

  • Numerical differentiation (finite differences):

    • Approximate the derivative using: f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x - h)}{2h}.

  • Automatic differentiation:

    • Automatically constructs a function that evaluates the exact derivative at any given point.

Autograd
  • Automatic differentiation for NumPy code.
    -Example code:
    '''python
    import autograd.numpy as np
    from autograd import grad

def tanh(x):
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

deffun(x):
return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * np.cos(x[1])

x = np.random.randn(3)
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients
'''

  • Evolved into Google JAX.
    -Example code:

import jax.numpy as jnp
from jax import grad, jit

def tanh(x):
 return (jnp.exp(x) - jnp.exp(-x)) / (jnp.exp(x) + jnp.exp(-x))

@jit
def fun(x):
 return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * jnp.cos(x[1])

x = jnp.array(np.random.randn(3))
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients
Improving Gradient Descent
  • Gradient descent may get trapped in local minima.

  • Gradient descent only works on smooth, differentiable objective functions.

  • Stochastic relaxation can introduce randomness to allow steep functions optimized.

Stochastic Relaxation
  • Averaging over many random instances, some very minor change in coloring might offer an advantage.

  • The gradient has been rendered (approximately) Lipschitz continuous by integrating many different random conditions.

Stochastic Gradient Descent (SGD)
  • Can do gradient descent on randomly selected parts independently.

  • Objective function can be written as a sum: L(\theta) = \sumi Li(\theta)

  • Advantages: Memory consumption, cache misses, reduce likelihood of getting stuck in minima.

Momentum
  • Used to reduce the chance of gradient descent becoming trapped by small fluctuations.

    \theta^{(i+1)} = \alpha v + \delta \nabla L(\theta)

    \theta = \theta - v

Critical Points
  • Critical points: gradient vector components are the zero vector.

  • Second-order derivatives: represent the “curvature” of a function.

  • Hessian matrix captures how gradients of a function vary with each other.

Eigenvalues of the Hessian
  • If all eigenvalues are strictly positive, the matrix is called positive definite and the point is a minimum.

  • If all eigenvalues are strictly negative (negative definite) and the point is a maximum.

  • If eigenvalues have mixed sign the point is a saddle point.

  • If the eigenvalues are all positive/negative, but with some zeros, the matrix is semidefinite and the point is plateau/ridge.

Second-Order Optimisation
  • Uses the Hessian matrix to jump to the bottom of each local quadratic approximation in single step.

  • Second-order methods are faster in general than first-order methods.

  • Evaluating the Hessian matrix requires d^2 computations, and d^2 storage.