Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

The Role of Provincial and Territorial Governments in Health Care

Note

Studied by 5 people

5.0(1)

3.8 Investment appraisal

Note

Studied by 12 people

5.0(1)

EWOT - 13. Markets and government

Note

Studied by 81 people

5.0(2)

Chapter 11 Section II, Part B—Task 4: Cultural Comparison

Note

Studied by 93 people

5.0(1)

AP world unit 3 vocab

Note

Studied by 14 people

5.0(1)

Chapter 29: The Triumph of the Right

Note

Studied by 27 people

5.0(1)

Optimisation II Notes

First Order Optimisation and Gradient Descent

First-order optimisation: Using first derivative to optimize.
Gradient descent: An iterative optimization algorithm to find the minimum of a function.
Automatic differentiation: Algorithm to automatically compute derivative of function.
Jacobian matrix: Matrix containing all first-order partial derivatives of a vector-valued function.
Gradient vector: Vector of partial derivatives of scalar function with respect to each input.
Hessian matrix: Matrix of second-order partial derivatives of scalar function.
Lipschitz continuity: Constraint on function's steepness; gradient is bounded.
Stochastic relaxation: Creating smooth objective functions from discontinuous ones.
Stochastic gradient descent: Optimization using gradient descent on random subsets of objective function.
Momentum: Technique to improve gradient descent by considering past updates.
Second-order methods: Optimization methods using second-order derivatives and their limitations.

Deep Neural Networks

Deep learning aims to find an approximating function y = f(x; θ). Parameters \theta are adjusted to minimize the distance ||f(x; θ) - y_i||.
Neural networks consist of layers, each applying a linear map followed by a fixed nonlinear activation function.
Optimization adjusts weight matrices in each layer; the nonlinear function remains fixed.
Backpropagation is used for automatic differentiation, efficiently computing the gradient of the objective function with respect to the weights.
Heuristic search methods are slow and offer no guarantee of convergence.
First-order algorithms are faster than heuristic search for optimization.

Gradients

If f(x) is a scalar function of a scalar x, f'(x) is the first derivative.
Jacobian matrix generalizes the derivative to vector functions, characterizing slope at a point.
J = \begin{bmatrix}
\frac{\partial y0}{\partial x0} & \frac{\partial y1}{\partial x0} & \ldots & \frac{\partial ym}{\partial x0} \
\frac{\partial y0}{\partial x1} & \frac{\partial y1}{\partial x1} & \ldots & \frac{\partial ym}{\partial x1} \
\vdots & \vdots & \ddots & \vdots \
\frac{\partial y0}{\partial xn} & \frac{\partial y1}{\partial xn} & \ldots & \frac{\partial ym}{\partial xn}
\end{bmatrix}
Gradient vector is a single-row Jacobian for scalar functions of a vector, indicating the direction of the steepest change.
\nabla L(\theta) = [\frac{\partial L(\theta)}{\partial \theta1}, \frac{\partial L(\theta)}{\partial \theta2}, \ldots, \frac{\partial L(\theta)}{\partial \theta_n}]
Hessian matrix is the Jacobian of the gradient vector, representing the second derivative for vector functions.

Objective Functions

Zeroth order optimization: requires evaluation of function only.
First order optimization: requires evaluation of function and its first derivative.
Second order optimization: requires evaluation of function, first, and second derivative.

Conditions for Optimisation with Derivatives

Differentiability: Smooth functions with continuous derivatives are easier to optimize iteratively.
First-order optimization requires objective function to be at least C^1 continuous and differentiable.
Lipschitz continuity: Requires function's gradient to be bounded; maximum steepness. |L(\thetai) - L(\thetaj)| \leq K \frac{\partial L(\theta)}{\partial d_i} where K is a fixed constant.

Gradient Descent Algorithm

Iterative algorithm: \theta^{(i+1)} = \theta^{(i)} - \delta \nabla L(\theta^{(i)})
- \delta: step size hyperparameter.
Gradient descent follows steepest slope to find minimum.
Step size affects convergence speed and stability.

Step Size

Optimal step size is related to the Lipschitz constant K.
Step size can be set adaptively via line search.
Alternative approach, use scheduled step sizes, reducing step size over iterations \delta = \delta_0 \cdot exp(-\lambda \cdot i)
Line search: adapt step size by starting with a big step and shrinking until gradient matches decrease in the objective function.

Derivatives

Analytical derivatives:
- Compute the derivative f'(x).
- Solve for the derivative being zero f'(x) = 0.
- Check if any of the solutions has positive second derivative f''(x) > 0.
Numerical differentiation (finite differences):
- Approximate the derivative using: f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x - h)}{2h}.
Automatic differentiation:
- Automatically constructs a function that evaluates the exact derivative at any given point.

Autograd

Automatic differentiation for NumPy code.
-Example code:
'''python
import autograd.numpy as np
from autograd import grad

def tanh(x):
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

deffun(x):
return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * np.cos(x[1])

x = np.random.randn(3)
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients
'''

Evolved into Google JAX.
-Example code:

import jax.numpy as jnp
from jax import grad, jit

def tanh(x):
 return (jnp.exp(x) - jnp.exp(-x)) / (jnp.exp(x) + jnp.exp(-x))

@jit
def fun(x):
 return 3.0 * tanh(x[0]) + x[0] * x[1] - x[2] * jnp.cos(x[1])

x = jnp.array(np.random.randn(3))
grads = grad(fun)(x)
print("Input values : ", x) # print the input values
print("Gradients : ", grads) # print the gradients

Improving Gradient Descent

Gradient descent may get trapped in local minima.
Gradient descent only works on smooth, differentiable objective functions.
Stochastic relaxation can introduce randomness to allow steep functions optimized.

Stochastic Relaxation

Averaging over many random instances, some very minor change in coloring might offer an advantage.
The gradient has been rendered (approximately) Lipschitz continuous by integrating many different random conditions.

Stochastic Gradient Descent (SGD)

Can do gradient descent on randomly selected parts independently.
Objective function can be written as a sum: L(\theta) = \sumi Li(\theta)
Advantages: Memory consumption, cache misses, reduce likelihood of getting stuck in minima.

Momentum

Used to reduce the chance of gradient descent becoming trapped by small fluctuations.
\theta^{(i+1)} = \alpha v + \delta \nabla L(\theta)
\theta = \theta - v

Critical Points

Critical points: gradient vector components are the zero vector.
Second-order derivatives: represent the “curvature” of a function.
Hessian matrix captures how gradients of a function vary with each other.

Eigenvalues of the Hessian

If all eigenvalues are strictly positive, the matrix is called positive definite and the point is a minimum.
If all eigenvalues are strictly negative (negative definite) and the point is a maximum.
If eigenvalues have mixed sign the point is a saddle point.
If the eigenvalues are all positive/negative, but with some zeros, the matrix is semidefinite and the point is plateau/ridge.

Second-Order Optimisation

Uses the Hessian matrix to jump to the bottom of each local quadratic approximation in single step.
Second-order methods are faster in general than first-order methods.
Evaluating the Hessian matrix requires d^2 computations, and d^2 storage.

Note