Machine Learning Algorithms and Gradient Descent

Introduction to Machine Learning Algorithms

A large class of machine learning algorithms can be categorized as supervised, unsupervised, and reinforcement learning.
These algorithms often fit within the empirical risk function framework, aiming to minimize an objective function.
Minimizing an objective function can be approached using gradient descent.
Gradient descent requires computing the derivative of the objective function, which is the focus of this lecture.

Example Scenario: A Simple Neural Network Unit

Output Unit: Represented as y^{..}, a function of an input pattern (vector \mathbf{s}_I) and parameters (vector \mathbf{\theta}).
Input Units: Multiple input units feed into the output unit. For example, \mathbf{s}_I = [s_{I1}, s_{I2}, s_{I3}]^T .
Activation: An input pattern \mathbf{s}_I activates each input node. s_{I1}, s_{I2}, s_{I3} are the activation levels of the respective nodes.
Connection Weights: Each input node's connection to the output unit has a corresponding weight: \theta_1, \theta_2, \theta_3 .
Output Function: The output y^{..} is a sigmoidal function (specifically, the logistic sigmoid) of a weighted sum of inputs:
- y^{..} = \sigma(\theta_1 s_{I1} + \theta_2 s_{I2} + \theta_3 s_{I3})
- The logistic sigmoid function is defined as \sigma(x) = \frac{1}{1 + e^{-x}}.
Supervised Learning Paradigm: Training data consists of input-desired response pairs: (\mathbf{s}_1, y_1), (\mathbf{s}_2, y_2), \dots, (\mathbf{s}_N, y_N). For instance, \mathbf{s}_I could be today's temperature and rainfall, and y_I could be whether it rained tomorrow (1 if rain, 0 if no rain), with y^{..} predicting the probability of rain tomorrow.
Empirical Risk Function (Loss Function): For this example, a mean squared error (MSE) is used:
- L_N(\mathbf{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta}))^2 (where y_i is the desired response and y^{..} is the prediction).
Goal of Learning: To minimize the empirical risk function L_N(\mathbf{\theta}) by adjusting the parameters \mathbf{\theta} (connection weights).

Batch Gradient Descent Algorithm

Strategy: Iteratively update the parameters \mathbf{\theta} to minimize the loss function.
Update Rule: The general rule for parameter update is: \mathbf{\theta}_{t+1} = \mathbf{\theta}_t - \gamma_t \nabla L_N(\mathbf{\theta}_t).
- \mathbf{\theta}_t: Parameters at iteration t.
- \gamma_t: Learning rate (or step size), a scalar variable that controls the size of the step in the opposite direction of the gradient.
- \nabla L_N(\mathbf{\theta}_t): The gradient of the empirical risk function with respect to \mathbf{\theta} evaluated at \mathbf{\theta}_t.
Iterative Process:
1. Obtain an initial guess for the parameters \mathbf{\theta} .
2. Compute the derivative (gradient) of L_N(\mathbf{\theta}) with respect to each parameter, using all the training data.
3. Update the parameters using the update rule, where the learning rate \gamma_t scales the gradient before subtraction.
4. Each update causes a slight perturbation of the connection weights, tending to decrease the empirical risk function.
5. Repeat this process (taking all training data, perturbing parameters) until the parameters converge (stop changing significantly).
Terminology: The parameters \mathbf{\theta} are referred to as weights.

Derivatives of the Sigmoid Function

The derivative of the logistic sigmoid function \sigma(x) = \frac{1}{1+e^{-x}} is essential for backpropagation.
The derivative is given by: \frac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x)).

Computing the Gradient for the Loss Function (Backpropagation)

To compute \nabla L_N(\mathbf{\theta}), we need to find the partial derivative of L_N(\mathbf{\theta}) with respect to each parameter \theta_j.
Consider one term from the sum in L_N(\mathbf{\theta}): the squared error for a single input \mathbf{s}_i: E_i = (y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta}))^2.
We use the chain rule to find \frac{\partial E_i}{\partial \theta_j}:
- Let Net_i = \sum_{k=1}^{D} \theta_k s_{ik} be the weighted sum of inputs for pattern i. Thus, y^{..}(\mathbf{s}_i, \mathbf{\theta}) = \sigma(Net_i). (Here, D is the number of input nodes, e.g., D=3).
- \frac{\partial E_i}{\partial \theta_j} = \frac{\partial E_i}{\partial y^{..}} \frac{\partial y^{..}}{\partial Net_i} \frac{\partial Net_i}{\partial \theta_j}.

Derivative of Error with respect to Output:
- \frac{\partial E_i}{\partial y^{..}} = \frac{\partial}{\partial y^{..}} (y_i - y^{..})^2 = 2(y_i - y^{..})(-1) = -2(y_i - y^{..}).
Derivative of Output with respect to Net Input:
- \frac{\partial y^{..}}{\partial Net_i} = \frac{\partial \sigma(Net_i)}{\partial Net_i} = \sigma(Net_i)(1 - \sigma(Net_i)) = y^{..}(1 - y^{..}).
Derivative of Net Input with respect to a Weight:
- \frac{\partial Net_i}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} (\theta_1 s_{i1} + \dots + \theta_j s_{ij} + \dots) = s_{ij}.

Combining the terms:
- \frac{\partial E_i}{\partial \theta_j} = -2(y_i - y^{..}) \, y^{..}(1 - y^{..}) \, s_{ij}.
Gradient for the entire loss function:
- The gradient for the empirical risk function is the average of these individual gradients over all N training patterns:
- \frac{\partial L_N(\mathbf{\theta})}{\partial \theta_j} = \frac{1}{N} \sum_{i=1}^{N} \frac{\partial E_i}{\partial \theta_j} = \frac{1}{N} \sum_{i=1}^{N} -2(y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta})) \, y^{..}(\mathbf{s}_i, \mathbf{\theta})(1 - y^{..}(\mathbf{s}_i, \mathbf{\theta})) \, s_{ij}.
This process of computing gradients by working backward from the output to the inputs through the network's connections is known as backpropagation.