N

Machine Learning Algorithms and Gradient Descent

Introduction to Machine Learning Algorithms
  • A large class of machine learning algorithms can be categorized as supervised, unsupervised, and reinforcement learning.

  • These algorithms often fit within the empirical risk function framework, aiming to minimize an objective function.

  • Minimizing an objective function can be approached using gradient descent.

  • Gradient descent requires computing the derivative of the objective function, which is the focus of this lecture.

Example Scenario: A Simple Neural Network Unit
  • Output Unit: Represented as y^{..}, a function of an input pattern (vector \mathbf{s}_I) and parameters (vector \mathbf{\theta}).

  • Input Units: Multiple input units feed into the output unit. For example, \mathbf{s}_I = [s_{I1}, s_{I2}, s_{I3}]^T .

  • Activation: An input pattern \mathbf{s}_I activates each input node. s_{I1}, s_{I2}, s_{I3} are the activation levels of the respective nodes.

  • Connection Weights: Each input node's connection to the output unit has a corresponding weight: \theta_1, \theta_2, \theta_3 .

  • Output Function: The output y^{..} is a sigmoidal function (specifically, the logistic sigmoid) of a weighted sum of inputs:

    • y^{..} = \sigma(\theta_1 s_{I1} + \theta_2 s_{I2} + \theta_3 s_{I3})

    • The logistic sigmoid function is defined as \sigma(x) = \frac{1}{1 + e^{-x}}.

  • Supervised Learning Paradigm: Training data consists of input-desired response pairs: (\mathbf{s}_1, y_1), (\mathbf{s}_2, y_2), \dots, (\mathbf{s}_N, y_N). For instance, \mathbf{s}_I could be today's temperature and rainfall, and y_I could be whether it rained tomorrow (1 if rain, 0 if no rain), with y^{..} predicting the probability of rain tomorrow.

  • Empirical Risk Function (Loss Function): For this example, a mean squared error (MSE) is used:

    • L_N(\mathbf{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta}))^2 (where y_i is the desired response and y^{..} is the prediction).

  • Goal of Learning: To minimize the empirical risk function L_N(\mathbf{\theta}) by adjusting the parameters \mathbf{\theta} (connection weights).

Batch Gradient Descent Algorithm
  • Strategy: Iteratively update the parameters \mathbf{\theta} to minimize the loss function.

  • Update Rule: The general rule for parameter update is: \mathbf{\theta}_{t+1} = \mathbf{\theta}_t - \gamma_t \nabla L_N(\mathbf{\theta}_t).

    • \mathbf{\theta}_t: Parameters at iteration t.

    • \gamma_t: Learning rate (or step size), a scalar variable that controls the size of the step in the opposite direction of the gradient.

    • \nabla L_N(\mathbf{\theta}_t): The gradient of the empirical risk function with respect to \mathbf{\theta} evaluated at \mathbf{\theta}_t.

  • Iterative Process:

    1. Obtain an initial guess for the parameters \mathbf{\theta} .

    2. Compute the derivative (gradient) of L_N(\mathbf{\theta}) with respect to each parameter, using all the training data.

    3. Update the parameters using the update rule, where the learning rate \gamma_t scales the gradient before subtraction.

    4. Each update causes a slight perturbation of the connection weights, tending to decrease the empirical risk function.

    5. Repeat this process (taking all training data, perturbing parameters) until the parameters converge (stop changing significantly).

  • Terminology: The parameters \mathbf{\theta} are referred to as weights.

Derivatives of the Sigmoid Function
  • The derivative of the logistic sigmoid function \sigma(x) = \frac{1}{1+e^{-x}} is essential for backpropagation.

  • The derivative is given by: \frac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x)).

Computing the Gradient for the Loss Function (Backpropagation)
  • To compute \nabla L_N(\mathbf{\theta}), we need to find the partial derivative of L_N(\mathbf{\theta}) with respect to each parameter \theta_j.

  • Consider one term from the sum in L_N(\mathbf{\theta}): the squared error for a single input \mathbf{s}_i: E_i = (y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta}))^2.

  • We use the chain rule to find \frac{\partial E_i}{\partial \theta_j}:

    • Let Net_i = \sum_{k=1}^{D} \theta_k s_{ik} be the weighted sum of inputs for pattern i. Thus, y^{..}(\mathbf{s}_i, \mathbf{\theta}) = \sigma(Net_i). (Here, D is the number of input nodes, e.g., D=3).

    • \frac{\partial E_i}{\partial \theta_j} = \frac{\partial E_i}{\partial y^{..}} \frac{\partial y^{..}}{\partial Net_i} \frac{\partial Net_i}{\partial \theta_j}.

  1. Derivative of Error with respect to Output:

    • \frac{\partial E_i}{\partial y^{..}} = \frac{\partial}{\partial y^{..}} (y_i - y^{..})^2 = 2(y_i - y^{..})(-1) = -2(y_i - y^{..}).

  2. Derivative of Output with respect to Net Input:

    • \frac{\partial y^{..}}{\partial Net_i} = \frac{\partial \sigma(Net_i)}{\partial Net_i} = \sigma(Net_i)(1 - \sigma(Net_i)) = y^{..}(1 - y^{..}).

  3. Derivative of Net Input with respect to a Weight:

    • \frac{\partial Net_i}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} (\theta_1 s_{i1} + \dots + \theta_j s_{ij} + \dots) = s_{ij}.

  • Combining the terms:

    • \frac{\partial E_i}{\partial \theta_j} = -2(y_i - y^{..}) \, y^{..}(1 - y^{..}) \, s_{ij}.

  • Gradient for the entire loss function:

    • The gradient for the empirical risk function is the average of these individual gradients over all N training patterns:

    • \frac{\partial L_N(\mathbf{\theta})}{\partial \theta_j} = \frac{1}{N} \sum_{i=1}^{N} \frac{\partial E_i}{\partial \theta_j} = \frac{1}{N} \sum_{i=1}^{N} -2(y_i - y^{..}(\mathbf{s}_i, \mathbf{\theta})) \, y^{..}(\mathbf{s}_i, \mathbf{\theta})(1 - y^{..}(\mathbf{s}_i, \mathbf{\theta})) \, s_{ij}.

  • This process of computing gradients by working backward from the output to the inputs through the network's connections is known as backpropagation.