N

Matrix Calculus and Supervised Learning

Homework and Exam Announcements

  • Homework one is due on September 21.

  • Announcement will clarify which parts of homework one will be on Exam Two and which will be on Exam One due to pacing of the course.

Recap of Previous Lecture

  • Last session introduced matrix calculus and vector calculus.

  • Focused on vector chain rule.

Supervised Learning Problem

  • Discussion involves supervised learning with a focus on maximum likelihood interpretation in neural networks.

  • Training data consists of ordered pairs: (si, yi), where:

    • s_i = input pattern

    • y_i = desired response, which is binary (1 or 0).

Example of Training Stimuli
  • Training samples represented as:

    • s1 → y1 (i.e., s1 might be a face image, y1 indicates if it has hair: 1=has hair, 0=does not)

  • For n training stimuli, y_i must be in {0, 1}.

Model Interpretation

  • The output from a neural network is denoted as:

    • y̅(si, θ) = predicted probability that yi = 1 given the input si, interpreted as: pi(θ) = P(yi = 1 | si)

Likelihood Function
  • The learning algorithm aims to maximize the likelihood of the entire dataset, defined as:

    • L(θ) = P(data|θ) = ext{Product of individual likelihoods}

  • This leads to Maximum Likelihood Estimation (MLE).

Loss Function Definition

  • Define loss function as:
    c(si, yi, θ) = -[yi ext{log}(pi(θ)) + (1 - yi) ext{log}(1 - pi(θ))]

  • When minimizing, if yi=1, we minimize - ext{log}(pi(θ)) , if yi=0 minimize - ext{log}(1 - pi(θ)).

  • Represents empirical risk through a negative log-likelihood approach.

Sigmoidal Function in Neural Models

  • The predicted probabilities pi(θ) are modeled with a sigmoidal function: pi(θ) = rac{1}{1 + e^{-φ_i}}

    • where φi = θ^T imes si

  • Identified as:

    • logistic sigmoid in neural networks or inverse logit in statistical contexts.

Cross Entropy Function Definition

  • Formulation leads to identifying the objective function:

    • Known as cross-entropy function, specifically for binary targets or Bernoulli random variables.

  • It’s common in practice for neural networks to utilize this cross-entropy for binary classifications.

Gradient Descent Learning Algorithm

  • Derivation gives the update rule for gradient descent:
    θ{t+1} = θt - ext{γ}_t rac{d c}{dθ}

  • Use chain rule to compute:
    rac{d c}{dθ} = -[yi - pi(θ)] imes s_i

Different Gradient Descent Methods

  • Batch Gradient Descent: Updates using all training data.

  • Adaptive Gradient Descent: Updates using a single training sample.

  • Mini-batch Gradient Descent: Compromise by using a subset (e.g., 10 samples).

Stop Criteria in Learning Algorithms

  • To assess convergence of gradient descent:

    • Assess average of recent gradients.

    • When it approaches zero, convergence is assumed.

Moving to Numerical Outputs in Supervised Learning

  • Change target variable y_i to real-valued outputs for learning, conforming to Gaussian assumptions:

    • P(yi | si, θ) ext{ is Gaussian }

Objective Function Formulation
  • The objective function based on Gaussian likelihood estimates leads to:

    • Sum squared error for optimal parameter findings:
      ext{Objective Function} = rac{1}{n} ext{Sum}(yi - ar{y}(si, θ))^2

Softmax Function Introduction for Categorical Outputs

  • Outputs can represent multiple categories using softmax functions.

  • Outputs defined as a vector of probabilities for each category using: pj(si, θ) = rac{e^{φj(si, θ)}}{ ext{Sum}{j=1}^{m} e^{φj(s_i, θ)}}

    • This construction allows categorical classification (one hot encoded).

Objective Function for Softmax Outputs
  • The derived objective function for learning in softmax outputs:

    • ext{Loss} = -yi^T ext{log}(p(si, θ))

Final Remarks

  • Significant material covered including supervised learning variations, loss functions, learning algorithms, and transition to categorical outputs using softmax.