Matrix Calculus and Supervised Learning

Homework and Exam Announcements

Homework one is due on September 21.
Announcement will clarify which parts of homework one will be on Exam Two and which will be on Exam One due to pacing of the course.

Recap of Previous Lecture

Last session introduced matrix calculus and vector calculus.
Focused on vector chain rule.

Supervised Learning Problem

Discussion involves supervised learning with a focus on maximum likelihood interpretation in neural networks.
Training data consists of ordered pairs: (si, yi), where:
- s_i = input pattern
- y_i = desired response, which is binary (1 or 0).

Example of Training Stimuli

Training samples represented as:
- s1 → y1 (i.e., s1 might be a face image, y1 indicates if it has hair: 1=has hair, 0=does not)
For n training stimuli, y_i must be in {0, 1}.

Model Interpretation

The output from a neural network is denoted as:
- y̅(si, θ) = predicted probability that yi = 1 given the input si, interpreted as: pi(θ) = P(yi = 1 | si)

Likelihood Function

The learning algorithm aims to maximize the likelihood of the entire dataset, defined as:
- L(θ) = P(data|θ) = ext{Product of individual likelihoods}
This leads to Maximum Likelihood Estimation (MLE).

Loss Function Definition

Define loss function as:
c(si, yi, θ) = -[yi ext{log}(pi(θ)) + (1 - yi) ext{log}(1 - pi(θ))]
When minimizing, if yi=1, we minimize - ext{log}(pi(θ)) , if yi=0 minimize - ext{log}(1 - pi(θ)).
Represents empirical risk through a negative log-likelihood approach.

Sigmoidal Function in Neural Models

The predicted probabilities pi(θ) are modeled with a sigmoidal function: pi(θ) = rac{1}{1 + e^{-φ_i}}
- where φi = θ^T imes si
Identified as:
- logistic sigmoid in neural networks or inverse logit in statistical contexts.

Cross Entropy Function Definition

Formulation leads to identifying the objective function:
- Known as cross-entropy function, specifically for binary targets or Bernoulli random variables.
It’s common in practice for neural networks to utilize this cross-entropy for binary classifications.

Gradient Descent Learning Algorithm

Derivation gives the update rule for gradient descent:
θ{t+1} = θt - ext{γ}_t rac{d c}{dθ}
Use chain rule to compute:
rac{d c}{dθ} = -[yi - pi(θ)] imes s_i

Different Gradient Descent Methods

Batch Gradient Descent: Updates using all training data.
Adaptive Gradient Descent: Updates using a single training sample.
Mini-batch Gradient Descent: Compromise by using a subset (e.g., 10 samples).

Stop Criteria in Learning Algorithms

To assess convergence of gradient descent:
- Assess average of recent gradients.
- When it approaches zero, convergence is assumed.

Moving to Numerical Outputs in Supervised Learning

Change target variable y_i to real-valued outputs for learning, conforming to Gaussian assumptions:
- P(yi | si, θ) ext{ is Gaussian }

Objective Function Formulation

The objective function based on Gaussian likelihood estimates leads to:
- Sum squared error for optimal parameter findings:
  ext{Objective Function} = rac{1}{n} ext{Sum}(yi - ar{y}(si, θ))^2

Softmax Function Introduction for Categorical Outputs

Outputs can represent multiple categories using softmax functions.
Outputs defined as a vector of probabilities for each category using: pj(si, θ) = rac{e^{φj(si, θ)}}{ ext{Sum}{j=1}^{m} e^{φj(s_i, θ)}}
- This construction allows categorical classification (one hot encoded).

Objective Function for Softmax Outputs

The derived objective function for learning in softmax outputs:
- ext{Loss} = -yi^T ext{log}(p(si, θ))

Final Remarks

Significant material covered including supervised learning variations, loss functions, learning algorithms, and transition to categorical outputs using softmax.