Matrix Calculus and Supervised Learning
Homework and Exam Announcements
Homework one is due on September 21.
Announcement will clarify which parts of homework one will be on Exam Two and which will be on Exam One due to pacing of the course.
Recap of Previous Lecture
Last session introduced matrix calculus and vector calculus.
Focused on vector chain rule.
Supervised Learning Problem
Discussion involves supervised learning with a focus on maximum likelihood interpretation in neural networks.
Training data consists of ordered pairs: (si, yi), where:
s_i = input pattern
y_i = desired response, which is binary (1 or 0).
Example of Training Stimuli
Training samples represented as:
s1 → y1 (i.e., s1 might be a face image, y1 indicates if it has hair: 1=has hair, 0=does not)
For n training stimuli, y_i must be in {0, 1}.
Model Interpretation
The output from a neural network is denoted as:
y̅(si, θ) = predicted probability that yi = 1 given the input si, interpreted as: pi(θ) = P(yi = 1 | si)
Likelihood Function
The learning algorithm aims to maximize the likelihood of the entire dataset, defined as:
L(θ) = P(data|θ) = ext{Product of individual likelihoods}
This leads to Maximum Likelihood Estimation (MLE).
Loss Function Definition
Define loss function as:
c(si, yi, θ) = -[yi ext{log}(pi(θ)) + (1 - yi) ext{log}(1 - pi(θ))]When minimizing, if yi=1, we minimize - ext{log}(pi(θ)) , if yi=0 minimize - ext{log}(1 - pi(θ)).
Represents empirical risk through a negative log-likelihood approach.
Sigmoidal Function in Neural Models
The predicted probabilities pi(θ) are modeled with a sigmoidal function: pi(θ) = rac{1}{1 + e^{-φ_i}}
where φi = θ^T imes si
Identified as:
logistic sigmoid in neural networks or inverse logit in statistical contexts.
Cross Entropy Function Definition
Formulation leads to identifying the objective function:
Known as cross-entropy function, specifically for binary targets or Bernoulli random variables.
It’s common in practice for neural networks to utilize this cross-entropy for binary classifications.
Gradient Descent Learning Algorithm
Derivation gives the update rule for gradient descent:
θ{t+1} = θt - ext{γ}_t rac{d c}{dθ}Use chain rule to compute:
rac{d c}{dθ} = -[yi - pi(θ)] imes s_i
Different Gradient Descent Methods
Batch Gradient Descent: Updates using all training data.
Adaptive Gradient Descent: Updates using a single training sample.
Mini-batch Gradient Descent: Compromise by using a subset (e.g., 10 samples).
Stop Criteria in Learning Algorithms
To assess convergence of gradient descent:
Assess average of recent gradients.
When it approaches zero, convergence is assumed.
Moving to Numerical Outputs in Supervised Learning
Change target variable y_i to real-valued outputs for learning, conforming to Gaussian assumptions:
P(yi | si, θ) ext{ is Gaussian }
Objective Function Formulation
The objective function based on Gaussian likelihood estimates leads to:
Sum squared error for optimal parameter findings:
ext{Objective Function} = rac{1}{n} ext{Sum}(yi - ar{y}(si, θ))^2
Softmax Function Introduction for Categorical Outputs
Outputs can represent multiple categories using softmax functions.
Outputs defined as a vector of probabilities for each category using: pj(si, θ) = rac{e^{φj(si, θ)}}{ ext{Sum}{j=1}^{m} e^{φj(s_i, θ)}}
This construction allows categorical classification (one hot encoded).
Objective Function for Softmax Outputs
The derived objective function for learning in softmax outputs:
ext{Loss} = -yi^T ext{log}(p(si, θ))
Final Remarks
Significant material covered including supervised learning variations, loss functions, learning algorithms, and transition to categorical outputs using softmax.