Backpropagation, step-by-step | DL3

Introduction to Backpropagation

Backpropagation is the core algorithm behind how neural networks learn.
The session will recap previous concepts before delving into the details of backpropagation.
The first explanation is intuitive, without referencing complex formulas, followed by a deeper dive into the math in subsequent content.

Neural Network Basics

A neural network functions by feeding forward information.
Example used: recognizing handwritten digits using an input layer of 784 neurons, two hidden layers each with 16 neurons, and an output layer of 10 neurons.
The output layer indicates the digit the network predicts.

Gradient Descent Recap

Gradient descent aims to minimize a cost function by adjusting weights and biases based on training data.
Cost function calculates the error between the predicted output and the desired output for training examples.
Total cost is averaged over all training examples.

Gradient and Sensitivity

We look for the negative gradient of the cost function.
The gradient indicates how changing weights and biases will help decrease the cost efficiently.
Each component of the gradient vector indicates sensitivity of the cost to individual weights and biases.
- Example: A sensitivity of 3.2 for one weight indicates a much stronger influence on the cost than a sensitivity of 0.1 for another.

Understanding Backpropagation

Rather than focusing on mathematical notation, we'll analyze the effect of each training example (
- The adjustments applied to weights and biases from a single training example.
For computational efficiency, we'll focus on one training example.

Analyzing One Training Example

An example image (e.g., a handwritten '2') is presented.
Begin with random activations in the output layer (e.g., [0.5, 0.8, 0.2]).
The goal is to increase the activation output for the '2' while decreasing others.
Adjustments should be proportional to how far each current output is from its target.
- Example: Increase activation for output '2' while decreasing for '8'.

Weight Adjustments

We can adjust the weights directly based on the influence on activations.
Stronger connections, indicated by larger activation values from the previous layer, warrant more significant weight increases.
Focus on what gives the most substantial change ("bang for your buck") with reference to the Hebbian theory - "neurons that fire together wire together."

Propagating Changes Backward

The desired adjustments of the output layer influence the preceding layer.
Each output neuron’s desired effect is combined, reflecting mutual influences.
This leads to adjusting weights and biases recursively throughout the network.

Averaging Adjustments Across Examples

Each training example nudges weights and biases differently.
To optimize, the same process is repeated for all training examples, averaging the desired changes.
The averaged nudges approximate the negative gradient of the cost function.

Computational Efficiency: Mini-Batching

Directly averaging adjustments for all training examples for each step is computationally expensive.
Instead, divide the training data into mini-batches (e.g., subsets of 100 examples).
Each mini-batch provides an effective approximation to the full gradient, speeding computations.
This process is referred to as stochastic gradient descent.

Summary of Backpropagation Process

Backpropagation determines how individual training examples suggest that weights and biases should be adjusted.
It focuses on the relative proportions of these adjustments to minimize cost.
Use mini-batch processing for computational efficiency while iteratively adjusting towards a local minimum of the cost function.

Importance of Training Data

A substantial amount of labeled training data is necessary for the algorithm to be effective.
The MNIST database serves as an extensive source of labeled handwritten digits as a common example.
Challenges in machine learning often relate to acquiring enough quality training data.