Backpropagation, step-by-step | DL3

Introduction to Backpropagation

  • Backpropagation is the core algorithm behind how neural networks learn.

  • The session will recap previous concepts before delving into the details of backpropagation.

  • The first explanation is intuitive, without referencing complex formulas, followed by a deeper dive into the math in subsequent content.

Neural Network Basics

  • A neural network functions by feeding forward information.

  • Example used: recognizing handwritten digits using an input layer of 784 neurons, two hidden layers each with 16 neurons, and an output layer of 10 neurons.

  • The output layer indicates the digit the network predicts.

Gradient Descent Recap

  • Gradient descent aims to minimize a cost function by adjusting weights and biases based on training data.

  • Cost function calculates the error between the predicted output and the desired output for training examples.

  • Total cost is averaged over all training examples.

Gradient and Sensitivity

  • We look for the negative gradient of the cost function.

  • The gradient indicates how changing weights and biases will help decrease the cost efficiently.

  • Each component of the gradient vector indicates sensitivity of the cost to individual weights and biases.

    • Example: A sensitivity of 3.2 for one weight indicates a much stronger influence on the cost than a sensitivity of 0.1 for another.

Understanding Backpropagation

  • Rather than focusing on mathematical notation, we'll analyze the effect of each training example (

    • The adjustments applied to weights and biases from a single training example.

  • For computational efficiency, we'll focus on one training example.

Analyzing One Training Example

  • An example image (e.g., a handwritten '2') is presented.

  • Begin with random activations in the output layer (e.g., [0.5, 0.8, 0.2]).

  • The goal is to increase the activation output for the '2' while decreasing others.

  • Adjustments should be proportional to how far each current output is from its target.

    • Example: Increase activation for output '2' while decreasing for '8'.

Weight Adjustments

  • We can adjust the weights directly based on the influence on activations.

  • Stronger connections, indicated by larger activation values from the previous layer, warrant more significant weight increases.

  • Focus on what gives the most substantial change ("bang for your buck") with reference to the Hebbian theory - "neurons that fire together wire together."

Propagating Changes Backward

  • The desired adjustments of the output layer influence the preceding layer.

  • Each output neuron’s desired effect is combined, reflecting mutual influences.

  • This leads to adjusting weights and biases recursively throughout the network.

Averaging Adjustments Across Examples

  • Each training example nudges weights and biases differently.

  • To optimize, the same process is repeated for all training examples, averaging the desired changes.

  • The averaged nudges approximate the negative gradient of the cost function.

Computational Efficiency: Mini-Batching

  • Directly averaging adjustments for all training examples for each step is computationally expensive.

  • Instead, divide the training data into mini-batches (e.g., subsets of 100 examples).

  • Each mini-batch provides an effective approximation to the full gradient, speeding computations.

  • This process is referred to as stochastic gradient descent.

Summary of Backpropagation Process

  • Backpropagation determines how individual training examples suggest that weights and biases should be adjusted.

  • It focuses on the relative proportions of these adjustments to minimize cost.

  • Use mini-batch processing for computational efficiency while iteratively adjusting towards a local minimum of the cost function.

Importance of Training Data

  • A substantial amount of labeled training data is necessary for the algorithm to be effective.

  • The MNIST database serves as an extensive source of labeled handwritten digits as a common example.

  • Challenges in machine learning often relate to acquiring enough quality training data.