P5: Notes on Gradient Problems in Neural Networks

Discussion focuses on the two significant issues in deep neural networks: vanishing and exploding gradients.
As the number of layers and neurons increases, understanding the impact of modifications to parameters becomes critical.
Issue identification is crucial: determining if problems arise due to changes in parameters or due to fitting scenarios.

Gradient Issues:
- The main gradient problems are vanishing gradients and exploding gradients.
- The exploding gradient problem has been alleviated through techniques like weight initialization and batch normalization, while vanishing gradients remain a challenge.

The activation function, denoted as $g$, is a function of the input $a_{l-1}$ and weight matrices $W$.
Mathematical representation of an activation function at layer $l$:
$A<em>l = g(W</em>{l}A_{l-1})$
The significance of the weight matrices ($W$) increases over iterations, often rendering the input insignificant if weights grow large enough.
Gradient calculations are performed through the loss function to derive updates for weights.
The relationship between activation of the last layer ($AL$) and its derivative is defined as: rac{dAL}{dL} = g'(A_{l-1})rac{dL}{dW} $$
where $g'$ is the derivative of the activation function.

The vanishing gradient problem occurs when weights are initialized to very small values, leading to:
- Gradients that approach zero due to the repeated multiplication of small values in deep layers, making it difficult for the network to learn (essentially saturating the response).
- If the multiplication of weights (like $0.01^{l}$) results in values approaching zero.
The exploding gradient problem occurs when weights are initialized as larger values (e.g., above 1), resulting in:
- Gradients that grow exponentially large, causing instability in the learning process and convergence failures (oscillations happen due to large weight adjustments).
- For instance, larger weights (like $1.5^{l}$) would yield output growing towards infinity.

The essential factors influencing gradient descent are:
- Weight Initialization: Appropriately initializing weights using smaller values can mitigate vanishing gradients.
- Magnitude of Layers: The number of layers ($l$) significantly impacts the behavior of the outputs:
- If $W < 1$: vanishing gradients; updates are minimal or stagnant.
- If $W > 1$: exploding gradients; updates are excessively large, leading to oscillation and divergence.
The gradients ($
abla W$) become problematic under both scenarios, reducing learning efficiency and rendering networks non-responsive to input changes.

The vanishing gradient problem can lead to saturation in the learning process where changes in weights and thus, the network's learning rate become negligible.
The exploding gradient problem can cause erratic training behavior, leading to convergence issues.
Understanding these issues is crucial for designing effective learning algorithms in deep neural networks, guiding the selection of proper initialization techniques and activation functions.