12 - Feed Forward Neural Networks

Purpose: Fit to a dataset of m examples.
Gradient Descent:
- Updating parameters, $ heta$, according to the negative of the gradient.
Methods of Gradient Descent:
- Batch Gradient Descent: Updates parameters once per epoch.
- Stochastic Gradient Descent: Updates parameters for each example.
Optimization Problem: Essential to identify parameters that minimize the loss.

Objective: Fit model to a dataset of m examples using logistic function.
Maintain the same gradient descent approach as linear regression with an emphasis on binary outcome predictions.

Focus: Fit model to m examples using the softmax function to produce probabilities.
Follows similar gradient descent principles as above.
Importantly involves one-hot encoding for categorical outputs.

Logistic Regression Notation:
- Input features include variables such as square footage (SF) and number of bedrooms (BR).
Model Complexity:
- Observations indicate that the model can be restricted.
- A more nuanced representation requires interaction terms or deeper learning.

Basic Structure:
- Layers in a neural network include input, hidden, and output layers.
Forward Propagation:
- Data flows from input layer through hidden layers to the output layer.
- Each layer has its activation functions defined.
Activation Functions:
- Common to have unique activation functions for different layers in the network.

Function of Hidden Units:
- Each hidden unit is designed to learn various relationships and patterns from input data.
- Individual interpretation of units can be complex or non-intuitive.
Architecture Design: Understanding intentions through training data and architecture design is essential for outcomes.

Configuration:
- Fully connected layers where each unit's activations contribute to subsequent layers.
- FFNNs do not loop back activations and maintain a directed flow.
- In traditional contexts, an FFNN with only one layer of hidden neurons is considered shallow.

Output Layer Activation Functions:
- Sigmoid Function: Primarily used for binary classification producing a single probability output.
- Linear Activation: Outputs a single value for regression tasks.
- For multiple output units, the softmax function enables multiclass classification probabilities.
Softmax Activation:
- Outputs a probability distribution across K classes, particularly effective in multiclass settings.

Multiclass Classification:
- Mutually exclusive classes with one prediction attempt (e.g., softmax probabilities).
Multilabel Classification:
- Non-exclusive predictions possible with simultaneous activations on separate outputs.

ReLU (Rectified Linear Unit):
- Most common due to effectiveness seen in deep networks.
- Introduced to combat the vanishing gradient problem effectively.
- Computation efficiency is favorable.
Leaky ReLU:
- Variation addressing the dying ReLU issue. Allows a small, non-zero gradient when input is less than zero.
GELU (Gaussian Error Linear Unit):
- Continuously differentiable, stabilizes gradients potentially and is applicable in transformer models.

Forward Propagation:
- Example computations for outputs through layers indicate how values evolve throughout the network.
Backpropagation:
- Deriving gradients to optimize weights in a neural network is essential for learning, with specifics changing based on the activation function used.

Strategies to Enhance Learning:
- Momentum: Takes into account previous parameter updates to stabilize updates.
- RMSprop: Averages gradients to smooth updates.
- Adam Optimizer: A combination of momentum and RMSprop.

Vanishing Gradients:
- Experienced primarily with certain activation functions (i.e., sigmoid) across hidden layers, leading to learning stagnation.
Exploding Gradients:
- High weight parameters can create excessively large gradients leading to instability.
- Regularization through weight limits and effective initialization may mitigate these issues.

Purpose: Prevent overfitting with techniques like weight decay (L2), dropout, and batch normalization.
- Dropout: Randomly zeroing activations during training to ensure model robustness against singular dependencies.
- Batch Normalization: Normalizes layer output to stabilize training by reducing covariate shifts across batches.

Xavier and He Initialization:
- Techniques tailored for respective activation functions to enable suitable scaling of initial weights, preventing gradient issues during training.

Importance of Design Decisions: Decisions regarding architecture, activation functions, and initialization impacts overall network training efficiency and effectiveness.
Ongoing research and literature continually inform best practices in neural network design and training.

For more detailed discussions, refer to the optional slides and provided academic literature for advanced understanding.