12 - Feed Forward Neural Networks

Review of Linear Regression

  • Purpose: Fit to a dataset of m examples.

  • Gradient Descent:

    • Updating parameters, $ heta$, according to the negative of the gradient.

  • Methods of Gradient Descent:

    • Batch Gradient Descent: Updates parameters once per epoch.

    • Stochastic Gradient Descent: Updates parameters for each example.

  • Optimization Problem: Essential to identify parameters that minimize the loss.

Review of Logistic Regression

  • Objective: Fit model to a dataset of m examples using logistic function.

  • Maintain the same gradient descent approach as linear regression with an emphasis on binary outcome predictions.

Review of Softmax Regression

  • Focus: Fit model to m examples using the softmax function to produce probabilities.

  • Follows similar gradient descent principles as above.

  • Importantly involves one-hot encoding for categorical outputs.

Notation and Model Structure

  • Logistic Regression Notation:

    • Input features include variables such as square footage (SF) and number of bedrooms (BR).

  • Model Complexity:

    • Observations indicate that the model can be restricted.

    • A more nuanced representation requires interaction terms or deeper learning.

Neural Network Architecture Overview

  • Basic Structure:

    • Layers in a neural network include input, hidden, and output layers.

  • Forward Propagation:

    • Data flows from input layer through hidden layers to the output layer.

    • Each layer has its activation functions defined.

  • Activation Functions:

    • Common to have unique activation functions for different layers in the network.

Hidden Units in Neural Networks

  • Function of Hidden Units:

    • Each hidden unit is designed to learn various relationships and patterns from input data.

    • Individual interpretation of units can be complex or non-intuitive.

  • Architecture Design: Understanding intentions through training data and architecture design is essential for outcomes.

Feed Forward Neural Networks (FFNN)

  • Configuration:

    • Fully connected layers where each unit's activations contribute to subsequent layers.

    • FFNNs do not loop back activations and maintain a directed flow.

    • In traditional contexts, an FFNN with only one layer of hidden neurons is considered shallow.

Activation Functions

  • Output Layer Activation Functions:

    • Sigmoid Function: Primarily used for binary classification producing a single probability output.

    • Linear Activation: Outputs a single value for regression tasks.

    • For multiple output units, the softmax function enables multiclass classification probabilities.

  • Softmax Activation:

    • Outputs a probability distribution across K classes, particularly effective in multiclass settings.

Multiclass Vs. Multilabel Classification

  • Multiclass Classification:

    • Mutually exclusive classes with one prediction attempt (e.g., softmax probabilities).

  • Multilabel Classification:

    • Non-exclusive predictions possible with simultaneous activations on separate outputs.

Hidden Layer Activation Functions

  • ReLU (Rectified Linear Unit):

    • Most common due to effectiveness seen in deep networks.

    • Introduced to combat the vanishing gradient problem effectively.

    • Computation efficiency is favorable.

  • Leaky ReLU:

    • Variation addressing the dying ReLU issue. Allows a small, non-zero gradient when input is less than zero.

  • GELU (Gaussian Error Linear Unit):

    • Continuously differentiable, stabilizes gradients potentially and is applicable in transformer models.

Training Techniques

  • Forward Propagation:

    • Example computations for outputs through layers indicate how values evolve throughout the network.

  • Backpropagation:

    • Deriving gradients to optimize weights in a neural network is essential for learning, with specifics changing based on the activation function used.

Optimization Algorithms

  • Strategies to Enhance Learning:

    • Momentum: Takes into account previous parameter updates to stabilize updates.

    • RMSprop: Averages gradients to smooth updates.

    • Adam Optimizer: A combination of momentum and RMSprop.

Gradient Challenges

  • Vanishing Gradients:

    • Experienced primarily with certain activation functions (i.e., sigmoid) across hidden layers, leading to learning stagnation.

  • Exploding Gradients:

    • High weight parameters can create excessively large gradients leading to instability.

    • Regularization through weight limits and effective initialization may mitigate these issues.

Regularization Techniques

  • Purpose: Prevent overfitting with techniques like weight decay (L2), dropout, and batch normalization.

    • Dropout: Randomly zeroing activations during training to ensure model robustness against singular dependencies.

    • Batch Normalization: Normalizes layer output to stabilize training by reducing covariate shifts across batches.

Weight Initialization Strategies

  • Xavier and He Initialization:

    • Techniques tailored for respective activation functions to enable suitable scaling of initial weights, preventing gradient issues during training.

Conclusion

  • Importance of Design Decisions: Decisions regarding architecture, activation functions, and initialization impacts overall network training efficiency and effectiveness.

  • Ongoing research and literature continually inform best practices in neural network design and training.

Additional Resources

  • For more detailed discussions, refer to the optional slides and provided academic literature for advanced understanding.