07 Deep Learning

ArgMax: Identifies the largest value from the output layer.
- Example: If output values are [0.6, 0.1, 0.2], ArgMax returns 0.6.
- Commonly used for testing.
SoftMax: Converts a vector of K real numbers into a probability distribution.
- Ensures all probabilities sum to 1.
- Commonly used for training.
- Formula:[ P(y_i) = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} ]
- Where ( z_i ) represents the output for class i.

More layers imply deeper learning.
Definition: Composed of multiple non-linear operations, generally with more than 1 hidden layer qualifies as deep learning.

Larger gradients lead to significant adjustments to weights in feedforward networks.
During backpropagation, gradients can diminish layer by layer such that earlier layers may not learn effectively.
The issue arises as gradients propagate back through many layers, becoming smaller and potentially reaching zero (vanishing gradient).

Area V1 detects edges (1st layer).
Area V2 recognizes primitive shapes (2nd layer).
Area V4 identifies higher-level abstractions and objects (3rd layer).
Mimics deep learning neural networks in processing visual information akin to how biological brains do.

Dynamic Standardization: The standards of neural networks are mostly based on trial & error, allowing flexibility.
Function Representation: Some complex functions are not efficiently represented by shallow architectures.
More formally, depth k architectures can compactly represent certain functions compared to depth (k-1).
- Computational consequences: Reduced need for multiple elements in the layers.
- Statistical implications: Insufficient depth can lead to poor generalization.

Structure: Similar input and output layers, bottleneck hidden layer with fewer nodes.
Purpose: Compress data while maintaining quality; learns without supervision.
Components:
- Encoding function to compress data.
- Decoding function for reconstruction.
- Measuring reconstruction loss to assess decoder performance.
Main usages: Data denoising and dimensionality reduction.

Designed for temporal data dependencies (e.g., time series, language processing).
Contextual importance: Past decisions impact current outcomes.
Applications include predictions, text summarization, speech recognition, etc.

Complex Training: Difficulty in remembering information long-term.
Exploding Gradient Problem: Occurs when large error gradients accumulate.
Vanishing Gradient Problem: The derivative of loss approaches zero, hindering learning in deeper layers.

RNNs often struggle with short-term memory.
LSTMs and GRUs: Address memory issues through memory cells that manage long-term dependencies by controlling what information to remember or forget.

Simplified version of LSTMs that combines forget and input gates into a single update gate.
GRUs are more parameter-efficient and faster to train compared to LSTMs.

Challenges in image classification, like detecting and categorizing various objects accurately.

CNNs manage spatial hierarchies in images, with convolutional and pooling layers crucial for feature extraction.
Pooling: Reduces spatial size, helpful in decreasing computational load.
- Types: Max Pooling (extracts max values) and Average Pooling (extracts average values).
Effective in managing high-dimensional inputs and improves model performance by avoiding overfitting.

Case studies like Fashion-MNIST, automatic image colorization, and caption generation for visual data.

Vulnerability to adversarial attacks (e.g., image alterations leading to misclassification).
Generalization issues faced by DNNs in image perception and recognition.

Understanding the technological framework behind deep learning and its neural networks is essential for applications in various fields like computer vision, natural language processing, and more.