3. Deep Learning Basics & Optimisation & CNNs

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/52

There's no tags or description

Looks like no tags are added yet.

Last updated 1:20 PM on 5/28/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

53 Terms

New cards

The Machine Learning Flow STeps

Define Task
Represent Data
Select Data
Select Metrics
Develop Machine Learning Model

New cards

The Baseline Equation of Neural Networks

A fundamental equation used to map combinations of basic features hierarchically from low-level edges to high-level concepts, mathematically represented as:

$y = f(\mathbf{x} \cdot \mathbf{W}) = \tanh\left(\sum_{k=1}^{n} x_k w_k\right)$

New cards

Generalized Linear Model (GLM) (eq)

A framework that formulates output probabilities by wrapping a linear combination of inputs inside an arbitrary exponential distribution function:

$p(y_n|x_n, w, \sigma^2) = \exp\left[\frac{y_n\eta_n - A(\eta_n)}{\sigma^2} + \log h(y_n, \sigma^2)\right]$

New cards

Basis Function (eq)

A method that replaces raw input vectors with transformed representations ($\phi$) that possess their own unique tuning parameters ($\theta_2$):

$f(x;\theta) = \mathbf{W}\phi(x;\theta_2) + b$

New cards

Complex Function Composition (eq)

The mechanism by which deep neural networks stack transformations recursively over $L$ layers to form highly complex, composite mathematical functions:

$f(x;\theta) = f_L(f_{L-1}(\dots(f_1(x))\dots))$

New cards

Perceptron (eq)

A deterministic binary classifier that takes an input vector, computes a linear combination of those inputs and their weights, and passes the result through a step function.

The mathematical representation of a perceptron using the Heaviside step function ($H$) or indicator function ($\mathbb{I}$):

$f(x;\theta) = \mathbb{I}(w^{T}x + b \ge 0) = H(w^{T}x + b)$
A single perceptron can only learn linear decision boundaries. Because of this, it fails entirely at non-linearly separable tasks, famously illustrated by the XOR problem.

New cards

Multilayer Perceptron (MLP) (def)

A feedforward neural network formed by the recursive recombination of multiple layers of neurons. Stacking these layers allows the network to overcome the linear limitations of a single perceptron.
The method by which an MLP solves the non-linear XOR problem: it uses a hidden layer where one neuron ($h_1$) acts as an AND gate and another ($h_2$) acts as an OR gate, which are then combined at the output ($y_1$).

New cards

What is Universal Function Approximator of MLP?

A core theorem stating that an MLP with arbitrary depth and size (width) can approximate any complex continuous mathematical function.

New cards

What is the requirement to train an MLP via gradient descent?

All activation functions must be differentiable (which is why the sharp, flat Heaviside step function cannot be used for backpropagation.

New cards

Activation functions (df + ex)

Mathematical functions added to neurons to introduce the non-linearity required to learn complex relationships. To allow for gradient-based optimization, they must be differentiable and computationally efficient to evaluate.

Heaviside Step Function
Sigmoid Function
Tanh (Hyperbolic Tangent) Function
ReLU (Rectified Linear Unit)
Leaky ReLU

New cards

Heaviside Step Function (eq)

A non-differentiable binary threshold function that outputs 0 for negative inputs and 1 for positive inputs:

$f(x;\theta)=\mathbb{I}(w^{T}x+b\ge0)=H(w^{T}x+b)$

New cards

Sigmoid Function and its derivative (eq)

An S-shaped activation function that maps inputs to a continuous $(0, 1)$ range. While it mimics biological neural spike thresholds, it suffers from severe saturation (vanishing gradients) at its outer bounds.

$\sigma(a) = \frac{1}{1 + e^{-a}}$

The derivative of the sigmoid function, highly efficient to compute because it relies on the output of the function itself:

$\varphi'(a) = \sigma(a)(1 - \sigma(a))$

New cards

Tanh (Hyperbolic Tangent) Function (eq)

A zero-centered sigmoidal curve that maps outputs to a range between $(-1, 1)$, often yielding faster training convergence than standard sigmoid:

$g(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$

New cards

ReLU (Rectified Linear Unit) and its derivative (eq)

A highly efficient activation function that acts linearly for positive inputs and outputs zero for negative inputs, generating linear decision regions over complex parameter spaces:

g(z) = \max(0, z) = a\mathbb{I}(a > 0)

The derivative of the ReLU function, which acts as a simple switch (either 1 or 0):

\text{ReLU}'(a) = \mathbb{I}(a > 0)

New cards

Leaky ReLU (eq)

A variation of ReLU designed to prevent "dead neurons" by keeping a tiny, constant slope ($\epsilon$) when inputs are negative:

$g(z) = \max(\epsilon z, z) \quad \text{where } \epsilon \ll 1$

New cards

What are the 3 steps of the Neural Network Training Loop?

Forward Pass: Pass a batch of training data through the network layers to compute the output and evaluate its final loss $\mathcal{L}(y, y^*)$.
Backward Pass: Route the loss backward through the layers using the recursive chain rule to find partial derivatives for every weight.
Weight Update: Scale the derived gradients by a learning rate $(\alpha)$ to update the network weights, iterating until convergence

New cards

Computational Graphs of DNN

Deep learning architectures are represented as computational graphs where nodes represent operations ( $+, *, -, \text{dot product}$ ) and edges pass tensors forward.

New cards

What is Forward-Mode Differentiation and why is it bad for NN?

Computes derivatives from inputs upward. It causes a combinatorial explosion of paths when calculating inputs relative to a scalar loss.

New cards

What is Reverse-Mode Differentiation (Backpropagation)?

Factors error paths backward starting from the final loss node to track the derivative of the output relative to all internal nodes simultaneously.

New cards

Why is Reverse-Mode Differentiation highly efficient for deep learning?

Because the output dimension (a single scalar loss) is much smaller than the massive input/parameter dimension.

New cards

What is a Jacobian Matrix (J_f)? (eq)

A matrix containing all first-order partial derivatives of a vector-valued function.

$\mathbf{J}_f(x) = \frac{\partial \mathbf{f}(x)}{\partial x} \in \mathbb{R}^{m \times n}$

New cards

How does backpropagation avoid the massive cost of computing full high-dimensional Jacobian matrices?

By evaluating Vector-Jacobian Products (VJPs) sequentially backward through the layers.

New cards

What is the formula for Cross-Entropy Loss and what is it?

Quantifies the distance between predicted and target probability distributions, typically paired with a sigmoid or softmax layer. For numeric stability against underflow/overflow, the log-sum-exp (lse) trick is used to rewrite the expression:

$\mathcal{L} = -[\mathbb{I}(y=0)\log p_0 + \mathbb{I}(y=1)\log p_1]$

New cards

What is the Log-Sum-Exp (LSE) trick and why is it used?

It rewrites log probability expressions to prevent arithmetic overflow/underflow during loss calculations.

New cards

What is the Vanishing Gradient problem?

When internal weights are small (<1) or activation functions saturate, gradients shrink exponentially as they travel backward, leaving early layers untrained.

New cards

What is the Exploding Gradient problem?

When internal weights are large (>1), gradients grow exponentially as they travel backward, causing wild training oscillations and instability.

New cards

What characterizes a Convex Optimization Surface?

Smooth, bowl-like surfaces where any local minimum is guaranteed to be the global minimum (e.g., standard linear regression).

New cards

What characterizes a Non-Convex Optimization Surface?

Jagged, wavy landscapes containing numerous local minima, saddle points, and plateaus typical of deep neural networks.

New cards

What are the trade-offs of Stochastic Gradient Descent (SGD)?

Pros: Fast updates, high noise helps jump out of local minima.
Cons: Highly erratic path, cannot exploit parallel hardware (GPU/TPU).

New cards

What are the trade-offs of Batch Gradient Descent?

Pros: Deterministic, stable gradient path.
Cons: Computations become incredibly slow and unfeasible for large datasets.

New cards

What are the trade-offs of Mini-Batch Gradient Descent?

Pros: Balanced gradient path, heavily leverages GPU/TPU parallel computing.
Cons: Adds another hyperparameter to tune (batch size).

New cards

Give def of Adaptive Heuristic Optimizers and give examples

To move beyond basic, unstable gradient descent ( $\theta_{t+1} = \theta_t - \eta_t g_t$ ), adaptive algorithms dynamically tune learning rates for each separate parameter:

SGD
Adagard
RMSprop
Adam

New cards

How does SGD with Momentum improve standard gradient descent?

It calculates an exponentially weighted moving average of past gradients to accelerate updates along consistent directions and dampen oscillations.

New cards

What is the mathematical update rule for SGD with Momentum?

$(v_t = \gamma v_{t-1} + \eta_t \mathbf{g}_t)$

New cards

How does the Adagrad optimizer handle learning rates?

It adapts learning rates inversely proportional to the sum of squares of all historical gradients (s_t = ∑ g_t²), making it ideal for sparse features.

New cards

What is the main drawback of Adagrad?

The learning rate drops off too quickly over time, causing the model to stop learning before converging.

New cards

How does RMSprop improve upon Adagrad?

It substitutes a simple accumulation with an exponentially decaying average of squared gradients to handle non-stationary, noisy objectives.

New cards

How does the Adam optimizer combine momentum and adaptive learning rates?

By dynamically evaluating both the decaying mean (1st moment) and uncentered variance (2nd moment) of the gradients.

New cards

What is the formula for a discrete 2D image convolution?

$S(i,j) = (I * K)(i,j) = \sum_{m} \sum_{n} I(i-m, j-n)K(m,n)$

New cards

What does a Sobel Filter calculate to detect edges?

It computes the first-order image gradient vector ∇I = [∂I/∂x, ∂I/∂y]ᵀ to find abrupt changes in pixel intensity.

New cards

What is the matrix for a Horizontal Sobel Filter (G_x)?

$G_x = \begin{bmatrix} -1 & 0 & +1 \\ -2 & 0 & +2 \\ -1 & 0 & +1 \end{bmatrix}$

Detects vertical edges by calculating the horizontal gradient derivative

New cards

What is the matrix for a Vertical Sobel Filter (G_y)?

$G_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ +1 & +2 & +1 \end{bmatrix}$

Detects horizontal edges by calculating the vertical gradient derivative.

New cards

What is the formula for Gradient Magnitude in edge detection?

$|G| = \sqrt{G_x^2 + G_y^2}$

Combined to find total edge strength at any pixel location

New cards

What is the Laplacian Operator (∇²I) in image filtering? And the matrix representation?

An isotropic (rotation-invariant) operator that computes the second spatial derivative of an image to highlight rapid intensity changes: ∇²I = ∂²I/∂x² + ∂²I/∂y²

$\nabla^2 I = \frac{\partial^2 I}{\partial x^2} + \frac{\partial^2 I}{\partial y^2}$
$K_{\text{Laplacian}} = \begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix} \quad \text{or alternative form:} \quad \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix}$

New cards

Why are standard MLPs poorly suited for computer vision compared to CNNs?

MLPs do not inherently respect spatial dimensions or structure, whereas CNNs enforce spatial constraints.

New cards

What is Weight Sharing in CNNs?

Convolving the exact same filter weights across the entire input space because equivalent visual patterns (like edges) occur across different positions.

New cards

What is a Filter Hierarchy in deep learning?

The progression where early layers isolate local features (edges), which deeper layers aggregate into abstract global representations (shapes, objects).

New cards

What do Padding and Stride control in a convolution operation?

Padding adds virtual borders (like zeros) to control spatial boundary dropoff; Stride dictates the step size of the filter moving across the image.

New cards

What is the purpose of Pooling layers in CNNs?

They downsample feature maps to introduce spatial invariance to small shifts and distortions (e.g., Max Pooling, Average Pooling).

New cards

What structural problem do Residual Connections fix?

The degradation problem, where accuracy saturates and degrades as networks get extremely deep.

New cards

What is the mathematical formulation of a Residual Block?

$\text{Output} = \mathcal{F}(x) + x$ where layers learn a residual mapping F(x) = H(x) - x instead of the full underlying mapping H(x).

New cards

Why do Residual Connections eliminate the vanishing gradient problem?

Gradients can flow directly backward through the identity shortcut paths without being altered or shrunk by weight matrices.