1/52
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
The Machine Learning Flow STeps
Define Task
Represent Data
Select Data
Select Metrics
Develop Machine Learning Model
The Baseline Equation of Neural Networks
A fundamental equation used to map combinations of basic features hierarchically from low-level edges to high-level concepts, mathematically represented as:
y=f(x⋅W)=tanh(k=1∑nxkwk)
Generalized Linear Model (GLM) (eq)
A framework that formulates output probabilities by wrapping a linear combination of inputs inside an arbitrary exponential distribution function:
p(yn∣xn,w,σ2)=exp[σ2ynηn−A(ηn)+logh(yn,σ2)]
Basis Function (eq)
A method that replaces raw input vectors with transformed representations ($\phi$) that possess their own unique tuning parameters ($\theta_2$):
f(x;θ)=Wϕ(x;θ2)+b
Complex Function Composition (eq)
The mechanism by which deep neural networks stack transformations recursively over $L$ layers to form highly complex, composite mathematical functions:
f(x;θ)=fL(fL−1(…(f1(x))…))
Perceptron (eq)
A deterministic binary classifier that takes an input vector, computes a linear combination of those inputs and their weights, and passes the result through a step function.
The mathematical representation of a perceptron using the Heaviside step function ($H$) or indicator function ($\mathbb{I}$):
f(x;θ)=I(wTx+b≥0)=H(wTx+b)
A single perceptron can only learn linear decision boundaries. Because of this, it fails entirely at non-linearly separable tasks, famously illustrated by the XOR problem.
Multilayer Perceptron (MLP) (def)
A feedforward neural network formed by the recursive recombination of multiple layers of neurons. Stacking these layers allows the network to overcome the linear limitations of a single perceptron.
The method by which an MLP solves the non-linear XOR problem: it uses a hidden layer where one neuron ($h_1$) acts as an AND gate and another ($h_2$) acts as an OR gate, which are then combined at the output ($y_1$).
What is Universal Function Approximator of MLP?
A core theorem stating that an MLP with arbitrary depth and size (width) can approximate any complex continuous mathematical function.
What is the requirement to train an MLP via gradient descent?
All activation functions must be differentiable (which is why the sharp, flat Heaviside step function cannot be used for backpropagation.
Activation functions (df + ex)
Mathematical functions added to neurons to introduce the non-linearity required to learn complex relationships. To allow for gradient-based optimization, they must be differentiable and computationally efficient to evaluate.
Heaviside Step Function
Sigmoid Function
Tanh (Hyperbolic Tangent) Function
ReLU (Rectified Linear Unit)
Leaky ReLU
Heaviside Step Function (eq)
A non-differentiable binary threshold function that outputs 0 for negative inputs and 1 for positive inputs:
f(x;θ)=I(wTx+b≥0)=H(wTx+b)
Sigmoid Function and its derivative (eq)
An S-shaped activation function that maps inputs to a continuous $(0, 1)$ range. While it mimics biological neural spike thresholds, it suffers from severe saturation (vanishing gradients) at its outer bounds.
σ(a)=1+e−a1
The derivative of the sigmoid function, highly efficient to compute because it relies on the output of the function itself:
φ′(a)=σ(a)(1−σ(a))
Tanh (Hyperbolic Tangent) Function (eq)
A zero-centered sigmoidal curve that maps outputs to a range between $(-1, 1)$, often yielding faster training convergence than standard sigmoid:
g(z)=ez+e−zez−e−z
ReLU (Rectified Linear Unit) and its derivative (eq)
A highly efficient activation function that acts linearly for positive inputs and outputs zero for negative inputs, generating linear decision regions over complex parameter spaces:
g(z) = \max(0, z) = a\mathbb{I}(a > 0)
The derivative of the ReLU function, which acts as a simple switch (either 1 or 0):
\text{ReLU}'(a) = \mathbb{I}(a > 0)
Leaky ReLU (eq)
A variation of ReLU designed to prevent "dead neurons" by keeping a tiny, constant slope ($\epsilon$) when inputs are negative:
g(z)=max(ϵz,z)where ϵ≪1
What are the 3 steps of the Neural Network Training Loop?
Forward Pass: Pass a batch of training data through the network layers to compute the output and evaluate its final loss $\mathcal{L}(y, y^*)$.
Backward Pass: Route the loss backward through the layers using the recursive chain rule to find partial derivatives for every weight.
Weight Update: Scale the derived gradients by a learning rate $(\alpha)$ to update the network weights, iterating until convergence
Computational Graphs of DNN
Deep learning architectures are represented as computational graphs where nodes represent operations (+,∗,−,dot product ) and edges pass tensors forward.
What is Forward-Mode Differentiation and why is it bad for NN?
Computes derivatives from inputs upward. It causes a combinatorial explosion of paths when calculating inputs relative to a scalar loss.
What is Reverse-Mode Differentiation (Backpropagation)?
Factors error paths backward starting from the final loss node to track the derivative of the output relative to all internal nodes simultaneously.
Why is Reverse-Mode Differentiation highly efficient for deep learning?
Because the output dimension (a single scalar loss) is much smaller than the massive input/parameter dimension.
What is a Jacobian Matrix (J_f)? (eq)
A matrix containing all first-order partial derivatives of a vector-valued function.
Jf(x)=∂x∂f(x)∈Rm×n
How does backpropagation avoid the massive cost of computing full high-dimensional Jacobian matrices?
By evaluating Vector-Jacobian Products (VJPs) sequentially backward through the layers.
What is the formula for Cross-Entropy Loss and what is it?
Quantifies the distance between predicted and target probability distributions, typically paired with a sigmoid or softmax layer. For numeric stability against underflow/overflow, the log-sum-exp (lse) trick is used to rewrite the expression:
L=−[I(y=0)logp0+I(y=1)logp1]
What is the Log-Sum-Exp (LSE) trick and why is it used?
It rewrites log probability expressions to prevent arithmetic overflow/underflow during loss calculations.
What is the Vanishing Gradient problem?
When internal weights are small (<1) or activation functions saturate, gradients shrink exponentially as they travel backward, leaving early layers untrained.
What is the Exploding Gradient problem?
When internal weights are large (>1), gradients grow exponentially as they travel backward, causing wild training oscillations and instability.
What characterizes a Convex Optimization Surface?
Smooth, bowl-like surfaces where any local minimum is guaranteed to be the global minimum (e.g., standard linear regression).
What characterizes a Non-Convex Optimization Surface?
Jagged, wavy landscapes containing numerous local minima, saddle points, and plateaus typical of deep neural networks.
What are the trade-offs of Stochastic Gradient Descent (SGD)?
Pros: Fast updates, high noise helps jump out of local minima.
Cons: Highly erratic path, cannot exploit parallel hardware (GPU/TPU).
What are the trade-offs of Batch Gradient Descent?
Pros: Deterministic, stable gradient path.
Cons: Computations become incredibly slow and unfeasible for large datasets.
What are the trade-offs of Mini-Batch Gradient Descent?
Pros: Balanced gradient path, heavily leverages GPU/TPU parallel computing.
Cons: Adds another hyperparameter to tune (batch size).
Give def of Adaptive Heuristic Optimizers and give examples
To move beyond basic, unstable gradient descent (θt+1=θt−ηtgt ), adaptive algorithms dynamically tune learning rates for each separate parameter:
SGD
Adagard
RMSprop
Adam
How does SGD with Momentum improve standard gradient descent?
It calculates an exponentially weighted moving average of past gradients to accelerate updates along consistent directions and dampen oscillations.
What is the mathematical update rule for SGD with Momentum?
(vt=γvt−1+ηtgt)
How does the Adagrad optimizer handle learning rates?
It adapts learning rates inversely proportional to the sum of squares of all historical gradients (s_t = ∑ g_t²), making it ideal for sparse features.
What is the main drawback of Adagrad?
The learning rate drops off too quickly over time, causing the model to stop learning before converging.
How does RMSprop improve upon Adagrad?
It substitutes a simple accumulation with an exponentially decaying average of squared gradients to handle non-stationary, noisy objectives.
How does the Adam optimizer combine momentum and adaptive learning rates?
By dynamically evaluating both the decaying mean (1st moment) and uncentered variance (2nd moment) of the gradients.
What is the formula for a discrete 2D image convolution?
S(i,j)=(I∗K)(i,j)=m∑n∑I(i−m,j−n)K(m,n)
What does a Sobel Filter calculate to detect edges?
It computes the first-order image gradient vector ∇I = [∂I/∂x, ∂I/∂y]ᵀ to find abrupt changes in pixel intensity.
What is the matrix for a Horizontal Sobel Filter (G_x)?
Gx=−1−2−1amp;0amp;0amp;0amp;+1amp;+2amp;+1
Detects vertical edges by calculating the horizontal gradient derivative
What is the matrix for a Vertical Sobel Filter (G_y)?
Gy=−10+1amp;−2amp;0amp;+2amp;−1amp;0amp;+1
Detects horizontal edges by calculating the vertical gradient derivative.
What is the formula for Gradient Magnitude in edge detection?
∣G∣=Gx2+Gy2
Combined to find total edge strength at any pixel location
What is the Laplacian Operator (∇²I) in image filtering? And the matrix representation?
An isotropic (rotation-invariant) operator that computes the second spatial derivative of an image to highlight rapid intensity changes: ∇²I = ∂²I/∂x² + ∂²I/∂y²
∇2I=∂x2∂2I+∂y2∂2I
KLaplacian=010amp;1amp;−4amp;1amp;0amp;1amp;0or alternative form:111amp;1amp;−8amp;1amp;1amp;1amp;1
Why are standard MLPs poorly suited for computer vision compared to CNNs?
MLPs do not inherently respect spatial dimensions or structure, whereas CNNs enforce spatial constraints.
What is Weight Sharing in CNNs?
Convolving the exact same filter weights across the entire input space because equivalent visual patterns (like edges) occur across different positions.
What is a Filter Hierarchy in deep learning?
The progression where early layers isolate local features (edges), which deeper layers aggregate into abstract global representations (shapes, objects).
What do Padding and Stride control in a convolution operation?
Padding adds virtual borders (like zeros) to control spatial boundary dropoff; Stride dictates the step size of the filter moving across the image.
What is the purpose of Pooling layers in CNNs?
They downsample feature maps to introduce spatial invariance to small shifts and distortions (e.g., Max Pooling, Average Pooling).
What structural problem do Residual Connections fix?
The degradation problem, where accuracy saturates and degrades as networks get extremely deep.
What is the mathematical formulation of a Residual Block?
Output=F(x)+x where layers learn a residual mapping F(x) = H(x) - x instead of the full underlying mapping H(x).
Why do Residual Connections eliminate the vanishing gradient problem?
Gradients can flow directly backward through the identity shortcut paths without being altered or shrunk by weight matrices.