Deep Learning Notes

Classification Tasks in Computer Vision

Classification:
- Identifies the category of an object in an image.
- Example: Determining if an image contains a cat.
Localization:
- Identifies the location of an object in an image.
Object Detection:
- Combines classification and localization.
- Identifies multiple objects in an image and their locations.
- Example: Detecting cats and dogs in the same image.

Segmentation Tasks in Computer Vision

Semantic Segmentation:
- Partitions an image into regions and classifies each region.
- Example: Identifying and labeling all pixels belonging to a person, tree, grass, or sky.
Instance Segmentation:
- Similar to semantic segmentation but distinguishes between different instances of the same object.
Panoptic Segmentation:
- Combines semantic and instance segmentation.

Regression

Regression:
- Predicts a continuous value.
- Example: Fitting a line to data points.
- Linear Regression Example: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.06-Linear-Regression.ipynb#scrollTo=46EbP7zkaBg-

Machine Translation

Uses an encoder-decoder architecture to translate text from one language to another.
- Encoder: processes the input sequence.
- Decoder: generates the translated output sequence.
- Example: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/12%20Translate%20text%20between%20languages.ipynb#scrollTo=K_UnAZQpetM8

Transcription

Transcribes audio into text.
Example: https://cloud.google.com/vision/docs/drag-and-drop

Generation

Generates new content, such as images, based on a description.
- Example: DALL·E generating an image of a monkey paying taxes.

Model Capacity

Definition: Model complexity; the ability of a model to express complex patterns or relationships.
High Capacity Model:
- Large number of parameters.
- High chance of memorizing training data if trained long enough.
Low Capacity Model:
- Low memorization capability.
- Trained on fewer samples.
- Fewer parameters.
- Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W_

Train, Test, and Validation Sets

Need for Validation:
- To assess generalization ability: how well the model performs on previously unobserved inputs.
- Reference: JKNITPYValidation and Test Sets.ipynb - Colaboratory

Hyperparameters vs. Parameters

Parameters:
- Learned during the machine learning process.
- Examples: Weights, bias.
Hyperparameters:
- Manually specified.
- Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W, JK NITPY Hyper-Parameters.ipynb - Colaboratory

Bias-Variance Trade-off

Example: Predicting height from weight.
- Blue points: Training samples.
- Green points: Testing samples.
- Arrow: Reference line.
Method 1: Linear Regression (Fit to Line)
- High Bias: Model cannot replicate the true reference line.
- The inability of a model to capture the true relationship between data points is bias.
Method 2: Polynomial Regression (Fit to Line)
- Low Bias: Can handle the true relationship between weight and height.
Performance Comparison (Training Set)
- Linear Regression: High Error.
- Polynomial Regression: Low, ~0 Error.
- Polynomial Regression wins on training data.
Performance Comparison (Test Set)
- Linear Regression: Moderate Error.
- Polynomial Regression: High Error.
- Linear Regression wins on test data.
Variance:
- Difference in fits between training and testing data.
Bias and Variance:
- High Bias, Low Variance: Underfitting.
- Low Bias, High Variance: Overfitting.
Solutions:
- Underfitting: Increase Model Capacity.
- Overfitting: Decrease Model Capacity.
Ideal ML Model: Low Bias and Low Variance.
Optimum Capacity Model.

K-Fold Cross-Validation

Dataset divided into k folds.
k iterations of training and validation.
- Training set: 90% of the data (training folds).
- Validation set: 10% of the data (validation fold).
Errors $E1, E2, E3, … Ek$ are calculated for each iteration.

Estimation in Statistics

Point Estimation:
- Provides a single best prediction of an unknown property of a model using sample data $(x1, …, xm)$ .
Function Estimation:
- Predicts the relationship between input and target variables.
Point Estimator:
- $\hat{\theta}m = g(x1, …, x_m)$
- $\hat{\theta}_m$ : Point estimator for a model property (e.g., expectation).
- m: Number of data elements.
- $(x1, …, xm)$ : Independent and identically distributed (i.i.d.) data points.
- g(.): Any estimation function for the given data points.

Bias and Variance in Estimation

Bias: Measures the expected deviation from the true value of $\theta$ .
- $bias(\hat{\theta}m) = E[\hat{\theta}m] - \theta$
Variance: Measures the deviation from the expected estimator value.
- $Var(\hat{\theta}_m)$

Maximum Likelihood Estimation (MLE)

Finds an optimal way to fit a distribution to data (generalization).
Common distributions:
- Uniform, Binomial, Bernoulli, Geometric, Poisson, Exponential, Weibull, Log Normal, Normal (Gaussian), Chi-Squared, Student's t, Beta, Gamma.
Assumption: Data is normally distributed.
- Measurements are close to the mean.
- Measurements are symmetrical around the mean.
Goal: Find where to center the normal distribution shape.
The distribution should say, most of the values you measure should be near my average!
Maximize the likelihood of observing the measured weights.
Likelihood depends on the standard deviation.
MLE Definition: A method of estimating the parameters of a statistical model.
- $w{ML} = \arg \maxw P{model}(X; w) = \arg \maxw \prod{i=1}^m P{model}(x^i; w) = \arg \maxw \sum{i=1}^m \log(P_{model}(x^i; w))$
- $P_{data}(x)$ : True but unknown data-generating distribution.
- $X = {x^1, …, x^m}$ : Data drawn independently from $P_{data}(x)$ .
- $P{model}(x; w)$ : Probability distribution estimating $P{data}(x)$ .
- Reference: https://seeing-theory.brown.edu/bayesian-inference/index.html

Supervised vs. Unsupervised Learning

Supervised Learning:
- Uses labeled data.
- Examples: Classification (Binary, Multi-class, Multi-label), Regression (Simple Linear, Multiple Linear, Polynomial).
Unsupervised Learning:
- Uses unlabeled data.
- Example: Clustering.
Semi-Supervised Learning:
- Uses both labeled and unlabeled data.
Reinforcement Learning:
- Learns from mistakes (penalty and reward).

Bayesian Statistics

Introduced by Thomas Bayes in the 1770s.
Conditional Probability:
- The probability of an event A given B equals the probability of B and A happening together divided by the probability of B.
Bayes Theorem:
- $P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$
- where:
  - P(A) -> Prior
  - P(B|A) -> Likelihood
  - P(A|B) -> Posterior
  - P(B) -> Evidence

Challenges Motivating Deep Learning

Traditional ML algorithms have limitations in solving AI problems like speech and object recognition.
Deep learning was motivated by the failure of traditional algorithms to generalize well on such tasks.
Problems
- The curse of dimensionality.
- Local constancy and smoothness regularization.
- Manifold learning.

Curse of Dimensionality

As the number of dimensions increases, ML problems become more complex.
Example: Prediction in 1-D, 2-D, 3-D cases.

Local Constancy and Smoothness Regularization

ML algorithms need prior beliefs about the functions they should learn.
Algorithms are biased towards preferring a class of functions.
Most generally used prior is smoothness or local constancy.
Says the function we learn should not change much within a small region.
If x is a point, the function on x gives an answer. A nearby point x+0.0001 should give nearly the same answer.
Deep learning introduces additional priors to reduce generalization error on sophisticated tasks.

Deep Learning vs. Traditional Machine Learning

Deep learning algorithms can perform better with more data compared to traditional ML algorithms.

Artificial Intelligence, Machine Learning, and Deep Learning

Artificial Intelligence:
- Any technique enabling computers to mimic human intelligence.
- Includes machine learning.
Machine Learning:
- A subset of AI that includes statistical techniques enabling machines to improve at tasks with experience.
- Includes deep learning.
Deep Learning:
- A subset of machine learning composed of algorithms that permit software to train itself by exposing multilayered neural networks to vast amounts of data.

Deep Feed Forward Networks

History stems from the 1940s.
In 1957, Rosenblatt introduced the perceptron.
ANN consists of simple processing units communicating via weighted connections.
Why ANN?
- Technical viewpoint: Some problems require massively parallel and adaptive processing.
- Biological viewpoint: ANNs can replicate and simulate components of the human brain.
Building blocks: Neurons / units / nodes.
Neuron Functionality:
- Receives input from other neurons.
- Changes its internal state (activation) based on the current input.
- Sends one output signal to many other neurons.
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Biological Neuron vs. Artificial Neuron
- Dendrites -> Interconnects
- Soma -> Processing Element
- Axon -> Conduction
- Synapses -> Weights
 - $Y0 = W0X_0$
 - $Y1 = W1X_1$
 - $\vdots$
 - $YN = WNX_N$
 - $S = \sum{i=0}^{N} Yi$

Single Neuron

Components
- Activation/Transfer function
- Bias (b)
- Input (X)
- Weight (W)
- Accumulator function
Equations:
- $f = WX + b$
- $a = sigmoid(f)$
- $A = a$

Deep Feed Forward Network - Processing

$\sum = X1 + X2 + … + X_m = y$
Weighted Sum
- $\sum = X1w1 + X2w2 + … + Xmwm = y$
- $\sum Xiwi$
Transfer Function
- $f(v_k)$
- $y = f(x)$

Network Parameters

Weights:
- Each neuron connected to others by communication links.
- Each link has a weight.
Bias:
- Impacts net input calculation.
- Considered like another weight ( $W{0j} = bj$ ).
- $Y{in} = \sum XiW{ij} = bj + \sum 1 \times X_{ij}$
Threshold:
- Pre-defined value for computing the final output.
Learning Rate (α):
- Controls the amount of weight adjustment at each training step.

Activation Functions

Purpose: To add non-linearity to the neural network.
Binary Sigmoid Function: (Logistic Sigmoid Function or Unipolar Sigmoid Function)
- $f(x) = \frac{1}{1 + e^{-\lambda x}}$
- Derivative $f'(x) = \lambda f(x)(1 - f(x))$
Bipolar Sigmoid Function:
- $f(x) = \frac{1 - e^{-\lambda x}}{1 + e^{-\lambda x}}$
- Derivative $f'(x) = \frac{\lambda}{2}(1 + f(x))(1 - f(x))$
Hyperbolic Tangent Function:
- $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- Derivative: $f'(x) = (1 + f(x))(1 - f(x))$
Ramp Function:
- $f(x) = \begin{cases} 1, & \text{if } x > 1 \ x, & \text{if } 0 \leq x \leq 1 \ 0, & \text{if } x < 0 \end{cases}$
ReLU (Rectified Linear Unit):
- $f(x) = max(0, x)$
Other:
- Binary Step Function
- ELU
- Leaky ReLU
- Linear
- Tanh
- SELU
- Sigmoid / Logistic
- Parametric ReLU

Single Layer & Multi Layer Perceptron

Differences in arrangement of layers, synapses, data flow

Multi Layer Perceptron

Three layers:
- Input Layer
- Hidden Layers
- Output layer.
Example using Longitude, Latitude, Elapsed time, Seismic energy as input

Multi Layer Perceptron (Linear Separability)

The concept of separability applies to binary classification problems. In them, we have two classes: one positive and the other negative. We say they’re separable if there’s a classifier whose decision boundary separates the positive objects from the negative ones. If such a decision boundary is a linear function of the features, we say that the classes are linearly separable.

Single Layer Perceptron (Linear Separability)

Cannot implement XOR

Multi Layer Perceptron (Linear Separability)

XOR can’t be calculated by a single perceptron
XOR can be calculated by a layered network
Activation : ReLU f(x)=max(0,x)
Example given through a network of ReLU functions how XOR can be computed.

Gradient Descent

Minimize a function by parameter optimization – Optimization Problem
How optimization of y = f(x) = x2 happen
- GD looks for minimizing f(x)
- Consider f(x) as a cost function to minimize
- Minimum value is 0
Find Slop at a given point to find the direction to minimize
- Find any point on the line
- Find gradient
- gradient = change in y / change in x
Increase x to minimize f(x) from the given point (ie. direction of movement)
Decrease x to minimize f(x) from the given point (ie. direction of movement)
Direction = - (Gradient)
- NEW VALUE = OLD VALUE - STEP SIZE
  *NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE)
- LEARNING RATE – HOW MUCH TO MOVE IN EACH DIRECTION
Training Linear Regression Model for Linear Regression, Minimize Loss Function
Cost Function $J(m,c) = [2-(c+m)]^2 + [4-(c+3m)]^2$
Find
- -2[2 - (c + m)] + (-2)[4-(c+3m)]
- => -2[2-(c)]+(-2)[4-(c)]
- => -2[2] -2[4]
- =>-4-8
- => -12
  - NEW VALUE = OLD VALUE - STEP SIZE
  - NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE)
  - Cnew = Cold – (LR * Slope)
  - = 0 – (0.0001 * -12)
  - Similarly find Mnew and update for next iteration till there is no further change in c and m
    Reference : https://www.omnicalculator.com/math/gradient dy/dx

Back Propagation

Goal: Minimize Error C = (a-y)²
Find Error
Minimize it
Only thing can be changed is weight, Hence we need this wrt weight (w)
Finding the Gradient
- $\frac{dC}{dw} = \frac{dC}{da} * \frac{da}{dw}$
- $\frac{da}{dw} = i$
- $\frac{dC}{da} = 2(a-y)$
NEW W= OLD W - (LEARNING RATE * SLOPE)
- $W1=W0 - (\alpha * \frac{dC}{dw})$

Gradient Descent VS Stochastic Gradient Descent (SGD)

What’s particular about gradient descent is that, to find the minimum of a function, the function itself needs to be a differentiable convex function.
Two major limitations – GD
- Calculating derivatives for the entire dataset is time consuming
- Memory required is proportional to the size of the dataset

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a probabilistic approximation of Gradient Descent
It is an approximation because, at each step, the algorithm calculates the gradient for one observation picked at random, instead of calculating the gradient for the entire dataset
Stochastic = Random

Gradient Descent VS Stochastic Gradient Descent (SGD)

SGD – Use ONLY ONE or SUBSET of training sample from training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
GD - Run through ALL the samples in training set to do a single update for a parameter in a particular iteration