Deep Learning Notes

Classification Tasks in Computer Vision

  • Classification:

    • Identifies the category of an object in an image.
    • Example: Determining if an image contains a cat.
  • Localization:

    • Identifies the location of an object in an image.
  • Object Detection:

    • Combines classification and localization.
    • Identifies multiple objects in an image and their locations.
    • Example: Detecting cats and dogs in the same image.

Segmentation Tasks in Computer Vision

  • Semantic Segmentation:

    • Partitions an image into regions and classifies each region.
    • Example: Identifying and labeling all pixels belonging to a person, tree, grass, or sky.
  • Instance Segmentation:

    • Similar to semantic segmentation but distinguishes between different instances of the same object.
  • Panoptic Segmentation:

    • Combines semantic and instance segmentation.

Regression

  • Regression:
    • Predicts a continuous value.
    • Example: Fitting a line to data points.
    • Linear Regression Example: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.06-Linear-Regression.ipynb#scrollTo=46EbP7zkaBg-

Machine Translation

  • Uses an encoder-decoder architecture to translate text from one language to another.
    • Encoder: processes the input sequence.
    • Decoder: generates the translated output sequence.
    • Example: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/12%20Translate%20text%20between%20languages.ipynb#scrollTo=K_UnAZQpetM8

Transcription

  • Transcribes audio into text.
  • Example: https://cloud.google.com/vision/docs/drag-and-drop

Generation

  • Generates new content, such as images, based on a description.
    • Example: DALL·E generating an image of a monkey paying taxes.

Model Capacity

  • Definition: Model complexity; the ability of a model to express complex patterns or relationships.

  • High Capacity Model:

    • Large number of parameters.
    • High chance of memorizing training data if trained long enough.
  • Low Capacity Model:

    • Low memorization capability.
    • Trained on fewer samples.
    • Fewer parameters.
    • Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W_

Train, Test, and Validation Sets

  • Need for Validation:
    • To assess generalization ability: how well the model performs on previously unobserved inputs.
    • Reference: JKNITPYValidation and Test Sets.ipynb - Colaboratory

Hyperparameters vs. Parameters

  • Parameters:

    • Learned during the machine learning process.
    • Examples: Weights, bias.
  • Hyperparameters:

    • Manually specified.
    • Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W, JK NITPY Hyper-Parameters.ipynb - Colaboratory

Bias-Variance Trade-off

  • Example: Predicting height from weight.

    • Blue points: Training samples.
    • Green points: Testing samples.
    • Arrow: Reference line.
  • Method 1: Linear Regression (Fit to Line)

    • High Bias: Model cannot replicate the true reference line.
    • The inability of a model to capture the true relationship between data points is bias.
  • Method 2: Polynomial Regression (Fit to Line)

    • Low Bias: Can handle the true relationship between weight and height.
  • Performance Comparison (Training Set)

    • Linear Regression: High Error.
    • Polynomial Regression: Low, ~0 Error.
    • Polynomial Regression wins on training data.
  • Performance Comparison (Test Set)

    • Linear Regression: Moderate Error.
    • Polynomial Regression: High Error.
    • Linear Regression wins on test data.
  • Variance:

    • Difference in fits between training and testing data.
  • Bias and Variance:

    • High Bias, Low Variance: Underfitting.
    • Low Bias, High Variance: Overfitting.
  • Solutions:

    • Underfitting: Increase Model Capacity.
    • Overfitting: Decrease Model Capacity.
  • Ideal ML Model: Low Bias and Low Variance.

  • Optimum Capacity Model.

K-Fold Cross-Validation

  • Dataset divided into k folds.
  • k iterations of training and validation.
    • Training set: 90% of the data (training folds).
    • Validation set: 10% of the data (validation fold).
  • Errors E<em>1,E</em>2,E<em>3,E</em>kE<em>1, E</em>2, E<em>3, … E</em>k are calculated for each iteration.

Estimation in Statistics

  • Point Estimation:

    • Provides a single best prediction of an unknown property of a model using sample data (x<em>1,,x</em>m)(x<em>1, …, x</em>m).
  • Function Estimation:

    • Predicts the relationship between input and target variables.
  • Point Estimator:

    • θ^<em>m=g(x</em>1,,xm)\hat{\theta}<em>m = g(x</em>1, …, x_m)
    • θ^m\hat{\theta}_m: Point estimator for a model property (e.g., expectation).
    • m: Number of data elements.
    • (x<em>1,,x</em>m)(x<em>1, …, x</em>m): Independent and identically distributed (i.i.d.) data points.
    • g(.): Any estimation function for the given data points.

Bias and Variance in Estimation

  • Bias: Measures the expected deviation from the true value of θ\theta.

    • bias(θ^<em>m)=E[θ^</em>m]θbias(\hat{\theta}<em>m) = E[\hat{\theta}</em>m] - \theta
  • Variance: Measures the deviation from the expected estimator value.

    • Var(θ^m)Var(\hat{\theta}_m)

Maximum Likelihood Estimation (MLE)

  • Finds an optimal way to fit a distribution to data (generalization).

  • Common distributions:

    • Uniform, Binomial, Bernoulli, Geometric, Poisson, Exponential, Weibull, Log Normal, Normal (Gaussian), Chi-Squared, Student's t, Beta, Gamma.
  • Assumption: Data is normally distributed.

    • Measurements are close to the mean.
    • Measurements are symmetrical around the mean.
  • Goal: Find where to center the normal distribution shape.

  • The distribution should say, most of the values you measure should be near my average!

  • Maximize the likelihood of observing the measured weights.

  • Likelihood depends on the standard deviation.

  • MLE Definition: A method of estimating the parameters of a statistical model.

    • w<em>ML=argmax</em>wP<em>model(X;w)=argmax</em>w<em>i=1mP</em>model(xi;w)=argmax<em>w</em>i=1mlog(Pmodel(xi;w))w<em>{ML} = \arg \max</em>w P<em>{model}(X; w) = \arg \max</em>w \prod<em>{i=1}^m P</em>{model}(x^i; w) = \arg \max<em>w \sum</em>{i=1}^m \log(P_{model}(x^i; w))
    • Pdata(x)P_{data}(x): True but unknown data-generating distribution.
    • X=x1,,xmX = {x^1, …, x^m}: Data drawn independently from Pdata(x)P_{data}(x).
    • P<em>model(x;w)P<em>{model}(x; w): Probability distribution estimating P</em>data(x)P</em>{data}(x).
    • Reference: https://seeing-theory.brown.edu/bayesian-inference/index.html

Supervised vs. Unsupervised Learning

  • Supervised Learning:

    • Uses labeled data.
    • Examples: Classification (Binary, Multi-class, Multi-label), Regression (Simple Linear, Multiple Linear, Polynomial).
  • Unsupervised Learning:

    • Uses unlabeled data.
    • Example: Clustering.
  • Semi-Supervised Learning:

    • Uses both labeled and unlabeled data.
  • Reinforcement Learning:

    • Learns from mistakes (penalty and reward).

Bayesian Statistics

  • Introduced by Thomas Bayes in the 1770s.

  • Conditional Probability:

    • The probability of an event A given B equals the probability of B and A happening together divided by the probability of B.
  • Bayes Theorem:

    • P(AB)=P(BA)×P(A)P(B)P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}
    • where:
      • P(A) -> Prior
      • P(B|A) -> Likelihood
      • P(A|B) -> Posterior
      • P(B) -> Evidence

Challenges Motivating Deep Learning

  • Traditional ML algorithms have limitations in solving AI problems like speech and object recognition.

  • Deep learning was motivated by the failure of traditional algorithms to generalize well on such tasks.

  • Problems

    • The curse of dimensionality.
    • Local constancy and smoothness regularization.
    • Manifold learning.

Curse of Dimensionality

  • As the number of dimensions increases, ML problems become more complex.

  • Example: Prediction in 1-D, 2-D, 3-D cases.

Local Constancy and Smoothness Regularization

  • ML algorithms need prior beliefs about the functions they should learn.
  • Algorithms are biased towards preferring a class of functions.
  • Most generally used prior is smoothness or local constancy.
  • Says the function we learn should not change much within a small region.
  • If x is a point, the function on x gives an answer. A nearby point x+0.0001 should give nearly the same answer.
  • Deep learning introduces additional priors to reduce generalization error on sophisticated tasks.

Deep Learning vs. Traditional Machine Learning

  • Deep learning algorithms can perform better with more data compared to traditional ML algorithms.

Artificial Intelligence, Machine Learning, and Deep Learning

  • Artificial Intelligence:

    • Any technique enabling computers to mimic human intelligence.
    • Includes machine learning.
  • Machine Learning:

    • A subset of AI that includes statistical techniques enabling machines to improve at tasks with experience.
    • Includes deep learning.
  • Deep Learning:

    • A subset of machine learning composed of algorithms that permit software to train itself by exposing multilayered neural networks to vast amounts of data.

Deep Feed Forward Networks

  • History stems from the 1940s.

  • In 1957, Rosenblatt introduced the perceptron.

  • ANN consists of simple processing units communicating via weighted connections.

  • Why ANN?

    • Technical viewpoint: Some problems require massively parallel and adaptive processing.
    • Biological viewpoint: ANNs can replicate and simulate components of the human brain.
  • Building blocks: Neurons / units / nodes.

  • Neuron Functionality:

    • Receives input from other neurons.
    • Changes its internal state (activation) based on the current input.
    • Sends one output signal to many other neurons.
  • Dendrites: Input

  • Cell body: Processor

  • Synaptic: Link

  • Axon: Output

  • Biological Neuron vs. Artificial Neuron

    • Dendrites -> Interconnects
    • Soma -> Processing Element
    • Axon -> Conduction
    • Synapses -> Weights
      • Y<em>0=W</em>0X0Y<em>0 = W</em>0X_0
      • Y<em>1=W</em>1X1Y<em>1 = W</em>1X_1
      • \vdots
      • Y<em>N=W</em>NXNY<em>N = W</em>NX_N
      • S=<em>i=0NY</em>iS = \sum<em>{i=0}^{N} Y</em>i

Single Neuron

  • Components

    • Activation/Transfer function
    • Bias (b)
    • Input (X)
    • Weight (W)
    • Accumulator function
  • Equations:

    • f=WX+bf = WX + b

    • a=sigmoid(f)a = sigmoid(f)

    • A=aA = a

Deep Feed Forward Network - Processing

  • =X<em>1+X</em>2++Xm=y\sum = X<em>1 + X</em>2 + … + X_m = y

  • Weighted Sum

    • =X<em>1w</em>1+X<em>2w</em>2++X<em>mw</em>m=y\sum = X<em>1w</em>1 + X<em>2w</em>2 + … + X<em>mw</em>m = y
    • X<em>iw</em>i\sum X<em>iw</em>i
  • Transfer Function

    • f(vk)f(v_k)
    • y=f(x)y = f(x)

Network Parameters

  • Weights:

    • Each neuron connected to others by communication links.
    • Each link has a weight.
  • Bias:

    • Impacts net input calculation.
    • Considered like another weight (W<em>0j=b</em>jW<em>{0j} = b</em>j).
    • Y<em>in=X</em>iW<em>ij=b</em>j+1×XijY<em>{in} = \sum X</em>iW<em>{ij} = b</em>j + \sum 1 \times X_{ij}
  • Threshold:

    • Pre-defined value for computing the final output.
  • Learning Rate (α):

    • Controls the amount of weight adjustment at each training step.

Activation Functions

  • Purpose: To add non-linearity to the neural network.

  • Binary Sigmoid Function: (Logistic Sigmoid Function or Unipolar Sigmoid Function)

    • f(x)=11+eλxf(x) = \frac{1}{1 + e^{-\lambda x}}

    • Derivative f(x)=λf(x)(1f(x))f'(x) = \lambda f(x)(1 - f(x))

  • Bipolar Sigmoid Function:

    • f(x)=1eλx1+eλxf(x) = \frac{1 - e^{-\lambda x}}{1 + e^{-\lambda x}}

    • Derivative f(x)=λ2(1+f(x))(1f(x))f'(x) = \frac{\lambda}{2}(1 + f(x))(1 - f(x))

  • Hyperbolic Tangent Function:

    • f(x)=exexex+exf(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • Derivative: f(x)=(1+f(x))(1f(x))f'(x) = (1 + f(x))(1 - f(x))
  • Ramp Function:

    • f(x)={1,amp;if xgt;1 x,amp;if 0x1 0,amp;if xlt;0f(x) = \begin{cases} 1, &amp; \text{if } x &gt; 1 \ x, &amp; \text{if } 0 \leq x \leq 1 \ 0, &amp; \text{if } x &lt; 0 \end{cases}
  • ReLU (Rectified Linear Unit):

    • f(x)=max(0,x)f(x) = max(0, x)
  • Other:

    • Binary Step Function

    • ELU

    • Leaky ReLU

    • Linear

    • Tanh

    • SELU

    • Sigmoid / Logistic

    • Parametric ReLU

Single Layer & Multi Layer Perceptron

  • Differences in arrangement of layers, synapses, data flow

Multi Layer Perceptron

  • Three layers:
    • Input Layer
    • Hidden Layers
    • Output layer.
  • Example using Longitude, Latitude, Elapsed time, Seismic energy as input

Multi Layer Perceptron (Linear Separability)

  • The concept of separability applies to binary classification problems. In them, we have two classes: one positive and the other negative. We say they’re separable if there’s a classifier whose decision boundary separates the positive objects from the negative ones. If such a decision boundary is a linear function of the features, we say that the classes are linearly separable.

Single Layer Perceptron (Linear Separability)

  • Cannot implement XOR

Multi Layer Perceptron (Linear Separability)

  • XOR can’t be calculated by a single perceptron
  • XOR can be calculated by a layered network
  • Activation : ReLU f(x)=max(0,x)
  • Example given through a network of ReLU functions how XOR can be computed.

Gradient Descent

  • Minimize a function by parameter optimization – Optimization Problem

  • How optimization of y = f(x) = x2 happen

    • GD looks for minimizing f(x)
    • Consider f(x) as a cost function to minimize
    • Minimum value is 0
  • Find Slop at a given point to find the direction to minimize

    • Find any point on the line
    • Find gradient
    • gradient = change in y / change in x
  • Increase x to minimize f(x) from the given point (ie. direction of movement)

  • Decrease x to minimize f(x) from the given point (ie. direction of movement)

  • Direction = - (Gradient)

    • NEW VALUE = OLD VALUE - STEP SIZE
      *NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE)
    • LEARNING RATE – HOW MUCH TO MOVE IN EACH DIRECTION
  • Training Linear Regression Model for Linear Regression, Minimize Loss Function

  • Cost Function J(m,c)=[2(c+m)]2+[4(c+3m)]2J(m,c) = [2-(c+m)]^2 + [4-(c+3m)]^2

  • Find

    • -2[2 - (c + m)] + (-2)[4-(c+3m)]

    • => -2[2-(c)]+(-2)[4-(c)]

    • => -2[2] -2[4]

    • =>-4-8

    • => -12

      • NEW VALUE = OLD VALUE - STEP SIZE
      • NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE)
      • Cnew = Cold – (LR * Slope)
      • = 0 – (0.0001 * -12)
      • Similarly find Mnew and update for next iteration till there is no further change in c and m
        Reference : https://www.omnicalculator.com/math/gradient dy/dx

Back Propagation

  • Goal: Minimize Error C = (a-y)²

  • Find Error

  • Minimize it

  • Only thing can be changed is weight, Hence we need this wrt weight (w)

  • Finding the Gradient

    • dCdw=dCdadadw\frac{dC}{dw} = \frac{dC}{da} * \frac{da}{dw}
    • dadw=i\frac{da}{dw} = i
    • dCda=2(ay)\frac{dC}{da} = 2(a-y)
  • NEW W= OLD W - (LEARNING RATE * SLOPE)

    • W1=W0(αdCdw)W1=W0 - (\alpha * \frac{dC}{dw})

Gradient Descent VS Stochastic Gradient Descent (SGD)

  • What’s particular about gradient descent is that, to find the minimum of a function, the function itself needs to be a differentiable convex function.

  • Two major limitations – GD

    • Calculating derivatives for the entire dataset is time consuming

    • Memory required is proportional to the size of the dataset

Stochastic Gradient Descent (SGD)

  • Stochastic Gradient Descent is a probabilistic approximation of Gradient Descent

  • It is an approximation because, at each step, the algorithm calculates the gradient for one observation picked at random, instead of calculating the gradient for the entire dataset
    Stochastic = Random

Gradient Descent VS Stochastic Gradient Descent (SGD)

  • SGD – Use ONLY ONE or SUBSET of training sample from training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

  • GD - Run through ALL the samples in training set to do a single update for a parameter in a particular iteration