Deep Learning Notes
Classification Tasks in Computer Vision
Classification:
- Identifies the category of an object in an image.
- Example: Determining if an image contains a cat.
Localization:
- Identifies the location of an object in an image.
Object Detection:
- Combines classification and localization.
- Identifies multiple objects in an image and their locations.
- Example: Detecting cats and dogs in the same image.
Segmentation Tasks in Computer Vision
Semantic Segmentation:
- Partitions an image into regions and classifies each region.
- Example: Identifying and labeling all pixels belonging to a person, tree, grass, or sky.
Instance Segmentation:
- Similar to semantic segmentation but distinguishes between different instances of the same object.
Panoptic Segmentation:
- Combines semantic and instance segmentation.
Regression
- Regression:
- Predicts a continuous value.
- Example: Fitting a line to data points.
- Linear Regression Example: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.06-Linear-Regression.ipynb#scrollTo=46EbP7zkaBg-
Machine Translation
- Uses an encoder-decoder architecture to translate text from one language to another.
- Encoder: processes the input sequence.
- Decoder: generates the translated output sequence.
- Example: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/12%20Translate%20text%20between%20languages.ipynb#scrollTo=K_UnAZQpetM8
Transcription
- Transcribes audio into text.
- Example: https://cloud.google.com/vision/docs/drag-and-drop
Generation
- Generates new content, such as images, based on a description.
- Example: DALL·E generating an image of a monkey paying taxes.
Model Capacity
Definition: Model complexity; the ability of a model to express complex patterns or relationships.
High Capacity Model:
- Large number of parameters.
- High chance of memorizing training data if trained long enough.
Low Capacity Model:
- Low memorization capability.
- Trained on fewer samples.
- Fewer parameters.
- Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W_
Train, Test, and Validation Sets
- Need for Validation:
- To assess generalization ability: how well the model performs on previously unobserved inputs.
- Reference: JKNITPYValidation and Test Sets.ipynb - Colaboratory
Hyperparameters vs. Parameters
Parameters:
- Learned during the machine learning process.
- Examples: Weights, bias.
Hyperparameters:
- Manually specified.
- Reference: https://colab.research.google.com/drive/1YrgesNqXi1UBrSoJ3P-4HrTopy1Kmrce#scrollTo=A9bh1rcgV7W, JK NITPY Hyper-Parameters.ipynb - Colaboratory
Bias-Variance Trade-off
Example: Predicting height from weight.
- Blue points: Training samples.
- Green points: Testing samples.
- Arrow: Reference line.
Method 1: Linear Regression (Fit to Line)
- High Bias: Model cannot replicate the true reference line.
- The inability of a model to capture the true relationship between data points is bias.
Method 2: Polynomial Regression (Fit to Line)
- Low Bias: Can handle the true relationship between weight and height.
Performance Comparison (Training Set)
- Linear Regression: High Error.
- Polynomial Regression: Low, ~0 Error.
- Polynomial Regression wins on training data.
Performance Comparison (Test Set)
- Linear Regression: Moderate Error.
- Polynomial Regression: High Error.
- Linear Regression wins on test data.
Variance:
- Difference in fits between training and testing data.
Bias and Variance:
- High Bias, Low Variance: Underfitting.
- Low Bias, High Variance: Overfitting.
Solutions:
- Underfitting: Increase Model Capacity.
- Overfitting: Decrease Model Capacity.
Ideal ML Model: Low Bias and Low Variance.
Optimum Capacity Model.
K-Fold Cross-Validation
- Dataset divided into k folds.
- k iterations of training and validation.
- Training set: 90% of the data (training folds).
- Validation set: 10% of the data (validation fold).
- Errors are calculated for each iteration.
Estimation in Statistics
Point Estimation:
- Provides a single best prediction of an unknown property of a model using sample data .
Function Estimation:
- Predicts the relationship between input and target variables.
Point Estimator:
- : Point estimator for a model property (e.g., expectation).
- m: Number of data elements.
- : Independent and identically distributed (i.i.d.) data points.
- g(.): Any estimation function for the given data points.
Bias and Variance in Estimation
Bias: Measures the expected deviation from the true value of .
Variance: Measures the deviation from the expected estimator value.
Maximum Likelihood Estimation (MLE)
Finds an optimal way to fit a distribution to data (generalization).
Common distributions:
- Uniform, Binomial, Bernoulli, Geometric, Poisson, Exponential, Weibull, Log Normal, Normal (Gaussian), Chi-Squared, Student's t, Beta, Gamma.
Assumption: Data is normally distributed.
- Measurements are close to the mean.
- Measurements are symmetrical around the mean.
Goal: Find where to center the normal distribution shape.
The distribution should say, most of the values you measure should be near my average!
Maximize the likelihood of observing the measured weights.
Likelihood depends on the standard deviation.
MLE Definition: A method of estimating the parameters of a statistical model.
- : True but unknown data-generating distribution.
- : Data drawn independently from .
- : Probability distribution estimating .
- Reference: https://seeing-theory.brown.edu/bayesian-inference/index.html
Supervised vs. Unsupervised Learning
Supervised Learning:
- Uses labeled data.
- Examples: Classification (Binary, Multi-class, Multi-label), Regression (Simple Linear, Multiple Linear, Polynomial).
Unsupervised Learning:
- Uses unlabeled data.
- Example: Clustering.
Semi-Supervised Learning:
- Uses both labeled and unlabeled data.
Reinforcement Learning:
- Learns from mistakes (penalty and reward).
Bayesian Statistics
Introduced by Thomas Bayes in the 1770s.
Conditional Probability:
- The probability of an event A given B equals the probability of B and A happening together divided by the probability of B.
Bayes Theorem:
- where:
- P(A) -> Prior
- P(B|A) -> Likelihood
- P(A|B) -> Posterior
- P(B) -> Evidence
Challenges Motivating Deep Learning
Traditional ML algorithms have limitations in solving AI problems like speech and object recognition.
Deep learning was motivated by the failure of traditional algorithms to generalize well on such tasks.
Problems
- The curse of dimensionality.
- Local constancy and smoothness regularization.
- Manifold learning.
Curse of Dimensionality
As the number of dimensions increases, ML problems become more complex.
Example: Prediction in 1-D, 2-D, 3-D cases.
Local Constancy and Smoothness Regularization
- ML algorithms need prior beliefs about the functions they should learn.
- Algorithms are biased towards preferring a class of functions.
- Most generally used prior is smoothness or local constancy.
- Says the function we learn should not change much within a small region.
- If x is a point, the function on x gives an answer. A nearby point x+0.0001 should give nearly the same answer.
- Deep learning introduces additional priors to reduce generalization error on sophisticated tasks.
Deep Learning vs. Traditional Machine Learning
- Deep learning algorithms can perform better with more data compared to traditional ML algorithms.
Artificial Intelligence, Machine Learning, and Deep Learning
Artificial Intelligence:
- Any technique enabling computers to mimic human intelligence.
- Includes machine learning.
Machine Learning:
- A subset of AI that includes statistical techniques enabling machines to improve at tasks with experience.
- Includes deep learning.
Deep Learning:
- A subset of machine learning composed of algorithms that permit software to train itself by exposing multilayered neural networks to vast amounts of data.
Deep Feed Forward Networks
History stems from the 1940s.
In 1957, Rosenblatt introduced the perceptron.
ANN consists of simple processing units communicating via weighted connections.
Why ANN?
- Technical viewpoint: Some problems require massively parallel and adaptive processing.
- Biological viewpoint: ANNs can replicate and simulate components of the human brain.
Building blocks: Neurons / units / nodes.
Neuron Functionality:
- Receives input from other neurons.
- Changes its internal state (activation) based on the current input.
- Sends one output signal to many other neurons.
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Biological Neuron vs. Artificial Neuron
- Dendrites -> Interconnects
- Soma -> Processing Element
- Axon -> Conduction
- Synapses -> Weights
Single Neuron
Components
- Activation/Transfer function
- Bias (b)
- Input (X)
- Weight (W)
- Accumulator function
Equations:
Deep Feed Forward Network - Processing
Weighted Sum
Transfer Function
Network Parameters
Weights:
- Each neuron connected to others by communication links.
- Each link has a weight.
Bias:
- Impacts net input calculation.
- Considered like another weight ().
Threshold:
- Pre-defined value for computing the final output.
Learning Rate (α):
- Controls the amount of weight adjustment at each training step.
Activation Functions
Purpose: To add non-linearity to the neural network.
Binary Sigmoid Function: (Logistic Sigmoid Function or Unipolar Sigmoid Function)
Derivative
Bipolar Sigmoid Function:
Derivative
Hyperbolic Tangent Function:
- Derivative:
Ramp Function:
ReLU (Rectified Linear Unit):
Other:
Binary Step Function
ELU
Leaky ReLU
Linear
Tanh
SELU
Sigmoid / Logistic
Parametric ReLU
Single Layer & Multi Layer Perceptron
- Differences in arrangement of layers, synapses, data flow
Multi Layer Perceptron
- Three layers:
- Input Layer
- Hidden Layers
- Output layer.
- Example using Longitude, Latitude, Elapsed time, Seismic energy as input
Multi Layer Perceptron (Linear Separability)
- The concept of separability applies to binary classification problems. In them, we have two classes: one positive and the other negative. We say they’re separable if there’s a classifier whose decision boundary separates the positive objects from the negative ones. If such a decision boundary is a linear function of the features, we say that the classes are linearly separable.
Single Layer Perceptron (Linear Separability)
- Cannot implement XOR
Multi Layer Perceptron (Linear Separability)
- XOR can’t be calculated by a single perceptron
- XOR can be calculated by a layered network
- Activation : ReLU f(x)=max(0,x)
- Example given through a network of ReLU functions how XOR can be computed.
Gradient Descent
Minimize a function by parameter optimization – Optimization Problem
How optimization of y = f(x) = x2 happen
- GD looks for minimizing f(x)
- Consider f(x) as a cost function to minimize
- Minimum value is 0
Find Slop at a given point to find the direction to minimize
- Find any point on the line
- Find gradient
- gradient = change in y / change in x
Increase x to minimize f(x) from the given point (ie. direction of movement)
Decrease x to minimize f(x) from the given point (ie. direction of movement)
Direction = - (Gradient)
- NEW VALUE = OLD VALUE - STEP SIZE
*NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE) - LEARNING RATE – HOW MUCH TO MOVE IN EACH DIRECTION
- NEW VALUE = OLD VALUE - STEP SIZE
Training Linear Regression Model for Linear Regression, Minimize Loss Function
Cost Function
Find
-2[2 - (c + m)] + (-2)[4-(c+3m)]
=> -2[2-(c)]+(-2)[4-(c)]
=> -2[2] -2[4]
=>-4-8
=> -12
- NEW VALUE = OLD VALUE - STEP SIZE
- NEW VALUE = OLD VALUE – (LEARNING RATE * SLOPE)
- Cnew = Cold – (LR * Slope)
- = 0 – (0.0001 * -12)
- Similarly find Mnew and update for next iteration till there is no further change in c and m
Reference : https://www.omnicalculator.com/math/gradient dy/dx
Back Propagation
Goal: Minimize Error C = (a-y)²
Find Error
Minimize it
Only thing can be changed is weight, Hence we need this wrt weight (w)
Finding the Gradient
NEW W= OLD W - (LEARNING RATE * SLOPE)
Gradient Descent VS Stochastic Gradient Descent (SGD)
What’s particular about gradient descent is that, to find the minimum of a function, the function itself needs to be a differentiable convex function.
Two major limitations – GD
Calculating derivatives for the entire dataset is time consuming
Memory required is proportional to the size of the dataset
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is a probabilistic approximation of Gradient Descent
It is an approximation because, at each step, the algorithm calculates the gradient for one observation picked at random, instead of calculating the gradient for the entire dataset
Stochastic = Random
Gradient Descent VS Stochastic Gradient Descent (SGD)
SGD – Use ONLY ONE or SUBSET of training sample from training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
GD - Run through ALL the samples in training set to do a single update for a parameter in a particular iteration