Deep Learning Notes

Biological Neural Network

Brain: Center of the nervous system.
Biological nervous system is the most important thing for living things, especially human beings.
Human nervous system's center contains a brain.
Any biological nervous system has a large number of interconnected processing units called neurons.
Each neuron can operate in parallel.
The human brain consists of approximately $10^{11}$ neurons communicating with each other with the help of electrical impulses.

Neuron Structure and Function

Dendrite: A bush of very thin fiber.
Axon: A long cylindrical fiber.
Soma: Also called a cell body, similar to a nucleus of a cell.
Synapse: A junction where the axon makes contact with the dendrites of neighboring neurons.

Neuron Communication

Each neuron contains a chemical called a neurotransmitter.
Signals are transmitted across neurons by this chemical.
Inputs from other neurons arrive at a neuron through dendrites.
These signals accumulate at the synapse of the neuron and serve as the output to be transmitted through the neuron.
The action may produce electrical impulses, which usually last for about a millisecond.

Pulse Generation

Pulses are generated due to incoming signals.
Not all signals produce pulses in the axon unless a threshold value is crossed.
Action signals in the axon of a neuron are commutative signals arriving at dendrites and are summed up at the soma.

Artificial Neural Network (ANN)

Artificial Neural Networks (ANNs) or Neural Networks (NNs) are simplified models of biological nervous systems.
The behavior of a biological neural network can be captured by a simple model called an artificial neuron or perceptron.

Analogy between Biological Neural Network (BNN) and Artificial Neural Network (ANN)

BNN: ANN:
- Cell - Neuron
- Dendrite - Weight or interconnection
- Soma - Net Input
- Axon - Output

Functions of Neurons in Artificial Neural Networks

Compute input signals.
Transport signals at high speed.
Store information.
Handle perception, automatic training, and learning.

Components of an Artificial Neuron

$X1, X2, X3, …, Xn$ : n inputs to the artificial neuron.
$W1, W2, W3, …, Wn$ : Weights attached to the input links.

Total Input

The total input “I” received by the soma of the artificial neuron is:
- $I = (X1 * W1) + (X2 * W2) + (X3 * W3) + … + (Xn * Wn)$

Transfer Function

A commonly known transfer function is the thresholding function denoted as Ø.
In this thresholding function, the sum (I) is compared with a threshold value θ.
If I > θ, then the output is 1, else it is 0 (like a linear filter).
Mathematically:
- \text{if } I > \theta, \text{ output } = 1
- $\text{else output } = 0$
Such a Ø is called a step function (also known as the Heaviside function).

Key Components of Neural Network Architecture

Input:
- Set of features fed into the model for learning.
- Example: pixel values of an image in object detection.
Weight:
- Gives importance to features that contribute more to learning.
- Achieved through scalar multiplication between the input value and the weight matrix.
- Example: negative words impacting sentiment analysis more than neutral words.
Bias:
- Used to prevent weights from becoming zero.
Transfer Function:
- Combines multiple inputs into one output value so that the activation function can be applied.
- Done by a simple summation of all inputs to the transfer function.
Activation Function:
- Introduces non-linearity to consider varying linearity with the inputs.
- Without it, the output would be a linear combination of input values, lacking non-linearity.
Output:
- Each ANN has one Output Layer which provides the output of model.
- Regression: Output Layer has one Node; its output is the Y-Predicted value.
- Categorical Data (Two classes): Output Layer has 2 Nodes, activation function is ‘Sigmoid’.
- Multi-class classification: Output Layer has Nodes equal to the number of classes, activation function is ‘Softmax’.

Activation Functions

Executed at each node in the hidden and output layers of the NN.
Used to determine the output of the neural network (yes or no).
Maps the resulting values between 0 to 1 or -1 to 1, etc. (depending on the function).
Derivative or Differential: Change in y-axis with respect to change in x-axis; also known as slope.
The derivative of the transfer function is needed to perform back propagation.

Types of Activation Functions

Linear Activation Function
Non-linear Activation Functions

Examples of Activation Functions

Step (or Heaviside) Function:
- Applies a hard threshold.
- A higher output implies a greater probability of being 1 (or a high output).
- Discontinuous; therefore, not used in practice with back-propagation.
Ramp Function:
- A truncated version of the linear function.
- Maps a range of inputs to outputs over the range (0, 1) with definitive cut-off points T1 and T2.
- Fires the node definitively above a threshold but retains some uncertainty in the lower regions.
Rectified Linear Unit (ReLU):
- A special case of the ramp function used in convolutional neural networks (CNNs).
- T1 = 0 and T2 is the maximum of the input, giving a linear function with no negative values
Sigmoid (or Fermi) Function:
- Maps the input to a value between 0 and 1 (but not equal to 0 or 1).
- Output from the node will be a high signal (if the input is positive) or a low one (if the input is negative).
- Efficiently perform back propagation.
- Sigmoid’s natural threshold is 0.5.
- Lack of output can affect subsequent weights, stopping the next nodes from learning.
Hyperbolic Tangent Function (tanh(x)):
- The tanh function supplies -1 for negative values, maintaining the output of the node and allowing subsequent nodes to learn from it.
Gaussian Function:
- An even function that gives the same output for equally positive and negative input values.
- Gives its maximal output when there is no input and has decreasing output with increasing distance from zero.
- Can be used in a node where the input feature is less likely to contribute to the final result.

Example Calculation in Neurons

If data has two features, Age and Exp (30 & 8), and Salary is 40,000, the features are given as input to the nodes of the input layers.
Example Calculations:
- $A = (30 * 2 + 0) + (8 * 3 + 0) = 84$
- $B = (30 * 1 + 0) + (8 * 1 + 0) = 38$
- $\text{Output} = (84 * 1 + 0) + (38 * 2 + 0) = 160$
160 is the output of the First Feed Forward process.
If this value does not match the salary column value 40,000, weights are updated in the Back Propagation process.
The Optimizer (e.g., Gradient Descent, Adam) takes care of this process.
First, weights $w5$ & $w6$ will be updated, then the next set of weights $w1$ , $w2$ , $w3$ & $w4$ will be updated.
Then the feed forward will happen again for the next data.
The weights & biases will be updated for the remaining data and epochs.

Architectures of ANN

Models are specified by three basic entities:
1. The Models synaptic interconnections.
2. The training and learning rules adopted for updating and adjusting the connection weights.
3. Their activation functions.
Three fundamental classes of ANN architectures:
- Single layer feed forward architecture
- Multilayer feed forward architecture
- Recurrent networks architecture

The AND Problem and its Neural Network

The simple Boolean AND operation with two input variables $x1$ and $x2$ has four input patterns: 00, 01, 10 and 11.
For the first three patterns, the output is 0, and for the last pattern, the output is 1.
The AND problem can be thought of as a perception problem where we have to receive four different patterns as input and perceive the results as 0 or 1.
A possible neuron specification to solve the AND problem is to set the threshold $\theta = 0.9$ . When the input is 11, the weight sum exceeds the threshold leading to the output 1; else, it gives the output 0.

Single Layer Feed Forward Neural Network

A layer of n neurons constitutes a single-layer feed-forward neural network.
It contains a single layer of artificial neurons.
The input layer and output layer receive input signals and transmit output signals but are boundaries, not true layers.
The only layer in the architecture is the synaptic links carrying the weights that connect every input to the output neurons.

Modeling SLFFNN

The output of any k-th neuron can be determined as follows.

Multilayer Feed Forward Neural Networks

This network is made up of multiple layers.
Besides processing an input and an output layer, architectures of this class also have one or more intermediary layers called hidden layers.
The hidden layer(s) aid in performing useful intermediary computation before directing the input to the output layer.
A multilayer feed forward network with l input neurons (number of neurons at the first layer), $m1, m2, …, mp$ number of neurons at $i^{th}$ hidden layer (i = 1, 2, …, p) and n neurons at the last layer (it is the output neurons) is written as $l - m1 - m2 - … - mp - n$ MLFFNN.
In $l – m – n$ MLFFNN, the input first layer contains l numbers of neurons, the only hidden layer contains m number of neurons, and the last (output) layer contains n number of neurons.
The inputs $x1, x2, …, x_p$ are fed to the first layer, and the weight matrices between input and the first layer, the first layer and the hidden layer, and those between hidden and the last (output) layer are denoted as $w^1, w^2$ and $w^3$ , respectively.
$f^1, f^2$ and $f^3$ are the transfer functions of neurons lying on the first, hidden, and the last layers, respectively.
The threshold values of any i-th neuron in j-th layer is denoted by $\theta$ .
The output of i-th, j-th, and k-th neuron in any l-th layer is represented by $O = f^l(\sum xi w^l + \theta)$ , where $xi$ is the input vector to the l-th layer.

Recurrent Neural Network Architecture

These networks differ from feedback network architectures in the sense that there is at least one ”feedback loop”.
Thus, in these networks, there could exist one layer with a feedback connection.
There could also be neurons with self-feedback links, that is, the output of a neuron is fed back into itself as input.
Depending on different types of feedback loops, several recurrent neural networks are known, such as the Hopfield network, and Boltzmann machine network.

Why Different Types of Neural Network Architectures?

Consider the case of a single neural network with two inputs.
$f = b0 + W1x1 + W2x2$ denotes a straight line in the plane of $x1 - x_2$ .
Depending on the values of $W1$ and $W2$ , we have a set of points for different values of $x1$ and $x2$ .
These points are linearly separable if the straight line f separates these points into two classes.

AND and XOR Problems

AND problem is linearly separable, while XOR problem is linearly non-separable.
In the AND problem, a straight line is possible to separate two tasks—output as 0 or 1 for any input.
In the XOR problem, such a line is not possible.
Note: A horizontal or a vertical line in case of the XOR problem is not admissible because in that case, it completely ignores one input.
For a 2-classification problem, if there is a straight line that acts as a decision boundary, then such a problem is linearly separable; otherwise, it is non-linearly separable.
The same concept can be extended to the n-classification problem.
Such a problem can be represented by an n-dimensional space, and a boundary would be with n − 1 dimensions that separate given sets.

Using Different Neural Network Architectures to Solve Problems

Any linearly separable problem can be solved with a single-layer feed-forward neural network (e.g., the AND problem).
If the problem is non-linearly separable, then a single-layer neural network cannot solve such a problem.
To solve such a problem, a multilayer feed-forward neural network is required.

Dynamic Neural Network

In some cases, the output needs to be compared with its target values to determine an error.
Based on this category of applications, a neural network can be static or dynamic.
In a static neural network, the error in prediction is neither calculated nor fed back for updating the neural network.
In a dynamic neural network, the error is determined and then fed back to the network to modify its weights (or architecture or both).

Choosing Architectures

For linearly separable problems, use a single-layer feed-forward neural network.
For non-linearly separable problems, use multilayer feed-forward neural networks.
For problems with error calculation, use recurrent neural networks as well as dynamic neural networks.

Learning Process in ANN

Learning is the ability of the neural network (NN) to learn from its environment and improve its performance through learning.
- The NN is stimulated by an environment
- The NN undergoes changes in its free parameters
- The NN responds in a new way to the environment
Learning is a process by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place (Mendel & McClaren 1970).

Five Basic Learning Rules

Error-correction learning: optimum filtering
Memory-based learning: memorizing the training data explicitly
Hebbian learning: neurobiological
Competitive learning: neurobiological
Boltzmann learning: statistical mechanics

Error-Correction Learning

error signal = desired response – output signal
$ek(n) = dk(n) – y_k(n)$
$ek(n)$ actuates a control mechanism to make the output signal $yk(n)$ come closer to the desired response $d_k(n)$ in a step-by-step manner
A cost function $\varepsilon(n) = \frac{1}{2} e^2_k(n)$ is the instantaneous value of the error energy -> a steady state
= a delta rule or Widrow-Hoff rule
$\Delta w{kj}(n) = \eta \, ek(n) \, x_j$
$\eta$ is the learning rate parameter
The adjustment made to a synaptic weight of a neuron is proportional to the product of the error signal and the input signal of the synapse in question.
$w{kj}(n+1) = w{kj}(n) + \Delta w_{kj}(n)$

Memory-Based Learning:

All of the past experiences are explicitly stored in a large memory of correctly classified input-output examples
${(xi, di)}_{i=1}^N$
Criterion used for defining the local neighborhood of the test vector $x_{test}$ .
Learning rule applied to the training examples in the local neighborhood of $x_{test}$ .
Nearest neighbor rule: the vector $x’N \in {x1,x2,…,xN}$ is the nearest neighbor of $x{test}$ if $\mini d(xi, x{test} ) = d(x’N , x{test} )$
If the classified examples d(xi, di ) are independently and identically distributed according to the joint probability distribution of the example (x,d).
If the sample size N is infinitely large.
The classification error incurred by the nearest neighbor rule is bounded above twice the Bayes probability of error.

k-nearest neighbor classifier:

Identify the k classified patterns that lie nearest to the test vector $x_{test}$ for some integer k.
Assign $x{test}$ to the class that is most frequently represented in the k nearest neighbors to $x{test}$ .

Hebbian Learning:

If two neurons on either side of the synapse (connection) are activated simultaneously, then the strength of that synapse is selectively increased.
If two neurons on either side of a synapse are activated asynchronously, then that synapse is selectively weakened or eliminated.

Time-dependent mechanism
Local mechanism (spatiotemporal contiguity)
Interactive mechanism
Conjunctional or correlational mechanism
A Hebbian synapse increases its strength with positively correlated presynaptic and postsynaptic signals and decreases its strength when signals are either uncorrelated or negatively correlated.
The Hebbian learning in mathematical terms:

$\Delta w{kj}(n)=F(yk(n),x_j(n))$
The simplest form:
- $\Delta w{kj}(n) = \eta yk(n)x_j(n)$
Covariance hypothesis:
- $\Delta w{kj} = \eta (xj- \bar{x})(y_j- \bar{y})$

Note that:

Synaptic weight $w{kj}$ is enhanced if the conditions xj > \bar{x} and yk > \bar{y} are both satisfied. Synaptic weight $w{kj}$ is depressed if there is xj > \bar{x} and yk < \bar{y} or yk > \bar{y} and xj < \bar{x}.

Competitive Learning:

The output neurons of a neural network compete among themselves to become active.

a set of neurons that are all the same (excepts for synaptic weights)
a limit imposed on the strength of each neuron
a mechanism that permits the neurons to compete -> a winner-takes-all
The standard competitive learning rule

$\Delta w{kj} = \eta(xj-w_{kj}) \text{ if neuron k wins the competition}$

= 0 if neuron k loses the competition
note. all the neurons in the network are constrained to have the same length.

Boltzmann Learning:

The neurons constitute a recurrent structure and they operate in a binary manner. The machine is characterized by an energy function E.
$E = -\frac{1}{2} \sumj \sumk w{kj}xkxj, j \neq k$ Machine operates by choosing a neuron at random then flipping the state of neuron k from state $xk$ to state – $x_k$ at some temperature T with probability

P(xk - xk) = 1/(1+exp(- Ek/T))

The Boltzmann learning rule:
$\Delta w{kj} = \eta(\rho^+{kj}- \rho^-{kj}), j \neq k,$ Note that both $\rho^+{kj}$ and $\rho^-_{kj}$ range in value from –1 to +1.
The environment
Clamped condition: the visible neurons are all clamped onto specific states determined by the environmentFree-running condition: all the neurons (=visible and hidden) are allowed to operate freely in value from –1 to +1.

Learning Process Factors

The number of layers in the network (Single-layered or multi-layered)
Direction of signal flow (Feedforward or recurrent)
Number of nodes in layers:
- The number of nodes in the input layer is equal to the number of features of the input data set.
- The number of output nodes will depend on possible outcomes (i.e., the number of classes in the case of supervised learning).
- The number of layers in the hidden layer is to be chosen by the user.
- A larger number of nodes in the hidden layer, higher the performance but too many nodes may result in overfitting as well as increased computational expense.
Weight of Interconnected Nodes:
- Stopping criterion: rate of misclassification < 1% or the maximum numbers of iterations should be less than 25 (a threshold value)
- The rate of misclassification may not reduce progressively.

Backpropagation

Difference in output values of the output layer and the expected values, are propagated back from the output layer to the preceding layers. Hence, the algorithm implementing this method is known as BACK PROPAGATION (i.e., propagating the errors back to the preceding layers).
Gradient descent algorithm is used in NN. Gradient Descent is an optimization algorithm used for minimizing the cost function. It is basically used for updating the parameters of the learning model.

This algorithm consists of multiple iterations, known as epochs. Each epoch consists of two phases:

Forward Phase
Backward Phase
Gradient Descent: calculates the partial derivative of the activation function by each interconnection weight to identify the ‘gradient’ or extent of change of the weight required to minimize the cost function.

Variables:

‘m’ neurons in the input layer
‘r’ neurons in the output layer
hidden layer with ‘n’ neurons
‘k’ is the no. of the hidden layer
The net signal input to the hidden layer neurons is given by:
If is the activation function of the hidden layer, then
The net signal input to the output layer neurons is given by:
If is the activation function of the output layer, then
If is the target of the k-th output neuron, then the cost function defined as the squared error of the output layer is given by:

Back Propagation Equation

$\frac{\partial E}{\partial w{jk}} = \sum{d \in D} (t{kd} - o{kd}) o{kd} (1 - o{kd}) x_{jd}$

Single sample dataset

Dataset Our dataset has one sample with two inputs and one output. 92 Our single sample is as following inputs=[2, 3] and output=[1].

Forward Pass Formula

$[2 \quad 3] . \begin{bmatrix} 0.11 & 0.12 \ 0.21 & 0.08 \end{bmatrix} = \begin{bmatrix} 0.85 & 0.48 \end{bmatrix} . \begin{bmatrix} 0.14 \ 0.15 \end{bmatrix} = [0.191]$

Reducing Error

$\text{prediction} = out$
$\text{prediction} = (h1) W5 + (h2) W6$
$\text{h1 = i1 W1 + i2 W2}$ $\text{h2 = i1 W3 + i2 W4}$
$\text{prediction} = (i1 W1 + i2 W2) W5 + (i1 W3 + i2 W4) W6$
To change the prediction value, we need to change the weights.

Gradient Descent Formulas

$W6 = W6 - \alpha \frac{\partial}{\partial W6}(\text{Error})$ Chain Rule: $\frac{\partial \text{Error}}{\partial W6} = \frac{\partial \text{Error}}{\partial \text{prediction}} * \frac{\partial \text{prediction}}{\partial W6}$ $\text{Error} = \frac{1}{2} (\text{prediction} - \text{actula})^2 * \frac{\partial}{\partial W6} (i1 W1 + i2 W2) W5 + (i1 W3 + i2 W4) W6$
Simplified:
$\frac{\partial \text{Error}}{\partial W6} = (\text{prediction} - \text{actula}) * (h2)$
Thus, $W6 = W6 - \alpha \Delta h_2$

And Similarly,

$W5 = W5 - \alpha \Delta h_1$

Chain Rule:
$\frac{\partial \text{Error}}{\partial W1} = \frac{\partial \text{Error}}{\partial \text{prediction}} * \frac{\partial \text{prediction}}{\partial h1} * \frac{\partial h1}{\partial W1}$
Thus, $W1 = (\text{prediction} - \text{actual}) * (Ws) * (i1)$

Updated Weights:

Applying the Gradient Descent updates, the new weight values for the respective weights.

Error Function

loss or cost function (sometimes also called an error function) Loss Function: A loss function/error function is for a single training example/input.
Cost Function: A cost function, on the other hand, is the average loss over the entire training dataset.

Loss function in Deep Learning Regression •
MSE(Mean Squared Error) -
MAE(Mean Absolute Error) -
Hubber loss 2. Classification
Binary cross-entropy
Categorical cross-entropy 3. AutoEncoder
KL Divergence 4. GAN
Discriminator loss
Minmax GAN loss 5. Object detection
Focal loss 6. Word embeddings
Triplet loss

Types of Regression Loss

Mean Absolute Error (MAE):
Mean Squared Error (MSE) : L2 Loss
Huber Loss•
δ (Hyper parameter) – defines the point where the Huber loss function transitions from a quadratic to linear. (δ 1.35)

Classification loss

Binary Cross Entropy/log loss

Log Loss =
N
i=1
yi logŷi + (1-yi)log(1-ŷi)

Categorical Cross entropy

k
yi logŷj where k is number of classes in the data
j=1
Loss =
n
Σy log (1) j=1

Artificial Neural Network:

Module II: Artificial Neural Network: (6 Hours) Artificial Neural Network, Basics of ANN, Activation Functions, Architectures of Neural Network. Learning Process in ANN, Error functions, Back Propagation Neural network. ANN 1 ANN and Machine Learning
Neurons •
Nucleus, body, tail etc •
Human Brain 100 billion neurons
Each is connected to a 1000 others

ANN
The whole purpose of ANN and ML is to mimic how a human brain works Receivers and transmitters