Artificial Intelligence (AI) – any technique enabling computers to mimic human behaviour.
Machine Learning (ML) – algorithms that let computers “learn” patterns without being explicitly programmed.
Deep Learning (DL) – subset of ML that extracts hierarchical patterns directly from raw data using neural networks.
Replaces hand-engineered features (time–consuming, non-scalable) with learned representations.
Three drivers of the current DL boom:
Big Data (abundant labelled & unlabelled data sets).
Hardware (GPUs/TPUs for massive parallelism).
Software (open-source libraries, e.g. TensorFlow & PyTorch).
TensorFlow – Google-backed, static graphs + tf.keras high-level API.
PyTorch – Dynamic computation graphs, pythonic, research-friendly.
Early work (Rosenblatt) viewed the brain as:
Mosaic sensory points → Projection areas → Association units → Response units.
Perceptron – mathematical abstraction of a neuron.
Takes inputs x1,\dots,xm with weights w1,\dots,wm and bias w_0.
Computes a weighted sum z = w0 + \sum{j=1}^{m} wj xj.
Applies non-linear activation \hat y = g(z).
Compact vector form: \hat y = g(w_0 + \mathbf{x}^T \mathbf{w}) where \mathbf{x} \in \mathbb{R}^m, \mathbf{w}\in\mathbb{R}^m.
Common scalar example: \hat y = g(1 + 3x1 - 2x2) defines a decision line in 2-D.
Multi-output version stacks neurons: zi = w{0,i} + \sum{j=1}^{m} xj w{j,i} then yi = g(z_i) for each output i.
Sigmoid g(z)=\frac{1}{1+e^{-z}}, derivative g'(z)=g(z)(1-g(z)).
Hyperbolic Tangent g(z)=\tanh(z), derivative g'(z)=1-\tanh^2(z).
ReLU g(z)=\max(0,z), derivative g'(z)=0\; (z\le 0),\;1\; (z>0).
All introduce non-linearity, allowing networks to approximate complex functions; without them networks collapse into linear models.
Architecture example: 2 inputs + bias → 2 hidden neurons → 1 output neuron.
Forward pass:
Hidden: z^{(1)}k = w^{(1)}{0k}+\sumj w^{(1)}{jk}xj, a^{(1)}k=g(z^{(1)}_k).
Output: z^{(2)} = w^{(2)}{0}+\sumk w^{(2)}{k}a^{(1)}k, \hat y=g_{out}(z^{(2)}).
zk(1)=w0k(1)w0k(1)+j∑wjk(1)xj
Common output activations: Sigmoid for binary, Softmax for multi-class.
Slope for linear y = mx + b is constant m.
Derivative of nonlinear f(x)=x^2 via limit: f'(x)=\lim_{\Delta x \to 0}\frac{(x+\Delta x)^2 - x^2}{\Delta x}=2x.
Partial derivatives for multivariate f(x,y)=x^3+y^2: \frac{\partial f}{\partial x}=3x^2, \frac{\partial f}{\partial y}=2y.
Mean Squared Error (MSE) \text{MSE}=\frac{1}{n}\sum{i=1}^{n} (yi-\hat y_i)^2.
Mean Absolute Error (MAE) \text{MAE}=\frac{1}{n}\sum{i=1}^{n} |yi-\hat y_i|.
Binary Cross-Entropy (Log-Loss) L = -\frac{1}{n}\sum{i=1}^{n}\big[yi\log(\hat yi)+(1-yi)\log(1-\hat y_i)\big].
TensorFlow/Keras compile examples:
\text{loss='mean_squared_error'}, 'mean_absolute_error', 'binary_crossentropy', 'sparse_categorical_crossentropy'.
Generic rule: w \leftarrow w - \eta \frac{\partial L}{\partial w}, b \leftarrow b - \eta \frac{\partial L}{\partial b} where \eta = learning rate.
For logistic-style model \hat y=\sigma(y) with y=\mathbf{w}^T\mathbf{x}+b:
\frac{\partial L}{\partial wj}=\frac{1}{n}\sum{i=1}^{n} x{ij}(\hat yi-y_i).
\frac{\partial L}{\partial b}=\frac{1}{n}\sum{i=1}^{n}(\hat yi-y_i).
Digital image = 3-D tensor H \times W \times C (e.g. 1080 \times 1080 \times 3 RGB).
Pixel intensities range [0,255] (often normalised to [0,1]).
Classification – assign label; model may output class probabilities.
Regression – predict continuous value (e.g. steering angle).
Object Detection – locate & label bounding boxes.
Semantic Segmentation – per-pixel classification.
Dense layers on 1920\times1080\times3 input would demand \sim 6 million neurons per layer and billions of weights → impractical.
Dense networks treat distant pixels the same as neighbours and are sensitive to object translation.
Convolution layer: learn filters (kernels) shared spatially.
Example: 4\times4 filter (16 weights) slides with stride to generate feature map.
Operation: element-wise multiply + sum.
Parameter sharing provides sparsity & translation equivalence.
Non-linearity (ReLU) follows each convolution.
Pooling layer (e.g. max pool 2\times2, stride 2):
Downsamples, reduces computation, introduces spatial invariance, lowers overfitting.
Feature Hierarchy
Early layers → edges & corners.
Middle layers → motifs (eyes, wheels).
Deep layers → object parts / high-level semantics.
Complete pipeline: CONV → ReLU → POOL repeated → flatten → fully connected → Softmax.
Vertical line detector \begin{bmatrix}-1 & 1 & -1\ -1 & 1 & -1\ -1 & 1 & -1\end{bmatrix}.
Diagonal detector \begin{bmatrix}-1 & -1 & 1\ -1 & 1 & -1\ 1 & -1 & -1\end{bmatrix}.
Loopy pattern filter demonstrated for digit 9 recognition.
Combining detected sub-parts (eyes, nose, ears) → head → body → Koala classifier.
ReLU zeros out negative filter responses, producing sparse feature maps.
Max-pool continues to keep the strongest response regardless of position (shift invariance).
Fully Convolutional Networks (FCN) – all-conv; downsample then upsample using deconvolution \text{Conv2DTranspose} to output pixel-wise predictions (segmentation).
R-CNN (Regions with CNN features)
1) Generate ~2k region proposals (Selective Search).
2) Warp each region, feed into CNN, 3) classify region.
Slow & brittle (hand-crafted proposals).
Faster R-CNN
End-to-end network; backbone conv extracts feature map once.
Region Proposal Network (RPN) predicts bounding boxes + objectness.
ROI Pooling aligns proposals → shared classifier head.
Learned proposals, orders-of-magnitude speed improvement.
Convolution: sparse connectivity, weight sharing ⇒ fewer parameters, reduced overfitting.
ReLU: non-linearity, simple derivative, accelerates convergence.
Pooling: dimensionality reduction, computational savings, tolerance to small distortions.
Scalability: DL leverages data/hardware; but large models consume energy.
Interpretability: learned features outperform manual yet can be opaque.
Fairness: biases in big data can propagate through learned representations.
TensorFlow code blocks for activations: tf.math.sigmoid(z)
, tf.math.tanh(z)
, tf.nn.relu(z)
.
PyTorch equivalents: torch.sigmoid(z)
, torch.tanh(z)
, torch.nn.ReLU()
.
Model compilation examples in Keras shown for different loss functions.
Neural networks map inputs → outputs via layers of linear transforms + non-linearities.
Training minimises loss via gradient descent; derivatives underpin updates.
CNNs specialise in vision: local receptive fields, shared filters, pooling.
Modern detection/segmentation models integrate learnable proposal or upsampling stages.
Toolchains (TensorFlow, PyTorch) abstract low-level math, letting practitioners focus on architecture & data.