N

Machine Learning Algorithms – Vocabulary Review

Artificial Intelligence, Machine Learning, and Deep Learning

  • Artificial Intelligence (AI) – any technique enabling computers to mimic human behaviour.

  • Machine Learning (ML) – algorithms that let computers “learn” patterns without being explicitly programmed.

  • Deep Learning (DL) – subset of ML that extracts hierarchical patterns directly from raw data using neural networks.

    • Replaces hand-engineered features (time–consuming, non-scalable) with learned representations.

  • Three drivers of the current DL boom:

    • Big Data (abundant labelled & unlabelled data sets).

    • Hardware (GPUs/TPUs for massive parallelism).

    • Software (open-source libraries, e.g. TensorFlow & PyTorch).

Framework Ecosystem

  • TensorFlow – Google-backed, static graphs + tf.keras high-level API.

  • PyTorch – Dynamic computation graphs, pythonic, research-friendly.

Biological Inspiration & the Perceptron

  • Early work (Rosenblatt) viewed the brain as:

    • Mosaic sensory points → Projection areas → Association units → Response units.

  • Perceptron – mathematical abstraction of a neuron.

    • Takes inputs x1,\dots,xm with weights w1,\dots,wm and bias w_0.

    • Computes a weighted sum z = w0 + \sum{j=1}^{m} wj xj.

    • Applies non-linear activation \hat y = g(z).

Forward Propagation Mathematics

  • Compact vector form: \hat y = g(w_0 + \mathbf{x}^T \mathbf{w}) where \mathbf{x} \in \mathbb{R}^m, \mathbf{w}\in\mathbb{R}^m.

  • Common scalar example: \hat y = g(1 + 3x1 - 2x2) defines a decision line in 2-D.

  • Multi-output version stacks neurons: zi = w{0,i} + \sum{j=1}^{m} xj w{j,i} then yi = g(z_i) for each output i.

Activation Functions

  • Sigmoid g(z)=\frac{1}{1+e^{-z}}, derivative g'(z)=g(z)(1-g(z)).

  • Hyperbolic Tangent g(z)=\tanh(z), derivative g'(z)=1-\tanh^2(z).

  • ReLU g(z)=\max(0,z), derivative g'(z)=0\; (z\le 0),\;1\; (z>0).

  • All introduce non-linearity, allowing networks to approximate complex functions; without them networks collapse into linear models.

Multilayer Perceptron (Single Hidden Layer)

  • Architecture example: 2 inputs + bias → 2 hidden neurons → 1 output neuron.

  • Forward pass:

    • Hidden: z^{(1)}k = w^{(1)}{0k}+\sumj w^{(1)}{jk}xj, a^{(1)}k=g(z^{(1)}_k).

    • Output: z^{(2)} = w^{(2)}{0}+\sumk w^{(2)}{k}a^{(1)}k, \hat y=g_{out}(z^{(2)}).

    • zk(1)=w0k(1)w0k(1)​+j∑​wjk(1)​xj​

  • Common output activations: Sigmoid for binary, Softmax for multi-class.

Calculus Refresher

  • Slope for linear y = mx + b is constant m.

  • Derivative of nonlinear f(x)=x^2 via limit: f'(x)=\lim_{\Delta x \to 0}\frac{(x+\Delta x)^2 - x^2}{\Delta x}=2x.

  • Partial derivatives for multivariate f(x,y)=x^3+y^2: \frac{\partial f}{\partial x}=3x^2, \frac{\partial f}{\partial y}=2y.

Loss / Cost Functions

  • Mean Squared Error (MSE) \text{MSE}=\frac{1}{n}\sum{i=1}^{n} (yi-\hat y_i)^2.

  • Mean Absolute Error (MAE) \text{MAE}=\frac{1}{n}\sum{i=1}^{n} |yi-\hat y_i|.

  • Binary Cross-Entropy (Log-Loss) L = -\frac{1}{n}\sum{i=1}^{n}\big[yi\log(\hat yi)+(1-yi)\log(1-\hat y_i)\big].

  • TensorFlow/Keras compile examples:

    • \text{loss='mean_squared_error'}, 'mean_absolute_error', 'binary_crossentropy', 'sparse_categorical_crossentropy'.

Gradient Descent Parameter Updates

  • Generic rule: w \leftarrow w - \eta \frac{\partial L}{\partial w}, b \leftarrow b - \eta \frac{\partial L}{\partial b} where \eta = learning rate.

  • For logistic-style model \hat y=\sigma(y) with y=\mathbf{w}^T\mathbf{x}+b:

    • \frac{\partial L}{\partial wj}=\frac{1}{n}\sum{i=1}^{n} x{ij}(\hat yi-y_i).

    • \frac{\partial L}{\partial b}=\frac{1}{n}\sum{i=1}^{n}(\hat yi-y_i).

Images as Numerical Matrices

  • Digital image = 3-D tensor H \times W \times C (e.g. 1080 \times 1080 \times 3 RGB).

  • Pixel intensities range [0,255] (often normalised to [0,1]).

Computer Vision Tasks

  • Classification – assign label; model may output class probabilities.

  • Regression – predict continuous value (e.g. steering angle).

  • Object Detection – locate & label bounding boxes.

  • Semantic Segmentation – per-pixel classification.

Why Not Plain ANN for Images?

  • Dense layers on 1920\times1080\times3 input would demand \sim 6 million neurons per layer and billions of weights → impractical.

  • Dense networks treat distant pixels the same as neighbours and are sensitive to object translation.

Convolutional Neural Networks (CNNs)

  • Convolution layer: learn filters (kernels) shared spatially.

    • Example: 4\times4 filter (16 weights) slides with stride to generate feature map.

    • Operation: element-wise multiply + sum.

  • Parameter sharing provides sparsity & translation equivalence.

  • Non-linearity (ReLU) follows each convolution.

  • Pooling layer (e.g. max pool 2\times2, stride 2):

    • Downsamples, reduces computation, introduces spatial invariance, lowers overfitting.

  • Feature Hierarchy

    • Early layers → edges & corners.

    • Middle layers → motifs (eyes, wheels).

    • Deep layers → object parts / high-level semantics.

  • Complete pipeline: CONV → ReLU → POOL repeated → flatten → fully connected → Softmax.

Practical Filter Examples (ASCII)

  • Vertical line detector \begin{bmatrix}-1 & 1 & -1\ -1 & 1 & -1\ -1 & 1 & -1\end{bmatrix}.

  • Diagonal detector \begin{bmatrix}-1 & -1 & 1\ -1 & 1 & -1\ 1 & -1 & -1\end{bmatrix}.

  • Loopy pattern filter demonstrated for digit 9 recognition.

Representation Learning Demonstrations

  • Combining detected sub-parts (eyes, nose, ears) → head → body → Koala classifier.

  • ReLU zeros out negative filter responses, producing sparse feature maps.

  • Max-pool continues to keep the strongest response regardless of position (shift invariance).

Advanced CNN Architectures for Vision

  • Fully Convolutional Networks (FCN) – all-conv; downsample then upsample using deconvolution \text{Conv2DTranspose} to output pixel-wise predictions (segmentation).

  • R-CNN (Regions with CNN features)

    • 1) Generate ~2k region proposals (Selective Search).

    • 2) Warp each region, feed into CNN, 3) classify region.

    • Slow & brittle (hand-crafted proposals).

  • Faster R-CNN

    • End-to-end network; backbone conv extracts feature map once.

    • Region Proposal Network (RPN) predicts bounding boxes + objectness.

    • ROI Pooling aligns proposals → shared classifier head.

    • Learned proposals, orders-of-magnitude speed improvement.

Convolution, ReLU & Pooling – Combined Benefits

  • Convolution: sparse connectivity, weight sharing ⇒ fewer parameters, reduced overfitting.

  • ReLU: non-linearity, simple derivative, accelerates convergence.

  • Pooling: dimensionality reduction, computational savings, tolerance to small distortions.

Ethical & Practical Considerations

  • Scalability: DL leverages data/hardware; but large models consume energy.

  • Interpretability: learned features outperform manual yet can be opaque.

  • Fairness: biases in big data can propagate through learned representations.

Numerical & Code Snippets (Framework Agnostic)

  • TensorFlow code blocks for activations: tf.math.sigmoid(z), tf.math.tanh(z), tf.nn.relu(z).

  • PyTorch equivalents: torch.sigmoid(z), torch.tanh(z), torch.nn.ReLU().

  • Model compilation examples in Keras shown for different loss functions.

Key Takeaways

  • Neural networks map inputs → outputs via layers of linear transforms + non-linearities.

  • Training minimises loss via gradient descent; derivatives underpin updates.

  • CNNs specialise in vision: local receptive fields, shared filters, pooling.

  • Modern detection/segmentation models integrate learnable proposal or upsampling stages.

  • Toolchains (TensorFlow, PyTorch) abstract low-level math, letting practitioners focus on architecture & data.