N

Machine Learning Algorithms – Comprehensive Bullet-Point Notes

Introduction to Artificial Intelligence, Machine Learning & Deep Learning

  • Artificial Intelligence (AI)
    • Any computational technique that enables computers to mimic human behavior or cognition.
  • Machine Learning (ML)
    • Sub-field of AI that provides systems with the ability to learn from data without being explicitly programmed.
    • Learns mapping \mathbf x \;\rightarrow\; y by minimizing a cost function over examples.
  • Deep Learning (DL)
    • Subset of ML that extracts hierarchical patterns directly from raw data with neural networks.
    • Replaces labor-intensive hand-engineered features with learned representations.
  • Why the boom “Now”
    • Big Data: unprecedented volumes & variety.
    • Hardware: GPUs/TPUs enable massive parallelism.
    • Software: high-level, open-source libraries (e.g.
      TensorFlow, PyTorch) drastically lower entry barrier.

Machine-Learning Frameworks & Tooling

  • TensorFlow & PyTorch explicitly highlighted (↑ on slide).
  • Provide auto-differentiation, optimized kernels, and deployment tooling.
  • Code snippets referenced:
    • tf.math.sigmoid(z), torch.sigmoid(z)
    • tf.nn.relu(z), torch.nn.ReLU()
    • model.compile(optimizer='adam', loss='…', metrics=['accuracy'])

Biological Inspiration & The Perceptron

  • FIG. 1: activity map of biological brain (topographic & random connections, feedback loops).
  • Perceptron (FIG. 2)
    • Simplified computational model of a biological neuron.
    • Components:
    • Inputs x1,\dots,xm
    • Weights w1,\dots,wm & bias w_0
    • Linear combination z = w0 + \sum{i=1}^{m} xi wi
    • Non-linear activation \hat y = g(z)
    • Foundation for multilayer perceptron (MLP) / feed-forward neural networks.

Forward Propagation in a Perceptron

  • Without bias: \hat y = g!\big( \sumi xi w_i \big)
  • With bias (more general): \hat y = g!\big( w_0 + \mathbf X^\top \mathbf w \big)
  • Activation options (sigmoid example)
    • g(z)=\frac1{1+e^{-z}} (smoothly squashes to 0–1)
  • Multi-output extension
    • For each output neuron j
      zj = w{0j}+\sum{i=1}^{m} xi w{ij}, \quad yj = g(z_j)

Common Activation Functions (all non-linear)

  • Sigmoid: g(z)=\frac1{1+e^{-z}}, \quad g'(z)=g(z)(1-g(z))
  • Hyperbolic Tangent: g(z)=\tanh(z), \quad g'(z)=1-g(z)^2
  • ReLU: g(z)=\max(0,z), \quad g'(z)=\begin{cases}1 & z>0\0 & z\le0\end{cases}
  • Importance
    • Introduce non-linearity → network can approximate arbitrary functions (universal approximation theorem).
    • Help mitigate vanishing gradients (ReLU better than sigmoid/tanh in deep nets).

Worked Perceptron Example (2-D Classification)

  • Parameters
    • w0=1, \; w1 = 3, \; w_2 = -2
    • Decision function: \hat y = g(1 + 3x1 - 2x2)
  • Decision boundary (z=0): 1 + 3x1 - 2x2 = 0 (a straight line in \mathbb R^2).
  • Sample input \mathbf x=[-1,\;2]^\top
    • z = 1 + 3(-1) - 2(2) = -6
    • \hat y = g(-6) \approx 0.002 → classified as negative (<0.5).
  • Region interpretation:
    • z>0 \;\Rightarrow\; \hat y>0.5 (positive class)
    • z<0 \;\Rightarrow\; \hat y<0.5 (negative class)

Single-Layer (One-Hidden-Layer) Neural Network (MLP)

  • Layers
    1. Input: \mathbf x \in \mathbb R^m
    2. Hidden: \mathbf z^{(1)} = W^{(1)} \mathbf x + \mathbf b^{(1)}, \; \mathbf a^{(1)} = g(\mathbf z^{(1)})
    3. Output: \mathbf z^{(2)} = W^{(2)} \mathbf a^{(1)} + \mathbf b^{(2)}, \; \hat{\mathbf y} = g_{out}(\mathbf z^{(2)})
  • Example indexing: z{2}=w{0,2}+\sum{j=1}^{m} xj w_{j,2}^{(1)}
  • Forward pass called “forward propagation.”

Calculus Refresher – Derivatives & Gradients

  • Scalar function slope vs derivative
    • For f(x)=x^2:
    • f(x+\Delta x)= (x+\Delta x)^2 = x^2+2x\Delta x + (\Delta x)^2
    • Slope =\dfrac{f(x+\Delta x)-f(x)}{\Delta x}=2x+\Delta x
    • As \Delta x \to 0 → derivative f'(x)=2x.
  • Partial derivatives (multivariate)
    • f(x,y)=x^3+y^2
    • \tfrac{\partial f}{\partial x}=3x^2, \; \tfrac{\partial f}{\partial y}=2y.
  • Gradient \nabla f=(\partial f/\partial x, \partial f/\partial y,\dots) guides optimization.

Loss / Cost Functions

  • Mean Squared Error (MSE) for a single output
    • Individual loss: L=\tfrac12 (y-\hat y)^2
    • Dataset: \text{MSE}=\tfrac1n\sum{i=1}^n (yi-\hat y_i)^2.
  • Mean Absolute Error (MAE)
    • \text{MAE}=\tfrac1n\sum{i=1}^n |yi-\hat y_i|.
  • Binary Cross-Entropy / Log-Loss
    • \text{BCE}= -\tfrac1n\sum{i=1}^n \big[yi\log(\hat yi)+(1-yi)\log(1-\hat y_i)\big].
  • Categorical Cross-Entropy, Sparse Categorical Cross-Entropy for multi-class tasks.
  • TensorFlow/Keras compile examples:
    • loss='mean_absolute_error' | 'mean_squared_error' | 'binary_crossentropy'.

Logistic Regression Example – Predicting Insurance Purchase

  • Features: Age (x1), Affordability (x2 – 0/1 having insurance).
  • Model: y = w1 x1 + w2 x2 + b; probability z=\sigma(-y).
    • Example chosen weights: w1=1, w2=1, b=0 → for $(x1=22, x2=1)$, y=23 ⇒ z\approx0.99 (very likely to buy).
  • Error for this instance (squared): (y-\hat y)^2=0.9801 (given sample where true y=0 → huge error ⇒ motivates learning).

Gradient Descent Optimization

  • Goal: minimize cost J(w1,w2,b).
  • Parameter updates:
    • w1 \leftarrow w1 - \eta \; \tfrac{\partial J}{\partial w_1}
    • w2 \leftarrow w2 - \eta \; \tfrac{\partial J}{\partial w_2}
    • b \leftarrow b - \eta \; \tfrac{\partial J}{\partial b}
  • Visual intuition: descending on cost surface toward global (or local) minima (illustrated on final slide).

Ethical, Practical, & Real-World Notes

  • Eliminating hand-engineered features increases scalability but hides interpretability.
  • Availability of big data & powerful hardware democratizes AI, but raises privacy & energy-consumption concerns.
  • Loss-function choice impacts robustness (e.g.
    MAE less sensitive to outliers than MSE).
  • Activation-function choice affects training stability (ReLU helps mitigate vanishing gradients, but susceptible to “dying ReLUs”).