Machine Learning Algorithms – Comprehensive Bullet-Point Notes
Introduction to Artificial Intelligence, Machine Learning & Deep Learning
- Artificial Intelligence (AI)
- Any computational technique that enables computers to mimic human behavior or cognition.
- Machine Learning (ML)
- Sub-field of AI that provides systems with the ability to learn from data without being explicitly programmed.
- Learns mapping \mathbf x \;\rightarrow\; y by minimizing a cost function over examples.
- Deep Learning (DL)
- Subset of ML that extracts hierarchical patterns directly from raw data with neural networks.
- Replaces labor-intensive hand-engineered features with learned representations.
- Why the boom “Now”
- Big Data: unprecedented volumes & variety.
- Hardware: GPUs/TPUs enable massive parallelism.
- Software: high-level, open-source libraries (e.g.
TensorFlow, PyTorch) drastically lower entry barrier.
- TensorFlow & PyTorch explicitly highlighted (↑ on slide).
- Provide auto-differentiation, optimized kernels, and deployment tooling.
- Code snippets referenced:
tf.math.sigmoid(z)
, torch.sigmoid(z)
tf.nn.relu(z)
, torch.nn.ReLU()
model.compile(optimizer='adam', loss='…', metrics=['accuracy'])
Biological Inspiration & The Perceptron
- FIG. 1: activity map of biological brain (topographic & random connections, feedback loops).
- Perceptron (FIG. 2)
- Simplified computational model of a biological neuron.
- Components:
- Inputs x1,\dots,xm
- Weights w1,\dots,wm & bias w_0
- Linear combination z = w0 + \sum{i=1}^{m} xi wi
- Non-linear activation \hat y = g(z)
- Foundation for multilayer perceptron (MLP) / feed-forward neural networks.
Forward Propagation in a Perceptron
- Without bias: \hat y = g!\big( \sumi xi w_i \big)
- With bias (more general): \hat y = g!\big( w_0 + \mathbf X^\top \mathbf w \big)
- Activation options (sigmoid example)
- g(z)=\frac1{1+e^{-z}} (smoothly squashes to 0–1)
- Multi-output extension
- For each output neuron j
zj = w{0j}+\sum{i=1}^{m} xi w{ij}, \quad yj = g(z_j)
Common Activation Functions (all non-linear)
- Sigmoid: g(z)=\frac1{1+e^{-z}}, \quad g'(z)=g(z)(1-g(z))
- Hyperbolic Tangent: g(z)=\tanh(z), \quad g'(z)=1-g(z)^2
- ReLU: g(z)=\max(0,z), \quad g'(z)=\begin{cases}1 & z>0\0 & z\le0\end{cases}
- Importance
- Introduce non-linearity → network can approximate arbitrary functions (universal approximation theorem).
- Help mitigate vanishing gradients (ReLU better than sigmoid/tanh in deep nets).
Worked Perceptron Example (2-D Classification)
- Parameters
- w0=1, \; w1 = 3, \; w_2 = -2
- Decision function: \hat y = g(1 + 3x1 - 2x2)
- Decision boundary (z=0): 1 + 3x1 - 2x2 = 0 (a straight line in \mathbb R^2).
- Sample input \mathbf x=[-1,\;2]^\top
- z = 1 + 3(-1) - 2(2) = -6
- \hat y = g(-6) \approx 0.002 → classified as negative (<0.5).
- Region interpretation:
- z>0 \;\Rightarrow\; \hat y>0.5 (positive class)
- z<0 \;\Rightarrow\; \hat y<0.5 (negative class)
Single-Layer (One-Hidden-Layer) Neural Network (MLP)
- Layers
- Input: \mathbf x \in \mathbb R^m
- Hidden: \mathbf z^{(1)} = W^{(1)} \mathbf x + \mathbf b^{(1)}, \; \mathbf a^{(1)} = g(\mathbf z^{(1)})
- Output: \mathbf z^{(2)} = W^{(2)} \mathbf a^{(1)} + \mathbf b^{(2)}, \; \hat{\mathbf y} = g_{out}(\mathbf z^{(2)})
- Example indexing: z{2}=w{0,2}+\sum{j=1}^{m} xj w_{j,2}^{(1)}
- Forward pass called “forward propagation.”
Calculus Refresher – Derivatives & Gradients
- Scalar function slope vs derivative
- For f(x)=x^2:
- f(x+\Delta x)= (x+\Delta x)^2 = x^2+2x\Delta x + (\Delta x)^2
- Slope =\dfrac{f(x+\Delta x)-f(x)}{\Delta x}=2x+\Delta x
- As \Delta x \to 0 → derivative f'(x)=2x.
- Partial derivatives (multivariate)
- f(x,y)=x^3+y^2
- \tfrac{\partial f}{\partial x}=3x^2, \; \tfrac{\partial f}{\partial y}=2y.
- Gradient \nabla f=(\partial f/\partial x, \partial f/\partial y,\dots) guides optimization.
Loss / Cost Functions
- Mean Squared Error (MSE) for a single output
- Individual loss: L=\tfrac12 (y-\hat y)^2
- Dataset: \text{MSE}=\tfrac1n\sum{i=1}^n (yi-\hat y_i)^2.
- Mean Absolute Error (MAE)
- \text{MAE}=\tfrac1n\sum{i=1}^n |yi-\hat y_i|.
- Binary Cross-Entropy / Log-Loss
- \text{BCE}= -\tfrac1n\sum{i=1}^n \big[yi\log(\hat yi)+(1-yi)\log(1-\hat y_i)\big].
- Categorical Cross-Entropy, Sparse Categorical Cross-Entropy for multi-class tasks.
- TensorFlow/Keras compile examples:
loss='mean_absolute_error' | 'mean_squared_error' | 'binary_crossentropy'
.
Logistic Regression Example – Predicting Insurance Purchase
- Features: Age (x1), Affordability (x2 – 0/1 having insurance).
- Model: y = w1 x1 + w2 x2 + b; probability z=\sigma(-y).
- Example chosen weights: w1=1, w2=1, b=0 → for $(x1=22, x2=1)$, y=23 ⇒ z\approx0.99 (very likely to buy).
- Error for this instance (squared): (y-\hat y)^2=0.9801 (given sample where true y=0 → huge error ⇒ motivates learning).
Gradient Descent Optimization
- Goal: minimize cost J(w1,w2,b).
- Parameter updates:
- w1 \leftarrow w1 - \eta \; \tfrac{\partial J}{\partial w_1}
- w2 \leftarrow w2 - \eta \; \tfrac{\partial J}{\partial w_2}
- b \leftarrow b - \eta \; \tfrac{\partial J}{\partial b}
- Visual intuition: descending on cost surface toward global (or local) minima (illustrated on final slide).
Ethical, Practical, & Real-World Notes
- Eliminating hand-engineered features increases scalability but hides interpretability.
- Availability of big data & powerful hardware democratizes AI, but raises privacy & energy-consumption concerns.
- Loss-function choice impacts robustness (e.g.
MAE less sensitive to outliers than MSE). - Activation-function choice affects training stability (ReLU helps mitigate vanishing gradients, but susceptible to “dying ReLUs”).