Transparency, Interpretability & Explainability in AI

Course & Module Context

Overall course organized around the five responsible-AI criteria:
• F – Fairness (already completed pre-mid-sem)
• A – Accountability
• T – Transparency ← current focus
• P – Privacy
• R – Robustness
Module 7 launches the post-mid-sem portion, starting with Transparency before moving on to Accountability, Privacy, Robustness.

Transparency – Ability to “see through” a model; understand how it works internally.
Interpretability – Human can identify the cause of an individual prediction; “Why was I accepted / rejected?”
Explainability – Model (or an auxiliary tool) produces a human-understandable explanation of its behaviour.
• Terms often used interchangeably but differ subtly (thin boundary).

Trust: end-users, auditors, regulators need confidence.
Compliance with ethical guidelines & legal standards.
Debugging & model validation.
Feature-level insight for domain experts.
Application-dependent: e.g. Fake-news flagger might not need rationale; loan approval system must supply reasons.

Linear & Logistic Regression
• Core equation: y = \beta0 + \beta1x1 + \beta2x2 + \dots + \betanxn • Coefficients \betai directly show direction & magnitude of feature influence.
• Logistic wraps with sigmoid: \hat{p}=\sigma(z)=\frac{1}{1+e^{-z}} producing clear probability threshold.
• Example (house-price):
– Size weight 200 ⇒ every extra sqft adds 200\.
– Bedrooms weight 15{,}000 ⇒ 3→4 BHK increases price by 15{,}000\.
– Age weight negative ⇒ older house cheaper.
Decision Tree
• Gives explicit if–then rules; each path traceable.
• Root → internal nodes → leaves (predictions).
• Inversion (root at top) provides human-readable flowchart.
• Random Forest / XGBoost: still tree-based but multiple trees & voting reduce interpretability (trade-off for performance).
Naïve Bayes / Bayesian Models
• Outputs P(C\mid X); probabilities are themselves explanations.
• Formula: P(C|X)=\frac{P(X|C)P(C)}{P(X)}.
• Spam filtering example: words “free”, “urgent”, “meeting” each contribute likelihood; easy to list top tokens & their weights.
## k-Nearest Neighbours (k-NN)
• Analogy-based: new instance classified by proximity in feature space.
• Visual & table explanations of nearest neighbours; majority vote rationale.
• Example: patient-heart-disease prediction using age/BP/cholesterol similarity.

Vanilla deep feed-forward nets, CNNs, RNNs often labelled black boxes.
Whether interpretability is required depends on problem statement (e.g. anomaly detection may need root-cause tracing).

Transformer block = Encoder + Decoder, each built from:
• Multi-Head Attention (MHA)
• Feed-Forward Network (FFN)
• Add & Norm (skip / residual connections)
Skip connection: concatenate/ add original vector to processed output to combat vanishing gradients & preserve info.
Attention intuition (human analogy):
• Distinguishing cat vs dog by focusing on ears, eyes, fur.
• Tiger vs cheetah: stripes vs spots.
Single-Head vs Multi-Head: one vs multiple feature sub-spaces examined in parallel.
Self-attention (within same sequence), Cross-attention (encoder ↔ decoder).
Q, K, V matrices learned; attention weights visualised as heat-maps (e.g., English→French word alignment).
Result: weight matrices supply fine-grained, global explanations of feature importance.

Human cognition uses representative exemplars (e.g., red Maruti 800 for “car”).
Network learns prototype vectors for each class; new sample compared via distance metric (similar to twin/Siamese nets).
High similarity ⇒ classification + interpretable “closest prototype” explanation.

Insert explicit concept layer C between input X and output Y.
Learn two mappings:
• f1: X \rightarrow C (detect concepts). • f2: C \rightarrow Y (make decision).
Concepts annotated by humans (e.g., “curved beak”, “red breast” → robin).
Bottleneck forces network to ground decisions in human concepts.

Apply after training any black-box model.
LIME (Local Interpretable Model-agnostic Explanations)
• Perturb input around instance, fit sparse linear surrogate, return local feature weights.
SHAP (SHapley Additive exPlanations)
• Game-theoretic contribution scores; consistent & additive.
Grad-CAM (Gradient-weighted Class Activation Mapping)
• Uses gradients w.r.t. convolutional feature maps to produce heat-map overlay on image.
Counterfactuals
• “What minimal feature changes would flip the prediction?” Helpful for recourse.

Intrinsic methods usually model-specific (tree rules, linear coefficients, attention weights).
Post-hoc tools largely model-agnostic (LIME, SHAP).
Local scope: explain a single prediction (LIME, SHAP, counterfactuals).
Global scope: summarise model behaviour overall (feature importance, attention matrices, tree paths).

Builds on Fairness (pre-mid-sem); interpretability tools also aid bias detection.
Accountability & auditability (up-coming) depend on transparent explanations.
Privacy vs interpretability trade-offs (to be discussed later).
Robustness: explanations reveal spurious correlations → defensive retraining.

Regulatory compliance (GDPR “right to explanation”).
Loan, hiring, medical diagnostics require cause-based answers.
Choice of interpretable vs opaque model must align with stakeholder needs & risk tolerance.

Linear/Logistic: y = \beta0 + \sum{i=1}^{n} \betai xi ; \sigma(z)=\frac{1}{1+e^{-z}}.
Bayes Rule: P(C|X)=\frac{P(X|C)P(C)}{P(X)}.
k-NN vote proportion: \hat{y}=\text{mode}\bigl( y{(1)},\dots,y{(k)} \bigr).
Attention score (scaled dot-product): \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V.
Transformer patch example: image split into 16\times16 pixel patches → sequence.

Practice tracing a decision-tree path for sample inputs.
Manually compute SHAP values for a 3-feature toy model to internalise concept.
Visualise attention maps (e.g., via HuggingFace tools) to connect theory to practice.
Explore prototype networks with small image datasets (CIFAR-10) for intuition.