Transparency, Interpretability & Explainability in AI

Course & Module Context

  • Overall course organized around the five responsible-AI criteria:
    F – Fairness (already completed pre-mid-sem)
    A – Accountability
    T – Transparency ← current focus
    P – Privacy
    R – Robustness
  • Module 7 launches the post-mid-sem portion, starting with Transparency before moving on to Accountability, Privacy, Robustness.

Three Key Terms & Their Nuances

  • Transparency – Ability to “see through” a model; understand how it works internally.
  • Interpretability – Human can identify the cause of an individual prediction; “Why was I accepted / rejected?”
  • Explainability – Model (or an auxiliary tool) produces a human-understandable explanation of its behaviour.
    • Terms often used interchangeably but differ subtly (thin boundary).

Why Interpretability/Explainability Matter

  • Trust: end-users, auditors, regulators need confidence.
  • Compliance with ethical guidelines & legal standards.
  • Debugging & model validation.
  • Feature-level insight for domain experts.
  • Application-dependent: e.g. Fake-news flagger might not need rationale; loan approval system must supply reasons.

Intrinsically Interpretable Classical ML Models

  • Linear & Logistic Regression

    • Core equation: y = \beta0 + \beta1x1 + \beta2x2 + \dots + \betanxn • Coefficients \betai directly show direction & magnitude of feature influence.
    • Logistic wraps with sigmoid: \hat{p}=\sigma(z)=\frac{1}{1+e^{-z}} producing clear probability threshold.
    • Example (house-price):
    – Size weight 200 ⇒ every extra sqft adds 200\.
    – Bedrooms weight 15{,}000 ⇒ 3→4 BHK increases price by 15{,}000\.
    – Age weight negative ⇒ older house cheaper.

  • Decision Tree

    • Gives explicit if–then rules; each path traceable.
    • Root → internal nodes → leaves (predictions).
    • Inversion (root at top) provides human-readable flowchart.
    • Random Forest / XGBoost: still tree-based but multiple trees & voting reduce interpretability (trade-off for performance).

  • • Outputs P(C\mid X); probabilities are themselves explanations.
    • Formula: P(C|X)=\frac{P(X|C)P(C)}{P(X)}.
    • Spam filtering example: words “free”, “urgent”, “meeting” each contribute likelihood; easy to list top tokens & their weights.

  • ## k-Nearest Neighbours (k-NN)
    • Analogy-based: new instance classified by proximity in feature space.
    • Visual & table explanations of nearest neighbours; majority vote rationale.
    • Example: patient-heart-disease prediction using age/BP/cholesterol similarity.

Deep Learning & The Transparency Challenge

  • Vanilla deep feed-forward nets, CNNs, RNNs often labelled black boxes.
  • Whether interpretability is required depends on problem statement (e.g. anomaly detection may need root-cause tracing).

Intrinsically Interpretable DL Architectures

1. Attention & Transformers

  • Transformer block = Encoder + Decoder, each built from:
    • Multi-Head Attention (MHA)
    • Feed-Forward Network (FFN)
    • Add & Norm (skip / residual connections)
  • Skip connection: concatenate/ add original vector to processed output to combat vanishing gradients & preserve info.
  • Attention intuition (human analogy):
    • Distinguishing cat vs dog by focusing on ears, eyes, fur.
    • Tiger vs cheetah: stripes vs spots.
  • Single-Head vs Multi-Head: one vs multiple feature sub-spaces examined in parallel.
  • Self-attention (within same sequence), Cross-attention (encoder ↔ decoder).
  • Q, K, V matrices learned; attention weights visualised as heat-maps (e.g., English→French word alignment).
  • Result: weight matrices supply fine-grained, global explanations of feature importance.

2. Prototype Networks

  • Human cognition uses representative exemplars (e.g., red Maruti 800 for “car”).
  • Network learns prototype vectors for each class; new sample compared via distance metric (similar to twin/Siamese nets).
  • High similarity ⇒ classification + interpretable “closest prototype” explanation.

3. Concept Bottleneck Models (CBM)

  • Insert explicit concept layer C between input X and output Y.
  • Learn two mappings:
    • f1: X \rightarrow C (detect concepts). • f2: C \rightarrow Y (make decision).
  • Concepts annotated by humans (e.g., “curved beak”, “red breast” → robin).
  • Bottleneck forces network to ground decisions in human concepts.

Post-Hoc Explainability Methods (Model-Agnostic)

  • Apply after training any black-box model.
  • LIME (Local Interpretable Model-agnostic Explanations)
    • Perturb input around instance, fit sparse linear surrogate, return local feature weights.
  • SHAP (SHapley Additive exPlanations)
    • Game-theoretic contribution scores; consistent & additive.
  • Grad-CAM (Gradient-weighted Class Activation Mapping)
    • Uses gradients w.r.t. convolutional feature maps to produce heat-map overlay on image.
  • Counterfactuals
    • “What minimal feature changes would flip the prediction?” Helpful for recourse.

Local vs Global, Model-Specific vs Model-Agnostic

  • Intrinsic methods usually model-specific (tree rules, linear coefficients, attention weights).
  • Post-hoc tools largely model-agnostic (LIME, SHAP).
  • Local scope: explain a single prediction (LIME, SHAP, counterfactuals).
  • Global scope: summarise model behaviour overall (feature importance, attention matrices, tree paths).

Connections to Earlier & Future Modules

  • Builds on Fairness (pre-mid-sem); interpretability tools also aid bias detection.
  • Accountability & auditability (up-coming) depend on transparent explanations.
  • Privacy vs interpretability trade-offs (to be discussed later).
  • Robustness: explanations reveal spurious correlations → defensive retraining.

Practical / Ethical Implications

  • Regulatory compliance (GDPR “right to explanation”).
  • Loan, hiring, medical diagnostics require cause-based answers.
  • Choice of interpretable vs opaque model must align with stakeholder needs & risk tolerance.

Key Equations & Numerical References

  • Linear/Logistic: y = \beta0 + \sum{i=1}^{n} \betai xi ; \sigma(z)=\frac{1}{1+e^{-z}}.
  • Bayes Rule: P(C|X)=\frac{P(X|C)P(C)}{P(X)}.
  • k-NN vote proportion: \hat{y}=\text{mode}\bigl( y{(1)},\dots,y{(k)} \bigr).
  • Attention score (scaled dot-product): \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V.
  • Transformer patch example: image split into 16\times16 pixel patches → sequence.

Study Tips & Further Reading

  • Practice tracing a decision-tree path for sample inputs.
  • Manually compute SHAP values for a 3-feature toy model to internalise concept.
  • Visualise attention maps (e.g., via HuggingFace tools) to connect theory to practice.
  • Explore prototype networks with small image datasets (CIFAR-10) for intuition.