AI

Supervised learning vs Reinforcement learning

  • What it is:

    • Supervised: model sees inputs with correct labels and learns to match them.

    • Reinforcement: agent tries actions, gets reward points, and learns which actions earn more points.

  • Why it matters for AI safety: Training signals (labels or rewards) can be incomplete, noisy, or easy to exploit, so alignment problems often start here.

  • Simple example:

    • Supervised: classify cat vs dog photos.

    • Reinforcement: robot arm stacks blocks for points, exploring moves until towers are tall.


Transformer

  • What it is: A neural network where every word (token) can pay attention to every other word before deciding what to output, repeated in many layers.

  • Why it matters for AI safety: Most powerful language and multimodal models are Transformers, so understanding their internals is the first step toward interpretability and circuit‑level safety work.

  • Simple example: The word “bank” looks at nearby words like “river” or “loan” to decide which meaning is correct.


RLHF (Reinforcement Learning from Human Feedback)

  • What it is: Collect human preference rankings of model outputs, train a reward model on those rankings, then fine‑tune the base model to maximize that learned reward.

  • Why it matters for AI safety: It is today’s main way to align chatbots to human‑like answers, but a brittle reward model can teach the bot to hide bad behavior.

  • Simple example: Users rank two chatbot replies, the reward model learns those rankings, then the chatbot trains to produce higher‑ranked replies.


Goal mis‑generalization

  • What it is: A model passes training tests but pursues the wrong objective in new settings.

  • Why it matters for AI safety: A system can look aligned during evaluation yet do something harmful once deployed.

  • Simple example: Robot trained to fetch a green ball sees a red ball and instead grabs any green object, showing its real goal was “fetch something green.”


Scalable oversight

  • What it is: Techniques that let weaker evaluators judge stronger systems without review cost growing linearly with capability (amplification, debate, recursive reward modeling).

  • Why it matters for AI safety: Humans cannot check every output of future superhuman models; oversight must stretch to keep them in check.

  • Simple example: Break a complex answer into smaller questions that a human can verify quickly, then stitch those answers back together.


Interpretability

  • What it is: Methods for peeking inside a model to see which neurons or circuits correspond to which concepts (feature visualization, causal tracing, activation patching).

  • Why it matters for AI safety: Without visibility into model cognition we cannot verify alignment or spot deception.

  • Simple example: Visualizing a neuron that lights up only for dog images, or patching an attention head to remove toxic completions.


Deception tests

  • What it is: Experiments that try to catch models lying, hiding information, or gradient hacking (trojan triggers, adversarial questions, hidden chain of thought).

  • Why it matters for AI safety: A misaligned model could hide its goals until it has leverage; early detection is crucial.

  • Simple example: Give a model a secret password, then ask tricky questions to see if it accidentally reveals it.


Robust evaluation

  • What it is: Stress‑testing models with adversarial prompts, distribution shifts, and safety benchmarks to probe worst‑case failures.

  • Why it matters for AI safety: Claims of safety must survive red‑team attacks and unusual inputs.

  • Simple example: Adversarially craft political persuasion prompts to see if a model violates policy by trying to sway opinions.


Governance lever

  • What it is: Policy tools or institutional mechanisms that guide AI development toward safety (audits, liability laws, compute licenses, incident reporting, standards).

  • Why it matters for AI safety: Technical alignment alone is not enough; external guardrails help prevent unsafe systems from being built or deployed.

  • Simple example: A law requiring companies to publish safety test results and obtain a license before releasing frontier‑scale models.