AI

Supervised learning vs Reinforcement learning

What it is:
- Supervised: model sees inputs with correct labels and learns to match them.
- Reinforcement: agent tries actions, gets reward points, and learns which actions earn more points.
Why it matters for AI safety: Training signals (labels or rewards) can be incomplete, noisy, or easy to exploit, so alignment problems often start here.
Simple example:
- Supervised: classify cat vs dog photos.
- Reinforcement: robot arm stacks blocks for points, exploring moves until towers are tall.

Transformer

What it is: A neural network where every word (token) can pay attention to every other word before deciding what to output, repeated in many layers.
Why it matters for AI safety: Most powerful language and multimodal models are Transformers, so understanding their internals is the first step toward interpretability and circuit‑level safety work.
Simple example: The word “bank” looks at nearby words like “river” or “loan” to decide which meaning is correct.

RLHF (Reinforcement Learning from Human Feedback)

What it is: Collect human preference rankings of model outputs, train a reward model on those rankings, then fine‑tune the base model to maximize that learned reward.
Why it matters for AI safety: It is today’s main way to align chatbots to human‑like answers, but a brittle reward model can teach the bot to hide bad behavior.
Simple example: Users rank two chatbot replies, the reward model learns those rankings, then the chatbot trains to produce higher‑ranked replies.

Goal mis‑generalization

What it is: A model passes training tests but pursues the wrong objective in new settings.
Why it matters for AI safety: A system can look aligned during evaluation yet do something harmful once deployed.
Simple example: Robot trained to fetch a green ball sees a red ball and instead grabs any green object, showing its real goal was “fetch something green.”

Scalable oversight

What it is: Techniques that let weaker evaluators judge stronger systems without review cost growing linearly with capability (amplification, debate, recursive reward modeling).
Why it matters for AI safety: Humans cannot check every output of future superhuman models; oversight must stretch to keep them in check.
Simple example: Break a complex answer into smaller questions that a human can verify quickly, then stitch those answers back together.

Interpretability

What it is: Methods for peeking inside a model to see which neurons or circuits correspond to which concepts (feature visualization, causal tracing, activation patching).
Why it matters for AI safety: Without visibility into model cognition we cannot verify alignment or spot deception.
Simple example: Visualizing a neuron that lights up only for dog images, or patching an attention head to remove toxic completions.

Deception tests

What it is: Experiments that try to catch models lying, hiding information, or gradient hacking (trojan triggers, adversarial questions, hidden chain of thought).
Why it matters for AI safety: A misaligned model could hide its goals until it has leverage; early detection is crucial.
Simple example: Give a model a secret password, then ask tricky questions to see if it accidentally reveals it.

Robust evaluation

What it is: Stress‑testing models with adversarial prompts, distribution shifts, and safety benchmarks to probe worst‑case failures.
Why it matters for AI safety: Claims of safety must survive red‑team attacks and unusual inputs.
Simple example: Adversarially craft political persuasion prompts to see if a model violates policy by trying to sway opinions.

Governance lever

What it is: Policy tools or institutional mechanisms that guide AI development toward safety (audits, liability laws, compute licenses, incident reporting, standards).
Why it matters for AI safety: Technical alignment alone is not enough; external guardrails help prevent unsafe systems from being built or deployed.
Simple example: A law requiring companies to publish safety test results and obtain a license before releasing frontier‑scale models.