8. Transfer Learning & Autoregressive LLMs

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/48

There's no tags or description

Looks like no tags are added yet.

Last updated 3:00 PM on 6/1/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

49 Terms

New cards

What is the core mathematical concept of Data Augmentation in Deep Learning, and how does it improve generalization?

Data Augmentation replaces a strict empirical data distribution with a mathematically smoothed distribution. By automatically injecting variations (e.g., rotations, cropping, or noise injection in computer vision) inside the data loader, it artificially increases dataset variance and prevents the model from overfitting to superficial details.

New cards

Explain Contrastive Learning (e.g., SimCLR) and how a Siamese Network processes data views.

Contrastive learning is a self-supervised method that creates stochastic, semantics-preserving transformations of unlabelled data. It passes these altered views through a dual-stream Siamese Network to maximize the vector representation agreement between alternative views of the same image (positive pairs) while minimizing similarity for entirely different images (negative pairs).

New cards

How does CLIP (Contrastive Language-Image Pre-training) enable zero-shot image classification at test time?

CLIP jointly trains a ResNet/Vision Transformer image encoder and a text Transformer encoder on massive, noisy pairs of images and text. Instead of predicting static category IDs, it learns to match an image representation with its corresponding textual description vector. At test time, zero-shot classification is achieved by predicting which novel text class string matches the image with the highest cosine similarity.

New cards

Detail the fine-tuning freezing strategies based on the size of the target downstream dataset.

Small Target Data: Freeze all pre-trained hidden layers; train only the weights of the newly initialized output classification layer.
Medium Target Data: Freeze early layers; train weights on the last few abstraction layers + the output layer.
Large Target Data: Fine-tune all layers across the entire network, using the pre-trained weights merely as a robust initialization point.

New cards

Compare BERT, GPT-x, and T5 regarding their Core Architecture Feature and Pre-training Objective.

BERT (Google): Architecture: Encoder-focused block providing dense bidirectional context.
- Objective: Masked Language Modeling (MLM).
GPT-x (OpenAI): Architecture: Decoder-focused block providing causal autoregressive context.
- Objective: Next Token Prediction.
T5 (Google AI): Architecture: Unified Encoder-Decoder architecture in a Text-to-Text format.
- Objective: Removing and decoding arbitrary token span placeholders.

New cards

What is the specific distribution recipe used during BERT's Masked Language Modeling (MLM) pre-training, and why is it structured this way?

BERT replaces a fraction of target input tokens using the following specific distribution:

80% of the time: Replaced with the literal [MASK] token.
10% of the time: Replaced with a completely random token.
10% of the time: Left entirely unchanged.

Rationale: This prevents the model from becoming complacent (only computing representations when it explicitly sees a [MASK] token). It forces the network to maintain robust contextual representations for every single position, even when facing occasional token noise.

New cards

Contrast the implementation structural differences between Serial Adapters and Parallel Adapters.

Serial Adapters: Custom parameter blocks inserted sequentially between existing layers. Features must pass through them in a bottleneck fashion.
Parallel Adapters: Linear layer blocks inserted alongside the original frozen layer blocks. They gate and compute features in parallel with the base network, which preserves structural alignment and simplifies distributed layer integration.

New cards

Explain the mathematical formulation of Low-Rank Adaptation (LoRA). How does it compress weight updates?

LoRA freezes the original model weight matrix $W_0 \in \mathbb{R}^{d \times k}$ and decomposes the active weight update matrix $\Delta W$ into two low-rank matrices $A$ and $B$ :

$\Delta W = B \cdot A$

Where $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ . The internal rank $r$ is configured to be much smaller than the hidden model dimension ( $r \ll d$ ). This drastically cuts down GPU memory footprint and avoids catastrophic forgetting while matching full fine-tuning accuracy.

New cards

Induction Heads Hypothesis" regarding how Large Language Models perform In-Context Learning?

The hypothesis states that in-context learning is driven by specialized attention structures called induction heads. These heads perform fuzzy pattern completion across context windows by tracking repeated sequence boundaries. If they observe a sequence pattern matching $[A][B] \dots [A^*]$ , they automatically allocate strong attention weights to predict that $[B^*]$ will follow.

New cards

Explain Chain-of-Thought (CoT) Prompting and state the difference between standard CoT and Zero-Shot CoT.

CoT prompting improves a model's logical execution by explicitly forcing it to generate intermediate reasoning text steps before answering.
Standard CoT: Appends a few complete, hand-written step-by-step reasoning demonstrations directly inside the prompt template.
Zero-Shot CoT: Simply appends the magic phrase "Let's think step by step" to the end of a question, which triggers the model to unroll its own logical path.

New cards

Why does standard Next-Token prediction pre-training create an "Alignment Problem" when building user assistants?

Raw pre-training optimizes a model to replicate internet text distributions, which includes toxic, biased, incorrect, or unhelpful text. This objective is misaligned with the human goal of creating a safe, helpful, and honest conversational assistant.

New cards

Write out the mathematical Loss Function used to train the Reward Model (RM) in an RLHF alignment pipeline. Explain the variables.

Back: Loss Function:

$J_{RM}(\phi) = \mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( R_\phi(x, y_w) - R_\phi(x, y_l) \right) \right]$

$x$ : The input prompt string.
$y_w$ : The winning (preferred) response text sample.
$y_l$ : The losing (dispreferred) response text sample.
$R_\phi$ : The scalar score output by the reward model parameterized by weights $\phi$.
$\sigma$ : The standard sigmoid function.

New cards

What is the role of Proximal Policy Optimization (PPO) in the final stage of an RLHF pipeline?

PPO acts as the reinforcement learning optimization engine. The frozen Reward Model functions as the environment's scoring metric. PPO fine-tunes the active LLM action policy ( $\pi_\theta$ ) to maximize the expected reward score while adding a trust-region penalty (KL-divergence constraint) against the original SFT model to keep optimization updates stable and prevent reward hacking.

New cards

Define Empirical Risk Minimization (ERM) and explain its mathematical limitation when training high-parameter models on small datasets.

ERM minimizes loss strictly over the observed empirical data distribution, treating data as a collection of rigid Dirac delta ($\delta$) functions:

$\mathcal{R}(f) \approx \frac{1}{N} \sum_{n=1}^N \delta(x - x_n)\delta(y - y_n)$

Limitation: It assumes zero probability for any data point outside the exact training samples. For massive parameter architectures, this literal fitting causes severe overfitting and poor generalization to unseen variations.

New cards

What is Vicinal Risk Minimization (VRM), and how does Data Augmentation mathematically implement it?

VRM replaces the rigid empirical distribution of ERM with an algorithmically smoothed virtual distribution:

$p_{\mathcal{D}}(x, y|A) = \frac{1}{N} \sum_{n=1}^N p(x|x_n, A)\delta(y - y_n)$

Data augmentation acts as the function $A$ that draws virtual samples $x$ from a localized "vicinity" around the actual training point $x_n$ (e.g., via a Gaussian kernel $\mathcal{N}(x|x_n, \sigma^2 \mathbf{I})$ ) while keeping the semantic target label $y_n$ completely invariant.

New cards

What is AutoAugment, and how does it optimize sequence generation and training pipelines?

AutoAugment treats data augmentation selection as a discrete search problem. It utilizes black-box optimization frameworks (such as Reinforcement Learning or Bayesian Optimization) to programmatically learn which individual transformations or sequential combinations of augmentations yield the highest downstream model accuracy.

New cards

Describe the internal layer structure and dimensionality constraints of a Transformer Adapter.

Transformer Adapters insert two shallow Multi-Layer Perceptrons (MLPs) inside each Transformer layer: one right after the multi-head attention block, and a second after the position-wise feedforward network.

The adapter down-projects the model dimension $D$ to a tight bottleneck dimension $M$ ( $M \ll D$ ), applies a localized non-linearity, and up-projects back to $D$. Built-in residual skip connections ensure it acts as an identity mapping when initialized to zero.

New cards

Write out the mathematical formulations for both Series and Parallel ResNet convolutional adapters ( $\alpha$ ).

Series Adapter: Formulated as a low-rank multiplicative perturbation directly following a standard convolution ( $f \circledast x$ ):
$y = \left(\text{diag}_1(\mathbf{I} + \alpha) \circledast f\right) \circledast x$
Parallel Adapter: Formulated as a low-rank additive perturbation running side-by-side with the primary convolution path:
$y = \left(f + \text{diag}_L(\alpha)\right) \circledast x$

New cards

Write out the contrastive loss maximization equation for SimCLR and explain why combining cropping with color distortion is vital.

$J = F(t_1(x))^T F(t_2(x)) - \log \sum_{x^-_i \in \mathcal{N}(x)} \exp \left( F(x^-_i)^T F(t_1(x)) \right)$

Color Distortion Necessity: Without random color distortions, the network can easily "cheat" the contrastive objective. It shortcuts representation learning by simply matching low-level color histograms across different crops of the same image rather than mapping structural visual features.

New cards

How does MoCo (Momentum Contrast) improve upon the standard batch-size constraints of SimCLR?

Standard contrastive learning requires massive active batch sizes to provide enough negative samples. MoCo decouples dictionary size from mini-batch size by maintaining a continuous queue of negative embeddings. It updates its key encoder using an exponential moving average (momentum), bypassing the need for huge batch sizes while maintaining representation quality.

New cards

Write out the Symmetric Cross-Entropy Loss Matrix optimization equation for CLIP.

Given unit-norm image embeddings $I_i$ and text embeddings $T_j$ , the pairwise similarity logits grid is defined as $L_{ij} = I_i^T T_j$ . The loss over a minibatch of size $N$ is:

$J = \frac{1}{2} \left[ \sum_{i=1}^N \text{CE}(L_{i,:}, \mathbf{1}_i) + \sum_{j=1}^N \text{CE}(L_{:,j}, \mathbf{1}_j) \right]$

This simultaneously optimizes image-to-text and text-to-image categorical classifications along the true diagonal matrix matches.

New cards

Write the zero-shot probability classification formula used by CLIP at inference time to evaluate an image vector $I$ against text labels.

Target text class labels are converted into descriptive prompts (e.g., "a photo of a [class]"). The image embedding $<span>I$ is compared directly against the entire matrix of candidate text prompt representations ( $T_1 \dots T_k$ ):

$p(y = k|x) = \text{softmax}([I^T T_1, \dots, I^T T_k])_k$

New cards

Write the minimax optimization equation for Domain Adversarial Learning using an auxiliary domain classifier $f_\theta$ and task head $g_\phi$ .

$\min_{\phi} \max_{\theta} \frac{1}{N_s + N_t} \sum_{n \in \mathcal{D}_s, \mathcal{D}_t} \ell(d_n, f_\theta(x_n)) + \frac{1}{N_s} \sum_{m \in \mathcal{D}_s} \ell(y_m, g_\phi(f_\theta(x_m)))$

Where $d_n$ represents the domain label (source vs. target) and $y_m$ represents the primary target class label.

New cards

What is the Gradient Sign Reversal Trick, and how is it used during the backpropagation phase of domain adaptation?

The gradient sign reversal trick is an architectural layer modification that leaves gradients un-altered during the forward pass, but multiplies them by a negative scalar (flips the sign) during backpropagation. This explicitly penalizes the base network for developing domain-specific features, forcing it to extract representations that are completely invariant across the source and target domains.

New cards

State the core predictive principle of a Language Model and explain how it relates to the Distributional Hypothesis.

A language model is a computational system that predicts the next word given a historical prefix of words, assigning a probability distribution over the entire vocabulary space.

This aligns with the Distributional Hypothesis, which posits that words occurring in similar context windows tend to share similar semantic meanings. By learning to iteratively predict co-occurring tokens across massive text corpora, the model implicitly acquires vast knowledge about grammar, context, and world facts without manual labeling.

New cards

Compare Decoder-Only, Encoder-Only, and Encoder-Decoder architectures on context processing and downstream utility.

Decoder-Only: Left-to-right causal attention. Generates text autoregressively by appending predicted tokens back into the prefix. (e.g., GPT, Llama).
Encoder-Only: Unmasked bidirectional attention. Evaluates text on both sides of a token simultaneously. Non-generative; builds contextual vector representations for classification and labeling (e.g., BERT).
Encoder-Decoder: Encodes a source input into a representation space, which a separate decoder uses to generate a new token sequence. Decouples input length from output length (e.g., T5 for machine translation).

New cards

What are System Prompts, and how do modern long-context transformer architectures leverage them?

System prompts are privileged instruction blocks prepended to the top of the input context window before any user queries. Because modern models boast context windows spanning thousands of tokens, developers use these prompts to set durable rules, behavioral tones, specific task configurations (e.g., "output only JSON"), and safety guardrails that condition all downstream autoregressive generation steps.

New cards

Write out the mathematical equation for Temperature Sampling over a logit vector $u$ for vocabulary token $i$.

$y_i = \frac{\exp(u_i / \tau)}{\sum_{j \in V} \exp(u_j / \tau)}$

Where $u$ is the vector of raw, real-valued model outputs, $V$ is the total vocabulary, and $\tau$ is the temperature parameter.

New cards

Contrast the probability distributions and model behaviors generated when Temperature ($\tau$) approaches 0 versus when $\tau$ approaches infinity.

* As $\tau \to 0$ (Low Temperature): The probability distribution sharpens dramatically, magnifying the score of the highest logit while suppressing others. The model's behavior converges toward Greedy Decoding ( $\text{argmax}$ ), becoming highly deterministic and predictable.
As $\tau \to \infty$ (High Temperature): The logit variations are flattened out. The resulting softmax distribution approaches a uniform distribution, drastically increasing randomness, token variety, and generation creativity (at the risk of producing gibberish).

New cards

Write out the pretraining loss function ($\mathcal{L}_{CE}$) for a causal language model at step $t+1$ and explain the training method used.

The model minimizes the cross-entropy loss over a text batch of length $T$ :

$\mathcal{L}_{CE} = -\log \hat{y}_t[w_{t+1}]$

This measures the negative log probability assigned to the true upcoming token $w_{t+1}$ .

Training Method: It utilizes Teacher Forcing, meaning the network is always conditioned on the ground-truth historical sequence during a training pass, rather than feeding its own potentially flawed historical guesses back into itself.

New cards

Detail the core goals and mechanics of Stage 2 (Instruction Tuning) and Stage 3 (Alignment) of LLM training.

Stage 2: Instruction Tuning (Supervised Fine-Tuning / SFT):

Goal: Transforms the model from a raw internet text-reconstruction engine into an assistant that follows explicit task commands.
- Mechanic: Supervised cross-entropy fine-tuning on a curated dataset of matched (Instruction, Response) pairs.
Stage 3: Alignment (Preference Alignment):
Goal: Minimizes harmful, toxic, or deceitful outputs while maximizing helpfulness.
- Mechanic: Optimizes the model using human preference datasets (Context $\rightarrow$ Accepted vs. Rejected completions) using reinforcement learning or reward-based optimization.

New cards

What is a known side-effect of applying aggressive Safety Filtering using toxicity classifiers during dataset preprocessing?

Toxicity classifiers frequently demonstrate algorithmic bias, mistakenly flagging benign text written in non-dominant or minority dialects (e.g., African American English) as toxic. Consequently, filtering out these blocks systematically underrepresents minority cultural data, and ironically makes the downstream model less robust at accurately identifying actual toxicity.

New cards

Explain how the Self-Attention calculation matrix differs between a Causal Decoder and a Bidirectional Encoder.

Causal Decoder: Applies a causal attention mask to the raw score matrix ($QK^T$). It sets the upper triangular values to $-\infty$ so that the softmax function zeroes out attention coefficients for any token position further down the sequence ($j > i$).
Bidirectional Encoder: Completely removes the attention mask. Every token position can look forward and backward across the entire sequence length $N$:
$\text{head} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

New cards

Write out the Masked Language Modeling (MLM) 80/10/10 token manipulation split and provide its theoretical justification.

ut of the 15% of total sequence tokens randomly chosen for manipulation, the split is applied as follows:

80% of the time: Replaced with the literal [MASK] token.
10% of the time: Replaced with a completely random word from the vocabulary.
10% of the time: Left entirely unchanged.

Justification: If chosen tokens were replaced with [MASK] 100% of the time, the model would only learn to optimize representations for explicit mask slots. Because [MASK] tokens never appear during downstream fine-tuning, this split ensures the model maintains high-quality, continuous contextual features for all tokens, forcing it to remain alert to correcting noisy or unmasked words.

New cards

Write the mathematical loss formulation ($\mathcal{L}_{MLM}$) for a Masked Language Model.

he model projects the final hidden layer representation $h_i^L$ of a manipulated token through the transposed embedding matrix ($E^T$) to yield vocabulary probabilities via a language modeling head ( $y_i = \text{softmax}(h_i^L E^T)$ ). The cross-entropy loss is computed only over the subset of manipulated tokens ( $M$ ):

$\mathcal{L}_{MLM} = -\frac{1}{|M|} \sum_{i \in M} \log P(x_i \mid h_i^L)$

New cards

Explain the architectural data setup required for the Next Sentence Prediction (NSP) auxiliary task in BERT.

Token Layout: Joint sentence pairs are concatenated. A [CLS] token is prepended to the absolute beginning of the sequence to capture the aggregate relationship representation. A [SEP] token is inserted between the two sentences and appended at the end.

Segment Embeddings: Unique First Segment and Second Segment vector embeddings are added directly onto the word and positional vectors to help the transformer mathematically distinguish between sentence boundaries.
Classification Head: The final state vector of the [CLS] token ( $h_{\text{CLS}}^L$ ) is passed to a classification head:
$y_i = \text{softmax}(h_{\text{CLS}}^L W_{NSP})$

New cards

Write out the language re-weighting probability formula used to balance multi-lingual training corpora. What is the Curse of Multilinguality?

To prevent massive high-resource datasets (e.g., English) from wiping out low-resource data in the shared token vocabulary, the raw selection probability $p_i$ for language $i$ is adjusted using a smoothing parameter $\alpha$:

$q_i = \frac{p_i^\alpha}{\sum_{j=1}^N p_j^\alpha}$

Setting $\alpha = 0.3$ upweights lower-resource languages.

The Curse of Multilinguality: As the number of supported languages scales excessively high, the capacity of a fixed-parameter model becomes overstretched. Performance degrades across all individual languages, and dominant grammatical patterns can bleed into low-resource representations (known as "having an accent").

New cards

How do Contextual Embeddings represent polysemy (e.g., $\text{bank}^1$ vs $\text{bank}^2$ ) geometrically compared to traditional static dictionaries?

Unlike static embeddings (Word2Vec) which assign one unchanging vector to a word type, contextual embeddings extract vectors directly from high-level hidden output states ( $h_i^L$ ).

When projected into a 2D plane via dimensionality reduction (like UMAP), identical word types visibly separate into distinct, highly isolated spatial clusters matching their discrete semantic senses. Instead of enforcing rigid categories like traditional dictionaries, contextual embeddings map word meaning as a smooth, high-dimensional mathematical continuum.

New cards

Explain the 1-Nearest-Neighbor Baseline mechanism for automated Word Sense Disambiguation (WSD).

1. Pass a sense-labeled reference corpus (e.g., SemCore) through the language model to extract contextual embedding vectors for all labeled tokens.

2. For each specific word sense, compute a centralized vector representation by pooling/averaging its corresponding token vectors.

3. When an ambiguous word appears in a novel sentence context, calculate its live contextual embedding vector.

4. Assign it the known historical sense of its closest spatial neighbor using a simple 1-NN similarity metric.

New cards

Why is a raw pretrained LLM considered fundamentally misaligned with the requirements of a practical human assistant?

A raw pretrained LLM is trained exclusively on a self-supervised text-completion task (predicting the next token over raw web data). This creates two major flaws:

Insufficiently Helpful: It favors continuing an autoregressive text string over obeying the true intent of a prompt (e.g., repeating an instruction or writing a sequel to it rather than answering it).
Harmful/Unsafe: It has no built-in constraints preventing the generation of dangerous, false, or toxic text if that text represents a highly probable internet distribution pattern.

New cards

Define and contrast the two primary sequential phases of the Post-Training Alignment pipeline.

Instruction Tuning (Supervised Fine-Tuning / SFT): Updates all network parameters via cross-entropy loss using high-quality (Instruction, Response) demonstration pairs to teach the model generalized task-following behavior.
Preference Alignment: Refines the model's tone, safety, and utility by utilizing human or automated judgments to choose between multiple model-generated candidate outputs (e.g., via RLHF or DPO).

New cards

Contrast the training architecture and goal of Instruction Tuning (SFT) with Task-Based Fine-Tuning (Encoder-only).

Task-Based Fine-Tuning (Encoder-only): Discards generative capacities entirely to train a dedicated, specialized classification or sequence-labeling head on narrow labels.

Instruction Tuning (SFT): Retains the autoregressive language modeling head to learn generalized task-following behavior across a massive, diverse prompt space, resulting in meta-learning—where the model gains the ability to execute entirely novel, unseen instruction sets.

New cards

Explain the Leave-One-Cluster-Out strategy used to evaluate instruction-tuned models without data contamination.

Because instruction datasets overlap significantly, evaluation suites group benchmarks into distinct task clusters (e.g., placing all textual entailment datasets into a single cluster). During training, the model is fully fine-tuned on all other clusters and tested exclusively on the withheld cluster. This guarantees that the evaluation measures generalized instruction-following behavior rather than memorized task structures.

New cards

Write out the formal Bradley-Terry Model equation used to convert discrete preference observations into a continuous probability distribution.

The model assumes that any text completion possesses an underlying latent scalar quality score, or reward ( $z \in \mathbb{R}$ ). The probability that a human prefers output $<span>o_i$ over $o_j$ given prompt $x$ is modeled as the logistic sigmoid ( $\sigma$ ) of their latent score difference:

$P(o_i \succ o_j \mid x) = \sigma(z_i - z_j) = \frac{1}{1 + e^{-(z_i - z_j)}}$

New cards

Provide the step-by-step mathematical derivation showing how setting the true latent score difference ( $\delta = z_i - z_j$ ) equal to the preference log-odds yields the Bradley-Terry sigmoidal probability.

Start by defining the log-odds (logit) of the preference:

$\delta = \log \left( \frac{P(o_i \succ o_j \mid x)}{1 - P(o_i \succ o_j \mid x)} \right)$

Exponentiate both sides to clear the logarithm:

$\exp(\delta) = \frac{P(o_i \succ o_j \mid x)}{1 - P(o_i \succ o_j \mid x)}$

Multiply through by the denominator:

$\exp(\delta) - \exp(\delta)P(o_i \succ o_j \mid x) = P(o_i \succ o_j \mid x)$

Isolate the probability term:

$\exp(\delta) = P(o_i \succ o_j \mid x)(1 + \exp(\delta))$

Solve for $P(o_i \succ o_j \mid x)$ and simplify:

$P(o_i \succ o_j \mid x) = \frac{\exp(\delta)}{1 + \exp(\delta)} = \frac{1}{1 + \exp(-\delta)} = \sigma(z_i - z_j)$

New cards

How is a baseline pretrained LLM architecturally modified to function as a parameterized Reward Model $r_\psi(x, o)$ ?

A pretrained base LLM is copied. Its final language modeling token-prediction token head (the unembedding matrix) is stripped off. It is replaced with a newly added, randomly initialized linear layer that projects the final block representation down into a single real-valued scalar output.

New cards

Write out the binary cross-entropy Loss Function used to optimize a Reward Model over a preference distribution $\mathcal{D}$ .

$\text{Loss: } \mathcal{L}_{CE} = -\mathbb{E}_{(x, o_w, o_l) \sim \mathcal{D}} \left[ \log \sigma \big( r_\psi(x, o_w) - r_\psi(x, o_l) \big) \right]$

Where $o_w$ is the chosen preferred response, $o_l$ is the rejected response, and $r_\psi$ is the scalar reward value output by the parameterized network weights $\phi$ or $\psi$ .

New cards

Map the four core elements of the Reinforcement Learning (RL) framework used in alignment to their corresponding components within the LLM.

Action: Selecting the next discrete token during autoregressive generation.
State: The entire context window token history generated up to the current decoding step.
Policy ( $\pi_\theta$ ): The generative transformer language model weights being actively optimized.
Reward ( $r(x,o)$ ): The scalar value assigned to a completed prompt-response pair by the frozen, external Reward Model.

New cards

Write the core optimization equation used to find the ideal policy weights ($\pi^*$) that maximize expected reward across a prompt distribution.

$\pi^* = \arg\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, o \sim \pi_\theta(o \mid x)} [r(x, o)]$

Where the policy network $\pi_\theta$ is tuned to generate responses $o$ that yield the highest possible scalar scores from the environmental reward model $r(x, o)$ .