Exam 2 - LIGN 167

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/88

flashcard set

Earn XP

Description and Tags

New material for exam 2 (text decoding, audio modeling, vision and language models, RLHF & alignment, text chain-of-thought, RAG, hallucination, training data, cognitive modeling)

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

89 Terms

1
New cards

How do you repurpose LMs for a specific application?

Update the model:

  • fine-tuning (replace LM head with a task-specific classifier and update all parameters using loss on downstream task)

  • domain-adaptation (fine-tune LM on next-word prediction in a target data domain)

  • instruction-tuning (domain adaptation to texts containing prompt/answer pairs; apply in-context learning in downstream applications to novel tasks)

  • RLHF (reinforcement learning from human feedback, fine-tune weights to maximize human preferences)

Interact with the model:

  • prompting (description of task in prompt; model learns through self-attention only (no weight updates) and generates text responses)

  • in-context learning (demonstration of task in prompt; model learns through self-attention only (no weight updates) and generates textual responses)

2
New cards

What is the pretrain-and-finetune paradigm? Why is it an efficient approach?

The pretrain-and-finetune paradigm (popularized around 2018) uses a shared pretrained model for all applications which is fine-tuned on task-specific data. This approach is a switch from custom-building all models. It helps reduce the need for large labeled datasets by leverage large unlabeled data. It also efficiently reuses parameters from pretrained models.

3
New cards

Define transfer learning

Transfer learning occurs when knowledge is acquired from one dataset/training objective and transfers to another.

(pre-trained weights from language models)

4
New cards

What are examples of repurposing LMs for a specific application?

Sequence classification:

  • sentiment

  • grammaticality

  • complexity

  • essay scoring

  • multiple-choice QA

Sequence pair classification:

  • paraphrase detection

  • natural language inference

Sequence tagging/labeling:

  • part-of-speech

  • named entity recognition

  • coreference

  • dependency labeling

  • span QA

Sequence-to-sequence:

  • translation

    • summarization

5
New cards

Examples of sequence classification

  • sentiment classification

  • grammatical acceptability

  • complexity

  • essay scoring

  • multiple-choice QA

6
New cards

Examples of sequence pair classification

  • paraphrase detection

  • natural language inference (NLI)

7
New cards

Examples of sequence tagging/labeling

  • part-of-speech

  • named entity recognition

  • coreference

  • dependency labeling

  • span QA

8
New cards

Examples of sequence-to-sequence

  • translation

  • summarization

9
New cards

What is out-of-domain (OOD) generalization?

OOD generalization is the ability of a model trained on one or more domains to generalize to unseen domains. Supervised models may overfit to the training domain, so OOD evaluation corrects this.

10
New cards

What is adversarial evaluation?

Supervised training can lead to simple heuristics that work for the bulk of training examples, but not for the long tail. Adversarial examples are designed to “trick” models that adopt these heuristics.

Example: “The doctor was paid by the actor.” —> Who paid who? - model confused by appearance order

11
New cards

What is named entity recognition (sequence tagging)?

Named entity recognition (NER) is a sequence tagging task in natural language processing that identifies and classifies named entities in text into predefined categories (e.g. people, organizations, locations). It works by assigning a tag to each word in a sentence.

12
New cards

Define Type I Error

False positive (predict true, actually negative)

13
New cards

Define Type II Error

False negative (predict negative, actually positive)

14
New cards

Define precision

Measures the reliability of a model’s positive predictions by calculating the proportion of true positives out of all positive predictions

“Of all the times the model predicted a positive outcome, how often was it correct?”

Precision = TP / (TP + FP)

High precision score indicates that the model makes very few false positive errors.

15
New cards

Define negative predictive value

Measures the reliability of a model’s negative predictions by calculating the proportion of true negative out of all negative predictions

“Of all the times the model predicted a negative outcome, how often was it correct?”

Negative predictive value = TN / (TN + FN)

High precision score indicates that the model makes very few false negative errors.

16
New cards

Define sensitivity

Measures the proportion of actual positive cases that were correctly identified by the model

Also called recall

“Of all the cases that were actually positive, what percentage did the model correctly predict as positive?”

SnNOUT: Sensitive test, Negative result = Rule Out

Sensitivity = TP / (TP + FN)

17
New cards

When is the sensitivity measure important?

The sensitivity metric is important when the cost of a false negative (i.e. Type II Error, missing an actual positive case) is high.

Example: medical diagnostic test for a serious disease —> highly sensitive test crucial for ensuring that very few people who have the disease are incorrectly told they are healthy

Negative result —> very effective at “ruling out” a disease (SnNOUT: Sensitive test, Negative result = Rule Out)

18
New cards

Define specificity

Measures the proportion of actual negative cases that are correctly identified by the model (i.e. true negative rate)

Specificity helps assess the model’s ability to avoid false positives (incorrectly predicting a positive outcome when the actual outcome was negative)

Specificity = TN / (TN + FP)

High specificity indicates that the model successfully classifies most of the true negative instances

19
New cards

What is Matthews Correlation?

statistical metric used to evaluate the performance of a binary classification model

balanced measure that considers all four categories of a confusion matrix (TP, TN, FP, FN)

MCC = TP × TN - FP × FN / √(TP + FP)(TP + FN)(TN + FP)(TN + FN)

*mathematically equivalent to pearson product-moment correlation coefficient applied to a 2×2 contingency (confusion) table

<p>statistical metric used to evaluate the performance of a binary classification model</p><p>balanced measure that considers all four categories of a confusion matrix (TP, TN, FP, FN)</p><p>MCC = TP × TN - FP × FN / √(TP + FP)(TP + FN)(TN + FP)(TN + FN)</p><p>*mathematically equivalent to pearson product-moment correlation coefficient applied to a 2×2 contingency (confusion) table</p>
20
New cards

Define harmonic mean and F1 score

The F1 score is the harmonic mean of precision and recall. It provides a single, balanced score that incorporates both metrics and is particularly useful for evaluating models on datasets with imbalanced classes.

The F1 score gives a single number that is closer to the lower of the two value (precision or recall), meaning a model must perform well on both to achieve a high F1 score.

F1 = 2 / (recall-1 + precision-1) = 2TP / (2TP + FP + FN)

*Doesn’t account for true negatives —> use when true negatives are common —> bias towards positive

<p>The F1 score is the harmonic mean of precision and recall. It provides a single, balanced score that incorporates both metrics and is particularly useful for evaluating models on datasets with imbalanced classes.</p><p>The F1 score gives a single number that is closer to the lower of the two value (precision or recall), meaning a model must perform well on both to achieve a high F1 score.</p><p>F1 = 2 / (recall<sup>-1</sup> + precision<sup>-1</sup>) = 2TP / (2TP + FP + FN)</p><p>*Doesn’t account for true negatives —&gt; use when true negatives are common —&gt; bias towards positive</p>
21
New cards

What is span question-answering (sequence tagging)?

A bidirectional model identifies spans in a passage that answer the question.

The answer is a continuous “span” of text extracted directly from a given context

E.g. SQuAD dataset

22
New cards

What are examples of sequence-to-sequence tasks and how to evaluate them?

tasks

  • translation

  • summarization

  • question-answering

eval

  • BLEU score

  • BERT score

23
New cards

Explain reference and grounding in natural language

Reference is part of language understanding. It is a relation between signifier (form) and a signified (real world thing). For humans, natural language is grounded in sensory experience.

24
New cards

Historically, what has been the go-to approach for image encoding?

convolutional neural networks (CNNs)

*still widely used

25
New cards

What challenges do you encounter trying to adapt the transformer architecture to images (pixel arrays)?

  • 2- (or 3-) dimensional data — text sequences are 1-dimensional but images are 2-dimensional

  • very large number of pixels (compare to number of words passed into LMs)

26
New cards

How do you fix the issues associated with adapting transformer architecture to images?

  • vision transformer (ViT)

  • discrete VAE

27
New cards

What is a vision transformer (ViT)?

A vision transformer (ViT) is a deep learning model that adapts the transformer architecture for image analysis by treating an image as a sequence of patches (similar to how NLP models process words)

Process:

  • break image up into patches (# patches as hyperparameter)

  • linearize patches (flatten out; concatenate rows; positional embedding for which patch it is)

  • project down to lower dimensional embeddings (map long pixel vector into more condensed, lower dimensional embedding)

  • train on image classification

This approach allows the model to capture both local and global relationships across the entire image simultaneously (unlike CNNs that start with local features)

<p>A vision transformer (ViT) is a deep learning model that adapts the transformer architecture for image analysis by treating an image as a sequence of patches (similar to how NLP models process words)</p><p>Process:</p><ul><li><p>break image up into patches (# patches as hyperparameter)</p></li><li><p>linearize patches (flatten out; concatenate rows; positional embedding for which patch it is)</p></li><li><p>project down to lower dimensional embeddings (map long pixel vector into more condensed, lower dimensional embedding)</p></li><li><p>train on image classification</p></li></ul><p>This approach allows the model to capture both local and global relationships across the entire image simultaneously (unlike CNNs that start with local features)</p>
28
New cards

What is a discrete VAE?

Discrete VAEs (variational auto-encoders) are generative models that use categorical or binary latent spaces to capture structured and interpretable representations in domains like text, images, and graphs.

How it works:

Encode image using discrete tokens

  • define vocab of discrete image/token embeddings —> each image represented as list of embeddings

  • discrete representation for every image

Train autoencoder to reconstruct image

  • trying to reconstruct input

  • throw away decoder in the end

Summary of process: put image into encoder —> get discrete embeddings —> pass into separate decoder —> try to reconstruct original image

<p>Discrete VAEs (variational auto-encoders) are generative models that use categorical or binary latent spaces to capture structured and interpretable representations in domains like text, images, and graphs.</p><p>How it works:</p><p>Encode image using discrete tokens</p><ul><li><p>define vocab of discrete image/token embeddings —&gt; each image represented as list of embeddings</p></li><li><p>discrete representation for every image</p></li></ul><p>Train autoencoder to reconstruct image</p><ul><li><p>trying to reconstruct input</p></li><li><p>throw away decoder in the end</p></li></ul><p></p><p>Summary of process: put image into encoder —&gt; get discrete embeddings —&gt; pass into separate decoder —&gt; try to reconstruct original image</p>
29
New cards

How was VisualBERT (2019) trained?

initialize transformer with pretrained BERT

continue training model with image and language data (example of transfer learning)

training objectives:

  • image-conditioned masked language modeling

  • sentence-image prediction

30
New cards

Training objectives for VisualBERT (2019) and other similar possibilities

Image-conditioned masked language modeling: given the text and image, mask out part of the text and predict

  • hope it becomes easier with visual image to condition on

  • cross modal connections

sentence-image prediction: provide two captions, model decides whether (a) both match the image or (b) only one matches

Possibilities

  • masked image modeling: fill in masked patches of image

  • masked multimodal modeling: mask both image and text, have to fill in both at the same time

31
New cards

Architectural choices for vision models

  • single-stream: one transformer that accepts inputs from both modalities

  • dual-stream: two transformers, one for each modality

  • early vs. late fusion: how far into the model do we see multimodal transformer layers? how far into the model do we see interaction between modalities (e.g. cross-attention)

<ul><li><p>single-stream: one transformer that accepts inputs from both modalities</p></li><li><p>dual-stream: two transformers, one for each modality</p></li></ul><ul><li><p>early vs. late fusion: how far into the model do we see multimodal transformer layers? how far into the model do we see interaction between modalities (e.g. cross-attention)</p></li></ul><p></p>
32
New cards

What is CLIP?

Contrastive language-image pre-training

A CLIP model is a type of neural network that learns to connect images and text by jointly training an image encoder and a text encoder.

Approach:

  • construct batch of N images X N texts (paired data)

  • labels (matching texts = positive samples, mismatching texts = negative samples)

  • objective: maximize similarity between positive samples and minimize similarity between negative samples

    • dot product from text and image encoders —> evaluate similarity

    • reducing loss: L(B) = Σ - (w * v) + Σ (w * v)

<p>Contrastive language-image pre-training</p><p>A CLIP model is a type of neural network that learns to connect images and text by jointly training an image encoder and a text encoder.</p><p>Approach:</p><ul><li><p>construct batch of N images X N texts (paired data)</p></li><li><p>labels (matching texts = positive samples, mismatching texts = negative samples)</p></li><li><p>objective: maximize similarity between positive samples and minimize similarity between negative samples</p><ul><li><p>dot product from text and image encoders —&gt; evaluate similarity</p></li><li><p>reducing loss: L(B) =<strong>&nbsp;</strong>Σ - (w * v) + Σ (w * v)</p></li></ul></li></ul><p></p>
33
New cards

What is the visual question answering evaluation dataset? What is a problem with it?

Images taken from diverse sources

Human-authored questions

Problem: simple heuristics (e.g. text-only baseline - how often right without seeing image? - compare accuracy —> how important is visual component for task performance?) work well

<p>Images taken from diverse sources</p><p>Human-authored questions</p><p>Problem: simple heuristics (e.g. text-only baseline - how often right without seeing image? - compare accuracy —&gt; how  important is visual component for task performance?) work well</p>
34
New cards

What is the Winoground evaluation datset?

Adversarial evaluations designed to “stump” models using simple heuristics.

Inspired by “Winograd schemas” (“The trophy wouldn’t fit in the suitcase because it was too small/big. What is ‘it’?”)

Balance pairs where a language-only baseline would achieve 50% accuracy

<p>Adversarial evaluations designed to&nbsp;“stump” models using simple heuristics.</p><p>Inspired by&nbsp;“Winograd schemas” (“The trophy wouldn’t fit in the suitcase because it was too small/big. What is&nbsp;‘it’?”)</p><p>Balance pairs where a language-only baseline would achieve 50% accuracy</p>
35
New cards

Where does most vision-text data come from? Other possible sources?

Most vision-text data is image-caption data.

An easy source is alt text (accessibility text for vision impaired).

Other possible data sources are captions from a movie (e.g. people sitting in a cafe talking). Captions don’t always have grounding.

36
New cards

What are some advantages/disadvantages of caption data?

advantages

  • learn what’s important about images

  • text is grounded in important parts of image

disadvantages

  • relationship between text and image in real life isn’t always so direct (i.e. what you’re seeing doesn’t have anything to do with image) —> model struggles

  • texts are describing concrete things (e.g. physical objects, actions, colors), but linguistic data domain is specific (and small subset of how language is actually used)

<p>advantages</p><ul><li><p>learn what’s important about images</p></li><li><p>text is grounded in important parts of image</p></li></ul><p>disadvantages</p><ul><li><p>relationship between text and image in real life isn’t always so direct (i.e. what you’re seeing doesn’t have anything to do with image) —&gt; model struggles</p></li><li><p>texts are describing concrete things (e.g. physical objects, actions, colors), but linguistic data domain is specific (and small subset of how language is actually used)</p></li></ul><p></p>
37
New cards

How does vision-language fine-tuning affect language performance?

There is catastrophic forgetting of complex language, because captions are from a highly restricted/simple domain.

Example: VisualBERT forgets language (unrelated to captions) during fine-tuning

38
New cards

What kind of models typically are image generation models?

Most image generation models are diffusion models.

Models are trained to iteratively remove noise from an image (with or without cross-attention to text). Images are generated by inputting random noise and iteratively denoising.

Example: Stable Diffusion (Rombach et al., 2022)

<p>Most image generation models are diffusion models.</p><p>Models are trained to iteratively remove noise from an image (with or without cross-attention to text). Images are generated by inputting random noise and iteratively denoising.</p><p>Example: Stable Diffusion (Rombach et al., 2022)</p>
39
New cards

What happens if you put a prompt into a simple pretrained LM and generate from it?

Model tries to continue texts in a way similar to what it’s seen before (e.g. trained on data scraped from internet —> doesn’t really include question-answer pairs).

It’s minimizing cross-entropy loss on token prediction.

<p>Model tries to continue texts in a way similar to what it’s seen before (e.g. trained on data scraped from internet —&gt; doesn’t really include question-answer pairs).</p><p>It’s minimizing cross-entropy loss on token prediction.</p>
40
New cards

What if you ask a pretrained LM to continue a toxic text?

It would try to continue the text as best as possible, which will likely include a toxic answer. This presents a problem for user-facing products.

41
New cards

What if you ask a pretrained LM a question it doesn’t know the answer to?

The LM will fall back on prior patterns/frequency in training data. This can include matching tone —> want to appear as similar as possible.

42
New cards

What if you ask a pretrained LM to help you do something dangerous?

The LM will try to output text that looks similar to training data. If LM is trained on data scraped from the internet, it will likely give instructions to do something dangerous when asked.

43
New cards

What is AI Alignment?

Alignment is the process of adapting a pretrained generative model to the needs of human users and the greater good.

44
New cards

What are several (common) goals of alignment?

  • helpfulness: AI should be able to solve problems for humans through interactions that are natural and easy for users

    • prompt-engineering shouldn’t be required for helpful model

    • handle typos, brief inputs, mistakes —> robustness

  • appropriate: AI should not generate texts that are rude, obscene, violent, etc.; “family-friendly” product

  • safety: Ai should refuse to comply with requests that are harmful to users, other people, or society

  • calibration: AI should be able to reflect on its level of confidence and be transparent when it is uncertain

    • knowledge is limited for models and humans —> don’t want LMs to make up facts because it seems helpful or resembles text

  • legal compliance: copyrighted material, what text do LMs have permission to share?

  • privacy

45
New cards

What is prompt-engineering?

Prompt-engineering involves designing a prompt to maximize model performance.

46
New cards

Why are instruction-response pairs largely out-of-domain for pretrained LMs?

*podcast

47
New cards

What are some methods for creating instruction-response pairs?

  • hand crafting

  • adapting labeled datasets (e.g. classification datasets - translation pairs)

  • use LLMs to generate examples

48
New cards

What is a system prompt?

The system prompt is text that is prepended to any user prompt. It supplies a persona and general instructions to the model that should apply in all interactions.
*text not code!!

<p>The system prompt is text that is prepended to any user prompt. It supplies a persona and general instructions to the model that should apply in all interactions.<br>*text not code!!</p>
49
New cards

What is the goal of preference learning?

The goal of preference learning is to teach models how to optimize outputs for maximizing human preferences.
*Humans share which answer they prefer

<p>The goal of preference learning is to teach models how to optimize outputs for maximizing human preferences.<br>*Humans share which answer they prefer</p>
50
New cards

How can you acquire preference data?

  • present humans with several LLM outputs to a prompt

    • raw scores, pairwise preferences, best of n selection

  • use previously annotated data from stack exchange/reddit

    • forums with upvoting and downvoting

    • easy to acquire by scraping internet

51
New cards

What does training a reward model do?

You can train a reward head on top of LLM to predict human preferences (using labeled data).

Output z-score given candidate and prompt inputs

<p>You can train a reward head on top of LLM to predict human preferences (using labeled data).</p><p>Output z-score given candidate and prompt inputs</p>
52
New cards

What is the optimization problem/equation in reinforcement learning?

Learn a policy π from which you can sample a sequence of actions a (i.e. generate tokens) that will optimize the expected reward model’s output r(x, o)

<p><span style="background-color: transparent;"><span>Learn a policy π from which you can sample a sequence of actions </span><em><span>a</span></em><span> (i.e. generate tokens) that will optimize the expected reward model’s output </span><em><span>r(x, o)</span></em></span></p>
53
New cards

KL regularization and reinforcement learning

KL regularization limits deviations from the original pretrained model
(avoid catastrophic forgetting)

Equation note: subtracting from reward if not deviating too far from original distribution

<p>KL regularization limits deviations from the original pretrained model<br>(avoid catastrophic forgetting)</p><p>Equation note: subtracting from reward if not deviating too far from original distribution</p>
54
New cards

RLHF: what do humans (actually) prefer?

  • good formatting/presentation: tables, bullets, clear structure, etc.

  • confident responses (—> hallucination)

    • flattery (—> sychophancy)

55
New cards

What happens when you instruct annotators to select responses that align with business goals?

  • mitigate social biases

  • espouse certain political viewpoints

  • refrain from discussing certain topics

56
New cards

What is chain-of-thought prompting? What are reasoning models?

CoT prompting improves reasoning capabilities of models by encouraging them to break down complicated problems into steps

Reasoning models are trained specifically for chain-of-thought prompting.

Prompt-engineering —> add “think/solve step-by-step” at the end of input

<p>CoT prompting improves reasoning capabilities of models by encouraging them to break down complicated problems into steps</p><p>Reasoning models are trained specifically for chain-of-thought prompting.</p><p></p><p>Prompt-engineering —&gt; add “think/solve step-by-step” at the end of input</p>
57
New cards

What is ancestral sampling?

Ancestral sampling is one of the simplest text decoding algorithms.

Steps:

  1. Pass in all previously generated text, or a user-specified prefix

  2. Use the LM head to output a distribution over the next token

  3. Sample from that distribution

  4. Append the sampled token to the prefix

  5. Repeat steps 1-4 until you generate the EOS token, or a predefined # of max tokens

58
New cards

What is maximum a posteriori decoding?

Maximum a posteriori decoding is a statistical method for estimating an unknown quantity or message based on observational data. It selects the outcome that has the highest posterior probability (which is the probability of the original message given the data that was observed)

<p>Maximum a posteriori decoding is a statistical method for estimating an unknown quantity or message based on observational data. It selects the outcome that has the highest posterior probability (which is the probability of the original message given the data that was observed)</p>
59
New cards

When would you want to use maximum a posteriori decoding?

You would want to use it when you have prior knowledge or assumptions about the parameter you are trying to estimate (especially where observed data is noisy or incomplete).

Want to find most probable continuation —> on topic response

60
New cards

What are the problems with maximum a posteriori decoding?

  • can be expensive to compute

    • the search space is technically infinite (unless you set a max length)

    • the search space grows exponentially

  • can be degenerate

    • e.g. most likely continuation may just be EOS

  • hallucinations (problem in all text generation algorithms)

  • no creativity - expected, bland response based on observed data

61
New cards

What is greedy decoding?

At each step in greedy decoding, the model always generates the token with the highest probability.

<p>At each step in greedy decoding, the model always generates the token with the highest probability.</p>
62
New cards

How does greedy decoding prevent issues like degenerate strings (i.e. the empty strings)?

*podcast

63
New cards

What is a problem with greedy decoding?

Greedy decoding can lead to highly repetitive text.

keep repeating question —> probability of repetitions increases and probability of other words goes to zero (*check podcast answer)

64
New cards

What is beam search?

Beam search is like greedy search, but at each step t, you greedily save the top k generations. k is referred to as the size of the beam.

65
New cards

How does beam search prevent degenerate generations?

*podcast

Beam search keeps the top k generations at each step t. By introducing more options than greedy decoding, we attempt to limit the possibility of degenerate generations (i.e. repetitions).

66
New cards

How does beam search prevent repetition?

In beam search, you compare all possible continuations. If a possible continuation has many repetitions, it will have a lower probability.

67
New cards

What else is beam search good for?

Beam search can also lead to more uniform information density (Meister et al., 2021)

<p>Beam search can also lead to more uniform information density (Meister et al., 2021)</p>
68
New cards

What is stochastic sampling? Example?

Method for introducing randomness based on the model’s predicted probability distribution

Example: ancestral sampling

Contrasts with deterministic methods like greedy sampling

69
New cards

What is the problem with ancestral sampling?

Anything is possible, which can randomly lead to weird outputs.

70
New cards

What is the temperature parameter? What do its values mean?

Temperature is a hyperparameter set to manipulate the degree of randomness in sampling.

T —> 0: greedy decoding

T = 1: classic ancestral sampling

T —> infinity: uniform random sampling

71
New cards

How do we get probabilities from neural networks? (temperature)

Incorrect: normalize raw outputs

Correct: softmax function

Equation note: divide logit by some constant T (temperature)

<p>Incorrect: normalize raw outputs</p><p>Correct: softmax function</p><p></p><p>Equation note: divide logit by some constant T (temperature)</p>
72
New cards

What kind of decoding/sampling when T —> 0?

No randomness —> greedy decoding

Dividing by a small number —> amplify differences between logits —> softmax becomes sharper (more extreme values) —> model more confident/deterministic

<p>No randomness —&gt; greedy decoding</p><p>Dividing by a small number —&gt; amplify differences between logits —&gt; softmax becomes sharper (more extreme values) —&gt; model more confident/deterministic</p>
73
New cards

What kind of decoding/sampling when T = 1?

Default case

Logits remain unchanged —> ancestral sampling

<p>Default case</p><p>Logits remain unchanged —&gt; ancestral sampling</p>
74
New cards

What kind of decoding/sampling when T —> infinity?

Uniform random sampling —> all tokens equally probable —> pure randomness

Dividing by a large number —> shrink differences between logits (i.e. small probability) —> softmax becomes flatter —> rare tokens more likely

<p>Uniform random sampling —&gt; all tokens equally probable —&gt; pure randomness</p><p>Dividing by a large number —&gt; shrink differences between logits (i.e. small probability) —&gt; softmax becomes flatter —&gt; rare tokens more likely</p>
75
New cards

How does temperature affect softmax?

Lower temperature —> sharper softmax (more extreme highs and lows)

Higher temperature —> flatter softmax (values closer together)

76
New cards

When would you want a large or small temperature?

Want more creative/unexpected output (e.g. poetry) —> larger temperature

Want less creative, safer, more deterministic output (e.g. medical, legal) —> smaller temperature

77
New cards

What are truncated distribution methods?

For truncated distribution methods, you set some elements of the sample space (i.e. words in the vocab) to have probability 0 and renormalize. Preferably, this is excluding words that we would never want to generate (rather than just assigning them very low probability).

78
New cards

How would you decide what to exclude from the distribution? (truncated distribution methods)

*podcast

79
New cards

What is top-k sampling?

Top-k sampling involves sampling from only the k words in the vocab with the highest probability: v^(k)

<p>Top-k sampling involves sampling from only the k words in the vocab with the highest probability: v^(k)</p>
80
New cards

In top-k sampling, what happens when k = 1? What happens when k = |V|?

k = 1 —> picking highest probability word —> greedy sampling

k = |V| —> don’t prune anything from vocabulary —> ancestral sampling

81
New cards

What is nucleus sampling (top-p)?

In nucleus sampling, you select the most likely words from the vocab until the total probability mass is >= p. Then, you sample from that subset.

<p>In nucleus sampling, you select the most likely words from the vocab until the total probability mass is &gt;= p. Then, you sample from that subset.</p>
82
New cards

When does nucleus sampling method give the same result as ancestral sampling?

p = 1 —> take everything from vocabulary —> ancestral sampling

83
New cards

What is typical sampling?

Typical sampling is a decoding method that aims to keep the entropy of each generated step close to the typical set of the model’s distribution.

Differs from top-k/top-p sampling because it uses entropy directly

<p>Typical sampling is a decoding method that aims to keep the entropy of each generated step close to the typical set of the model’s distribution.</p><p>Differs from top-k/top-p sampling because it uses entropy directly</p>
84
New cards

What is calibration?

LMs sound confident even when they are not correct

85
New cards

What are some problems with using LMs as knowledge bases to answer factual questions?

  • hallucination: LMs “make up” factually incorrect information in their response

  • calibration: LMs sound confident even when they are not correct

  • proprietary information: LMs aren’t trained on private data (assuming good actors), so they are less useful in internal settings

  • static models: knowledge is constantly changing, and training LMs continually on new data is expensive and complicated

86
New cards

What is RAG?

RAG stands for retrieval augmented generation. It combines information retrieval (e.g. google search engine) with prompt-based LM text generation

Steps:

  1. use information retrieval techniques to find documents that are relevant to the user’s question

  2. prepend documents to the model prompt

  3. additional prompt engineering (e.g. “based on these texts, answer this question: …”

    1. generate response

<p>RAG stands for retrieval augmented generation. It combines information retrieval (e.g. google search engine) with prompt-based LM text generation</p><p>Steps:</p><ol><li><p>use information retrieval techniques to find documents that are relevant to the user’s question</p></li><li><p>prepend documents to the model prompt</p></li><li><p>additional prompt engineering (e.g.&nbsp;“based on these texts, answer this question: …”</p><ol><li><p>generate response</p></li></ol></li></ol><p></p>
87
New cards

What is tf-idf?

tf-idf stands for term frequency — inverse document frequency. Together, they measure how important a word is to a specific document relative to the entire corpus.

Every document/query has a tf-idf vector d/q where each entry is the tf-idf for a different term

88
New cards

What is an approach for information retrieval using BERT? Any problems with this approach?

Pass query and document together to BERT, and output a score indicating probability of match

Problem: highly inefficient because you need to pass every document through BERT each time you get a query

<p>Pass query and document together to BERT, and output a score indicating probability of match</p><p>Problem: highly inefficient because you need to pass every document through BERT each time you get a query</p>
89
New cards

What is a second approach for information retrieval using BERT? Any problems with this approach?

Precompute document vectors and compute similarity score (dot product) with query vector on-the-fly

Problem: document vectors are not query-aware

<p>Precompute document vectors and compute similarity score (dot product) with query vector on-the-fly</p><p>Problem: document vectors are not query-aware</p>