Language Models and Sequence-to-Sequence Models

Motivations

The goal is to use representations of tokens to model a sequence of tokens for a specific task.
Distinguish between word salad, spelling errors, and grammatical sentences.
Language Models (LMs) define probability distributions over strings in a language.
LMs can generate strings.
LMs can score/rank candidate strings to choose the most likely.
- If $P{LM}(A) > P{LM}(B)$ , return A.

Language Models

Grammatically incorrect or rare sentences should be improbable.
LMs determine the probability of a word following a sequence of words.
Two classes of models:
- Count-based: Markov assumptions with smoothing.
- Neural models.

Framework

Given $(t1, …, tp) \in VD$ , estimate $p(t{n+1} | t1, …, tn)$ .
Formally, compute the probability distribution of the next word, where it can be any word in the vocabulary.
A system that does this is called a Language Model.
Language Modeling is the task of predicting what word comes next, e.g., "The students opened their ".
An LM assigns a probability to a piece of text.
For text $x$ , the probability is $P(x)$ .

N-gram Language Models

An n-gram is a chunk of n consecutive words.
- Unigrams: "the", "students", "opened", "their".
- Bigrams: "the students", "students opened", "opened their".
- Trigrams: "the students opened", "students opened their".
- Four-grams: "the students opened their".
Collect statistics on n-gram frequency to predict the next word.
Markov assumption: $x(t+1)$ depends only on the preceding n-1 words.
Formula for calculating probabilities:
- $P(wn | w1, …, w{n-1}) = \frac{count(w1, …, wn)}{count(w1, …, w_{n-1})}$

N-gram Language Models: Example

Learning a 4-gram LM.
Example: "as the proctor started the clock, the students opened their ____"
In the corpus:
- "students opened their" occurred 1000 times.
- "students opened their books" occurred 400 times.
- $P(\text{books } | \text{ students opened their}) = 0.4$
- "students opened their exams" occurred 100 times.
- $P(\text{exams } | \text{ students opened their}) = 0.1$

Sparsity Problems with N-gram Language Models

Problem 1: If "students opened their w" never occurred in the data, then w has probability 0!
- Solution: Add small $\delta$ to the count for every $w \in V$ (smoothing).
Note: Increasing n makes sparsity problems worse; typically, n cannot be bigger than 5.
Problem 2: If "students opened their" never occurred in data, then we can’t calculate probability for any w!
- Solution: Condition on "opened their" instead (backoff).

Storage Problems with N-gram Language Models

Need to store count for all n-grams seen in the corpus.
Increasing n or increasing corpus increases model size.

N-gram Language Models in Practice

A trigram LM can be built over a 1.7 million word corpus (Reuters) in a few seconds on a laptop.
Example: "today the "
- +company: 0.153
- +bank: 0.153
- +price: 0.077
- +italian: 0.039
- +emirate: 0.039
- Sparsity problem: not much granularity in the probability distribution.

Evaluation: How Good Is Our Model?

Does our language model prefer good sentences to bad ones?
Assign higher probability to "real" or "frequently observed" sentences than "ungrammatical" or "rarely observed" sentences.
Train parameters on a training set.
Test model’s performance on unseen data (test set).
An evaluation metric tells us how well our model does on the test set.

Extrinsic Evaluation of LMs

Compare models A and B by putting each model in a task (spelling corrector, speech recognizer, Machine Translation system).
Run the task, get an accuracy for A and for B.
Compare accuracy for A and B.

Difficulty of Extrinsic Evaluation of LMs

Extrinsic evaluation is time-consuming.
Intrinsic (in-vitro) evaluation: perplexity.
Directly measures language model performance at predicting words.
Unless the test data looks just like the training data.
Doesn't necessarily correspond with real application performance.
Gives us a single general metric for language models, useful for large language models (LLMs) as well as n-gram LMs.

Training Sets and Test Sets

Train parameters of our model on a training set.
Test the model’s performance on data we haven’t seen.
A test set is an unseen dataset; different from the training set.
We want to measure generalization to unseen data.
An evaluation metric (like perplexity) tells us how well our model does on the test set.

Choosing Training and Test Sets

If we're building an LM for a specific task, the test set should reflect the task language.
If we're building a general-purpose model:
- We'll need lots of different kinds of training data.
- We don't want the training set or the test set to be just from one domain or author or language.

Training on the Test Set -

Don’t allow test sentences into the training set.
Or else the LM will assign that sentence an artificially high probability when we see it in the test set.
And hence assign the whole test set a falsely high probability, making the LM look better than it really is.
This is called "Training on the test set", and it’s bad science!

Intuition of Perplexity

A good LM prefers "real" sentences.
Assign higher probability to “real” or “frequently observed” sentences.
Assigns lower probability to “word salad” (i.e., random sequences of words) or “rarely observed” sentences.

Predicting Upcoming Words

The Shannon Game: How well can we predict the next word?
- I always order pizza with cheese and ___
- The 33rd President of the US was ___
- I saw a ___
Unigrams are terrible at this game.
A better model of a text is one that assigns a higher probability to the word that actually occurs.
- mushrooms (0.1)
- pepperoni (0.1)
- anchovies (0.01)
- Fried rice (0.0001).

Perplexity

The best language model is one that best predicts the entire unseen test set.
Generalize to all the words; the best LM assigns high probability to the entire test set.
When comparing two LMs, A and B, compute $PA(\text{test set})$ and $PB(\text{test set})$ .The better LM will give a higher probability to the test set.
Perplexity is the inverse probability of the test set, normalized by the number of words.
$Perplexity = P(w1, …, wN)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w1, …, wN)}}$
Probability range is [0,1], perplexity range is [1, ∞]. Minimizing perplexity is the same as maximizing probability.
Lower perplexity = better language model.

Neural Language Model

Neural Networks can be trained on very large amounts of data.
Neural Networks can use continuous representation of input tokens capturing the distributional hypothesis efficiently.

Neural Language Models

Language modeling: predict the next word.
A Neural Language Model is a neural network trained to predict the next word.
Several neural architecture choices:
- Feedforward neural networks
- Recurrent neural networks
- Transformers neural networks

Building a Neural Language Model

Recall the Language Modeling task:
- Input: sequence of words
- Output: prob. dist. of the next word

A Fixed-Window Neural Language Model

Improvements over n-gram LM:
- No sparsity problem
- Don’t need to store all observed n-grams
Remaining problems:
- Fixed window is too small
- Enlarging window enlarges $W$
- Window can never be large enough! We need a neural architecture that can process any length input.

Recurrent Neural Networks (RNN)

A family of neural architectures (RNN, LSTM, GRU).

A Simple RNN Language Model

$y(t) = \text{softmax}(Uh(t) + b^2) \in R^{|V|}$
$h(t) = \sigma(Wh h(t-1) + We e(t) + b_1)$
$e(t) = Ex(t)$

RNN Advantages:

Can process any length input.
Computation for step t can (in theory) use information from many steps back.

RNN Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back.

Generating Text with a RNN Language Model

Just like an n-gram Language Model, you can use an RNN Language Model to generate text by repeated sampling.
Sampled output becomes the next step's input.
Train an RNN-LM on any kind of text then generate text in that style.

Language Models Summary:

Probabilities of sentences, various uses.
Traditional language model: based on counts of words in context.
Neural language models.
Both types need a lot of data to train.
Usually evaluated using perplexity.

Sequence-to-Sequence Framework

Input sequence of tokens: $(x1, …, xT) \in V_T$
Target sequence of tokens: $(y1, …, y{T'}) \in V_{T'}$
Goal: Estimate $P\theta(y1, …, y{T'} | x1, …, x_T)$ .
Framed as a classification task:
- $\hat{y}t = \text{argmax}{y \in V'} P\theta(y | (x1, …, xT), (yt, …, y_{t-1})) \quad \forall t \in [1, T']$

Sequence Generation Tasks

Machine Translation
Summarization

Machine Translation

Given a sequence in one language → translate it into another language.
Cats eat mice → Les chats mangent les souris
A lot of parallel data for this task from and to English
Harder to Translate Hawaiian to Swiss German than English to French

Machine Translation - Evaluation

Evaluating Sequence Generation Tasks Automatically is Challenging.
BLEU SCORE:
- N-Gram-Based Evaluation metric
- It measures how much of the n-gram in the prediction appears in the reference translation
1 means the prediction is the same as the gold translation
0 means there is no n-gram in common.

Summarization

Evaluation: ROUGE score measures how much the words (and/or n-grams) in the human reference summaries appeared in the machine-generated summaries.

Design Questions

What output activation function and loss? Softmax and Cross Entropy
What architecture?
- How do you represent a token to feed the model?

Encoder-Decoder Model

Model an output sequence conditioned on an input sequence.
Deep learning does that with an encoder-decoder model, also called “sequence to sequence” or “seq2seq”.

Modeling Translation

Neural Machine Translation (NMT) translates with a single neural network.
Want to find the best Finnish sentence, given an English sentence.
Express translation as a probabilistic model:
$p(y|x)$

Chain Rule

Expanding using the chain rule gives:

Training a Neural Machine Translation System

Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.

Issues

Last encoder hidden-state "summarises" source sentence.
This needs to capture all information about the source sentence
Information bottleneck!
Fixed-size representation degrades as sentence length increases.

Attention

Attention provides a solution to the bottleneck problem.
Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.

Sequence-to-Sequence with Attention

Attention scores (dot product)
Compute softmax to turn the scores into a probability distribution
Attention output Utilize the attention distribution to do a weighted sum of the encoder hidden states
Concatenate attention output with decoder hidden state, then use to compute $y_1$

Equations

We have encoder hidden states $h1, …, hN \in R^h$
On timestep t, we have decoder hidden state $s_t \in R^h$
We get the attention scores $e_t$ for this step:
- $et = [st^Th1, …, st^Th_N] \in R^N$
We take softmax to get the attention distribution $a_t$ for this step
- $at = \text{softmax}(et) \in R^N$
We use $at$ to take a weighted sum of the encoder hidden states to get the attention output $at$
- $at = \sum{i=1}^N a{ti}hi \in R^h$
Finally, we concatenate the attention output $a_t$ with the decoder hidden state st and proceed as in the non-attention seq2seq model
- $[at; st] \in R^{2h}$

Attention Benefits:

Attention significantly improves plain seq2seq performance
Attention solves the bottleneck problem
Allows decoder to look directly at the source; bypass bottleneck
Attention provides some interpretability
By inspecting the attention distribution, we can see what the decoder was focusing on

General Deep Learning Technique

*Given a set of vector values and vector query, attention provides a method for assessing a weighted sum of the values dependent on the query

Variant Computing

We have some values $h1, …, hN \in R^{d1}$ and a query $s \in R^{d2}$
Computing the attention scores
Taking softmax to get attention distribution
Using attention distribution to take weighted sum of values, thus obtaining the attention output (sometimes called the context vector) There are multiple ways to do this.

Attention Variants Basic dot-product attention:

Basic dot-product attention:

$ei = s^Thi$
Note: this assumes $d1 = d2$
Multiplicative attention:
- $ei = s^TWhi$
- Where $W$ is a weight matrix.
Additive attention:
- $ei = v^T tanh(W1hi + W2s)$
- Where $W1, W2$ are weight matrices and $v$ is a weight vector.

Different Framework Types

Decoder only
Encoder-Decoder (input sequence) -> (output sequence)