Language Models and Sequence-to-Sequence Models
Motivations
- The goal is to use representations of tokens to model a sequence of tokens for a specific task.
- Distinguish between word salad, spelling errors, and grammatical sentences.
- Language Models (LMs) define probability distributions over strings in a language.
- LMs can generate strings.
- LMs can score/rank candidate strings to choose the most likely.
- If P{LM}(A) > P{LM}(B), return A.
Language Models
- Grammatically incorrect or rare sentences should be improbable.
- LMs determine the probability of a word following a sequence of words.
- Two classes of models:
- Count-based: Markov assumptions with smoothing.
- Neural models.
Framework
- Given (t1, …, tp) \in VD, estimate p(t{n+1} | t1, …, tn).
- Formally, compute the probability distribution of the next word, where it can be any word in the vocabulary.
- A system that does this is called a Language Model.
- Language Modeling is the task of predicting what word comes next, e.g., "The students opened their ".
- An LM assigns a probability to a piece of text.
- For text x, the probability is P(x).
N-gram Language Models
- An n-gram is a chunk of n consecutive words.
- Unigrams: "the", "students", "opened", "their".
- Bigrams: "the students", "students opened", "opened their".
- Trigrams: "the students opened", "students opened their".
- Four-grams: "the students opened their".
- Collect statistics on n-gram frequency to predict the next word.
- Markov assumption: x(t+1) depends only on the preceding n-1 words.
- Formula for calculating probabilities:
- P(wn | w1, …, w{n-1}) = \frac{count(w1, …, wn)}{count(w1, …, w_{n-1})}
N-gram Language Models: Example
- Learning a 4-gram LM.
- Example: "as the proctor started the clock, the students opened their ____"
- In the corpus:
- "students opened their" occurred 1000 times.
- "students opened their books" occurred 400 times.
- P(\text{books } | \text{ students opened their}) = 0.4
- "students opened their exams" occurred 100 times.
- P(\text{exams } | \text{ students opened their}) = 0.1
Sparsity Problems with N-gram Language Models
- Problem 1: If "students opened their w" never occurred in the data, then w has probability 0!
- Solution: Add small \delta to the count for every w \in V (smoothing).
- Note: Increasing n makes sparsity problems worse; typically, n cannot be bigger than 5.
- Problem 2: If "students opened their" never occurred in data, then we can’t calculate probability for any w!
- Solution: Condition on "opened their" instead (backoff).
Storage Problems with N-gram Language Models
- Need to store count for all n-grams seen in the corpus.
- Increasing n or increasing corpus increases model size.
N-gram Language Models in Practice
- A trigram LM can be built over a 1.7 million word corpus (Reuters) in a few seconds on a laptop.
- Example: "today the "
- +company: 0.153
- +bank: 0.153
- +price: 0.077
- +italian: 0.039
- +emirate: 0.039
- Sparsity problem: not much granularity in the probability distribution.
Evaluation: How Good Is Our Model?
- Does our language model prefer good sentences to bad ones?
- Assign higher probability to "real" or "frequently observed" sentences than "ungrammatical" or "rarely observed" sentences.
- Train parameters on a training set.
- Test model’s performance on unseen data (test set).
- An evaluation metric tells us how well our model does on the test set.
Extrinsic Evaluation of LMs
- Compare models A and B by putting each model in a task (spelling corrector, speech recognizer, Machine Translation system).
- Run the task, get an accuracy for A and for B.
- Compare accuracy for A and B.
Difficulty of Extrinsic Evaluation of LMs
- Extrinsic evaluation is time-consuming.
- Intrinsic (in-vitro) evaluation: perplexity.
- Directly measures language model performance at predicting words.
- Unless the test data looks just like the training data.
- Doesn't necessarily correspond with real application performance.
- Gives us a single general metric for language models, useful for large language models (LLMs) as well as n-gram LMs.
Training Sets and Test Sets
- Train parameters of our model on a training set.
- Test the model’s performance on data we haven’t seen.
- A test set is an unseen dataset; different from the training set.
- We want to measure generalization to unseen data.
- An evaluation metric (like perplexity) tells us how well our model does on the test set.
Choosing Training and Test Sets
- If we're building an LM for a specific task, the test set should reflect the task language.
- If we're building a general-purpose model:
- We'll need lots of different kinds of training data.
- We don't want the training set or the test set to be just from one domain or author or language.
Training on the Test Set -
- Don’t allow test sentences into the training set.
- Or else the LM will assign that sentence an artificially high probability when we see it in the test set.
- And hence assign the whole test set a falsely high probability, making the LM look better than it really is.
- This is called "Training on the test set", and it’s bad science!
Intuition of Perplexity
- A good LM prefers "real" sentences.
- Assign higher probability to “real” or “frequently observed” sentences.
- Assigns lower probability to “word salad” (i.e., random sequences of words) or “rarely observed” sentences.
Predicting Upcoming Words
- The Shannon Game: How well can we predict the next word?
- I always order pizza with cheese and ___
- The 33rd President of the US was ___
- I saw a ___
- Unigrams are terrible at this game.
- A better model of a text is one that assigns a higher probability to the word that actually occurs.
- mushrooms (0.1)
- pepperoni (0.1)
- anchovies (0.01)
- Fried rice (0.0001).
Perplexity
- The best language model is one that best predicts the entire unseen test set.
- Generalize to all the words; the best LM assigns high probability to the entire test set.
- When comparing two LMs, A and B, compute PA(\text{test set}) and PB(\text{test set}) .The better LM will give a higher probability to the test set.
- Perplexity is the inverse probability of the test set, normalized by the number of words.
- Perplexity = P(w1, …, wN)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w1, …, wN)}}
- Probability range is [0,1], perplexity range is [1, ∞]. Minimizing perplexity is the same as maximizing probability.
- Lower perplexity = better language model.
Neural Language Model
- Neural Networks can be trained on very large amounts of data.
- Neural Networks can use continuous representation of input tokens capturing the distributional hypothesis efficiently.
Neural Language Models
- Language modeling: predict the next word.
- A Neural Language Model is a neural network trained to predict the next word.
- Several neural architecture choices:
- Feedforward neural networks
- Recurrent neural networks
- Transformers neural networks
Building a Neural Language Model
- Recall the Language Modeling task:
- Input: sequence of words
- Output: prob. dist. of the next word
A Fixed-Window Neural Language Model
- Improvements over n-gram LM:
- No sparsity problem
- Don’t need to store all observed n-grams
- Remaining problems:
- Fixed window is too small
- Enlarging window enlarges W
- Window can never be large enough! We need a neural architecture that can process any length input.
Recurrent Neural Networks (RNN)
- A family of neural architectures (RNN, LSTM, GRU).
A Simple RNN Language Model
- y(t) = \text{softmax}(Uh(t) + b^2) \in R^{|V|}
- h(t) = \sigma(Wh h(t-1) + We e(t) + b_1)
- e(t) = Ex(t)
RNN Advantages:
- Can process any length input.
- Computation for step t can (in theory) use information from many steps back.
RNN Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back.
Generating Text with a RNN Language Model
- Just like an n-gram Language Model, you can use an RNN Language Model to generate text by repeated sampling.
- Sampled output becomes the next step's input.
- Train an RNN-LM on any kind of text then generate text in that style.
Language Models Summary:
- Probabilities of sentences, various uses.
- Traditional language model: based on counts of words in context.
- Neural language models.
- Both types need a lot of data to train.
- Usually evaluated using perplexity.
Sequence-to-Sequence Framework
- Input sequence of tokens: (x1, …, xT) \in V_T
- Target sequence of tokens: (y1, …, y{T'}) \in V_{T'}
- Goal: Estimate P\theta(y1, …, y{T'} | x1, …, x_T).
- Framed as a classification task:
- \hat{y}t = \text{argmax}{y \in V'} P\theta(y | (x1, …, xT), (yt, …, y_{t-1})) \quad \forall t \in [1, T']
Sequence Generation Tasks
- Machine Translation
- Summarization
Machine Translation
- Given a sequence in one language → translate it into another language.
- Cats eat mice → Les chats mangent les souris
- A lot of parallel data for this task from and to English
- Harder to Translate Hawaiian to Swiss German than English to French
Machine Translation - Evaluation
- Evaluating Sequence Generation Tasks Automatically is Challenging.
- BLEU SCORE:
- N-Gram-Based Evaluation metric
- It measures how much of the n-gram in the prediction appears in the reference translation
- 1 means the prediction is the same as the gold translation
- 0 means there is no n-gram in common.
Summarization
- Evaluation: ROUGE score measures how much the words (and/or n-grams) in the human reference summaries appeared in the machine-generated summaries.
Design Questions
- What output activation function and loss? Softmax and Cross Entropy
- What architecture?
- How do you represent a token to feed the model?
Encoder-Decoder Model
- Model an output sequence conditioned on an input sequence.
- Deep learning does that with an encoder-decoder model, also called “sequence to sequence” or “seq2seq”.
Modeling Translation
- Neural Machine Translation (NMT) translates with a single neural network.
- Want to find the best Finnish sentence, given an English sentence.
- Express translation as a probabilistic model:
p(y|x)
Chain Rule
- Expanding using the chain rule gives:
Training a Neural Machine Translation System
- Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.
Issues
- Last encoder hidden-state "summarises" source sentence.
- This needs to capture all information about the source sentence
- Information bottleneck!
- Fixed-size representation degrades as sentence length increases.
Attention
- Attention provides a solution to the bottleneck problem.
- Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.
Sequence-to-Sequence with Attention
- Attention scores (dot product)
- Compute softmax to turn the scores into a probability distribution
- Attention output Utilize the attention distribution to do a weighted sum of the encoder hidden states
- Concatenate attention output with decoder hidden state, then use to compute y_1
Equations
- We have encoder hidden states h1, …, hN \in R^h
- On timestep t, we have decoder hidden state s_t \in R^h
- We get the attention scores e_t for this step:
- et = [st^Th1, …, st^Th_N] \in R^N
- We take softmax to get the attention distribution a_t for this step
- at = \text{softmax}(et) \in R^N
- We use at to take a weighted sum of the encoder hidden states to get the attention output at
- at = \sum{i=1}^N a{ti}hi \in R^h
- Finally, we concatenate the attention output a_t with the decoder hidden state st and proceed as in the non-attention seq2seq model
- [at; st] \in R^{2h}
Attention Benefits:
- Attention significantly improves plain seq2seq performance
- Attention solves the bottleneck problem
- Allows decoder to look directly at the source; bypass bottleneck
- Attention provides some interpretability
- By inspecting the attention distribution, we can see what the decoder was focusing on
General Deep Learning Technique
*Given a set of vector values and vector query, attention provides a method for assessing a weighted sum of the values dependent on the query
Variant Computing
- We have some values h1, …, hN \in R^{d1} and a query s \in R^{d2}
- Computing the attention scores
- Taking softmax to get attention distribution
- Using attention distribution to take weighted sum of values, thus obtaining the attention output (sometimes called the context vector) There are multiple ways to do this.
Attention Variants Basic dot-product attention:
Basic dot-product attention:
- ei = s^Thi
- Note: this assumes d1 = d2
- Multiplicative attention:
- ei = s^TWhi
- Where W is a weight matrix.
- Additive attention:
- ei = v^T tanh(W1hi + W2s)
- Where W1, W2 are weight matrices and v is a weight vector.
Different Framework Types
- Decoder only
- Encoder-Decoder (input sequence) -> (output sequence)