Language Models and Sequence-to-Sequence Models

Motivations

  • The goal is to use representations of tokens to model a sequence of tokens for a specific task.
  • Distinguish between word salad, spelling errors, and grammatical sentences.
  • Language Models (LMs) define probability distributions over strings in a language.
  • LMs can generate strings.
  • LMs can score/rank candidate strings to choose the most likely.
    • If P{LM}(A) > P{LM}(B), return A.

Language Models

  • Grammatically incorrect or rare sentences should be improbable.
  • LMs determine the probability of a word following a sequence of words.
  • Two classes of models:
    • Count-based: Markov assumptions with smoothing.
    • Neural models.

Framework

  • Given (t1, …, tp) \in VD, estimate p(t{n+1} | t1, …, tn).
  • Formally, compute the probability distribution of the next word, where it can be any word in the vocabulary.
  • A system that does this is called a Language Model.
  • Language Modeling is the task of predicting what word comes next, e.g., "The students opened their ".
  • An LM assigns a probability to a piece of text.
  • For text x, the probability is P(x).

N-gram Language Models

  • An n-gram is a chunk of n consecutive words.
    • Unigrams: "the", "students", "opened", "their".
    • Bigrams: "the students", "students opened", "opened their".
    • Trigrams: "the students opened", "students opened their".
    • Four-grams: "the students opened their".
  • Collect statistics on n-gram frequency to predict the next word.
  • Markov assumption: x(t+1) depends only on the preceding n-1 words.
  • Formula for calculating probabilities:
    • P(wn | w1, …, w{n-1}) = \frac{count(w1, …, wn)}{count(w1, …, w_{n-1})}

N-gram Language Models: Example

  • Learning a 4-gram LM.
  • Example: "as the proctor started the clock, the students opened their ____"
  • In the corpus:
    • "students opened their" occurred 1000 times.
    • "students opened their books" occurred 400 times.
    • P(\text{books } | \text{ students opened their}) = 0.4
    • "students opened their exams" occurred 100 times.
    • P(\text{exams } | \text{ students opened their}) = 0.1

Sparsity Problems with N-gram Language Models

  • Problem 1: If "students opened their w" never occurred in the data, then w has probability 0!
    • Solution: Add small \delta to the count for every w \in V (smoothing).
  • Note: Increasing n makes sparsity problems worse; typically, n cannot be bigger than 5.
  • Problem 2: If "students opened their" never occurred in data, then we can’t calculate probability for any w!
    • Solution: Condition on "opened their" instead (backoff).

Storage Problems with N-gram Language Models

  • Need to store count for all n-grams seen in the corpus.
  • Increasing n or increasing corpus increases model size.

N-gram Language Models in Practice

  • A trigram LM can be built over a 1.7 million word corpus (Reuters) in a few seconds on a laptop.
  • Example: "today the "
    • +company: 0.153
    • +bank: 0.153
    • +price: 0.077
    • +italian: 0.039
    • +emirate: 0.039
    • Sparsity problem: not much granularity in the probability distribution.

Evaluation: How Good Is Our Model?

  • Does our language model prefer good sentences to bad ones?
  • Assign higher probability to "real" or "frequently observed" sentences than "ungrammatical" or "rarely observed" sentences.
  • Train parameters on a training set.
  • Test model’s performance on unseen data (test set).
  • An evaluation metric tells us how well our model does on the test set.

Extrinsic Evaluation of LMs

  • Compare models A and B by putting each model in a task (spelling corrector, speech recognizer, Machine Translation system).
  • Run the task, get an accuracy for A and for B.
  • Compare accuracy for A and B.

Difficulty of Extrinsic Evaluation of LMs

  • Extrinsic evaluation is time-consuming.
  • Intrinsic (in-vitro) evaluation: perplexity.
  • Directly measures language model performance at predicting words.
  • Unless the test data looks just like the training data.
  • Doesn't necessarily correspond with real application performance.
  • Gives us a single general metric for language models, useful for large language models (LLMs) as well as n-gram LMs.

Training Sets and Test Sets

  • Train parameters of our model on a training set.
  • Test the model’s performance on data we haven’t seen.
  • A test set is an unseen dataset; different from the training set.
  • We want to measure generalization to unseen data.
  • An evaluation metric (like perplexity) tells us how well our model does on the test set.

Choosing Training and Test Sets

  • If we're building an LM for a specific task, the test set should reflect the task language.
  • If we're building a general-purpose model:
    • We'll need lots of different kinds of training data.
    • We don't want the training set or the test set to be just from one domain or author or language.

Training on the Test Set -

  • Don’t allow test sentences into the training set.
  • Or else the LM will assign that sentence an artificially high probability when we see it in the test set.
  • And hence assign the whole test set a falsely high probability, making the LM look better than it really is.
  • This is called "Training on the test set", and it’s bad science!

Intuition of Perplexity

  • A good LM prefers "real" sentences.
  • Assign higher probability to “real” or “frequently observed” sentences.
  • Assigns lower probability to “word salad” (i.e., random sequences of words) or “rarely observed” sentences.

Predicting Upcoming Words

  • The Shannon Game: How well can we predict the next word?
    • I always order pizza with cheese and ___
    • The 33rd President of the US was ___
    • I saw a ___
  • Unigrams are terrible at this game.
  • A better model of a text is one that assigns a higher probability to the word that actually occurs.
    • mushrooms (0.1)
    • pepperoni (0.1)
    • anchovies (0.01)
    • Fried rice (0.0001).

Perplexity

  • The best language model is one that best predicts the entire unseen test set.
  • Generalize to all the words; the best LM assigns high probability to the entire test set.
  • When comparing two LMs, A and B, compute PA(\text{test set}) and PB(\text{test set}) .The better LM will give a higher probability to the test set.
  • Perplexity is the inverse probability of the test set, normalized by the number of words.
  • Perplexity = P(w1, …, wN)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w1, …, wN)}}
  • Probability range is [0,1], perplexity range is [1, ∞]. Minimizing perplexity is the same as maximizing probability.
  • Lower perplexity = better language model.

Neural Language Model

  • Neural Networks can be trained on very large amounts of data.
  • Neural Networks can use continuous representation of input tokens capturing the distributional hypothesis efficiently.

Neural Language Models

  • Language modeling: predict the next word.
  • A Neural Language Model is a neural network trained to predict the next word.
  • Several neural architecture choices:
    • Feedforward neural networks
    • Recurrent neural networks
    • Transformers neural networks

Building a Neural Language Model

  • Recall the Language Modeling task:
    • Input: sequence of words
    • Output: prob. dist. of the next word

A Fixed-Window Neural Language Model

  • Improvements over n-gram LM:
    • No sparsity problem
    • Don’t need to store all observed n-grams
  • Remaining problems:
    • Fixed window is too small
    • Enlarging window enlarges W
    • Window can never be large enough! We need a neural architecture that can process any length input.

Recurrent Neural Networks (RNN)

  • A family of neural architectures (RNN, LSTM, GRU).

A Simple RNN Language Model

  • y(t) = \text{softmax}(Uh(t) + b^2) \in R^{|V|}
  • h(t) = \sigma(Wh h(t-1) + We e(t) + b_1)
  • e(t) = Ex(t)

RNN Advantages:

  • Can process any length input.
  • Computation for step t can (in theory) use information from many steps back.

RNN Disadvantages:

  • Recurrent computation is slow
  • In practice, difficult to access information from many steps back.

Generating Text with a RNN Language Model

  • Just like an n-gram Language Model, you can use an RNN Language Model to generate text by repeated sampling.
  • Sampled output becomes the next step's input.
  • Train an RNN-LM on any kind of text then generate text in that style.

Language Models Summary:

  • Probabilities of sentences, various uses.
  • Traditional language model: based on counts of words in context.
  • Neural language models.
  • Both types need a lot of data to train.
  • Usually evaluated using perplexity.

Sequence-to-Sequence Framework

  • Input sequence of tokens: (x1, …, xT) \in V_T
  • Target sequence of tokens: (y1, …, y{T'}) \in V_{T'}
  • Goal: Estimate P\theta(y1, …, y{T'} | x1, …, x_T).
  • Framed as a classification task:
    • \hat{y}t = \text{argmax}{y \in V'} P\theta(y | (x1, …, xT), (yt, …, y_{t-1})) \quad \forall t \in [1, T']

Sequence Generation Tasks

  • Machine Translation
  • Summarization

Machine Translation

  • Given a sequence in one language → translate it into another language.
  • Cats eat mice → Les chats mangent les souris
  • A lot of parallel data for this task from and to English
  • Harder to Translate Hawaiian to Swiss German than English to French

Machine Translation - Evaluation

  • Evaluating Sequence Generation Tasks Automatically is Challenging.
  • BLEU SCORE:
    • N-Gram-Based Evaluation metric
    • It measures how much of the n-gram in the prediction appears in the reference translation
  • 1 means the prediction is the same as the gold translation
  • 0 means there is no n-gram in common.

Summarization

  • Evaluation: ROUGE score measures how much the words (and/or n-grams) in the human reference summaries appeared in the machine-generated summaries.

Design Questions

  • What output activation function and loss? Softmax and Cross Entropy
  • What architecture?
    • How do you represent a token to feed the model?

Encoder-Decoder Model

  • Model an output sequence conditioned on an input sequence.
  • Deep learning does that with an encoder-decoder model, also called “sequence to sequence” or “seq2seq”.

Modeling Translation

  • Neural Machine Translation (NMT) translates with a single neural network.
  • Want to find the best Finnish sentence, given an English sentence.
  • Express translation as a probabilistic model:
    p(y|x)

Chain Rule

  • Expanding using the chain rule gives:

Training a Neural Machine Translation System

  • Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.

Issues

  • Last encoder hidden-state "summarises" source sentence.
  • This needs to capture all information about the source sentence
  • Information bottleneck!
  • Fixed-size representation degrades as sentence length increases.

Attention

  • Attention provides a solution to the bottleneck problem.
  • Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.

Sequence-to-Sequence with Attention

  • Attention scores (dot product)
  • Compute softmax to turn the scores into a probability distribution
  • Attention output Utilize the attention distribution to do a weighted sum of the encoder hidden states
  • Concatenate attention output with decoder hidden state, then use to compute y_1

Equations

  • We have encoder hidden states h1, …, hN \in R^h
  • On timestep t, we have decoder hidden state s_t \in R^h
  • We get the attention scores e_t for this step:
    • et = [st^Th1, …, st^Th_N] \in R^N
  • We take softmax to get the attention distribution a_t for this step
    • at = \text{softmax}(et) \in R^N
  • We use at to take a weighted sum of the encoder hidden states to get the attention output at
    • at = \sum{i=1}^N a{ti}hi \in R^h
  • Finally, we concatenate the attention output a_t with the decoder hidden state st and proceed as in the non-attention seq2seq model
    • [at; st] \in R^{2h}

Attention Benefits:

  • Attention significantly improves plain seq2seq performance
  • Attention solves the bottleneck problem
  • Allows decoder to look directly at the source; bypass bottleneck
  • Attention provides some interpretability
  • By inspecting the attention distribution, we can see what the decoder was focusing on

General Deep Learning Technique

*Given a set of vector values and vector query, attention provides a method for assessing a weighted sum of the values dependent on the query

Variant Computing

  • We have some values h1, …, hN \in R^{d1} and a query s \in R^{d2}
  • Computing the attention scores
  • Taking softmax to get attention distribution
  • Using attention distribution to take weighted sum of values, thus obtaining the attention output (sometimes called the context vector) There are multiple ways to do this.

Attention Variants Basic dot-product attention:

Basic dot-product attention:

  • ei = s^Thi
  • Note: this assumes d1 = d2
  • Multiplicative attention:
    • ei = s^TWhi
    • Where W is a weight matrix.
  • Additive attention:
    • ei = v^T tanh(W1hi + W2s)
    • Where W1, W2 are weight matrices and v is a weight vector.

Different Framework Types

  • Decoder only
  • Encoder-Decoder (input sequence) -> (output sequence)