Language Models and Sequence-to-Sequence Models

Motivations

  • The goal is to use representations of tokens to model a sequence of tokens for a specific task.
  • Distinguish between word salad, spelling errors, and grammatical sentences.
  • Language Models (LMs) define probability distributions over strings in a language.
  • LMs can generate strings.
  • LMs can score/rank candidate strings to choose the most likely.
    • If P<em>LM(A)>P</em>LM(B)P<em>{LM}(A) > P</em>{LM}(B), return A.

Language Models

  • Grammatically incorrect or rare sentences should be improbable.
  • LMs determine the probability of a word following a sequence of words.
  • Two classes of models:
    • Count-based: Markov assumptions with smoothing.
    • Neural models.

Framework

  • Given (t<em>1,,t</em>p)V<em>D(t<em>1, …, t</em>p) \in V<em>D, estimate p(t</em>n+1t<em>1,,t</em>n)p(t</em>{n+1} | t<em>1, …, t</em>n).
  • Formally, compute the probability distribution of the next word, where it can be any word in the vocabulary.
  • A system that does this is called a Language Model.
  • Language Modeling is the task of predicting what word comes next, e.g., "The students opened their ".
  • An LM assigns a probability to a piece of text.
  • For text xx, the probability is P(x)P(x).

N-gram Language Models

  • An n-gram is a chunk of n consecutive words.
    • Unigrams: "the", "students", "opened", "their".
    • Bigrams: "the students", "students opened", "opened their".
    • Trigrams: "the students opened", "students opened their".
    • Four-grams: "the students opened their".
  • Collect statistics on n-gram frequency to predict the next word.
  • Markov assumption: x(t+1)x(t+1) depends only on the preceding n-1 words.
  • Formula for calculating probabilities:
    • P(w<em>nw</em>1,,w<em>n1)=count(w</em>1,,w<em>n)count(w</em>1,,wn1)P(w<em>n | w</em>1, …, w<em>{n-1}) = \frac{count(w</em>1, …, w<em>n)}{count(w</em>1, …, w_{n-1})}

N-gram Language Models: Example

  • Learning a 4-gram LM.
  • Example: "as the proctor started the clock, the students opened their ____"
  • In the corpus:
    • "students opened their" occurred 1000 times.
    • "students opened their books" occurred 400 times.
    • P(books  students opened their)=0.4P(\text{books } | \text{ students opened their}) = 0.4
    • "students opened their exams" occurred 100 times.
    • P(exams  students opened their)=0.1P(\text{exams } | \text{ students opened their}) = 0.1

Sparsity Problems with N-gram Language Models

  • Problem 1: If "students opened their w" never occurred in the data, then w has probability 0!
    • Solution: Add small δ\delta to the count for every wVw \in V (smoothing).
  • Note: Increasing n makes sparsity problems worse; typically, n cannot be bigger than 5.
  • Problem 2: If "students opened their" never occurred in data, then we can’t calculate probability for any w!
    • Solution: Condition on "opened their" instead (backoff).

Storage Problems with N-gram Language Models

  • Need to store count for all n-grams seen in the corpus.
  • Increasing n or increasing corpus increases model size.

N-gram Language Models in Practice

  • A trigram LM can be built over a 1.7 million word corpus (Reuters) in a few seconds on a laptop.
  • Example: "today the "
    • +company: 0.153
    • +bank: 0.153
    • +price: 0.077
    • +italian: 0.039
    • +emirate: 0.039
    • Sparsity problem: not much granularity in the probability distribution.

Evaluation: How Good Is Our Model?

  • Does our language model prefer good sentences to bad ones?
  • Assign higher probability to "real" or "frequently observed" sentences than "ungrammatical" or "rarely observed" sentences.
  • Train parameters on a training set.
  • Test model’s performance on unseen data (test set).
  • An evaluation metric tells us how well our model does on the test set.

Extrinsic Evaluation of LMs

  • Compare models A and B by putting each model in a task (spelling corrector, speech recognizer, Machine Translation system).
  • Run the task, get an accuracy for A and for B.
  • Compare accuracy for A and B.

Difficulty of Extrinsic Evaluation of LMs

  • Extrinsic evaluation is time-consuming.
  • Intrinsic (in-vitro) evaluation: perplexity.
  • Directly measures language model performance at predicting words.
  • Unless the test data looks just like the training data.
  • Doesn't necessarily correspond with real application performance.
  • Gives us a single general metric for language models, useful for large language models (LLMs) as well as n-gram LMs.

Training Sets and Test Sets

  • Train parameters of our model on a training set.
  • Test the model’s performance on data we haven’t seen.
  • A test set is an unseen dataset; different from the training set.
  • We want to measure generalization to unseen data.
  • An evaluation metric (like perplexity) tells us how well our model does on the test set.

Choosing Training and Test Sets

  • If we're building an LM for a specific task, the test set should reflect the task language.
  • If we're building a general-purpose model:
    • We'll need lots of different kinds of training data.
    • We don't want the training set or the test set to be just from one domain or author or language.

Training on the Test Set -

  • Don’t allow test sentences into the training set.
  • Or else the LM will assign that sentence an artificially high probability when we see it in the test set.
  • And hence assign the whole test set a falsely high probability, making the LM look better than it really is.
  • This is called "Training on the test set", and it’s bad science!

Intuition of Perplexity

  • A good LM prefers "real" sentences.
  • Assign higher probability to “real” or “frequently observed” sentences.
  • Assigns lower probability to “word salad” (i.e., random sequences of words) or “rarely observed” sentences.

Predicting Upcoming Words

  • The Shannon Game: How well can we predict the next word?
    • I always order pizza with cheese and ___
    • The 33rd President of the US was ___
    • I saw a ___
  • Unigrams are terrible at this game.
  • A better model of a text is one that assigns a higher probability to the word that actually occurs.
    • mushrooms (0.1)
    • pepperoni (0.1)
    • anchovies (0.01)
    • Fried rice (0.0001).

Perplexity

  • The best language model is one that best predicts the entire unseen test set.
  • Generalize to all the words; the best LM assigns high probability to the entire test set.
  • When comparing two LMs, A and B, compute P<em>A(test set)P<em>A(\text{test set}) and P</em>B(test set)P</em>B(\text{test set}) .The better LM will give a higher probability to the test set.
  • Perplexity is the inverse probability of the test set, normalized by the number of words.
  • Perplexity=P(w<em>1,,w</em>N)1N=1P(w<em>1,,w</em>N)NPerplexity = P(w<em>1, …, w</em>N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w<em>1, …, w</em>N)}}
  • Probability range is [0,1], perplexity range is [1, ∞]. Minimizing perplexity is the same as maximizing probability.
  • Lower perplexity = better language model.

Neural Language Model

  • Neural Networks can be trained on very large amounts of data.
  • Neural Networks can use continuous representation of input tokens capturing the distributional hypothesis efficiently.

Neural Language Models

  • Language modeling: predict the next word.
  • A Neural Language Model is a neural network trained to predict the next word.
  • Several neural architecture choices:
    • Feedforward neural networks
    • Recurrent neural networks
    • Transformers neural networks

Building a Neural Language Model

  • Recall the Language Modeling task:
    • Input: sequence of words
    • Output: prob. dist. of the next word

A Fixed-Window Neural Language Model

  • Improvements over n-gram LM:
    • No sparsity problem
    • Don’t need to store all observed n-grams
  • Remaining problems:
    • Fixed window is too small
    • Enlarging window enlarges WW
    • Window can never be large enough! We need a neural architecture that can process any length input.

Recurrent Neural Networks (RNN)

  • A family of neural architectures (RNN, LSTM, GRU).

A Simple RNN Language Model

  • y(t)=softmax(Uh(t)+b2)RVy(t) = \text{softmax}(Uh(t) + b^2) \in R^{|V|}
  • h(t)=σ(W<em>hh(t1)+W</em>ee(t)+b1)h(t) = \sigma(W<em>h h(t-1) + W</em>e e(t) + b_1)
  • e(t)=Ex(t)e(t) = Ex(t)

RNN Advantages:

  • Can process any length input.
  • Computation for step t can (in theory) use information from many steps back.

RNN Disadvantages:

  • Recurrent computation is slow
  • In practice, difficult to access information from many steps back.

Generating Text with a RNN Language Model

  • Just like an n-gram Language Model, you can use an RNN Language Model to generate text by repeated sampling.
  • Sampled output becomes the next step's input.
  • Train an RNN-LM on any kind of text then generate text in that style.

Language Models Summary:

  • Probabilities of sentences, various uses.
  • Traditional language model: based on counts of words in context.
  • Neural language models.
  • Both types need a lot of data to train.
  • Usually evaluated using perplexity.

Sequence-to-Sequence Framework

  • Input sequence of tokens: (x<em>1,,x</em>T)VT(x<em>1, …, x</em>T) \in V_T
  • Target sequence of tokens: (y<em>1,,y</em>T)VT(y<em>1, …, y</em>{T'}) \in V_{T'}
  • Goal: Estimate P<em>θ(y</em>1,,y<em>Tx</em>1,,xT)P<em>\theta(y</em>1, …, y<em>{T'} | x</em>1, …, x_T).
  • Framed as a classification task:
    • y^<em>t=argmax</em>yVP<em>θ(y(x</em>1,,x<em>T),(y</em>t,,yt1))t[1,T]\hat{y}<em>t = \text{argmax}</em>{y \in V'} P<em>\theta(y | (x</em>1, …, x<em>T), (y</em>t, …, y_{t-1})) \quad \forall t \in [1, T']

Sequence Generation Tasks

  • Machine Translation
  • Summarization

Machine Translation

  • Given a sequence in one language → translate it into another language.
  • Cats eat mice → Les chats mangent les souris
  • A lot of parallel data for this task from and to English
  • Harder to Translate Hawaiian to Swiss German than English to French

Machine Translation - Evaluation

  • Evaluating Sequence Generation Tasks Automatically is Challenging.
  • BLEU SCORE:
    • N-Gram-Based Evaluation metric
    • It measures how much of the n-gram in the prediction appears in the reference translation
  • 1 means the prediction is the same as the gold translation
  • 0 means there is no n-gram in common.

Summarization

  • Evaluation: ROUGE score measures how much the words (and/or n-grams) in the human reference summaries appeared in the machine-generated summaries.

Design Questions

  • What output activation function and loss? Softmax and Cross Entropy
  • What architecture?
    • How do you represent a token to feed the model?

Encoder-Decoder Model

  • Model an output sequence conditioned on an input sequence.
  • Deep learning does that with an encoder-decoder model, also called “sequence to sequence” or “seq2seq”.

Modeling Translation

  • Neural Machine Translation (NMT) translates with a single neural network.
  • Want to find the best Finnish sentence, given an English sentence.
  • Express translation as a probabilistic model:
    p(yx)p(y|x)

Chain Rule

  • Expanding using the chain rule gives:

Training a Neural Machine Translation System

  • Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.

Issues

  • Last encoder hidden-state "summarises" source sentence.
  • This needs to capture all information about the source sentence
  • Information bottleneck!
  • Fixed-size representation degrades as sentence length increases.

Attention

  • Attention provides a solution to the bottleneck problem.
  • Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence.

Sequence-to-Sequence with Attention

  • Attention scores (dot product)
  • Compute softmax to turn the scores into a probability distribution
  • Attention output Utilize the attention distribution to do a weighted sum of the encoder hidden states
  • Concatenate attention output with decoder hidden state, then use to compute y1y_1

Equations

  • We have encoder hidden states h<em>1,,h</em>NRhh<em>1, …, h</em>N \in R^h
  • On timestep t, we have decoder hidden state stRhs_t \in R^h
  • We get the attention scores ete_t for this step:
    • e<em>t=[s</em>tTh<em>1,,s</em>tThN]RNe<em>t = [s</em>t^Th<em>1, …, s</em>t^Th_N] \in R^N
  • We take softmax to get the attention distribution ata_t for this step
    • a<em>t=softmax(e</em>t)RNa<em>t = \text{softmax}(e</em>t) \in R^N
  • We use a<em>ta<em>t to take a weighted sum of the encoder hidden states to get the attention output a</em>ta</em>t
    • a<em>t=</em>i=1Na<em>tih</em>iRha<em>t = \sum</em>{i=1}^N a<em>{ti}h</em>i \in R^h
  • Finally, we concatenate the attention output ata_t with the decoder hidden state st and proceed as in the non-attention seq2seq model
    • [a<em>t;s</em>t]R2h[a<em>t; s</em>t] \in R^{2h}

Attention Benefits:

  • Attention significantly improves plain seq2seq performance
  • Attention solves the bottleneck problem
  • Allows decoder to look directly at the source; bypass bottleneck
  • Attention provides some interpretability
  • By inspecting the attention distribution, we can see what the decoder was focusing on

General Deep Learning Technique

*Given a set of vector values and vector query, attention provides a method for assessing a weighted sum of the values dependent on the query

Variant Computing

  • We have some values h<em>1,,h</em>NRd<em>1h<em>1, …, h</em>N \in R^{d<em>1} and a query sRd</em>2s \in R^{d</em>2}
  • Computing the attention scores
  • Taking softmax to get attention distribution
  • Using attention distribution to take weighted sum of values, thus obtaining the attention output (sometimes called the context vector) There are multiple ways to do this.

Attention Variants Basic dot-product attention:

Basic dot-product attention:

  • e<em>i=sTh</em>ie<em>i = s^Th</em>i
  • Note: this assumes d<em>1=d</em>2d<em>1 = d</em>2
  • Multiplicative attention:
    • e<em>i=sTW</em>hie<em>i = s^TW</em>hi
    • Where WW is a weight matrix.
  • Additive attention:
    • e<em>i=vTtanh(W</em>1h<em>i+W</em>2s)e<em>i = v^T tanh(W</em>1h<em>i + W</em>2s)
    • Where W<em>1,W</em>2W<em>1, W</em>2 are weight matrices and vv is a weight vector.

Different Framework Types

  • Decoder only
  • Encoder-Decoder (input sequence) -> (output sequence)