L13 - Natural Language Processing 1: Text Vectorisation

Natural Language Processing (NLP) 1: Text Vectorisation

Introduction to NLP

Natural Language Processing (NLP) deals with understanding and manipulating human language by computers. It's a vital field because much of human knowledge is stored as text. NLP focuses on enabling computers to process, analyze, and understand human language. The lecture will skip some NLP topics made obsolete by transformer models (refer to Canvas for supplemental readings).

NLP Tasks

NLP involves various tasks, including:

  • Text classification: Determining the topic of a text.
  • Content filtering: Identifying spam or offensive content.
  • Sentiment analysis: Determining the sentiment (positive, negative) of a text.
  • Translation: Converting text from one language to another.
  • Summarization: Generating a concise summary of a longer text.

Tools for NLP

Existing knowledge includes:

  • Sequence processing using Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
  • Input encoding, including embeddings.

These tools allow building NLP models at a 2017 level. The next lecture will cover Transformers, enabling the construction of NLP models at a 2022 level, comparable to the MMLU benchmark.

Text Vectorisation

Text vectorisation is the process of converting text into numerical data that machine learning models can process. The typical steps include:

  1. Standardisation: Removing diacritics and punctuation, and converting text to lowercase.
  2. Tokenisation: Splitting text into tokens, which can be words, subwords, or groups of words.
  3. Indexing: Converting tokens to integer values.
  4. Encoding: Converting indices into embeddings or one-hot encoding.

Keras TextVectorization Layer

The keras.layers.TextVectorization layer can perform all the text processing steps.

  • standardize and split can be custom functions.
keras.layers.TextVectorization(
 max_tokens=None,
 standardize="lower_and_strip_punctuation",
 split="whitespace",
 ngrams=None,
 output_mode="int",
 output_sequence_length=None,
 vocabulary=None
)

The vocabulary is built automatically from the data by calling adapt().

  • Index 1 is reserved for unknown tokens ('[UNK]').
  • Index 0 is reserved for the mask token ('').

Example:

vectorize_layer = keras.layers.TextVectorization(
 max_tokens=1000,
 standardize="lower_and_strip_punctuation",
 output_sequence_length=250
)

vectorize_layer.adapt(dataset)
vectorize_layer.get_vocabulary()[:10] # ['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

Text Vectorisation Example

Example using a sentence from the IMDb movie review dataset:

  • Text: I am shocked. Shocked and dismayed that the 428 of you IMDB users
  • Encoded: [10, 238, 2355, 2355, 3, 1, 12, 2, 1, 5, 23, 933, 5911]
  • Decoded: i am shocked shocked and [UNK] that the [UNK] of you imdb users

HTML tags can be removed using regular expressions.

def custom_standardization(input_data):
 lowercase = tf.strings.lower(input_data)
 without_html = tf.strings.regex_replace(lowercase, '<[^>]+>', ' ')
 without_punctuation = tf.strings.regex_replace(without_html, '[{}]'.format(string.punctuation), '')
 return without_punctuation

vectorize_layer = keras.layers.TextVectorization(
 standardize=custom_standardization
)
  • TextVectorization is based on TensorFlow operations, which can complicate its use with other backend frameworks.

N-grams

N-grams consider the relationships between words by grouping them. Instead of treating text as a sequence, we can train a Dense model on the presence of words, ignoring the order.

Example from the IMDb review dataset:

  • Review: Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to … Sentiment: positive
  • Review: If you like original gut wrenching laughter you will like this movie. If you are young … Sentiment: positive
  • Review: This movie made it into one of my top 10 most awful movies. Horrible. There wasn’t … Sentiment: negative

Classification can be achieved by learning correlations between single words and sentiment.

N-gram Details

Word combinations can modify the meaning of individual words (e.g., "not good"). N-grams extend the vocabulary with pairs or triplets of words.

  • Example: “ … not really expecting much, ” ➞ {“not”, “not really”, “really”, “really expecting”, “expecting”, “expecting much”}
  • Bigrams: Pairs of words.
  • Trigrams: Triplets of words.
  • N-grams: N words.

N-gram Applications

N-grams can be applied to various data types:

Sequence1-gram2-gram3-gram
Protein sequencingCys, Gly, Leu, Ser, TrpCys-Gly, Gly-Leu, Leu-Ser, Ser-TrpCys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp
DNA sequencingA, G, C, T, T, C, G, AAG, GC, CT, TT, TC, CG, GAAGC, GCT, CTT, TTC, TCG, CGA

N-grams are still useful for language research purposes; refer to the Google N-gram Viewer.

Embeddings

Text is converted into numbers, which are then transformed into word embeddings. Embeddings map categorical values to vectors of floating-point values. These values are initially randomised and optimised during training.

Example:

my_embedding = {
 'the': [-1.46, -0.86, 0.09],
 'and': [-0.27, 1.15, 1.19],
 'a': [1.17, 0.06, -0.16],
 'of': [0.60, 0.10, 0.22],
}

Embedding Layers

Embedding layers are similar to one-hot encoding followed by a linear Dense layer (without the bias term).

Example:

the -> [0, 1, 0, 0] -> [-1.46, -0.86, 0.09]
embedding = keras.Sequential([
 keras.layers.StringLookup(output_mode="one-hot"),
 keras.layers.Dense(units=embedding_dim, use_bias=False, activation=None)
])

Computing Similarity in Embedding Space

There is a relation between semantics and position in embedding space. Token similarity can be measured using:

  1. Euclidean (L2) norm distance:
    ||(a - b)||2 = \sqrt{(a1 - b1)^2 + … + (an - b_n)^2}

  2. Cosine similarity: Measures the angular difference between positions, ignoring vector lengths.
    cos(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}

Interpreting Similarity Measures

  • Cosine similarity: It is often preferred because word embeddings are affected by word frequency.

Embedding Examples

  • Analogies can be represented in embedding spaces, such as man is to king as woman is to queen.
  • Verb tense and country-capital relationships can also be visualised.

Better Text Tokenisation

Words that are variants of each other (conjugated, plural, etc.) can be collapsed to simplify the vocabulary. For example, am, is, and are can be lemmatized to be. Manually composed algorithms have been developed for lemmatisation.

Example:

  • He is reading booksHe be read book

Modern Tokenisation

Modern tokenisation algorithms are based on substring frequencies in data rather than linguistics. Words are split into individual characters and recombined into common subwords. The final vocabulary may not always make intuitive sense.

Examples from the GPT-2 tokeniser vocabulary:

objective, stacked, USB, Energy, 306, booster, Bird, learn, stationary, nighttime, 85, rice, tensions, mission, iency, quitting, agging, hypers, OOOOOOOO, Typ, reopen, finding, Spoon, Plate, nat, Ïĥ, climates, Druid, download, isition, æĦ, partic, predis, calf, Object, annie, example

Pretrained Embeddings and Models

Similar to image classification, pretrained models can be fine-tuned for NLP tasks. Embeddings are general and can be reused for other models or tasks.

NLP Resources

  • Tensorflow Text.
  • Hugging Face models and ML library.