L13 - Natural Language Processing 1: Text Vectorisation
Natural Language Processing (NLP) 1: Text Vectorisation
Introduction to NLP
Natural Language Processing (NLP) deals with understanding and manipulating human language by computers. It's a vital field because much of human knowledge is stored as text. NLP focuses on enabling computers to process, analyze, and understand human language. The lecture will skip some NLP topics made obsolete by transformer models (refer to Canvas for supplemental readings).
NLP Tasks
NLP involves various tasks, including:
- Text classification: Determining the topic of a text.
- Content filtering: Identifying spam or offensive content.
- Sentiment analysis: Determining the sentiment (positive, negative) of a text.
- Translation: Converting text from one language to another.
- Summarization: Generating a concise summary of a longer text.
Tools for NLP
Existing knowledge includes:
- Sequence processing using Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
- Input encoding, including embeddings.
These tools allow building NLP models at a 2017 level. The next lecture will cover Transformers, enabling the construction of NLP models at a 2022 level, comparable to the MMLU benchmark.
Text Vectorisation
Text vectorisation is the process of converting text into numerical data that machine learning models can process. The typical steps include:
- Standardisation: Removing diacritics and punctuation, and converting text to lowercase.
- Tokenisation: Splitting text into tokens, which can be words, subwords, or groups of words.
- Indexing: Converting tokens to integer values.
- Encoding: Converting indices into embeddings or one-hot encoding.
Keras TextVectorization Layer
The keras.layers.TextVectorization
layer can perform all the text processing steps.
standardize
andsplit
can be custom functions.
keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
ngrams=None,
output_mode="int",
output_sequence_length=None,
vocabulary=None
)
The vocabulary is built automatically from the data by calling adapt()
.
- Index 1 is reserved for unknown tokens ('[UNK]').
- Index 0 is reserved for the mask token ('').
Example:
vectorize_layer = keras.layers.TextVectorization(
max_tokens=1000,
standardize="lower_and_strip_punctuation",
output_sequence_length=250
)
vectorize_layer.adapt(dataset)
vectorize_layer.get_vocabulary()[:10] # ['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it']
Text Vectorisation Example
Example using a sentence from the IMDb movie review dataset:
- Text: I am shocked. Shocked and dismayed that the 428 of you IMDB users
- Encoded: [10, 238, 2355, 2355, 3, 1, 12, 2, 1, 5, 23, 933, 5911]
- Decoded:
i am shocked shocked and [UNK] that the [UNK] of you imdb users
HTML tags can be removed using regular expressions.
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
without_html = tf.strings.regex_replace(lowercase, '<[^>]+>', ' ')
without_punctuation = tf.strings.regex_replace(without_html, '[{}]'.format(string.punctuation), '')
return without_punctuation
vectorize_layer = keras.layers.TextVectorization(
standardize=custom_standardization
)
TextVectorization
is based on TensorFlow operations, which can complicate its use with other backend frameworks.
N-grams
N-grams consider the relationships between words by grouping them. Instead of treating text as a sequence, we can train a Dense model on the presence of words, ignoring the order.
Example from the IMDb review dataset:
- Review: Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to … Sentiment: positive
- Review: If you like original gut wrenching laughter you will like this movie. If you are young … Sentiment: positive
- Review: This movie made it into one of my top 10 most awful movies. Horrible. There wasn’t … Sentiment: negative
Classification can be achieved by learning correlations between single words and sentiment.
N-gram Details
Word combinations can modify the meaning of individual words (e.g., "not good"). N-grams extend the vocabulary with pairs or triplets of words.
- Example: “ … not really expecting much, ” ➞ {“not”, “not really”, “really”, “really expecting”, “expecting”, “expecting much”}
- Bigrams: Pairs of words.
- Trigrams: Triplets of words.
- N-grams: N words.
N-gram Applications
N-grams can be applied to various data types:
Sequence | 1-gram | 2-gram | 3-gram |
---|---|---|---|
Protein sequencing | Cys, Gly, Leu, Ser, Trp | Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp | Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp |
DNA sequencing | A, G, C, T, T, C, G, A | AG, GC, CT, TT, TC, CG, GA | AGC, GCT, CTT, TTC, TCG, CGA |
N-grams are still useful for language research purposes; refer to the Google N-gram Viewer.
Embeddings
Text is converted into numbers, which are then transformed into word embeddings. Embeddings map categorical values to vectors of floating-point values. These values are initially randomised and optimised during training.
Example:
my_embedding = {
'the': [-1.46, -0.86, 0.09],
'and': [-0.27, 1.15, 1.19],
'a': [1.17, 0.06, -0.16],
'of': [0.60, 0.10, 0.22],
}
Embedding Layers
Embedding layers are similar to one-hot encoding followed by a linear Dense layer (without the bias term).
Example:
the -> [0, 1, 0, 0] -> [-1.46, -0.86, 0.09]
embedding = keras.Sequential([
keras.layers.StringLookup(output_mode="one-hot"),
keras.layers.Dense(units=embedding_dim, use_bias=False, activation=None)
])
Computing Similarity in Embedding Space
There is a relation between semantics and position in embedding space. Token similarity can be measured using:
Euclidean (L2) norm distance:
||(a - b)||2 = \sqrt{(a1 - b1)^2 + … + (an - b_n)^2}Cosine similarity: Measures the angular difference between positions, ignoring vector lengths.
cos(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}
Interpreting Similarity Measures
- Cosine similarity: It is often preferred because word embeddings are affected by word frequency.
Embedding Examples
- Analogies can be represented in embedding spaces, such as
man
is toking
aswoman
is toqueen
. - Verb tense and country-capital relationships can also be visualised.
Better Text Tokenisation
Words that are variants of each other (conjugated, plural, etc.) can be collapsed to simplify the vocabulary. For example, am
, is
, and are
can be lemmatized to be
. Manually composed algorithms have been developed for lemmatisation.
Example:
He is reading books
➞He be read book
Modern Tokenisation
Modern tokenisation algorithms are based on substring frequencies in data rather than linguistics. Words are split into individual characters and recombined into common subwords. The final vocabulary may not always make intuitive sense.
Examples from the GPT-2 tokeniser vocabulary:
objective
, stacked
, USB
, Energy
, 306
, booster
, Bird
, learn
, stationary
, nighttime
, 85
, rice
, tensions
, mission
, iency
, quitting
, agging
, hypers
, OOOOOOOO
, Typ
, reopen
, finding
, Spoon
, Plate
, nat
, Ïĥ
, climates
, Druid
, download
, isition
, æĦ
, partic
, predis
, calf
, Object
, annie
, example
Pretrained Embeddings and Models
Similar to image classification, pretrained models can be fine-tuned for NLP tasks. Embeddings are general and can be reused for other models or tasks.
NLP Resources
- Tensorflow Text.
- Hugging Face models and ML library.