Natural Language Processing (NLP) deals with understanding and manipulating human language by computers. It's a vital field because much of human knowledge is stored as text. NLP focuses on enabling computers to process, analyze, and understand human language. The lecture will skip some NLP topics made obsolete by transformer models (refer to Canvas for supplemental readings).
NLP involves various tasks, including:
Existing knowledge includes:
These tools allow building NLP models at a 2017 level. The next lecture will cover Transformers, enabling the construction of NLP models at a 2022 level, comparable to the MMLU benchmark.
Text vectorisation is the process of converting text into numerical data that machine learning models can process. The typical steps include:
The keras.layers.TextVectorization
layer can perform all the text processing steps.
standardize
and split
can be custom functions.keras.layers.TextVectorization(
max_tokens=None,
standardize="lower_and_strip_punctuation",
split="whitespace",
ngrams=None,
output_mode="int",
output_sequence_length=None,
vocabulary=None
)
The vocabulary is built automatically from the data by calling adapt()
.
Example:
vectorize_layer = keras.layers.TextVectorization(
max_tokens=1000,
standardize="lower_and_strip_punctuation",
output_sequence_length=250
)
vectorize_layer.adapt(dataset)
vectorize_layer.get_vocabulary()[:10] # ['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it']
Example using a sentence from the IMDb movie review dataset:
i am shocked shocked and [UNK] that the [UNK] of you imdb users
HTML tags can be removed using regular expressions.
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
without_html = tf.strings.regex_replace(lowercase, '<[^>]+>', ' ')
without_punctuation = tf.strings.regex_replace(without_html, '[{}]'.format(string.punctuation), '')
return without_punctuation
vectorize_layer = keras.layers.TextVectorization(
standardize=custom_standardization
)
TextVectorization
is based on TensorFlow operations, which can complicate its use with other backend frameworks.N-grams consider the relationships between words by grouping them. Instead of treating text as a sequence, we can train a Dense model on the presence of words, ignoring the order.
Example from the IMDb review dataset:
Classification can be achieved by learning correlations between single words and sentiment.
Word combinations can modify the meaning of individual words (e.g., "not good"). N-grams extend the vocabulary with pairs or triplets of words.
N-grams can be applied to various data types:
Sequence | 1-gram | 2-gram | 3-gram |
---|---|---|---|
Protein sequencing | Cys, Gly, Leu, Ser, Trp | Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp | Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp |
DNA sequencing | A, G, C, T, T, C, G, A | AG, GC, CT, TT, TC, CG, GA | AGC, GCT, CTT, TTC, TCG, CGA |
N-grams are still useful for language research purposes; refer to the Google N-gram Viewer.
Text is converted into numbers, which are then transformed into word embeddings. Embeddings map categorical values to vectors of floating-point values. These values are initially randomised and optimised during training.
Example:
my_embedding = {
'the': [-1.46, -0.86, 0.09],
'and': [-0.27, 1.15, 1.19],
'a': [1.17, 0.06, -0.16],
'of': [0.60, 0.10, 0.22],
}
Embedding layers are similar to one-hot encoding followed by a linear Dense layer (without the bias term).
Example:
the -> [0, 1, 0, 0] -> [-1.46, -0.86, 0.09]
embedding = keras.Sequential([
keras.layers.StringLookup(output_mode="one-hot"),
keras.layers.Dense(units=embedding_dim, use_bias=False, activation=None)
])
There is a relation between semantics and position in embedding space. Token similarity can be measured using:
Euclidean (L2) norm distance:
||(a - b)||2 = \sqrt{(a1 - b1)^2 + … + (an - b_n)^2}
Cosine similarity: Measures the angular difference between positions, ignoring vector lengths.
cos(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}
man
is to king
as woman
is to queen
.Words that are variants of each other (conjugated, plural, etc.) can be collapsed to simplify the vocabulary. For example, am
, is
, and are
can be lemmatized to be
. Manually composed algorithms have been developed for lemmatisation.
Example:
He is reading books
➞ He be read book
Modern tokenisation algorithms are based on substring frequencies in data rather than linguistics. Words are split into individual characters and recombined into common subwords. The final vocabulary may not always make intuitive sense.
Examples from the GPT-2 tokeniser vocabulary:
objective
, stacked
, USB
, Energy
, 306
, booster
, Bird
, learn
, stationary
, nighttime
, 85
, rice
, tensions
, mission
, iency
, quitting
, agging
, hypers
, OOOOOOOO
, Typ
, reopen
, finding
, Spoon
, Plate
, nat
, Ïĥ
, climates
, Druid
, download
, isition
, æĦ
, partic
, predis
, calf
, Object
, annie
, example
Similar to image classification, pretrained models can be fine-tuned for NLP tasks. Embeddings are general and can be reused for other models or tasks.