Regular Expressions, Tokenization & Edit Distance

CHAPTER 1 – INTRODUCTION

Pascal quotation: “The last thing you figure out in writing a book is what to put first.”

CHAPTER 2 – REGULAR EXPRESSIONS, TOKENIZATION, EDIT DISTANCE

ELIZA dialogue illustrates early pattern-matching NLP- Pattern/response pairs (“I need X” → “WHAT WOULD IT MEAN … X?”)
- Highlights power of simple regex-based pattern matching

2.0 WHY PATTERN-BASED METHODS MATTER

Even sophisticated conversational agents still rely on regexes for text patterns, intent triggers, price extraction, etc.
Text normalization pipeline relies heavily on regexes.

2.1 REGULAR EXPRESSIONS (REs)

2.1.1 Basic patterns

Concatenation: /woodchuck/ – sequence of characters
Character classes with brackets []
- Disjunction: /[wW]oodchuck/, /[abc]/, /[0-9]/
- Ranges with dash: /[A-Z]/, /[b-g]/
- Negation with leading caret: /[^A-Z]/
Optionality: ? → previous char 0 or 1 times /woodchucks?/ /colou?r/
Kleene star *: zero or more occurrences; +: one or more
- Sheep language examples: baa+!, /[0-9]+/ integer pattern
Wildcard period . matches any single char; .* for any string
Anchors (Fig 2.7)
- ^ start of line, $ end of line
- \b word boundary, \B non-word boundary

2.1.2 Disjunction & Grouping

Pipe | for alternation: /cat|dog/
Parentheses group & change precedence: /gupp(y|ies)/, counters apply to groups (…)*
Operator precedence: ()\ > * + ? {}\ > sequence > |
Greedy vs non-greedy: *?, +?

2.1.3 Example – finding word “the”

/the/ misses capitalized variant & overmatches
/[tT]he/ still hits “other”
/[tT]he/ — add word boundaries

Introduces precision (↓FP) vs recall (↓FN)

2.1.4 More operators

Aliases (Fig 2.8): \d, \w, \s, etc.
Counting with {n}, {n,m}, etc. (Fig 2.9)
Escaping special chars with backslash (Fig 2.10): \*, \. …

2.1.5 Complex price & disk-space patterns

Dollars: /(^|\W)\$[0-9]{0,3}(\.[0-9][0-9])?\b/
Gigabytes: /\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/

2.1.6 Substitution & Capture Groups

Syntax s/regex/replacement/
Back-reference \1, \2 etc. Example: s/([0-9]+)/<\1>/
Non-capturing (?: … )
ELIZA cascade: ranked substitutions using capture groups

2.1.7 Lookahead assertions

Positive lookahead (?=pattern); negative (?!pattern)
Example: /^(?!Volcano)[A-Za-z]+/ — word at line start not beginning with “Volcano”

2.2 WORDS

Defining word tokens: punctuation, disfluencies (uh, um), fillers, fragments
Word types vs instances
- Types = $|V|$ , Instances = N
Herdan/Heaps Law: \puncV = k{\puncN^\beta} with 0<\beta<1, empirical $\beta \approx .67-.75$
Lemma vs wordform: sing is lemma of sang/sung/sings; dictionary boldface ≈ lemmas

2.3 CORPORA & DATASHEETS

Language, genre, dialect variety (AAE, code-switching) matter
Datasheet fields: motivation, situation, language variety, demographics, collection & annotation process, distribution

2.4 SIMPLE UNIX TOKENIZATION PIPELINE

tr -sc 'A-Za-z' '\n' – replace non-letters with newlines
sort | uniq -c for counts
Fold case with second tr A-Z a-z
Sort by frequency: sort -n -r

2.5 TOKENIZATION

2.5.1 Top-down rule-based tokenization

Penn Treebank conventions: split clitics (does n’t), keep hyphens, separate punctuation
Languages without spaces need extra work (Chinese/Japanese)

2.5.2 Bottom-up Byte-Pair Encoding (BPE)

Learner starts with chars, iteratively merges most frequent adjacent pairs k times (parameter)
Segmenter applies learned merges greedily to new text
Handles unknown words via subword decomposition
Algorithm pseudocode (Fig 2.13)

2.6 WORD NORMALIZATION

Case folding when helpful; sometimes capitalization retained (NER)
Lemmatization: map wordforms to lemmas; essential for morphologically rich languages
Stemming – crude suffix stripping (Porter) rules e.g., ATIONAL→ATE, ING→∅

2.7 SENTENCE SEGMENTATION

Cues: . ! ?; ambiguity with abbreviations; rule/ML + abbreviation dictionaries

2.8 MINIMUM EDIT DISTANCE (MED)

Measures similarity via minimal insertions, deletions, substitutions
Levenshtein distance: cost 1 for ins/del, 1 or 2 for sub
Applications: spelling correction, speech recognition WER, coreference

Algorithm (dynamic programming)

Matrix D[i,j] where D[i,j] = \min\begin{cases}D[i-1,j]+1\D[i,j-1]+1\D[i-1,j-1]+cost\end{cases}
Initialization first row i deletions, first col j insertions
Greedy backtrace yields alignment; patterns greedy

Example

distance(intention, execution)=8 (ins/del=1, sub=2) grid Fig 2.18

SUMMARY OF KEY CONCEPTS & IMPLICATIONS

Regular expressions provide expressive yet efficient pattern spec for NLP tasks
Text normalization (= tokenization, case/lemma, sentence split) is prerequisite for most pipelines
Subword tokenization (BPE) critical for large language models & unknown words
Corpus choice/datasheets essential for fairness, dialect coverage
MED via DP foundational algorithm for similarity, extended by Viterbi etc.

CHAPTER 3 – NAIVE BAYES AND SENTIMENT CLASSIFICATION

Sentiment Classification: Task of determining the emotional tone (positive, negative, neutral, etc.) behind a piece of text.

Applications: Widely used for analyzing product reviews, social media posts, customer feedback, news articles, and more, to gauge public opinion or sentiment towards entities.

Naive Bayes Classifier:

A probabilistic classifier based on Bayes' Theorem, with a "naive" independence assumption between features.
Bayes' Theorem: $P(class|document) = \frac{P(document|class)P(class)}{P(document)}$
- $P(class|document)$ : Posterior probability of a document belonging to a specific class.
- $P(document|class)$ : Likelihood of observing the document given the class.
- $P(class)$ : Prior probability of the class.
- $P(document)$ : Evidence (often ignored in comparison as it's constant across classes).
"Naive" Assumption: Assumes that the features (words) are conditionally independent given the class. This simplifies calculations but is rarely true in natural language (e.g., "not good" implies bad, but "not" and "good" are treated independently).
Bag-of-Words Model: Represents text as an unordered collection (bag) of words, disregarding grammar, word order, and sentence structure. Only word presence and frequency matter.
Feature Extraction: Converting raw text into numerical features which the model can understand. Common methods include:
- Count Vectors (Term Frequency): Simply count how many times each word appears in a document.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weights words by how frequently they appear in a document (TF) adjusted by how rarely they appear across all documents (IDF), giving more importance to rare, distinctive words.
Training:
1. Calculate Prior Probabilities ( $P(class)$ ) from the training data (e.g., $P(positive) = \frac{\text{count of positive documents}}{\text{total documents}}$ ).
2. Calculate Likelihoods ( $P(word|class)$ ) from the training data. This is the probability of a specific word appearing in a document of a given class.
Smoothing: Essential to handle zero probabilities for unseen words in the test data. If a word in a new document was not present in the training data for a specific class, its likelihood would be zero, making the whole product zero. Laplace smoothing (Add-k smoothing) is a common technique.
- Formula for Laplace Smoothing ( $\alpha=1$ ):
 $P(wordi|class) = \frac{count(wordi, class) + 1}{count(class) + |V|}$
 where $|V|$ is the total number of unique words (vocabulary size) in the training data.
Numerical Example (Sentiment Classification with Naive Bayes):
- Training Data:
- Positive: "great movie", "awesome film"
- Negative: "bad acting", "terrible acting"
- Vocabulary (V): {great, movie, awesome, film, bad, acting, terrible}
- Counts:
- $N_{pos} = 2$ (positive documents)
- $N_{neg} = 2$ (negative documents)
- Word Counts in Classes:
- C(great, pos) = 1, C(movie, pos) = 1, C(awesome, pos) = 1, C(film, pos) = 1
- C(bad, neg) = 1, C(acting, neg) = 2, C(terrible, neg) = 1
- Other counts are 0.
- Priors:
- $P(pos) = \frac{2}{4} = 0.5$
- $P(neg) = \frac{2}{4} = 0.5$
- Likelihoods (with Laplace Smoothing, $\alpha=1$ ):
- For Positive Class ( $|V|=7$ ):
  - $P(great|pos) = \frac{1+1}{4+7} = \frac{2}{11}$ (word count in positive class + 1 / total words in positive class + |V|)
  - $P(acting|pos) = \frac{0+1}{4+7} = \frac{1}{11}$
- For Negative Class ( $|V|=7$ ):
  - $P(acting|neg) = \frac{2+1}{4+7} = \frac{3}{11}$ (word count in negative class + 1 / total words in negative class + |V|)
  - $P(great|neg) = \frac{0+1}{4+7} = \frac{1}{11}$
- Test Document: "great acting"
- Prediction:
- $P(pos|\text{"great acting"}) \propto P(pos) \cdot P(great|pos) \cdot P(acting|pos) = 0.5 \cdot \frac{2}{11} \cdot \frac{1}{11} = 0.5 \cdot \frac{2}{121} \approx 0.00826$
- $P(neg|\text{"great acting"}) \propto P(neg) \cdot P(great|neg) \cdot P(acting|neg) = 0.5 \cdot \frac{1}{11} \cdot \frac{3}{11} = 0.5 \cdot \frac{3}{121} \approx 0.01239$
- Since 0.01239 > 0.00826, the document "great acting" is classified as Negative.

Evaluation Metrics:

Used to assess the performance of a classification model.
Definitions:
- True Positive (TP): Correctly predicted positive.
- True Negative (TN): Correctly predicted negative.
- False Positive (FP): Predicted positive, but actually negative (Type I error).
- False Negative (FN): Predicted negative, but actually positive (Type II error).
Metrics:
- Precision: Of all instances predicted as positive, how many were actually positive? Focuses on false positives.
  $Precision = \frac{TP}{TP+FP}$
- Recall (Sensitivity): Of all actual positive instances, how many were correctly predicted as positive? Focuses on false negatives.
  $Recall = \frac{TP}{TP+FN}$
- F1-score: The harmonic mean of Precision and Recall. Useful when there is an uneven class distribution, as it balances both metrics.
  $F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$
- Accuracy: The proportion of correctly classified instances out of the total instances. Can be misleading if classes are imbalanced.
  $Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$
- Confusion Matrix: A table that summarizes the performance of a classification algorithm. Rows represent actual classes, columns represent predicted classes.

Numerical Example (Evaluation Metrics):

Assume a binary classification model (Positive/Negative) yields the following results on a test set of 100 samples:
Actual Positive: 30, Actual Negative: 70
Model Predictions:
- Predicted Positive: 25 True Positives (TP), 5 False Positives (FP)
- Predicted Negative: 5 False Negatives (FN), 65 True Negatives (TN)

Confusion Matrix:

	Predicted Positive	Predicted Negative
Actual Pos	25 (TP)	5 (FN)
Actual Neg	5 (FP)	65 (TN)
Calculations:
$Precision = \frac{25}{25+5} = \frac{25}{30} \approx 0.833$
$Recall = \frac{25}{25+5} = \frac{25}{30} \approx 0.833$
$F1 = 2 \cdot \frac{0.833 \cdot 0.833}{0.833 + 0.833} = 0.833$
$Accuracy = \frac{25+65}{25+5+5+65} = \frac{90}{100} = 0.90$

CHAPTER 4 – LINGUISTIC ESSENTIALS & DISCRIMINATIVE CLASSIFIERS

Linguistic Essentials:

Parts of Speech (POS) Tagging: The process of assigning grammatical categories (e.g., Noun, Verb, Adjective, Adverb, Preposition) to each word in a given text. It's a foundational step for many advanced NLP tasks.
- Importance: Crucial for tasks like parsing (understanding sentence structure), named entity recognition (identifying proper nouns), and machine translation to correctly interpret word roles.
Morphology: The study of word structure and how words are formed from smaller meaningful units (morphemes).
- Inflection: Changes in word form to indicate grammatical functions (e.g., walk -> walks, walking, walked). It doesn't change the word's core meaning or part of speech.
- Derivation: Forms new words from existing ones, often changing their part of speech or meaning (e.g., happy -> unhappy, happiness; govern -> government).
- Example: In "unbreakable": un- (prefix), break (root), -able (suffix).
Syntax: The study of how words are combined to form phrases, clauses, and sentences, and the rules governing their structure.
- Syntactic Parsing: The process of analyzing a sentence to determine its grammatical structure according to a formal grammar (e.g., constituency parsing (parse trees), dependency parsing).
Semantics: The study of meaning in language.
- Lexical Semantics: Deals with the meaning of individual words (e.g., synonymy, antonymy, polysemy).
- Compositional Semantics: Deals with how word meanings combine to form the meaning of larger units like phrases and sentences.
Discourse: The study of linguistic units larger than a single sentence, such as paragraphs, conversations, or documents.
- Examples:
- Coreference Resolution: Identifying when different expressions in a text refer to the same entity (e.g., "John bought a car. He loves it."; "He" and "it" refer to "John" and "a car" respectively).
- Coherence: How sentences and ideas logically connect to form a meaningful whole.

Sequence Labeling:

A general class of NLP tasks where the goal is to assign a label to each element in a sequence. Often framed as predicting a tag for each word in a sentence.
Examples:
- POS Tagging: Assigning a grammatical tag (e.g., NN for noun, VB for verb) to each word.
- Named Entity Recognition (NER): Identifying and classifying named entities (e.g., persons, organizations, locations, dates) in text.
- Chunking (Shallow Parsing): Segmenting and labeling multi-word sequences into syntactically correlated word groupings (e.g., noun phrases, verb phrases).

Discriminative vs. Generative Models:

Generative Models (e.g., Naive Bayes, Hidden Markov Models):
- Model the underlying distribution of the data for each class, i.e., $P(x,y)$ , or more commonly, $P(x|y)$ (likelihood) and $P(y)$ (prior).
- They learn the characteristics of each class to generate (or model) new data points.
- Tend to perform better with smaller datasets but can be less robust to noisy data due to strong independence assumptions.
Discriminative Models (e.g., Perceptron, Logistic Regression, Support Vector Machines):
- Directly model the conditional probability $P(y|x)$ or a direct mapping from input to output.
- Focus on learning the decision boundary between classes, without needing to understand the internal structure of each class's data distribution.
- Generalize well to new, unseen data and are often more robust to noisy data. Typically require more data.

Perceptron:

A foundational algorithm for supervised learning, specifically for binary classification. It's a simple linear classifier.
Learning Process:
1. Initialise weights $w$ and bias $b$ (often to zero or small random values).
2. For each training example $(x, label)$ : a. Calculate the predicted output: $y{pred} = sign(w \cdot x + b)$ . b. If $y{pred} \neq label$ (misclassification), update weights and bias:
 - $w \leftarrow w + label \cdot x$
 - $b \leftarrow b + label$
Limitations:
- Can only learn linearly separable functions. If the data cannot be perfectly separated by a straight line (or hyperplane in higher dimensions), the Perceptron algorithm will not converge.
- Not probabilistic.
Numerical Example (Perceptron):
- Data: (input, label)
- $x_1 = (1, 0)$ , label = 1
- $x_2 = (0, 1)$ , label = 1
- $x_3 = (0, 0)$ , label = -1
- $x_4 = (1, 1)$ , label = -1
- Initialize: $w = (0, 0)$ , $b = 0$ (learning rate $\eta = 1$ for simplicity)
- Iteration 1:
- Consider $(x_1, 1) = ((1, 0), 1)$
- $z = w \cdot x_1 + b = (0 \cdot 1) + (0 \cdot 0) + 0 = 0$
- $y_{pred} = sign(0) = 0$ (or 1 depending on convention for 0, let's assume 1 for ease of use here if $w \cdot x + b \ge 0$ and -1 otherwise)
- Assuming $sign(0)=1$ : $y_{pred} = 1$ . Correctly classified.
- Iteration 2:
- Consider $(x_2, 1) = ((0, 1), 1)$
- $z = w \cdot x_2 + b = (0 \cdot 0) + (0 \cdot 1) + 0 = 0$
- $y_{pred} = 1$ . Correctly classified.
- Iteration 3:
- Consider $(x_3, -1) = ((0, 0), -1)$
- $z = w \cdot x_3 + b = (0 \cdot 0) + (0 \cdot 0) + 0 = 0$
- $y_{pred} = 1$ . Misclassification! (Actual label is -1).
- Update:
  - $w \leftarrow w + label \cdot x_3 = (0, 0) + (-1) \cdot (0, 0) = (0, 0)$
  - $b \leftarrow b + label = 0 + (-1) = -1$
- Now $w = (0, 0)$ , $b = -1$
- Iteration 4:
- Consider $(x_4, -1) = ((1, 1), -1)$
- $z = w \cdot x_4 + b = (0 \cdot 1) + (0 \cdot 1) + (-1) = -1$
- $y_{pred} = sign(-1) = -1$ . Correctly classified.
- Next Epoch: Re-evaluate on misclassified examples (if any). The process continues until all examples are correctly classified (if separable) or for a fixed number of epochs.

CHAPTER 5 – CLASSIFICATION AND LOGISTIC REGRESSION

Logistic Regression:

A powerful

A powerful discriminative classifier used for predicting the probability of a binary outcome (0 or 1, true or false) or for multi-class classification. Despite its name, it is a classification algorithm, not a regression algorithm in the traditional sense.

Models the probability of a binary outcome: Logistic Regression applies the sigmoid (or logistic) function to a linear combination of input features to output a probability score between 0 and 1.
- Sigmoid Function ( $\sigma$ ):
  $P(y=1|x) = \sigma(w \cdot x + b) = \frac{1}{1 + e^{-(w \cdot x + b)}}$
  where $w$ is the weight vector, $x$ is the input feature vector, and $b$ is the bias.
Decision Boundary: The point at which the predicted probability crosses a threshold (typically 0.5) determines the class. For a linear model like logistic regression, this results in a linear decision boundary (a line in 2D, a hyperplane in higher dimensions) that separates the classes.
Loss Function (Cost Function):
- Measures the discrepancy between the model's predicted probabilities and the actual labels. The goal during training is to minimize this loss.
- For logistic regression, the Cross-Entropy Loss (also known as Log Loss) is commonly used:
 $J(\theta) = -\frac{1}{N} \sum{i=1}^{N} [yi \log(h\theta(xi)) + (1 - yi) \log(1 - h\theta(xi))]$ where $N$ is the number of training examples, $yi$ is the actual label (0 or 1), and $h\theta(xi)$ is the predicted probability for $x_i$ . This function penalizes confident incorrect predictions heavily.
Optimization: The process of finding the optimal parameters ( $w$ and $b$ ) that minimize the loss function.
- Gradient Descent: An iterative optimization algorithm that adjusts parameters in the direction opposite to the gradient of the loss function. The learning rate ( $\eta$ ) controls the size of these steps.
- Parameter update rule: $\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)$
- Types of Gradient Descent:
  - Batch Gradient Descent: Computes the gradient over the entire training dataset for each update. Can be slow for large datasets but guarantees convergence to the minimum for convex functions.
  - Stochastic Gradient Descent (SGD): Computes the gradient and updates parameters for each individual training example. Faster but with noisy updates, leading to a more erratic path to convergence.
  - Mini-batch Gradient Descent: A compromise between Batch and SGD, using small random batches of training examples for each update. Offers a good balance of speed and stability.
Regularization: Techniques used to prevent overfitting, which occurs when a model learns the training data too well and performs poorly on unseen data.
- Adds a penalty term to the loss function that discourages large weights.
- L1 Regularization (Lasso): Adds the sum of the absolute values of the weights to the loss function ( $\lambda \sum |w_j|$ ).
- Promotes sparsity, meaning it can drive some weights exactly to zero, effectively leading to feature selection.
- L2 Regularization (Ridge): Adds the sum of the squared values of the weights to the loss function ( $\lambda \sum w_j^2$ ).
- Shrinks weights towards zero (but rarely exactly to zero), leading to a more generalized model.
- Regularized Loss Function: $J_{regularized}(\theta) = J(\theta) + \lambda \cdot R(\theta)$
- The hyperparameter $\lambda$ (lambda) controls the strength of the regularization; a higher $\lambda$ means stronger regularization.

Numerical Example (Logistic Regression Prediction)

Assume a simple binary classification problem with two features ( $x1, x2$ ).

Parameters: $w = [0.5, -0.2]$ , $b = 0.1$
Test Input: $x = [1.0, 3.0]$

Calculate the linear combination:
$z = w \cdot x + b = (0.5 \cdot 1.0) + (-0.2 \cdot 3.0) + 0.1$
$z = 0.5 - 0.6 + 0.1$
$z = 0$
Apply the Sigmoid Function:
$P(y=1|x) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-0}}$
$P(y=1|x) = \frac{1}{1 + 1} = \frac{1}{2} = 0.5$
Make a prediction: If the threshold is 0.5:
Since $P(y=1|x) = 0.5$ , the model would typically classify this as positive (class 1) or be considered on the boundary, depending on the tie-breaking rule.

CHAPTER 6 – NEURAL NETWORKS AND NEURAL LANGUAGE MODELS

Neural Networks: Computational models inspired by the structure and function of biological neural networks.

Composed of interconnected nodes (neurons) organized in layers.

Layers: Input layer, hidden layers, output layer.

Activation Functions: Introduce non-linearity into the network (e.g., Sigmoid, Tanh, ReLU).

$ReLU(z) = \max(0, z)$

Feed-forward Networks: Information flows in one direction, from input to output, no cycles.

Training Neural Networks:

Forward Propagation: Calculate output given inputs and current weights.

Backpropagation: Algorithm for efficiently calculating the gradients of the loss function with respect to the network weights.

Uses the chain rule to propagate errors backward through the network.

Optimization Algorithms: Stochastic Gradient Descent (SGD), Adam, RMSprop.

Word Embeddings: Dense vector representations of words.

Capture semantic and syntactic relationships between words.

Word2Vec: Popular method for learning word embeddings.

Skip-gram: Predict context words given a target word.

CBOW (Continuous Bag-of-Words): Predict target word given context words.

GloVe (Global Vectors for Word Representation): Combines advantages of global matrix factorization and local context window methods.

Neural Language Models (NLM):

Use neural networks to predict the next word in a sequence.

Represent words as continuous vectors (embeddings).

Can learn complex, non-linear dependencies in language.

Basic architecture involves an input layer (word embeddings), a hidden layer, and an output layer (softmax over vocabulary).