Regular Expressions, Tokenization & Edit Distance

CHAPTER 1 – INTRODUCTION

  • Pascal quotation: “The last thing you figure out in writing a book is what to put first.”

CHAPTER 2 – REGULAR EXPRESSIONS, TOKENIZATION, EDIT DISTANCE

  • ELIZA dialogue illustrates early pattern-matching NLP- Pattern/response pairs (“I need X” → “WHAT WOULD IT MEAN … X?”)

    • Highlights power of simple regex-based pattern matching

2.0 WHY PATTERN-BASED METHODS MATTER
  • Even sophisticated conversational agents still rely on regexes for text patterns, intent triggers, price extraction, etc.

  • Text normalization pipeline relies heavily on regexes.

2.1 REGULAR EXPRESSIONS (REs)
2.1.1 Basic patterns
  • Concatenation: /woodchuck/ – sequence of characters

  • Character classes with brackets []

    • Disjunction: /[wW]oodchuck/, /[abc]/, /[0-9]/

    • Ranges with dash: /[A-Z]/, /[b-g]/

    • Negation with leading caret: /[^A-Z]/

  • Optionality: ? → previous char 0 or 1 times /woodchucks?/ /colou?r/

  • Kleene star *: zero or more occurrences; +: one or more

    • Sheep language examples: baa+!, /[0-9]+/ integer pattern

  • Wildcard period . matches any single char; .* for any string

  • Anchors (Fig 2.7)

    • ^ start of line, $ end of line

    • \b word boundary, \B non-word boundary

2.1.2 Disjunction & Grouping
  • Pipe | for alternation: /cat|dog/

  • Parentheses group & change precedence: /gupp(y|ies)/, counters apply to groups (…)*

  • Operator precedence: ()\ > * + ? {}\ > sequence > |

  • Greedy vs non-greedy: *?, +?

2.1.3 Example – finding word “the”
  1. /the/ misses capitalized variant & overmatches

  2. /[tT]he/ still hits “other”

  3. /[tT]he/ — add word boundaries

  • Introduces precision (↓FP) vs recall (↓FN)

2.1.4 More operators
  • Aliases (Fig 2.8): \d, \w, \s, etc.

  • Counting with {n}, {n,m}, etc. (Fig 2.9)

  • Escaping special chars with backslash (Fig 2.10): \*, \.

2.1.5 Complex price & disk-space patterns
  • Dollars: /(^|\W)\$[0-9]{0,3}(\.[0-9][0-9])?\b/

  • Gigabytes: /\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/

2.1.6 Substitution & Capture Groups
  • Syntax s/regex/replacement/

  • Back-reference \1, \2 etc. Example: s/([0-9]+)/<\1>/

  • Non-capturing (?: … )

  • ELIZA cascade: ranked substitutions using capture groups

2.1.7 Lookahead assertions
  • Positive lookahead (?=pattern); negative (?!pattern)

  • Example: /^(?!Volcano)[A-Za-z]+/ — word at line start not beginning with “Volcano”

2.2 WORDS
  • Defining word tokens: punctuation, disfluencies (uh, um), fillers, fragments

  • Word types vs instances

    • Types = V|V|, Instances = N

  • Herdan/Heaps Law: \puncV = k{\puncN^\beta} with 0<\beta<1, empirical β.67.75\beta \approx .67-.75

  • Lemma vs wordform: sing is lemma of sang/sung/sings; dictionary boldface ≈ lemmas

2.3 CORPORA & DATASHEETS
  • Language, genre, dialect variety (AAE, code-switching) matter

  • Datasheet fields: motivation, situation, language variety, demographics, collection & annotation process, distribution

2.4 SIMPLE UNIX TOKENIZATION PIPELINE
  • tr -sc 'A-Za-z' '\n' – replace non-letters with newlines

  • sort | uniq -c for counts

  • Fold case with second tr A-Z a-z

  • Sort by frequency: sort -n -r

2.5 TOKENIZATION
2.5.1 Top-down rule-based tokenization
  • Penn Treebank conventions: split clitics (does n’t), keep hyphens, separate punctuation

  • Languages without spaces need extra work (Chinese/Japanese)

2.5.2 Bottom-up Byte-Pair Encoding (BPE)
  • Learner starts with chars, iteratively merges most frequent adjacent pairs k times (parameter)

  • Segmenter applies learned merges greedily to new text

  • Handles unknown words via subword decomposition

  • Algorithm pseudocode (Fig 2.13)

2.6 WORD NORMALIZATION
  • Case folding when helpful; sometimes capitalization retained (NER)

  • Lemmatization: map wordforms to lemmas; essential for morphologically rich languages

  • Stemming – crude suffix stripping (Porter) rules e.g., ATIONAL→ATE, ING→∅

2.7 SENTENCE SEGMENTATION
  • Cues: . ! ?; ambiguity with abbreviations; rule/ML + abbreviation dictionaries

2.8 MINIMUM EDIT DISTANCE (MED)
  • Measures similarity via minimal insertions, deletions, substitutions

  • Levenshtein distance: cost 1 for ins/del, 1 or 2 for sub

  • Applications: spelling correction, speech recognition WER, coreference

Algorithm (dynamic programming)
  • Matrix D[i,j] where D[i,j] = \min\begin{cases}D[i-1,j]+1\D[i,j-1]+1\D[i-1,j-1]+cost\end{cases}

  • Initialization first row i deletions, first col j insertions

  • Greedy backtrace yields alignment; patterns greedy

Example
  • distance(intention, execution)=8 (ins/del=1, sub=2) grid Fig 2.18

SUMMARY OF KEY CONCEPTS & IMPLICATIONS

  • Regular expressions provide expressive yet efficient pattern spec for NLP tasks

  • Text normalization (= tokenization, case/lemma, sentence split) is prerequisite for most pipelines

  • Subword tokenization (BPE) critical for large language models & unknown words

  • Corpus choice/datasheets essential for fairness, dialect coverage

  • MED via DP foundational algorithm for similarity, extended by Viterbi etc.

CHAPTER 3 – NAIVE BAYES AND SENTIMENT CLASSIFICATION

Sentiment Classification: Task of determining the emotional tone (positive, negative, neutral, etc.) behind a piece of text.

  • Applications: Widely used for analyzing product reviews, social media posts, customer feedback, news articles, and more, to gauge public opinion or sentiment towards entities.

Naive Bayes Classifier:

  • A probabilistic classifier based on Bayes' Theorem, with a "naive" independence assumption between features.

  • Bayes' Theorem: P(classdocument)=P(documentclass)P(class)P(document)P(class|document) = \frac{P(document|class)P(class)}{P(document)}

    • P(classdocument)P(class|document): Posterior probability of a document belonging to a specific class.

    • P(documentclass)P(document|class): Likelihood of observing the document given the class.

    • P(class)P(class): Prior probability of the class.

    • P(document)P(document): Evidence (often ignored in comparison as it's constant across classes).

  • "Naive" Assumption: Assumes that the features (words) are conditionally independent given the class. This simplifies calculations but is rarely true in natural language (e.g., "not good" implies bad, but "not" and "good" are treated independently).

  • Bag-of-Words Model: Represents text as an unordered collection (bag) of words, disregarding grammar, word order, and sentence structure. Only word presence and frequency matter.

  • Feature Extraction: Converting raw text into numerical features which the model can understand. Common methods include:

    • Count Vectors (Term Frequency): Simply count how many times each word appears in a document.

    • TF-IDF (Term Frequency-Inverse Document Frequency): Weights words by how frequently they appear in a document (TF) adjusted by how rarely they appear across all documents (IDF), giving more importance to rare, distinctive words.

  • Training:

    1. Calculate Prior Probabilities (P(class)P(class)) from the training data (e.g., P(positive)=count of positive documentstotal documentsP(positive) = \frac{\text{count of positive documents}}{\text{total documents}}).

    2. Calculate Likelihoods (P(wordclass)P(word|class)) from the training data. This is the probability of a specific word appearing in a document of a given class.

  • Smoothing: Essential to handle zero probabilities for unseen words in the test data. If a word in a new document was not present in the training data for a specific class, its likelihood would be zero, making the whole product zero. Laplace smoothing (Add-k smoothing) is a common technique.

    • Formula for Laplace Smoothing (α=1\alpha=1):
      P(word<em>iclass)=count(word</em>i,class)+1count(class)+VP(word<em>i|class) = \frac{count(word</em>i, class) + 1}{count(class) + |V|}
      where V|V| is the total number of unique words (vocabulary size) in the training data.

  • Numerical Example (Sentiment Classification with Naive Bayes):

    • Training Data:

    • Positive: "great movie", "awesome film"

    • Negative: "bad acting", "terrible acting"

    • Vocabulary (V): {great, movie, awesome, film, bad, acting, terrible}

    • Counts:

    • Npos=2N_{pos} = 2 (positive documents)

    • Nneg=2N_{neg} = 2 (negative documents)

    • Word Counts in Classes:

    • C(great, pos) = 1, C(movie, pos) = 1, C(awesome, pos) = 1, C(film, pos) = 1

    • C(bad, neg) = 1, C(acting, neg) = 2, C(terrible, neg) = 1

    • Other counts are 0.

    • Priors:

    • P(pos)=24=0.5P(pos) = \frac{2}{4} = 0.5

    • P(neg)=24=0.5P(neg) = \frac{2}{4} = 0.5

    • Likelihoods (with Laplace Smoothing, α=1\alpha=1):

    • For Positive Class (V=7|V|=7):

      • P(greatpos)=1+14+7=211P(great|pos) = \frac{1+1}{4+7} = \frac{2}{11} (word count in positive class + 1 / total words in positive class + |V|)

      • P(actingpos)=0+14+7=111P(acting|pos) = \frac{0+1}{4+7} = \frac{1}{11}

    • For Negative Class (V=7|V|=7):

      • P(actingneg)=2+14+7=311P(acting|neg) = \frac{2+1}{4+7} = \frac{3}{11} (word count in negative class + 1 / total words in negative class + |V|)

      • P(greatneg)=0+14+7=111P(great|neg) = \frac{0+1}{4+7} = \frac{1}{11}

    • Test Document: "great acting"

    • Prediction:

    • P(pos"great acting")P(pos)P(greatpos)P(actingpos)=0.5211111=0.521210.00826P(pos|\text{"great acting"}) \propto P(pos) \cdot P(great|pos) \cdot P(acting|pos) = 0.5 \cdot \frac{2}{11} \cdot \frac{1}{11} = 0.5 \cdot \frac{2}{121} \approx 0.00826

    • P(neg"great acting")P(neg)P(greatneg)P(actingneg)=0.5111311=0.531210.01239P(neg|\text{"great acting"}) \propto P(neg) \cdot P(great|neg) \cdot P(acting|neg) = 0.5 \cdot \frac{1}{11} \cdot \frac{3}{11} = 0.5 \cdot \frac{3}{121} \approx 0.01239

    • Since 0.01239 > 0.00826, the document "great acting" is classified as Negative.

Evaluation Metrics:


  • Used to assess the performance of a classification model.


  • Definitions:

    • True Positive (TP): Correctly predicted positive.

    • True Negative (TN): Correctly predicted negative.

    • False Positive (FP): Predicted positive, but actually negative (Type I error).

    • False Negative (FN): Predicted negative, but actually positive (Type II error).


  • Metrics:

    • Precision: Of all instances predicted as positive, how many were actually positive? Focuses on false positives.
      Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}

    • Recall (Sensitivity): Of all actual positive instances, how many were correctly predicted as positive? Focuses on false negatives.
      Recall=TPTP+FNRecall = \frac{TP}{TP+FN}

    • F1-score: The harmonic mean of Precision and Recall. Useful when there is an uneven class distribution, as it balances both metrics.
      F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

    • Accuracy: The proportion of correctly classified instances out of the total instances. Can be misleading if classes are imbalanced.
      Accuracy=TP+TNTP+TN+FP+FNAccuracy = \frac{TP+TN}{TP+TN+FP+FN}

    • Confusion Matrix: A table that summarizes the performance of a classification algorithm. Rows represent actual classes, columns represent predicted classes.


  • Numerical Example (Evaluation Metrics):

    • Assume a binary classification model (Positive/Negative) yields the following results on a test set of 100 samples:


    • Actual Positive: 30, Actual Negative: 70


    • Model Predictions:

      • Predicted Positive: 25 True Positives (TP), 5 False Positives (FP)

      • Predicted Negative: 5 False Negatives (FN), 65 True Negatives (TN)


    • Confusion Matrix:

      Predicted Positive

      Predicted Negative


      Actual Pos

      25 (TP)

      5 (FN)


      Actual Neg

      5 (FP)

      65 (TN)

      • Calculations:

      • Precision=2525+5=25300.833Precision = \frac{25}{25+5} = \frac{25}{30} \approx 0.833

      • Recall=2525+5=25300.833Recall = \frac{25}{25+5} = \frac{25}{30} \approx 0.833

      • F1=20.8330.8330.833+0.833=0.833F1 = 2 \cdot \frac{0.833 \cdot 0.833}{0.833 + 0.833} = 0.833

      • Accuracy=25+6525+5+5+65=90100=0.90Accuracy = \frac{25+65}{25+5+5+65} = \frac{90}{100} = 0.90

      CHAPTER 4 – LINGUISTIC ESSENTIALS & DISCRIMINATIVE CLASSIFIERS

      Linguistic Essentials:

      • Parts of Speech (POS) Tagging: The process of assigning grammatical categories (e.g., Noun, Verb, Adjective, Adverb, Preposition) to each word in a given text. It's a foundational step for many advanced NLP tasks.

        • Importance: Crucial for tasks like parsing (understanding sentence structure), named entity recognition (identifying proper nouns), and machine translation to correctly interpret word roles.

      • Morphology: The study of word structure and how words are formed from smaller meaningful units (morphemes).

        • Inflection: Changes in word form to indicate grammatical functions (e.g., walk -> walks, walking, walked). It doesn't change the word's core meaning or part of speech.

        • Derivation: Forms new words from existing ones, often changing their part of speech or meaning (e.g., happy -> unhappy, happiness; govern -> government).

        • Example: In "unbreakable": un- (prefix), break (root), -able (suffix).

      • Syntax: The study of how words are combined to form phrases, clauses, and sentences, and the rules governing their structure.

        • Syntactic Parsing: The process of analyzing a sentence to determine its grammatical structure according to a formal grammar (e.g., constituency parsing (parse trees), dependency parsing).

      • Semantics: The study of meaning in language.

        • Lexical Semantics: Deals with the meaning of individual words (e.g., synonymy, antonymy, polysemy).

        • Compositional Semantics: Deals with how word meanings combine to form the meaning of larger units like phrases and sentences.

      • Discourse: The study of linguistic units larger than a single sentence, such as paragraphs, conversations, or documents.

        • Examples:

        • Coreference Resolution: Identifying when different expressions in a text refer to the same entity (e.g., "John bought a car. He loves it."; "He" and "it" refer to "John" and "a car" respectively).

        • Coherence: How sentences and ideas logically connect to form a meaningful whole.

      Sequence Labeling:

      • A general class of NLP tasks where the goal is to assign a label to each element in a sequence. Often framed as predicting a tag for each word in a sentence.

      • Examples:

        • POS Tagging: Assigning a grammatical tag (e.g., NN for noun, VB for verb) to each word.

        • Named Entity Recognition (NER): Identifying and classifying named entities (e.g., persons, organizations, locations, dates) in text.

        • Chunking (Shallow Parsing): Segmenting and labeling multi-word sequences into syntactically correlated word groupings (e.g., noun phrases, verb phrases).

      Discriminative vs. Generative Models:

      • Generative Models (e.g., Naive Bayes, Hidden Markov Models):

        • Model the underlying distribution of the data for each class, i.e., P(x,y)P(x,y), or more commonly, P(xy)P(x|y) (likelihood) and P(y)P(y) (prior).

        • They learn the characteristics of each class to generate (or model) new data points.

        • Tend to perform better with smaller datasets but can be less robust to noisy data due to strong independence assumptions.

      • Discriminative Models (e.g., Perceptron, Logistic Regression, Support Vector Machines):

        • Directly model the conditional probability P(yx)P(y|x) or a direct mapping from input to output.

        • Focus on learning the decision boundary between classes, without needing to understand the internal structure of each class's data distribution.

        • Generalize well to new, unseen data and are often more robust to noisy data. Typically require more data.

      Perceptron:

      • A foundational algorithm for supervised learning, specifically for binary classification. It's a simple linear classifier.

      • Learning Process:

        1. Initialise weights ww and bias bb (often to zero or small random values).

        2. For each training example (x,label)(x, label): a. Calculate the predicted output: y<em>pred=sign(wx+b)y<em>{pred} = sign(w \cdot x + b). b. If y</em>predlabely</em>{pred} \neq label (misclassification), update weights and bias:

          • ww+labelxw \leftarrow w + label \cdot x

          • bb+labelb \leftarrow b + label

      • Limitations:

        • Can only learn linearly separable functions. If the data cannot be perfectly separated by a straight line (or hyperplane in higher dimensions), the Perceptron algorithm will not converge.

        • Not probabilistic.

      • Numerical Example (Perceptron):

        • Data: (input, label)

        • x1=(1,0)x_1 = (1, 0), label = 1

        • x2=(0,1)x_2 = (0, 1), label = 1

        • x3=(0,0)x_3 = (0, 0), label = -1

        • x4=(1,1)x_4 = (1, 1), label = -1

        • Initialize: w=(0,0)w = (0, 0), b=0b = 0 (learning rate η=1\eta = 1 for simplicity)

        • Iteration 1:

        • Consider (x1,1)=((1,0),1)(x_1, 1) = ((1, 0), 1)

        • z=wx1+b=(01)+(00)+0=0z = w \cdot x_1 + b = (0 \cdot 1) + (0 \cdot 0) + 0 = 0

        • ypred=sign(0)=0y_{pred} = sign(0) = 0 (or 1 depending on convention for 0, let's assume 1 for ease of use here if wx+b0w \cdot x + b \ge 0 and -1 otherwise)

        • Assuming sign(0)=1sign(0)=1: ypred=1y_{pred} = 1. Correctly classified.

        • Iteration 2:

        • Consider (x2,1)=((0,1),1)(x_2, 1) = ((0, 1), 1)

        • z=wx2+b=(00)+(01)+0=0z = w \cdot x_2 + b = (0 \cdot 0) + (0 \cdot 1) + 0 = 0

        • ypred=1y_{pred} = 1. Correctly classified.

        • Iteration 3:

        • Consider (x3,1)=((0,0),1)(x_3, -1) = ((0, 0), -1)

        • z=wx3+b=(00)+(00)+0=0z = w \cdot x_3 + b = (0 \cdot 0) + (0 \cdot 0) + 0 = 0

        • ypred=1y_{pred} = 1. Misclassification! (Actual label is -1).

        • Update:

          • ww+labelx3=(0,0)+(1)(0,0)=(0,0)w \leftarrow w + label \cdot x_3 = (0, 0) + (-1) \cdot (0, 0) = (0, 0)

          • bb+label=0+(1)=1b \leftarrow b + label = 0 + (-1) = -1

        • Now w=(0,0)w = (0, 0), b=1b = -1

        • Iteration 4:

        • Consider (x4,1)=((1,1),1)(x_4, -1) = ((1, 1), -1)

        • z=wx4+b=(01)+(01)+(1)=1z = w \cdot x_4 + b = (0 \cdot 1) + (0 \cdot 1) + (-1) = -1

        • ypred=sign(1)=1y_{pred} = sign(-1) = -1. Correctly classified.

        • Next Epoch: Re-evaluate on misclassified examples (if any). The process continues until all examples are correctly classified (if separable) or for a fixed number of epochs.

      CHAPTER 5 – CLASSIFICATION AND LOGISTIC REGRESSION

      Logistic Regression:

      • A powerful

A powerful discriminative classifier used for predicting the probability of a binary outcome (0 or 1, true or false) or for multi-class classification. Despite its name, it is a classification algorithm, not a regression algorithm in the traditional sense.

  • Models the probability of a binary outcome: Logistic Regression applies the sigmoid (or logistic) function to a linear combination of input features to output a probability score between 0 and 1.

    • Sigmoid Function (σ\sigma):
      P(y=1x)=σ(wx+b)=11+e(wx+b)P(y=1|x) = \sigma(w \cdot x + b) = \frac{1}{1 + e^{-(w \cdot x + b)}}
      where ww is the weight vector, xx is the input feature vector, and bb is the bias.

  • Decision Boundary: The point at which the predicted probability crosses a threshold (typically 0.5) determines the class. For a linear model like logistic regression, this results in a linear decision boundary (a line in 2D, a hyperplane in higher dimensions) that separates the classes.

  • Loss Function (Cost Function):

    • Measures the discrepancy between the model's predicted probabilities and the actual labels. The goal during training is to minimize this loss.

    • For logistic regression, the Cross-Entropy Loss (also known as Log Loss) is commonly used:
      J(θ)=1N<em>i=1N[y</em>ilog(h<em>θ(x</em>i))+(1y<em>i)log(1h</em>θ(x<em>i))]J(\theta) = -\frac{1}{N} \sum<em>{i=1}^{N} [y</em>i \log(h<em>\theta(x</em>i)) + (1 - y<em>i) \log(1 - h</em>\theta(x<em>i))] where NN is the number of training examples, y</em>iy</em>i is the actual label (0 or 1), and h<em>θ(x</em>i)h<em>\theta(x</em>i) is the predicted probability for xix_i. This function penalizes confident incorrect predictions heavily.

  • Optimization: The process of finding the optimal parameters (ww and bb) that minimize the loss function.

    • Gradient Descent: An iterative optimization algorithm that adjusts parameters in the direction opposite to the gradient of the loss function. The learning rate (η\eta) controls the size of these steps.

    • Parameter update rule: θθηθJ(θ)\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)

    • Types of Gradient Descent:

      • Batch Gradient Descent: Computes the gradient over the entire training dataset for each update. Can be slow for large datasets but guarantees convergence to the minimum for convex functions.

      • Stochastic Gradient Descent (SGD): Computes the gradient and updates parameters for each individual training example. Faster but with noisy updates, leading to a more erratic path to convergence.

      • Mini-batch Gradient Descent: A compromise between Batch and SGD, using small random batches of training examples for each update. Offers a good balance of speed and stability.

  • Regularization: Techniques used to prevent overfitting, which occurs when a model learns the training data too well and performs poorly on unseen data.

    • Adds a penalty term to the loss function that discourages large weights.

    • L1 Regularization (Lasso): Adds the sum of the absolute values of the weights to the loss function (λwj\lambda \sum |w_j|).

    • Promotes sparsity, meaning it can drive some weights exactly to zero, effectively leading to feature selection.

    • L2 Regularization (Ridge): Adds the sum of the squared values of the weights to the loss function (λwj2\lambda \sum w_j^2).

    • Shrinks weights towards zero (but rarely exactly to zero), leading to a more generalized model.

    • Regularized Loss Function: Jregularized(θ)=J(θ)+λR(θ)J_{regularized}(\theta) = J(\theta) + \lambda \cdot R(\theta)

    • The hyperparameter λ\lambda (lambda) controls the strength of the regularization; a higher λ\lambda means stronger regularization.

Numerical Example (Logistic Regression Prediction)

Assume a simple binary classification problem with two features (x<em>1,x</em>2x<em>1, x</em>2).

  • Parameters: w=[0.5,0.2]w = [0.5, -0.2], b=0.1b = 0.1

  • Test Input: x=[1.0,3.0]x = [1.0, 3.0]

  1. Calculate the linear combination:
    z=wx+b=(0.51.0)+(0.23.0)+0.1z = w \cdot x + b = (0.5 \cdot 1.0) + (-0.2 \cdot 3.0) + 0.1
    z=0.50.6+0.1z = 0.5 - 0.6 + 0.1
    z=0z = 0

  2. Apply the Sigmoid Function:
    P(y=1x)=11+ez=11+e0P(y=1|x) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-0}}
    P(y=1x)=11+1=12=0.5P(y=1|x) = \frac{1}{1 + 1} = \frac{1}{2} = 0.5

  3. Make a prediction: If the threshold is 0.5:
    Since P(y=1x)=0.5P(y=1|x) = 0.5, the model would typically classify this as positive (class 1) or be considered on the boundary, depending on the tie-breaking rule.

CHAPTER 6 – NEURAL NETWORKS AND NEURAL LANGUAGE MODELS

Neural Networks: Computational models inspired by the structure and function of biological neural networks.

Composed of interconnected nodes (neurons) organized in layers.

Layers: Input layer, hidden layers, output layer.

Activation Functions: Introduce non-linearity into the network (e.g., Sigmoid, Tanh, ReLU).

ReLU(z)=max(0,z)ReLU(z) = \max(0, z)

Feed-forward Networks: Information flows in one direction, from input to output, no cycles.

Training Neural Networks:

Forward Propagation: Calculate output given inputs and current weights.

Backpropagation: Algorithm for efficiently calculating the gradients of the loss function with respect to the network weights.

Uses the chain rule to propagate errors backward through the network.

Optimization Algorithms: Stochastic Gradient Descent (SGD), Adam, RMSprop.

Word Embeddings: Dense vector representations of words.

Capture semantic and syntactic relationships between words.

Word2Vec: Popular method for learning word embeddings.

Skip-gram: Predict context words given a target word.

CBOW (Continuous Bag-of-Words): Predict target word given context words.

GloVe (Global Vectors for Word Representation): Combines advantages of global matrix factorization and local context window methods.

Neural Language Models (NLM):

Use neural networks to predict the next word in a sequence.

Represent words as continuous vectors (embeddings).

Can learn complex, non-linear dependencies in language.

Basic architecture involves an input layer (word embeddings), a hidden layer, and an output layer (softmax over vocabulary).