Computational Linguistics

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/15

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

16 Terms

1
New cards

Text normalisation

Converting text to a more convenient, standard form before processing.

  • Tokenisation

  • Normalising word formats

  • Segmenting sentences

2
New cards

Tokenisation

Separating out words or word parts from running text

3
New cards

N-grams

Sequence of n words, used to estimate probability of word given the n-1 previous words, or to assign probabilities to entire sequences

  • Bigram = two-word sequence

  • Trigram = three-word sequence

4
New cards

Maximum Likelihood Estimation (MLE) for N-grams

An intuitive way to estimate probabilities for n-grams by computing probabilities of sequences based on the observed frequency of n-grams in a corpus and normalising them.

→ maximise the likelihood that the training data occurs under the model.

Problem: leads to zero probabilities for unseen sequences

5
New cards

Zero Probabilities and Smoothing

Zero Probabilities

  • occur when n-gram sequence does not appear in training data but in test set

  • probability of entire test set = 0 → prevents computation of metrics like perplexity

Smoothing (or discounting)

  • algorithm to tackle zero probabilities

  • Takes small amount of probability mass from more frequent events and distributes it to unseen events

6
New cards

Laplace (Add-one) Smoothing

  • Adds 1 to all n-gram counts before normalisation to prevent zero probabilities

  • Denominator adjusted by adding vocabulary size to total count

  • Commonly used in Naive Bayes text categorisation

7
New cards

Backoff

  • if higher-order n-gram has zero probability, model „backs off“ to lower-order n-gram to estimate probability

  • Stupid backoff algorithm does not discount higher-order n-grams → not a true probability distribution but works in practice

8
New cards

Naive Bayes Classifier

Multinominal Naive Bayes classifier

  • generative classifier that estimates probability in a log space of document belonging to a class

  • Makes „naive“ assumption of conditional independence among features: treats documents as bag of words, ignores word order and only looks at frequency

9
New cards

Prior probability P(C)

  • percentage of documents in training set belonging to class C

  • Calculation: Number of documents in class c (NC) / Total number of documents (Ndoc)

10
New cards

Likelihood P(wi|c)

  • probability of a word (wi) appearing given a class c

  • Estimated by counting frequency of wi in all documents of class c and normalising by total count of all words in class c

  • Laplace (add-one) smoothing applied to avoid zero probabilities for unseen words in specific class

11
New cards

Evaluation Metrics

Used instead of accuracy for unbalanced classes in text classification

  • Precision = percentage of items the system correctly detected as positive that are actually positive (TP / (TP + FP))

  • Recall = percentage of actual positive items correctly identified by system (TP / (TP + FN))

  • F-1 measure = combination of precision and recall, used when they are equally balanced (2PR / (P + R))

12
New cards

Logistic Regression

  • discriminative classifier that learns to distinguish between classes directly, rather than building model of how data is generated (unlike Naive Bayes)

  • Lears vector of weights (w) and bias term (b) for features

  • Weighted sum of features passed through sigmoid function to produce probability between 0 and 1

  • Decision boundary to assign class (e.g., 0.5)

13
New cards

Vectorisation

  • process of converting text documents into numerical feature vectors for logistic regression

  • Term-Frequncy-Inverse Document Frequency (TF-IDF) assigns numerical weight to each word to reflect its importance in corpus

14
New cards

Training in Logistic Regression

Training phase = learning weights (w) and bias (b) that minimises loss function

  • Loss function: cross-entropy loss aims to maximise log probability of true labels given observation

    • Measures how well classifier‘s output probability matches true label (derived from negative log likelihood). Lower loss = better model performance

  • Algorithm: Stochastic Gradient Descent (SGD) achieves minimisation through iterative optimisation algorithm

    • Iteratively updates model parameters, computes gradient of loss function for mini-batch of training examples and adjusts parameters in opposite direction of gradient scaled by learning rate (hyperparameter)

15
New cards

Overfitting and Regularisation

Overfitting

  • model learns training data too perfectly, including noise → poor generalisation to unseen data

Regularisation

  • technique to prevent overfitting

  • Adds penalty term to loss function, penalising large weights

  • L1 regularisation (Lasso) = smaller weights

  • L2 regularisation (Ridge) = sparser weights

16
New cards

Generative vs Discriminative Classifiers

Generative (e.g., Naive Bayes)

  • models how class could generate input data

  • Tries to understand what each class looks like

  • Performs well on small datasets/short documents

Discriminative classifier (e.g., Logistic Regression)

  • directly learns to distinguish between classes

  • Focuses on most useful features for discrimination

  • More robust to correlated features & works better on large datasets