Computational Linguistics

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/15

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

16 Terms

New cards

Text normalisation

Converting text to a more convenient, standard form before processing.

Tokenisation
Normalising word formats
Segmenting sentences

New cards

Tokenisation

Separating out words or word parts from running text

New cards

N-grams

Sequence of n words, used to estimate probability of word given the n-1 previous words, or to assign probabilities to entire sequences

Bigram = two-word sequence
Trigram = three-word sequence

New cards

Maximum Likelihood Estimation (MLE) for N-grams

An intuitive way to estimate probabilities for n-grams by computing probabilities of sequences based on the observed frequency of n-grams in a corpus and normalising them.

→ maximise the likelihood that the training data occurs under the model.

Problem: leads to zero probabilities for unseen sequences

New cards

Zero Probabilities and Smoothing

Zero Probabilities

occur when n-gram sequence does not appear in training data but in test set
probability of entire test set = 0 → prevents computation of metrics like perplexity

Smoothing (or discounting)

algorithm to tackle zero probabilities
Takes small amount of probability mass from more frequent events and distributes it to unseen events

New cards

Laplace (Add-one) Smoothing

Adds 1 to all n-gram counts before normalisation to prevent zero probabilities
Denominator adjusted by adding vocabulary size to total count
Commonly used in Naive Bayes text categorisation

New cards

Backoff

if higher-order n-gram has zero probability, model „backs off“ to lower-order n-gram to estimate probability
Stupid backoff algorithm does not discount higher-order n-grams → not a true probability distribution but works in practice

New cards

Naive Bayes Classifier

Multinominal Naive Bayes classifier

generative classifier that estimates probability in a log space of document belonging to a class
Makes „naive“ assumption of conditional independence among features: treats documents as bag of words, ignores word order and only looks at frequency

New cards

Prior probability P(C)

percentage of documents in training set belonging to class C
Calculation: Number of documents in class c (NC) / Total number of documents (Ndoc)

New cards

Likelihood P(wi|c)

probability of a word (wi) appearing given a class c
Estimated by counting frequency of wi in all documents of class c and normalising by total count of all words in class c
Laplace (add-one) smoothing applied to avoid zero probabilities for unseen words in specific class

New cards

Evaluation Metrics

Used instead of accuracy for unbalanced classes in text classification

Precision = percentage of items the system correctly detected as positive that are actually positive (TP / (TP + FP))
Recall = percentage of actual positive items correctly identified by system (TP / (TP + FN))
F-1 measure = combination of precision and recall, used when they are equally balanced (2PR / (P + R))

New cards

Logistic Regression

discriminative classifier that learns to distinguish between classes directly, rather than building model of how data is generated (unlike Naive Bayes)
Lears vector of weights (w) and bias term (b) for features
Weighted sum of features passed through sigmoid function to produce probability between 0 and 1
Decision boundary to assign class (e.g., 0.5)

New cards

Vectorisation

process of converting text documents into numerical feature vectors for logistic regression
Term-Frequncy-Inverse Document Frequency (TF-IDF) assigns numerical weight to each word to reflect its importance in corpus

New cards

Training in Logistic Regression

Training phase = learning weights (w) and bias (b) that minimises loss function

Loss function: cross-entropy loss aims to maximise log probability of true labels given observation
- Measures how well classifier‘s output probability matches true label (derived from negative log likelihood). Lower loss = better model performance
Algorithm: Stochastic Gradient Descent (SGD) achieves minimisation through iterative optimisation algorithm
- Iteratively updates model parameters, computes gradient of loss function for mini-batch of training examples and adjusts parameters in opposite direction of gradient scaled by learning rate (hyperparameter)

New cards

Overfitting and Regularisation

Overfitting

model learns training data too perfectly, including noise → poor generalisation to unseen data

Regularisation

technique to prevent overfitting
Adds penalty term to loss function, penalising large weights
L1 regularisation (Lasso) = smaller weights
L2 regularisation (Ridge) = sparser weights

New cards

Generative vs Discriminative Classifiers

Generative (e.g., Naive Bayes)

models how class could generate input data
Tries to understand what each class looks like
Performs well on small datasets/short documents

Discriminative classifier (e.g., Logistic Regression)

directly learns to distinguish between classes
Focuses on most useful features for discrimination
More robust to correlated features & works better on large datasets