1/15
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Text normalisation
Converting text to a more convenient, standard form before processing.
Tokenisation
Normalising word formats
Segmenting sentences
Tokenisation
Separating out words or word parts from running text
N-grams
Sequence of n words, used to estimate probability of word given the n-1 previous words, or to assign probabilities to entire sequences
Bigram = two-word sequence
Trigram = three-word sequence
Maximum Likelihood Estimation (MLE) for N-grams
An intuitive way to estimate probabilities for n-grams by computing probabilities of sequences based on the observed frequency of n-grams in a corpus and normalising them.
→ maximise the likelihood that the training data occurs under the model.
Problem: leads to zero probabilities for unseen sequences
Zero Probabilities and Smoothing
Zero Probabilities
occur when n-gram sequence does not appear in training data but in test set
probability of entire test set = 0 → prevents computation of metrics like perplexity
Smoothing (or discounting)
algorithm to tackle zero probabilities
Takes small amount of probability mass from more frequent events and distributes it to unseen events
Laplace (Add-one) Smoothing
Adds 1 to all n-gram counts before normalisation to prevent zero probabilities
Denominator adjusted by adding vocabulary size to total count
Commonly used in Naive Bayes text categorisation
Backoff
if higher-order n-gram has zero probability, model „backs off“ to lower-order n-gram to estimate probability
Stupid backoff algorithm does not discount higher-order n-grams → not a true probability distribution but works in practice
Naive Bayes Classifier
Multinominal Naive Bayes classifier
generative classifier that estimates probability in a log space of document belonging to a class
Makes „naive“ assumption of conditional independence among features: treats documents as bag of words, ignores word order and only looks at frequency
Prior probability P(C)
percentage of documents in training set belonging to class C
Calculation: Number of documents in class c (NC) / Total number of documents (Ndoc)
Likelihood P(wi|c)
probability of a word (wi) appearing given a class c
Estimated by counting frequency of wi in all documents of class c and normalising by total count of all words in class c
Laplace (add-one) smoothing applied to avoid zero probabilities for unseen words in specific class
Evaluation Metrics
Used instead of accuracy for unbalanced classes in text classification
Precision = percentage of items the system correctly detected as positive that are actually positive (TP / (TP + FP))
Recall = percentage of actual positive items correctly identified by system (TP / (TP + FN))
F-1 measure = combination of precision and recall, used when they are equally balanced (2PR / (P + R))
Logistic Regression
discriminative classifier that learns to distinguish between classes directly, rather than building model of how data is generated (unlike Naive Bayes)
Lears vector of weights (w) and bias term (b) for features
Weighted sum of features passed through sigmoid function to produce probability between 0 and 1
Decision boundary to assign class (e.g., 0.5)
Vectorisation
process of converting text documents into numerical feature vectors for logistic regression
Term-Frequncy-Inverse Document Frequency (TF-IDF) assigns numerical weight to each word to reflect its importance in corpus
Training in Logistic Regression
Training phase = learning weights (w) and bias (b) that minimises loss function
Loss function: cross-entropy loss aims to maximise log probability of true labels given observation
Measures how well classifier‘s output probability matches true label (derived from negative log likelihood). Lower loss = better model performance
Algorithm: Stochastic Gradient Descent (SGD) achieves minimisation through iterative optimisation algorithm
Iteratively updates model parameters, computes gradient of loss function for mini-batch of training examples and adjusts parameters in opposite direction of gradient scaled by learning rate (hyperparameter)
Overfitting and Regularisation
Overfitting
model learns training data too perfectly, including noise → poor generalisation to unseen data
Regularisation
technique to prevent overfitting
Adds penalty term to loss function, penalising large weights
L1 regularisation (Lasso) = smaller weights
L2 regularisation (Ridge) = sparser weights
Generative vs Discriminative Classifiers
Generative (e.g., Naive Bayes)
models how class could generate input data
Tries to understand what each class looks like
Performs well on small datasets/short documents
Discriminative classifier (e.g., Logistic Regression)
directly learns to distinguish between classes
Focuses on most useful features for discrimination
More robust to correlated features & works better on large datasets