Intro to NLP (Week 1 & 2)

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/30

There's no tags or description

Looks like no tags are added yet.

Last updated 3:35 PM on 4/10/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

31 Terms

New cards

Symbolic vs Statistical vs Generative/Neural/DL NLP

Symbolic(rule-based NLP) (1950-1990) relies on hard coded rules and templates: reliable but unscalable to non stylistic, ambiguous language. Sentiment analysis on symbolic NLP is very weak

ELIZA (early chatbot)

Statistical NLP leverages statistical info to create probabilistic/generative models that learn distribution over language and generate text. Requires high data availability (HDA) and high quality data.

WWW benefited the growth of statistical NLP due to HDA

Neural Networks learn representations automatically from raw text and replaces features with (EMBEDDING VECTORS). Sequences are models with RNNs, LSTMs, GRUs (2010-2017) and Transformers(2017)

learn distributed representations(embeddings) to capture semantic similarity. Scales better with large corpora = better performance. Better at modelling long range dependencies.
Data hungry and compute intensive and can hallucinate

<p>Symbolic(rule-based NLP) (1950-1990) relies on hard coded rules and templates: reliable but unscalable to non stylistic, ambiguous language. Sentiment analysis on symbolic NLP is very weak</p><ul><li><p>ELIZA (early chatbot)</p></li></ul><p></p><p>Statistical NLP leverages statistical info to create probabilistic/generative models that learn distribution over language and generate text. Requires high data availability (HDA) and high quality data.</p><ul><li><p>WWW benefited the growth of statistical NLP due to HDA</p></li></ul><p></p><p>Neural Networks learn representations automatically from raw text and replaces features with (EMBEDDING VECTORS). Sequences are models with RNNs, LSTMs, GRUs (2010-2017) and Transformers(2017) </p><ul><li><p>learn distributed representations(embeddings) to capture semantic similarity. Scales better with large corpora = better performance. Better at modelling long range dependencies. </p></li><li><p>Data hungry and compute intensive and can hallucinate</p></li></ul><p></p>

New cards

Early NLP experiment (1950)

Georgetown IBM experiment. russian → english translation which helped launch NLP as a field

New cards

Why did early rule-based NLP systems struggle?

Rule-based systems had one major strength: when a case was covered by the rules, they could be reliable because the knowledge came from human experts. But they struggled because language is:

creative,
ambiguous,
constantly changing,
full of exceptions.
hard to build a complete and scalable set of rules.

New cards

What was the main advantage and main bottleneck of statistical NLP?

The main advantage of statistical NLP is that it does not need rigid hand-crafted assumptions about language in advance. Instead, it can learn patterns from data, which makes it more scalable and flexible than pure rule-based approaches.

The main bottleneck is data: these methods need large amounts of high-quality, representative data. In machine translation, improvements in the 1980s were strongly tied to the availability of large amounts of parallel data, where texts in one language are aligned with equivalent texts in another.

New cards

How did the WWW help NLP?

The growth of the World Wide Web made massive amounts of text available, which helped statistical NLP methods improve. In tasks like machine translation, the availability of large parallel corpora was especially important because systems could learn translation relationships from real examples.

New cards

Pretraining a model to Fine tuning

GPT (language modelling)

BERT (masked language modelling) (bidirectional context)

After pretraining the general model it needs to be fine tuned

Classification: Sentiment Analysis
Sequence Labelling: Named Entity Recognition
Seq2Seq : Translation

New cards

NLP applications

Google Translate (machine translation)

challenging because different language have different rules/definitions for words different forms of words based on gender and number. Differences in word order also add complexity. Words themselves are ambiguous.
parallel data contains annotation for the chunks of text on both sides that express precisely the same idea. Learn to identify the correspondences between the two parts at level of granularity

Information Retrieval/Information Search

How to search and retrieve quickly with millions of documents on the web. How to rank relevance. Queries may be imprecise, incomplete, ungrammatical or ambiguous? Book (for reading or booking hotel?)

Spam Detection (text classification task)

Challenging since some emails contain clear red flags but some are normal content. The cost of false alarms? Must balance false positives and negatives. Combination of signals instead of one signal feature
Binary classification approach

Predictive Text

Challenging as the range of possible responses is practically infinite. Ability to accurately predict bring ML closer to human
provided with a large number of examples, computers can observe what characters and word combinations are most likely and predict

<p>Google Translate (machine translation)</p><ul><li><p>challenging because different language have different rules/definitions for words different forms of words based on gender and number. Differences in word order also add complexity. Words themselves are ambiguous.</p></li><li><p>parallel data contains annotation for the chunks of text on both sides that express precisely the same idea. Learn to identify the correspondences between the two parts at level of granularity</p></li></ul><p></p><p>Information Retrieval/Information Search</p><ul><li><p>How to search and retrieve quickly with millions of documents on the web. How to rank relevance. Queries may be imprecise, incomplete, ungrammatical or ambiguous? Book (for reading or booking hotel?)</p></li></ul><p></p><p>Spam Detection (text classification task)</p><ul><li><p>Challenging since some emails contain clear red flags but some are normal content. The cost of false alarms? Must balance false positives and negatives. Combination of signals instead of one signal feature</p></li><li><p>Binary classification approach</p></li></ul><p>Predictive Text</p><ul><li><p>Challenging as the range of possible responses is practically infinite. Ability to accurately predict bring ML closer to human</p></li><li><p>provided with a large number of examples, computers can observe what characters and word combinations are most likely and predict</p></li></ul><p></p>

New cards

Linguistic Analysis

Raw text processing – segmentation and tokenization.
Morphology – stemming and lemmatization. (sub word level)
Word level – part-of-speech tagging.
Syntax – grammatical relations, chunking, parsing.
Semantics – word meaning and text meaning.

New cards

Stemming vs Lemmatisation

Stemming cuts off prefixes and suffixes.

Downside is the root form may not be a valid word running → run.

Efficient and scalable

Stemming can merge dimensions, which can reduce vector space dimensionality (good for computation) > cost of stemming errors

Advantages of stemming:

it does not rely as heavily on a dictionary,
it can handle new or creative words better,
it can reduce dimensionality more aggressively.

A lemma is a dictionary form where we can map multiple forms to the most basic one (Reduces space).

Lemmatisation is resource intensive and time consuming since algorithm has to keep a dictionary of lemmas to convert word forms.
Uses look up tables with rule sets.
Returns more understandable base forms.
Can be improved with PoS

New cards

NLP Project Pipeline

Step 1(Analysis): Split the dataset, remove punctuation with regex, convert to lowercase (define the task, inputs/outputs, pick framework (e.g classification problem)

Step 2(data exploration +preprocessing): Lowercasing, tokenisation, stopword removal, punctuation, lemmatisation or stemming. Store queries and docs in clearly defined data structures (e.g dictionaries of queries, collections of documents(

Step 3: Feature engineering, consider weight of features (textual: n-grams, tf-idf, bag of words), normalisation: remove punctuation, lowercase, remove stopwords) . Build a shared vector space, which may have some 0s as words are not present.

Step 4: The model(transformers, LSTM vs logistic regression), naive bayes, apply cosine similarity between the query vector and doc vector

Step 5: Evaluate with accuracy, precision, MRR, baselines and benchmarks.

<p>Step 1(Analysis): Split the dataset, remove punctuation with regex, convert to lowercase (define the task, inputs/outputs, pick framework (e.g classification problem)</p><p>Step 2(data exploration +preprocessing): Lowercasing, tokenisation, stopword removal, punctuation, lemmatisation or stemming. Store queries and docs in clearly defined data structures (e.g dictionaries of queries, collections of documents(</p><p>Step 3: Feature engineering, consider weight of features (textual: n-grams, tf-idf, bag of words), normalisation: remove punctuation, lowercase, remove stopwords) . Build a shared vector space, which may have some 0s as words are not present.</p><p>Step 4: The model(transformers, LSTM vs logistic regression), naive bayes, apply cosine similarity between the query vector and doc vector</p><p>Step 5: Evaluate with accuracy, precision, MRR, baselines and benchmarks.</p><p></p>

New cards

NLP Methods

Symbolic/Feature-based NLP

Manual linguistic or count-based representations
“Hand-engineered text representations such as bag-of-words, n-grams, and TF-IDF. Usually sparse and interpretable.”

Statistical / traditional ML NLP

Learns from feature counts using classical ML
Naive Bayes, logistic regression, SVM, HMM

Vector-based / neural

“Represents text numerically as dense vectors (embeddings), either static or contextual, and uses neural networks.”
Word2Vec, LSTM, BERT, Transformers

Generative models

“Learn probability distributions over text and can generate new sequences; sometimes also used for classification or labelling.”
n-gram LM, seq2seq, GPT, T5

<p>Symbolic/Feature-based NLP</p><ul><li><p>Manual linguistic or count-based representations</p></li><li><p>“Hand-engineered text representations such as bag-of-words, n-grams, and TF-IDF. Usually sparse and interpretable.”</p></li></ul><p>Statistical / traditional ML NLP</p><ul><li><p>Learns from feature counts using classical ML</p></li><li><p>Naive Bayes, logistic regression, SVM, HMM</p></li></ul><p>Vector-based / neural </p><ul><li><p>“Represents text numerically as dense vectors (embeddings), either static or contextual, and uses neural networks.”</p></li><li><p>Word2Vec, LSTM, BERT, Transformers</p></li></ul><p>Generative models </p><ul><li><p>“Learn probability distributions over text and can generate new sequences; sometimes also used for classification or labelling.”</p></li><li><p>n-gram LM, seq2seq, GPT, T5</p></li></ul><p></p>

New cards

In information retrieval, what are a document, collection, term, and query?

In information retrieval (IR), the core objective is to find relevant, usually unstructured, information (such as text, images, or documents) within a large repository to satisfy a specific user need

A document is any unit of text indexed for retrieval.

A collection is the set of documents being searched

. A term is an item, often a word or phrase, used by the system to represent content.

A query is the user’s information need expressed as a set of terms.

New cards

Tokenisation

tokenization or word segmentation is the task of separating out (tokenizing)words from running text. Most important step in text processing

Multi-word expressions are phrases like New York Stock Exchange that behave like a single conceptual unit even though they contain spaces. Tokenizing them incorrectly can damage meaning

tokenization systems are designed to be fast and reliable, often using deterministic methods such as regular expressions compiled into finite-state automata, together with lists of abbreviations and machine-learning-based refinements.

New cards

Frequency Analysis and Zipfs Law

Frequency analysis studies how often words occur in a corpus. It helps reveal patterns in the data, such as which words are very common and may be uninformative for a task.

Zipf's Law is an empirical law stating that in a large corpus of natural language, the frequency of any word is inversely proportional to its rank in a frequency table. The Power Law Distribution: If you rank words by frequency, the r -th most frequent word has a frequency proportional to 1/r^α (where α is close to 1)

Normalised frequency of the element of the rank is in the picture: (where s is the exponent of language.) English is s=1

Zipf’s law explains why we see so many very frequent words, many of which become stopwords in tasks like IR or classification. It also explains the long tail of rare terms and why feature selection matters so much.

<p>Frequency analysis studies how often words occur in a corpus. It helps reveal patterns in the data, such as which words are very common and may be uninformative for a task.</p><p></p><p>Zipf's Law is <strong><mark data-color="rgba(0, 0, 0, 0)" style="background-color: rgba(0, 0, 0, 0); color: inherit;">an empirical law stating that in a large corpus of natural language, the frequency of any word is inversely proportional to its rank in a frequency table. </mark>The Power Law Distribution:</strong> If you rank words by frequency, the<strong> r </strong>-th most frequent word has a frequency proportional to <strong>1/r^α</strong> (where <strong>α</strong> is close to 1)</p><p>Normalised frequency of the element of the rank is in the picture: (where s is the exponent of language.) English is s=1</p><p></p><p>Zipf’s law explains why we see so many very frequent words, many of which become <strong>stopwords</strong> in tasks like IR or classification. It also explains the long tail of rare terms and why feature selection matters so much.</p>

New cards

What are stopwords, and why are they often removed?

Stopwords are very frequent words that carry relatively little meaning on their own in many NLP tasks. Examples are common function words like “the”, “of”, and “and”.

In information retrieval, removing stopwords is useful because:

they often do not express the core information need,
they increase the dimensionality of the search space,
filtering them helps produce a smaller, more informative representation of the query or document.

stopwords are not “useless” in an absolute sense. They can contribute grammatical structure (depends on task)

New cards

Vector Representation

In the vector-space model, both queries and documents are represented as vectors in the same space, where dimensions correspond to terms. The values in each dimension often represent counts or weights for those terms.

e.g a query with two terms can be represented as a two-dimensional vector where each coordinate corresponds to how often each term appears.

Similarity between query and document can then be measured geometrically.

New cards

Length/ norm and dot product of a vector

The length (norm) of a vector tells us how large it is. For a vector u = [u1, u2, ..., un], the Euclidean norm is: ||u|| = sqrt(u1^2 + u2^2 + ... + un^2)

This matters because cosine similarity divides by vector lengths to normalise for document/query size.

The dot product measures how much two vectors align. A larger dot product usually means greater similarity, but it is affected by vector length. For vectors u and v, the dot product is: u · v = u1v1 + u2v2 + ... + unvn

This is the key formula behind cosine similarity

<p>The <strong>length (norm)</strong> of a vector tells us how large it is. For a vector u = [u1, u2, ..., un], the Euclidean norm is: ||u|| = sqrt(u1^2 + u2^2 + ... + un^2)</p><p>This matters because cosine similarity divides by vector lengths to normalise for document/query size.</p><p></p><p>The <strong>dot product</strong> measures how much two vectors align. A larger dot product usually means greater similarity, but it is affected by vector length. For vectors u and v, the dot product is: u · v = u1v1 + u2v2 + ... + unvn </p><p>This is the key formula behind cosine similarity</p>

New cards

Angle between two vectors

The relationship is: u · v = ||u|| ||v|| cos(theta)

New cards

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. The formula is: cosine(A,B) = (A · B) / (||A|| ||B||)

It compares direction rather than raw size, which makes it very useful in IR.

Expanded form in images

<p>Cosine similarity measures the <strong>cosine of the angle</strong> between two vectors. The formula is: cosine(A,B) = (A · B) / (||A|| ||B||)</p><p>It compares direction rather than raw size, which makes it very useful in IR.</p><p>Expanded form in images</p>

New cards

Cosine Similarity vs Plain Dot Product

The plain dot product tends to favour longer vectors, so longer documents may look more relevant simply because they contain more words. Cosine similarity fixes this by dividing by the lengths of the vectors, so it focuses on direction / relative term distribution instead of absolute size.

New cards

Cosine similarity vs euclidean distance

Euclidean distance is heavily affected by document length, while cosine similarity is length-normalised and focuses on direction.

New cards

What is PageRank

PageRank is a search-ranking idea that looks not just at the words on a page but also at how pages link to each other. Links from other pages act a bit like endorsements, helping determine which pages are important. A page gets a high PageRank if the random surfer is likely to visit it often

This models the idea that a user usually follows links, but sometimes jumps to a random page.

New cards

PageRank algorithm

Step 1: Start with initial PageRank values

If there are 4 pages, a simple start is to give each one the same value:

PR(A) = PR(B) = PR(C) = PR(D) = 0.25

Step 2: For each page, look at incoming links. Find which pages point to that page.

Step 3: Compute contribution from each incoming page
If page v links to u, then its contribution is:

PR(v) / L(v)

Step 4: Add all contributions. That gives the new PageRank of page u.

PR(u) = PageRank of page u
B_u = set of pages that link into page u
PR(v) = PageRank of page v
L(v) = number of outgoing links from page v
N = number of pages
d = damping factor

To calculate the PageRank of page u, add up the contribution from every page v that links to u. Each linking page gives: PR(v) / L(v)

PR(u) = sum over v in B_u of PR(v) / L(v)

PR(u) = Σ_{v ∈ B_u} PR(v) / L(v)

The slides note a damping factor often taken as:

d=0.85

with probability d, the user keeps following links
with probability 1 - d, the user jumps to a random page
experimentally, d = 0.85

PR(u) = (1 - d) + d * Σ_{v ∈ B_u} PR(v) / L(v) OR

PR(u) = (1 - d)/N + d * Σ_{v ∈ B_u} PR(v) / L(v)

Step 5: Repeat

Keep recalculating all page scores until the changes become very small.

Repeat until the change is < some ε

That means stop when the change is less than a tiny threshold ε.

New cards

TF-IDF

Term frequency (TF) is the idea that the more often a key term appears in a document, the more relevant that document may be to that term.

TF(term)=1term in document×number of occurrences

Inverse document frequency (IDF) measures how informative or rare a term is across the whole collection.

if a term appears in many documents, it is less useful for distinguishing relevance,
if a term appears in few documents, it is more informative.

IDF(term) = log10(N / (DF(term) + 1)) (+1 not always needed)

where:

N = total number of documents,
DF(term) = number of documents containing the term.

The +1 smooths the count to avoid division-by-zero problems, and the logarithm tones down very large raw differences

TF-IDF combines local importance and global rarity:

TF says how important a term is within a document,
IDF says how distinctive that term is across the collection.

So:

TF-IDF(term) = TF(term) × IDF(term)

This means a term gets a high score when it occurs often in a document and is not common across all documents. TF-IDF scores are then used to build document vectors for search.

New cards

Term weighting vs raw counts

Raw counts are helpful, but not all terms are equally informative. IR uses term weighting so that the algorithm focuses more on salient, useful words and less on words that are common everywhere.

Weighting improves efficiency and relevance.

a term that appears many times in one document may matter,
but if it appears in almost every document in the collection, it is not very discriminative.

That is why term weighting moves beyond simple counts toward measures like TF, IDF, and TF-IDF.

New cards

TF Matrix

A TF-IDF matrix is a table that shows the TF-IDF weight of each term in each document.

Usually:

rows = documents
columns = terms
cell value = TF-IDF score of that term in that document

So it is basically a way of turning text into numbers so documents can be compared mathematically.

the TF-IDF matrix is part of the vector space model.

<p>A <strong>TF-IDF matrix</strong> is a table that shows the <strong>TF-IDF weight of each term in each document</strong>.</p><p>Usually:</p><ul><li><p><strong>rows</strong> = documents</p></li><li><p><strong>columns</strong> = terms</p></li><li><p><strong>cell value</strong> = TF-IDF score of that term in that document</p></li></ul><p>So it is basically a way of turning text into numbers so documents can be compared mathematically.</p><p></p><p>the TF-IDF matrix is part of the <strong>vector space model</strong>.</p>

New cards

Precision and mean precision

Precision@k measures the proportion of the top k returned documents that are relevant:

P@k=# relevant documents in top k/ k = relevant documents in top k

It focuses on the top of the ranking, which is what users usually care about most.

Mean P@k = (1 / Q) * sum of (P_i@k) where Q is the total number of queries

It focuses on the top results, which is where users usually pay most attention in search systems.

New cards

MRR (Mean reciprocal rank)

Reciprocal rank is based on the position of the first relevant result for one query. MRR is the average reciprocal rank across multiple queries. MRR measures how early the first relevant result appears, which reflects how quickly a user finds a useful answer.

RR = 1 / rank of first relevant result

MRR = (1 / Q) * sum of (1 / rank_i) where Q is the total number of queries and rank_i is the position of the first relevant document for query i

New cards

What is a web crawler

A web crawler is a program that recursively downloads web pages to build a searchable index. It is one of the foundations of large-scale web search engines.

New cards

What is the difference between directory-style access and search

A directory organises content by a human-defined hierarchy and users browse through categories. A search system lets users issue queries and retrieves documents algorithmically. As the web grew, search became necessary because manual directory-style organisation no longer scaled well.

New cards

What is the difference between a forward index and an inverted index? (IR)

A forward index and inverted index are about how a search engine stores and looks up text efficiently.

A forward index stores which words occur in each document.
An inverted index stores, for each word, which documents contain it.
Search engines rely heavily on inverted indexing because it makes query lookup much faster.