Computational Linguistics and natural language processing Lecture Notes

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/52

Earn XP

Description and Tags

Comprehensive vocabulary flashcards covering basic concepts of Computational Linguistics, Natural Language Processing, statistical methods, machine learning architectures, and specific NLP tasks.

Last updated 12:38 PM on 5/4/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

53 Terms

New cards

Computational Linguistics

The research field interested in answering linguistic questions using computational methodology.

New cards

Natural Language Processing (NLP)

Research focused on the automatic processing of human language for practical applications.

New cards

Sign Language Processing

A subfield of both NLP and Computer Vision (CV) involving the automatic analysis of sign language content, such as translation into spoken text.

New cards

Pragmatics

The branch of linguistic knowledge concerning the use of appropriate sentences for various communicative purposes.

New cards

Language Model

A computational representation of linguistic knowledge based on statistical observation of corpora and probabilistic calculations between 0 and 1.

New cards

Corpus (pl. Corpora)

A set of natural linguistic data available in digital format, selected and organized to satisfy specific criteria for linguistic analysis.

New cards

Grammatical Ambiguity

A challenge in CL where a word can function as different parts of speech, such as 'do' being a verb or a noun.

New cards

Syntactic Ambiguity

A challenge in CL illustrated by sentences like 'Sherlock saw a man with a magnifying glass,' where the structural relationship is unclear.

New cards

Multi-word expression

Expressions like idioms, phrasal verbs, or metaphorical expressions where the meaning does not correspond to the literal combination of the component words.

New cards

Introspective Data

Qualitative, informal, and small-scale judgements produced by a researcher reflecting on their own linguistic knowledge.

New cards

Representativeness

The extent to which a corpus permits accurate generalizations about a target domain, involving diversity of text types and typical distribution of features.

New cards

Brown University Standard Corpus of Present-Day American English

The first major first-generation corpus; a generalist, synchronic, monolingual collection of 1 million words from 1961.

New cards

Token

The minimal unit of text used for analysis, including words, punctuation, numbers, and acronyms.

New cards

Type-Token Ratio (TTR)

A measure of lexical variety calculated by dividing the number of types by the number of tokens.

New cards

Zipf's Laws

The observation that the frequency of any word is inversely proportional to its rank in a frequency table, resulting in few high-frequency words and many low-frequency words.

New cards

Hapax legomena

Words that occur only once within a specific corpus.

New cards

Nominal Variable

Data categories with no inherent order or hierarchy, such as part of speech, text genres, or author gender.

New cards

Ordinal Variable

Categories with a specific order but where the difference between values cannot be measured arithmetically, such as sentence formality or acceptability levels.

New cards

Mode

The most common value in a set of values, useful for identifying trends in categorical data.

New cards

Standard Deviation

The square root of variance, used to indicate how much values vary or spread around the mean.

New cards

Collocates

Words that habitually co-occur with a specific search word (the node) within a defined span or window.

New cards

TF-IDF (Term Frequency-Inverse Document Frequency)

A metric that calculates the relevance of a word by multiplying its frequency in a specific document by its rarity across the entire document set.

New cards

Jaccard Index

A degree of overlap between texts calculated by dividing the number of common words by the total number of words in both texts.

New cards

Levelsthein Distance

The number of operations (deletion, insertion, substitution) required to transform one string into another.

New cards

Gold-standard annotations

Linguistic labels produced by trained human annotators using their intuition and a specific set of guidelines.

New cards

Cohen's Kappa ( $\kappa$ )

A chance-corrected measure of Inter-Annotator Agreement (IAA) that accounts for the probability of annotators agreeing by accident.

New cards

Word Embeddings

Numerical vector representations of words in a multidimensional space where semantically similar words are positioned closer together.

New cards

Cosine Similarity

A measure of similarity between two vectors based on the cosine of the angle between them, ranging from $-1$ to $1$ .

New cards

Word2Vec

A method using neural networks to predict word embeddings, featuring two context representations: CBOW and Skip-Gram.

New cards

FastText

A modified version of Word2Vec that incorporates character n-grams to better represent rare words and spelling errors.

New cards

Supervised Learning

A type of machine learning where the model is trained using annotated data containing both inputs and correct labels.

New cards

Overfitting

A phenomenon where a model learns training data too well but fails to generalize to new, unseen data.

New cards

Cross-validation

A technique to assess the model by splitting data into $k$ folds, repeatedly using each fold for evaluation and the others for training.

New cards

K-means

An unsupervised technique that partitions data into clusters by minimizing the distance between data points and a cluster center (centroid).

New cards

Precision

The ratio between elements correctly predicted by a system and the total number of predicted elements.

New cards

Recall

The ratio between elements correctly predicted by a system and the actual total of correct elements in the data.

New cards

Perceptron

The simplest type of artificial neural network model, consisting of only one neuron.

New cards

Transformer

A neural network architecture designed for sequential data that uses self-attention to capture context by considering all elements of a sequence simultaneously.

New cards

Encoder-Decoder

A Transformer architecture where the first part processes input into contextual representations and the second part generates output tokens.

New cards

BERT (Bidirectional Encoder Representations from Transformers)

A 2018 transformer-based model optimized for understanding and classification using only the encoder part and Masked Language Modeling.

New cards

RAG (Retrieval-Augmented Generation)

An expert-level LLM architecture that retrieves relevant documents from an external knowledge base to augment the prompt before generation.

New cards

Zero-shot Prompting

A prompt that provides no examples or demonstrations, asking the model to perform a task based solely on the instruction.

New cards

Few-shot Prompting

A paradigm that allows language models to learn tasks by providing a few examples of demonstrations within the prompt.

New cards

Chain-of-Thought (CoT)

A prompting method where the model is asked to explain its reasoning step-by-step to improve performance on complex tasks.

New cards

Hallucination

A limitation of LLMs where the output contains fake, illogical, or made-up information presented as fact.

New cards

Stemming

A pre-processing task that truncates words by removing affixes to reduce morphological variability.

New cards

ALPAC Report (1966)

An influential report that claimed Machine Translation had no immediate utility, was too slow, and cost double that of human translation.

New cards

BLEU (BiLingual Evaluation Understudy)

An automatic MT metric that compares n-grams of the machine output to human reference translations.

New cards

Named Entity Recognition (NER)

A classification task aimed at identifying and categorizing mentions of specific entities, such as proper names, in a text.

New cards

Stance Detection

The task of identifying a person's standpoint (favor, against, or neither) toward a specific proposition or idea.

New cards

Sentiment Shifters

Expressions, such as negations, that change the prior polarity of a word (e.g., 'not good').

New cards

Universal Dependencies (UD)

An international initiative for creating syntactically annotated treebanks for many languages based on a universal core of tags and relations.

New cards

UDPipe

A trainable pipeline used for tasks including sentence splitting, tokenization, POS tagging, lemmatization, and dependency parsing.