1/52
Comprehensive vocabulary flashcards covering basic concepts of Computational Linguistics, Natural Language Processing, statistical methods, machine learning architectures, and specific NLP tasks.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Computational Linguistics
The research field interested in answering linguistic questions using computational methodology.
Natural Language Processing (NLP)
Research focused on the automatic processing of human language for practical applications.
Sign Language Processing
A subfield of both NLP and Computer Vision (CV) involving the automatic analysis of sign language content, such as translation into spoken text.
Pragmatics
The branch of linguistic knowledge concerning the use of appropriate sentences for various communicative purposes.
Language Model
A computational representation of linguistic knowledge based on statistical observation of corpora and probabilistic calculations between 0 and 1.
Corpus (pl. Corpora)
A set of natural linguistic data available in digital format, selected and organized to satisfy specific criteria for linguistic analysis.
Grammatical Ambiguity
A challenge in CL where a word can function as different parts of speech, such as 'do' being a verb or a noun.
Syntactic Ambiguity
A challenge in CL illustrated by sentences like 'Sherlock saw a man with a magnifying glass,' where the structural relationship is unclear.
Multi-word expression
Expressions like idioms, phrasal verbs, or metaphorical expressions where the meaning does not correspond to the literal combination of the component words.
Introspective Data
Qualitative, informal, and small-scale judgements produced by a researcher reflecting on their own linguistic knowledge.
Representativeness
The extent to which a corpus permits accurate generalizations about a target domain, involving diversity of text types and typical distribution of features.
Brown University Standard Corpus of Present-Day American English
The first major first-generation corpus; a generalist, synchronic, monolingual collection of 1 million words from 1961.
Token
The minimal unit of text used for analysis, including words, punctuation, numbers, and acronyms.
Type-Token Ratio (TTR)
A measure of lexical variety calculated by dividing the number of types by the number of tokens.
Zipf's Laws
The observation that the frequency of any word is inversely proportional to its rank in a frequency table, resulting in few high-frequency words and many low-frequency words.
Hapax legomena
Words that occur only once within a specific corpus.
Nominal Variable
Data categories with no inherent order or hierarchy, such as part of speech, text genres, or author gender.
Ordinal Variable
Categories with a specific order but where the difference between values cannot be measured arithmetically, such as sentence formality or acceptability levels.
Mode
The most common value in a set of values, useful for identifying trends in categorical data.
Standard Deviation
The square root of variance, used to indicate how much values vary or spread around the mean.
Collocates
Words that habitually co-occur with a specific search word (the node) within a defined span or window.
TF-IDF (Term Frequency-Inverse Document Frequency)
A metric that calculates the relevance of a word by multiplying its frequency in a specific document by its rarity across the entire document set.
Jaccard Index
A degree of overlap between texts calculated by dividing the number of common words by the total number of words in both texts.
Levelsthein Distance
The number of operations (deletion, insertion, substitution) required to transform one string into another.
Gold-standard annotations
Linguistic labels produced by trained human annotators using their intuition and a specific set of guidelines.
Cohen's Kappa (κ)
A chance-corrected measure of Inter-Annotator Agreement (IAA) that accounts for the probability of annotators agreeing by accident.
Word Embeddings
Numerical vector representations of words in a multidimensional space where semantically similar words are positioned closer together.
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from −1 to 1.
Word2Vec
A method using neural networks to predict word embeddings, featuring two context representations: CBOW and Skip-Gram.
FastText
A modified version of Word2Vec that incorporates character n-grams to better represent rare words and spelling errors.
Supervised Learning
A type of machine learning where the model is trained using annotated data containing both inputs and correct labels.
Overfitting
A phenomenon where a model learns training data too well but fails to generalize to new, unseen data.
Cross-validation
A technique to assess the model by splitting data into k folds, repeatedly using each fold for evaluation and the others for training.
K-means
An unsupervised technique that partitions data into clusters by minimizing the distance between data points and a cluster center (centroid).
Precision
The ratio between elements correctly predicted by a system and the total number of predicted elements.
Recall
The ratio between elements correctly predicted by a system and the actual total of correct elements in the data.
Perceptron
The simplest type of artificial neural network model, consisting of only one neuron.
Transformer
A neural network architecture designed for sequential data that uses self-attention to capture context by considering all elements of a sequence simultaneously.
Encoder-Decoder
A Transformer architecture where the first part processes input into contextual representations and the second part generates output tokens.
BERT (Bidirectional Encoder Representations from Transformers)
A 2018 transformer-based model optimized for understanding and classification using only the encoder part and Masked Language Modeling.
RAG (Retrieval-Augmented Generation)
An expert-level LLM architecture that retrieves relevant documents from an external knowledge base to augment the prompt before generation.
Zero-shot Prompting
A prompt that provides no examples or demonstrations, asking the model to perform a task based solely on the instruction.
Few-shot Prompting
A paradigm that allows language models to learn tasks by providing a few examples of demonstrations within the prompt.
Chain-of-Thought (CoT)
A prompting method where the model is asked to explain its reasoning step-by-step to improve performance on complex tasks.
Hallucination
A limitation of LLMs where the output contains fake, illogical, or made-up information presented as fact.
Stemming
A pre-processing task that truncates words by removing affixes to reduce morphological variability.
ALPAC Report (1966)
An influential report that claimed Machine Translation had no immediate utility, was too slow, and cost double that of human translation.
BLEU (BiLingual Evaluation Understudy)
An automatic MT metric that compares n-grams of the machine output to human reference translations.
Named Entity Recognition (NER)
A classification task aimed at identifying and categorizing mentions of specific entities, such as proper names, in a text.
Stance Detection
The task of identifying a person's standpoint (favor, against, or neither) toward a specific proposition or idea.
Sentiment Shifters
Expressions, such as negations, that change the prior polarity of a word (e.g., 'not good').
Universal Dependencies (UD)
An international initiative for creating syntactically annotated treebanks for many languages based on a universal core of tags and relations.
UDPipe
A trainable pipeline used for tasks including sentence splitting, tokenization, POS tagging, lemmatization, and dependency parsing.