Methods Lexical Semantics I & II

0.0(0)
studied byStudied by 0 people
0.0(0)
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/45

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 11:30 AM on 12/1/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

46 Terms

1
New cards

Lexical Semantics

The study of...

ā–¶ what individual lexical items mean,

ā–¶ how we can represent their meaning,

ā–¶ and how to combine the meaning of individual items to obtain an interpretation for a phrase/utterance

2
New cards

Lexical Semantics in Computational Linguistics

ā–¶ Recognize word senses in text (manually and automatically)

ā–¶ Define similarities between words

ā–¶ Determine how strongly a verb ā€œgoes withā€ its subject (selectional preferences)

ā–¶ Recognize and interpret figurative uses of words

ā–¶ Describe relations between words (or better, between word senses)

3
New cards

Semantic Ontologies

structured dictionaries that define word senses and relation to other word senses

4
New cards

WordNet

large lexical resource that organizes words and synsets according to their semantic relations

5
New cards

Limitations of Relational Models

ā–¶ Relational models such as WordNet are glorified thesauri

ā–¶ Require many years of development and depend on skilled lexicographers

ā–¶ Inconsistencies throughout the resource

ā–¶ Ontology is only as good as ontologist(s) – it is not only data

6
New cards

Distributional Semantic Model (DSM)

A model that encodes meaning from word co-occurrence patterns.

7
New cards

Effect of preprocessing

Linguistic annotation changes the nearest neighbors in a distributional model.

8
New cards

Semantic Similarity

two words sharing a high number of salient features (attributes) → paradigmatic relatedness

9
New cards

Semantic Relatedness

two words semantically associated without being necessarily similar → syntagmatic relatedness

10
New cards

Feature scaling

Adjusting feature values (e.g., Logarithmic scaling, Relevance weighting, Statistical association measures) before similarity computation.

11
New cards

Simple association measures

Pointwise Mutual Information,Ā t-score,Ā Log-Likelihood,Ā Odds Ratio

12
New cards

Dimensionality reduction

Identify the latent dimensions and project the data onto these new dimensions

13
New cards

How are the word embeddings created?

ā–¶ give words from a vocabulary as input to a (feed-forward) neural network

ā–¶ embed them as vectors into a lower dimension space of a fixed size

ā–¶ fine-tune through back-propagation

14
New cards

What is the objective of creating the word embeddings?

create word representations that are good at predicting the surrounding context

15
New cards

Distributional Representation

ā–¶ captures linguistic distribution of each word in form of a high-dimensional numeric vector

ā–¶ typically based on co-occurrence counts (aka ā€œcountā€ models)

ā–¶ based on distributional hypothesis: similar distribution ā‰ƒ similar meaning (similar distribution = similar representation)

16
New cards

Distributed Representation

ā–¶ sub-symbolic, compact representation of words as dense numeric vector

ā–¶ meaning is captured in different dimensions and it is used to predict words (aka ā€œpredictā€ models)

ā–¶ similarity of vectors corresponds to similarity of the words

ā–¶ aka word embeddings

17
New cards

Methods to train word embeddings

word2vec,Ā FastText,Ā GloVe,Ā ELMo, BERT, Flair

18
New cards

FastText

a method similar to word2vec but trained on character n-grams instead of words

19
New cards

GloVe

first uses co-occurrence matrix, calculates ratios of probabilities; trained with log-bilinear regression model

20
New cards

ELMo, BERT, Flair

Contextualized word embeddings

21
New cards

word2vec

ā–¶ takes words from a very large corpus of text as input (unsupervised)

ā–¶ learn a vector representation for each word to predict between every word and its context

ā–¶ fully connected feed-forward neural network with one hidden layer

Two main algorithms:

ā–¶ Continuous Bag of Words (CBOW)

ā–¶ Skip-gram

22
New cards

Continuous Bag of Words (CBOW)

predicts center word from the given context (sum of surrounding words vectors), uses continuous representations whose order is of no importance, can be seen as a precognitive language model, Objective function similar to a language model.

23
New cards

Skip-gram

predicts context taking the center word as input, objective function sums the log probabilities of the surrounding n words to the left and to the right of the target word wt

24
New cards

Embedding models consider…

the history (previous words) and the future (following words) of a center word. The number of words considered is called ā€œthe window sizeā€

25
New cards

Words embeddings have … structure

Words embeddings have linear structure that enables analogies with vector arithmetics

26
New cards

Variations on word sense analysis

ā–¶ Word Sense Induction: we don’t know what (or even how many) senses the words have

ā–¶ Word Sense Disambiguation (WSD): we have a sense inventory for each word

ā–¶ Entity Linking: like WSD only with entities and (usually) an extra ā€œOTHERā€ option (because probably not all referents of an entity are known)

27
New cards

Working Assumptions

ā–¶ coherence

ā–¶ one sense per collocation

ā–¶ one sense per discourse

28
New cards

Word sense disambiguation

select a sense for a word from a set of predefined possibilities (sense inventory usually comes from a dictionary or thesaurus) - supervised

29
New cards

Word sense induction

split the usages of a word into different meanings - unsupervised

30
New cards

WSD / WSI target sets

lexical sample

ā–¶ gather all contexts corresponding to occurrences of a target word

ā–¶ partition these contexts into regions of high density

ā–¶ assign a sense to each region

all words

ā–¶ make a graph consisting of all senses of all words to be disambiguated

ā–¶ choose the best combination of senses

31
New cards

Approaches to WSD

ā–¶ Knowledge-Based Disambiguation (use external resources and discourse properties)

ā–¶ Supervised Disambiguation (uses labeled data)

ā–¶ Unsupervised Disambiguation (one approach for all targets)

32
New cards

Describing the context: features

ā–¶ information about the target word’s senses, e. g., definitions, related concepts, unambiguous contexts, ...

ā–¶ information about the words around the target word

ā–¶ frequently cooccurring words

ā–¶ words that cooccur only with particular senses

ā–¶ selectional preferences (e. g., drink (with the ā€œingestā€ sense) takes liquids as objects)

ā–¶ words, root forms/lemmas, POS, frequency, ...

33
New cards

WSD with definitions

Identify the correct senses using definitions overlap

34
New cards

How to find the optimal sense combination for WSD with definitions?

Find the correct senses one at a time or Simulated annealing (function f = combination of word senses in a given text, Find the combination of senses that leads to highest definition overlap (redundancy))

35
New cards

WSD with a similarity graph

1. For each open-class word gather all word senses

2. Compute pairwise sense similarities with one of the similarity metrics (e. g., if we use WordNet senses, use graph-based similarity on WordNet)

3. Find the ā€œbestā€ combination of senses

36
New cards

Unsupervised WSD goal

assign a word sense from an inventory but without training data

37
New cards

Unsupervided WSI goal

cluster/group the contexts of ambiguous words, discriminate between these groups without actually labeling them

38
New cards

WSI clustering types

ā–¶ hierarchical clustering of contexts

ā–¶ clustering by committee

ā–¶ k-means clustering

39
New cards

hierarchical clustering of contexts

start with one word per cluster, and iteratively merge the clusters

ā–¶ single-link/complete-link/average-link clustering

ā–¶ hierarchical density-based clustering

40
New cards

clustering by committee

ā–¶ find the top-k most similar words for each word

ā–¶ construct committees as collections of tight clusters using the top-k similar words

ā–¶ form as many committees as possible on the condition that each newly formed committee is not very similar to any existing committee

ā–¶ assign each word to its most similar committee

41
New cards

LDA (Latent Dirichlet Allocation) a.k.a. Topic Modeling

discovers underlying themes in a collection of words by assigning each word a probability of belonging to different topics

42
New cards

Word sense induction by graph clustering

For a target word w, we build a collocation graph that connects the words in w’s context. Every edge in the graph represents the similarity between the connected nodes

43
New cards

mini-cut in graph clustering

find the partition of a graph by cutting the smallest number of edges or the edges with a minimum weighted sum

44
New cards

Chinese whispers in graph clustering

1. assign a class to each node

2. at each iteration a node gets reassigned the strongest class in the local neighborhood (most connected) ā–¶ In case of ties, choose a class randomly

45
New cards

Evaluation

ā–¶ Comparison with a gold standard

ā–¶ Precision / cluster purity = percentage of tokens that are tagged correctly, out of all tokens targeted by the system

ā–¶ Recall / cluster overlap = percentage of tokens that are tagged correctly, out of all words

46
New cards

Motivation for Multi-Modal Semantics

ā–¶ Semantics requires ā€œgroundingā€

ā–¶ Semantics across multiple input modalities

ā–¶ Better semantic representations for NLP: Importance for human-like understanding and real-world applications (e. g., image captioning, video retrieval, grounded dialogue)