Methods Lexical Semantics I & II

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/45

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

46 Terms

1
New cards

Lexical Semantics

The study of...

what individual lexical items mean,

how we can represent their meaning,

and how to combine the meaning of individual items to obtain an interpretation for a phrase/utterance

2
New cards

Lexical Semantics in Computational Linguistics

Recognize word senses in text (manually and automatically)

Define similarities between words

Determine how strongly a verb “goes with” its subject (selectional preferences)

Recognize and interpret figurative uses of words

Describe relations between words (or better, between word senses)

3
New cards

Semantic Ontologies

structured dictionaries that define word senses and relation to other word senses

4
New cards

WordNet

large lexical resource that organizes words and synsets according to their semantic relations

5
New cards

Limitations of Relational Models

Relational models such as WordNet are glorified thesauri

Require many years of development and depend on skilled lexicographers

Inconsistencies throughout the resource

Ontology is only as good as ontologist(s) – it is not only data

6
New cards

Distributional Semantic Model (DSM)

A model that encodes meaning from word co-occurrence patterns.

7
New cards

Effect of preprocessing

Linguistic annotation changes the nearest neighbors in a distributional model.

8
New cards

Semantic Similarity

two words sharing a high number of salient features (attributes) → paradigmatic relatedness

9
New cards

Semantic Relatedness

two words semantically associated without being necessarily similar → syntagmatic relatedness

10
New cards

Feature scaling

Adjusting feature values (e.g., Logarithmic scaling, Relevance weighting, Statistical association measures) before similarity computation.

11
New cards

Simple association measures

Pointwise Mutual Information, t-score, Log-Likelihood, Odds Ratio

12
New cards

Dimensionality reduction

Identify the latent dimensions and project the data onto these new dimensions

13
New cards

How are the word embeddings created?

give words from a vocabulary as input to a (feed-forward) neural network

embed them as vectors into a lower dimension space of a fixed size

fine-tune through back-propagation

14
New cards

What is the objective of creating the word embeddings?

create word representations that are good at predicting the surrounding context

15
New cards

Distributional Representation

captures linguistic distribution of each word in form of a high-dimensional numeric vector

typically based on co-occurrence counts (aka “count” models)

based on distributional hypothesis: similar distribution ≃ similar meaning (similar distribution = similar representation)

16
New cards

Distributed Representation

sub-symbolic, compact representation of words as dense numeric vector

meaning is captured in different dimensions and it is used to predict words (aka “predict” models)

similarity of vectors corresponds to similarity of the words

aka word embeddings

17
New cards

Methods to train word embeddings

word2vec, FastText, GloVe, ELMo, BERT, Flair

18
New cards

FastText

a method similar to word2vec but trained on character n-grams instead of words

19
New cards

GloVe

first uses co-occurrence matrix, calculates ratios of probabilities; trained with log-bilinear regression model

20
New cards

ELMo, BERT, Flair

Contextualized word embeddings

21
New cards

word2vec

takes words from a very large corpus of text as input (unsupervised)

learn a vector representation for each word to predict between every word and its context

fully connected feed-forward neural network with one hidden layer

Two main algorithms:

Continuous Bag of Words (CBOW)

Skip-gram

22
New cards

Continuous Bag of Words (CBOW)

predicts center word from the given context (sum of surrounding words vectors), uses continuous representations whose order is of no importance, can be seen as a precognitive language model, Objective function similar to a language model.

23
New cards

Skip-gram

predicts context taking the center word as input, objective function sums the log probabilities of the surrounding n words to the left and to the right of the target word wt

24
New cards

Embedding models consider…

the history (previous words) and the future (following words) of a center word. The number of words considered is called “the window size”

25
New cards

Words embeddings have … structure

Words embeddings have linear structure that enables analogies with vector arithmetics

26
New cards

Variations on word sense analysis

Word Sense Induction: we don’t know what (or even how many) senses the words have

Word Sense Disambiguation (WSD): we have a sense inventory for each word

Entity Linking: like WSD only with entities and (usually) an extra “OTHER” option (because probably not all referents of an entity are known)

27
New cards

Working Assumptions

coherence

one sense per collocation

one sense per discourse

28
New cards

Word sense disambiguation

select a sense for a word from a set of predefined possibilities (sense inventory usually comes from a dictionary or thesaurus) - supervised

29
New cards

Word sense induction

split the usages of a word into different meanings - unsupervised

30
New cards

WSD / WSI target sets

lexical sample

gather all contexts corresponding to occurrences of a target word

partition these contexts into regions of high density

assign a sense to each region

all words

make a graph consisting of all senses of all words to be disambiguated

choose the best combination of senses

31
New cards

Approaches to WSD

Knowledge-Based Disambiguation (use external resources and discourse properties)

Supervised Disambiguation (uses labeled data)

Unsupervised Disambiguation (one approach for all targets)

32
New cards

Describing the context: features

information about the target word’s senses, e. g., definitions, related concepts, unambiguous contexts, ...

information about the words around the target word

frequently cooccurring words

words that cooccur only with particular senses

selectional preferences (e. g., drink (with the “ingest” sense) takes liquids as objects)

words, root forms/lemmas, POS, frequency, ...

33
New cards

WSD with definitions

Identify the correct senses using definitions overlap

34
New cards

How to find the optimal sense combination for WSD with definitions?

Find the correct senses one at a time or Simulated annealing (function f = combination of word senses in a given text, Find the combination of senses that leads to highest definition overlap (redundancy))

35
New cards

WSD with a similarity graph

1. For each open-class word gather all word senses

2. Compute pairwise sense similarities with one of the similarity metrics (e. g., if we use WordNet senses, use graph-based similarity on WordNet)

3. Find the “best” combination of senses

36
New cards

Unsupervised WSD goal

assign a word sense from an inventory but without training data

37
New cards

Unsupervided WSI goal

cluster/group the contexts of ambiguous words, discriminate between these groups without actually labeling them

38
New cards

WSI clustering types

hierarchical clustering of contexts

clustering by committee

k-means clustering

39
New cards

hierarchical clustering of contexts

start with one word per cluster, and iteratively merge the clusters

single-link/complete-link/average-link clustering

hierarchical density-based clustering

40
New cards

clustering by committee

find the top-k most similar words for each word

construct committees as collections of tight clusters using the top-k similar words

form as many committees as possible on the condition that each newly formed committee is not very similar to any existing committee

assign each word to its most similar committee

41
New cards

LDA (Latent Dirichlet Allocation) a.k.a. Topic Modeling

discovers underlying themes in a collection of words by assigning each word a probability of belonging to different topics

42
New cards

Word sense induction by graph clustering

For a target word w, we build a collocation graph that connects the words in w’s context. Every edge in the graph represents the similarity between the connected nodes

43
New cards

mini-cut in graph clustering

find the partition of a graph by cutting the smallest number of edges or the edges with a minimum weighted sum

44
New cards

Chinese whispers in graph clustering

1. assign a class to each node

2. at each iteration a node gets reassigned the strongest class in the local neighborhood (most connected) In case of ties, choose a class randomly

45
New cards

Evaluation

Comparison with a gold standard

Precision / cluster purity = percentage of tokens that are tagged correctly, out of all tokens targeted by the system

Recall / cluster overlap = percentage of tokens that are tagged correctly, out of all words

46
New cards

Motivation for Multi-Modal Semantics

Semantics requires “grounding”

Semantics across multiple input modalities

Better semantic representations for NLP: Importance for human-like understanding and real-world applications (e. g., image captioning, video retrieval, grounded dialogue)