1/45
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Lexical Semantics
The study of...
▶ what individual lexical items mean,
▶ how we can represent their meaning,
▶ and how to combine the meaning of individual items to obtain an interpretation for a phrase/utterance
Lexical Semantics in Computational Linguistics
▶ Recognize word senses in text (manually and automatically)
▶ Define similarities between words
▶ Determine how strongly a verb “goes with” its subject (selectional preferences)
▶ Recognize and interpret figurative uses of words
▶ Describe relations between words (or better, between word senses)
Semantic Ontologies
structured dictionaries that define word senses and relation to other word senses
WordNet
large lexical resource that organizes words and synsets according to their semantic relations
Limitations of Relational Models
▶ Relational models such as WordNet are glorified thesauri
▶ Require many years of development and depend on skilled lexicographers
▶ Inconsistencies throughout the resource
▶ Ontology is only as good as ontologist(s) – it is not only data
Distributional Semantic Model (DSM)
A model that encodes meaning from word co-occurrence patterns.
Effect of preprocessing
Linguistic annotation changes the nearest neighbors in a distributional model.
Semantic Similarity
two words sharing a high number of salient features (attributes) → paradigmatic relatedness
Semantic Relatedness
two words semantically associated without being necessarily similar → syntagmatic relatedness
Feature scaling
Adjusting feature values (e.g., Logarithmic scaling, Relevance weighting, Statistical association measures) before similarity computation.
Simple association measures
Pointwise Mutual Information, t-score, Log-Likelihood, Odds Ratio
Dimensionality reduction
Identify the latent dimensions and project the data onto these new dimensions
How are the word embeddings created?
▶ give words from a vocabulary as input to a (feed-forward) neural network
▶ embed them as vectors into a lower dimension space of a fixed size
▶ fine-tune through back-propagation
What is the objective of creating the word embeddings?
create word representations that are good at predicting the surrounding context
Distributional Representation
▶ captures linguistic distribution of each word in form of a high-dimensional numeric vector
▶ typically based on co-occurrence counts (aka “count” models)
▶ based on distributional hypothesis: similar distribution ≃ similar meaning (similar distribution = similar representation)
Distributed Representation
▶ sub-symbolic, compact representation of words as dense numeric vector
▶ meaning is captured in different dimensions and it is used to predict words (aka “predict” models)
▶ similarity of vectors corresponds to similarity of the words
▶ aka word embeddings
Methods to train word embeddings
word2vec, FastText, GloVe, ELMo, BERT, Flair
FastText
a method similar to word2vec but trained on character n-grams instead of words
GloVe
first uses co-occurrence matrix, calculates ratios of probabilities; trained with log-bilinear regression model
ELMo, BERT, Flair
Contextualized word embeddings
word2vec
▶ takes words from a very large corpus of text as input (unsupervised)
▶ learn a vector representation for each word to predict between every word and its context
▶ fully connected feed-forward neural network with one hidden layer
Two main algorithms:
▶ Continuous Bag of Words (CBOW)
▶ Skip-gram
Continuous Bag of Words (CBOW)
predicts center word from the given context (sum of surrounding words vectors), uses continuous representations whose order is of no importance, can be seen as a precognitive language model, Objective function similar to a language model.
Skip-gram
predicts context taking the center word as input, objective function sums the log probabilities of the surrounding n words to the left and to the right of the target word wt
Embedding models consider…
the history (previous words) and the future (following words) of a center word. The number of words considered is called “the window size”
Words embeddings have … structure
Words embeddings have linear structure that enables analogies with vector arithmetics
Variations on word sense analysis
▶ Word Sense Induction: we don’t know what (or even how many) senses the words have
▶ Word Sense Disambiguation (WSD): we have a sense inventory for each word
▶ Entity Linking: like WSD only with entities and (usually) an extra “OTHER” option (because probably not all referents of an entity are known)
Working Assumptions
▶ coherence
▶ one sense per collocation
▶ one sense per discourse
Word sense disambiguation
select a sense for a word from a set of predefined possibilities (sense inventory usually comes from a dictionary or thesaurus) - supervised
Word sense induction
split the usages of a word into different meanings - unsupervised
WSD / WSI target sets
lexical sample
▶ gather all contexts corresponding to occurrences of a target word
▶ partition these contexts into regions of high density
▶ assign a sense to each region
all words
▶ make a graph consisting of all senses of all words to be disambiguated
▶ choose the best combination of senses
Approaches to WSD
▶ Knowledge-Based Disambiguation (use external resources and discourse properties)
▶ Supervised Disambiguation (uses labeled data)
▶ Unsupervised Disambiguation (one approach for all targets)
Describing the context: features
▶ information about the target word’s senses, e. g., definitions, related concepts, unambiguous contexts, ...
▶ information about the words around the target word
▶ frequently cooccurring words
▶ words that cooccur only with particular senses
▶ selectional preferences (e. g., drink (with the “ingest” sense) takes liquids as objects)
▶ words, root forms/lemmas, POS, frequency, ...
WSD with definitions
Identify the correct senses using definitions overlap
How to find the optimal sense combination for WSD with definitions?
Find the correct senses one at a time or Simulated annealing (function f = combination of word senses in a given text, Find the combination of senses that leads to highest definition overlap (redundancy))
WSD with a similarity graph
1. For each open-class word gather all word senses
2. Compute pairwise sense similarities with one of the similarity metrics (e. g., if we use WordNet senses, use graph-based similarity on WordNet)
3. Find the “best” combination of senses
Unsupervised WSD goal
assign a word sense from an inventory but without training data
Unsupervided WSI goal
cluster/group the contexts of ambiguous words, discriminate between these groups without actually labeling them
WSI clustering types
▶ hierarchical clustering of contexts
▶ clustering by committee
▶ k-means clustering
hierarchical clustering of contexts
start with one word per cluster, and iteratively merge the clusters
▶ single-link/complete-link/average-link clustering
▶ hierarchical density-based clustering
clustering by committee
▶ find the top-k most similar words for each word
▶ construct committees as collections of tight clusters using the top-k similar words
▶ form as many committees as possible on the condition that each newly formed committee is not very similar to any existing committee
▶ assign each word to its most similar committee
LDA (Latent Dirichlet Allocation) a.k.a. Topic Modeling
discovers underlying themes in a collection of words by assigning each word a probability of belonging to different topics
Word sense induction by graph clustering
For a target word w, we build a collocation graph that connects the words in w’s context. Every edge in the graph represents the similarity between the connected nodes
mini-cut in graph clustering
find the partition of a graph by cutting the smallest number of edges or the edges with a minimum weighted sum
Chinese whispers in graph clustering
1. assign a class to each node
2. at each iteration a node gets reassigned the strongest class in the local neighborhood (most connected) ▶ In case of ties, choose a class randomly
Evaluation
▶ Comparison with a gold standard
▶ Precision / cluster purity = percentage of tokens that are tagged correctly, out of all tokens targeted by the system
▶ Recall / cluster overlap = percentage of tokens that are tagged correctly, out of all words
Motivation for Multi-Modal Semantics
▶ Semantics requires “grounding”
▶ Semantics across multiple input modalities
▶ Better semantic representations for NLP: Importance for human-like understanding and real-world applications (e. g., image captioning, video retrieval, grounded dialogue)