L14 – Natural language processing: More on embeddings

0.0(0)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/13

Earn XP

Description and Tags

Flashcards about embeddings and Natural Language Processing.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

14 Terms

New cards

What is text vectorisation?

Converting text to numeric data, like translating English to French. Instead of words, you have numbers a computer can understand.

New cards

What are the typical steps for text vectorisation?

Standardisation (cleaning the text), Tokenisation (splitting into pieces), Indexing (assigning IDs), and Encoding (converting IDs to vectors). These steps transform raw text data into numerical representations that machine learning models can process.

New cards

What is Tokenisation?

Split text into tokens which can be words, subwords, or groups of words. Like cutting a sentence into individual words or phrases.

New cards

How do Embedding layers work?

It is practically the same as one-hot encoding, followed by a linear Dense layer (without the bias term). It's like creating a unique address for each word in a high-dimensional space, learning these addresses via a neural network.

New cards

What are the options for measuring token similarity in embedding space?

Euclidian (L2) norm distance (straight-line distance) and Cosine similarity (angle between vectors). Imagine measuring how far apart or similarly oriented two points are in a space.

New cards

What is Cosine similarity?

Angular difference between positions a and b (ignores vector lengths). Like figuring out how much two lines are pointing in the same direction, regardless of their length.

New cards

How does vector search work?

Uses vector similarity to find relevant content. Like using a map (embedding space) to find locations (content) near a specific point (query).

New cards

Besides words, what else can we create embeddings for?

Sentences, Images, Audio. Anything that can be represented as data can have an embedding.

New cards

What does Semantic Textual Similarity measure?

Measures similarity between entire sentences. Like checking if two paragraphs convey the same message, even if they use different words.

New cards

How do we approach image embedding with CNNs?

Let the flattened output of the last Conv layer be the embedding space. The CNN learns to extract the most important features of an image and represent it as a vector.

New cards

What forms a latent feature space?

The output of embedding layers, convolutional layers, and all feature extraction layers, form a latent feature space. The hidden representation of data learned by the model.

New cards

What is Transfer learning?

Training a model on one task, but using it (with little modification) on a different task. Like learning to ride a bike and then using that knowledge to learn to ride a motorcycle more easily.

New cards

What are benchmark tasks?

Various classification / clustering / information retrieval problems with human annotated solutions. Like puzzles with known solutions that we use to test how well a problem-solving strategy works.

New cards

How are embedding layers after training?

A given token is always mapped to the same embedding vector. The address remains consistent.