NLP B6-9.

5.0(1)
studied byStudied by 6 people
5.0(1)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/38

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

39 Terms

1
New cards

Role of AI alignment

Ensuring that it operates in accordance (“is aligned”) with

the intended goals and preferences of humans (users, operators etc.), and

general ethical principles

2
New cards

Main human influences on AI systems

Choosing the:

  • dataset

  • reward function

  • loss or objective function

3
New cards

Outer misalignment

A divergence between the developer specified objective or reward of the system and the intended human goals

4
New cards

Inner misalignment

A divergence between the explicitly specified training objective and what the system actually pursues, its so-called emergent goals.

5
New cards

Instruction following assistant

An LLM-based general model which can carry out a wide, open-ended range of tasks based on their descriptions

6
New cards

Main expectations towards an instruction following assistant

HHH

  • helpful

  • honest

  • harmless

7
New cards

Hallucination

Plausibly sounding but non-factual, misleading statements

8
New cards

Main strategies for creating instruction datasets

  • manual creation

  • data integration

  • synthetic generation

9
New cards

Manual creation

Correct responses are written by human annotators, instructions are either collected from user–LLM interactions or also manually created

10
New cards

Data integration

Converting existing supervised NLP task datasets into natural language (instruction, response) pairs using manually created templates.

E.g. Flan

11
New cards

Synthetic generation

The responses are generated by LLMs (but are possibly filtered by humans), while instructions are either

collected from user prompts, or

also generated by LLMs based on a pool of manually created seed prompts → randomly sample the pool to prompt an LLM to generate further instructions and examples, filter these and add the best ones iteratively

E.g. Self-Instruct

12
New cards

Proximal Policy Optimization (PPO)

A policy gradient variant which avoids making too large policy changes by clipping the updates to a certain range

13
New cards

RL training objectives

  • maximize the expected reward for (instruction, model-response) pairs

  • minimize (a scaled version of) the KL divergence between the conditional distributions predicted by the policy and by the instruct language model used for its initialization

14
New cards

Direct Preference Optimization (DPO)

Transforms the RL optimization problem into a supervised (ML) learning task, hence eliminating the need for the costly reward model

  1. Reparameterizes the RL optimization problem in terms of the policy instead of the reward model RM

  2. Formulates a maximum likelihood objective for the policy πθ

  3. Optimizes the policy via supervised learning on the original user judgements

15
New cards

Input of conditional text generation

A complex representation of the assistive dialog’s context, including its history (instead of a single instruction)

16
New cards

Complexity of retrieval with nearest-neighbor search

O(Nd)

d is the embedding size, N is the number of documents

17
New cards

Methods for approximating nearest neighbors

  • Hashing

  • Quantization

  • Tree structure

  • Graph-based

18
New cards

Main idea of using locality-sensitive hashing for nearest neighbor approximation

The probability of collision monotonically decreases with the increasing distance of two vectors (the bins will contain elements which are close to eachother)
→ we perform complete nearest neighbor search in the element’s bin only

19
New cards

Main idea of using KD-trees for nearest neighbor approximation

  1. Drawing a hyper-plane at the median orthogonal to the highest-variance data dimension

  2. Each half is split using the same principle, until each node contains a single element only → tree leaves

  3. We create connections by merging nodes/subgroups by the inverse order of their separation

  4. Use priority search for finding the nearest neighbors

<ol><li><p>Drawing a hyper-plane at the median orthogonal to the highest-variance data dimension</p></li><li><p>Each half is split using the same principle, until each node contains a single element only → tree leaves</p></li><li><p>We create connections by merging nodes/subgroups by the inverse order of their separation</p></li><li><p>Use priority search for finding the nearest neighbors</p></li></ol>
20
New cards

Main idea of using priority search in KD-trees for nearest neighbor approximation

  1. We split up our data into cells, each cell containing a KD-tree leaf node

  2. We encode the user query, and finds its cell.

  3. We measure the distance between the leaf node belonging to that cell and the encoded query

  4. We use this distance as a search radius -> we only do NN search in cells which are touched

<ol><li><p>We split up our data into cells, each cell containing a KD-tree leaf node</p></li><li><p>We encode the user query, and finds its cell.</p></li><li><p>We measure the distance between the leaf node belonging to that cell and the encoded query</p></li><li><p>We use this distance as a search radius -&gt; we only do NN search in cells which are touched</p></li></ol>
21
New cards

Voronoi cell

A geometric shape that represents the region closest to a specific point, forming boundaries with neighboring points.

<p>A geometric shape that represents the region closest to a specific point, forming boundaries with neighboring points.</p>
22
New cards

Vector Quantization

A compression technique that represents text data as a smaller set of reference vectors (centroids), approximating the original high-dimensional word vectors with the closest centoid vector.

It significantly enhances storage efficiency and processing speeds ←→ involves a trade-off with information loss due to approximation

<p>A compression technique that represents text data as a smaller set of reference vectors (centroids), approximating the original high-dimensional word vectors with the closest centoid vector. </p><p>It significantly enhances storage efficiency and processing speeds ←→ involves a trade-off with information loss due to approximation</p>
23
New cards

Product quantization

A high-dimensional vector is divided into smaller sub-vectors or segments. Each sub-vector is then quantized independently, using a smaller codebook of centroids that is specific to that segment. The final quantized representation of the original vector is obtained by combining the quantized codes (indices of the nearest centroids) of each segment (taking the Cartesian-product).

This is more computationally efficient since it's much easier to manage and compute distances within these lower-dimensional subspaces.

24
New cards

Complexity of product quantization

O(d*m^{1/L})

L is the number of segments, d is the vector dimensionality, m is the number of the possible value combinations

25
New cards

Small world property of graphs

  • shortest path between two vertices of the graph on average should be small (idea of "six degrees of separation" in social networks)

  • clustering coefficient (ratio of the fully connected

    triples (triangles) and all triples in the graph), should be

    large → captures the intuition that entities tend to form tightly interconnected groups

In the context of NLP, these properties of small-world networks facilitate models and systems that are both efficient (due to short path lengths) and capable of capturing nuanced relationships (due to high clustering).

26
New cards

Navigable small worlds (NSW) algorithm

Vertices are iteratively inserted into the network. By default we connect the vertex with its closest neighbors, except with a certain p probability, when we connect it randomly
→ we build up the network in a node-by-node manner

<p>Vertices are iteratively inserted into the network. By default we connect the vertex with its closest neighbors, except with a certain <em>p</em> probability, when we connect it randomly<br>→ we build up the network in a node-by-node manner</p>
27
New cards

Hierarchical navigable small worlds (HNSW)

  • HNSW constructs a multi-layered graph where each layer is a smaller-world network that contains a subset of the nodes in the layer below. (The top has the fewest, while the bottom layer contains all the nodes)

  • It is based on the principle of proximity, each node connects to its nearest neighbors at its own layer and possibly to nodes at other layers.

To find the nearest neighbors of a query point, HNSW starts the search from the top layer using a greedy algorithm. At each step, it moves to the node closest to the query until no closer node can be found, then proceeds to search the next layer down. This process repeats until the bottom layer is reached.

<ul><li><p><span>HNSW constructs a multi-layered graph where each layer is a smaller-world network that contains a subset of the nodes in the layer below. (The top has the fewest, while the bottom layer contains all the nodes) </span></p></li><li><p><span>It is based on the principle of proximity, each node connects to its nearest neighbors at its own layer and possibly to nodes at other layers. </span></p></li></ul><p><span>To find the nearest neighbors of a query point, HNSW starts the search from the top layer using a greedy algorithm. At each step, it moves to the node closest to the query until no closer node can be found, then proceeds to search the next layer down. This process repeats until the bottom layer is reached.</span></p>
28
New cards

Average complexity of HNSW inference

O(log(N))

N is the number of documents

29
New cards

Sentence-level supervised dataset examples

  • sentence similarity datasets

  • sentiment analysis datasets

  • natural language inference datasets (premise and either an entailment, a contradiction, or a neutral pair)

30
New cards

Instruction embedding

The model dynamically determines which task to perform based on the content of the embedded instruction

→ provides versatility and adaptability to multiple tasks and domains

31
New cards

Retrieval Augmented Generation (RAG) steps

  1. Question-forming

  2. Retrieval

  3. Document aggregation

  4. Asnwer-forming

32
New cards

Hypothetical document embedding

The model generates fake answers to the query and then retrieves the actual answers based on the similarity between the fake answers and the real documents themselves.

33
New cards

Entity memory

A list of entities and related knowledge which gets stored in a database that the LLM can update as well as retrieve information from.

<p>A list of entities and related knowledge which gets stored in a database that the LLM can update as well as retrieve information from.</p>
34
New cards

Retrieval Augmented Language Model Pretraining (REALM)

It uses neural knowledge retriever (BERT-like) embedding models to retrieve knowledge from the textual knowledge corpus, which gets fed to a knowledge-augmented encoder alongside the actual input

<p>It uses neural knowledge retriever (BERT-like) embedding models to retrieve knowledge from the textual knowledge corpus, which gets fed to a knowledge-augmented encoder alongside the actual input</p>
35
New cards

Retrieval-Enhanced Transformer (RETRO)

The main idea is that relevant context information is encoded using cross-attention based on the input information.

Initially the input gets chunked, and each chunk is processed separately → a frozen BERT model retrieves their corresponding context vectors (neighbors) → these are encoded using cross-attention → In the decoder cross-attention incorporates the modified context information into the input as the key and value

<p>The main idea is that relevant context information is encoded using cross-attention based on the input information. </p><p>Initially the input gets chunked, and each chunk is processed separately → a frozen BERT model retrieves their corresponding context vectors (neighbors) → these are encoded using cross-attention  → In the decoder cross-attention incorporates the modified context information into the input as the key and value</p>
36
New cards

Self-monologue model

A model that operates in a semi-autonomous loop-like manner by generating its objectives, executing tasks based on those objectives, and then learning from the outcomes of its actions

37
New cards

AutoGPT steps

Thoughts: Interpretation of the user input/observations with respect to the goals.

Reasoning: Chain of thought about what to do for this input.

Plan: Planned actions to execute (additional external tools/expert LLMs can be called)

Criticism: Reflexion on action before execution, aim for improvement

Action: Action execution with inputs generated by AutoGPT.

<p><span data-name="arrow_forward" data-type="emoji">▶</span> Thoughts: Interpretation of the user input/observations with respect to the goals.</p><p><span data-name="arrow_forward" data-type="emoji">▶</span> Reasoning: Chain of thought about what to do for this input.</p><p><span data-name="arrow_forward" data-type="emoji">▶</span> Plan: Planned actions to execute (additional external tools/expert LLMs can be called)</p><p><span data-name="arrow_forward" data-type="emoji">▶</span> Criticism: Reflexion on action before execution, aim for improvement</p><p><span data-name="arrow_forward" data-type="emoji">▶</span> Action: Action execution with inputs generated by AutoGPT.</p>
38
New cards

Conversational agent collaboration

Agents collaborate in a conversational manner. Each agent is specialized to use a given tool, while the controller schedules and routes the conversation between them iteratively.

39
New cards

Tool fine-tuning

A graph of API calls is constructed using a multitude of LLM calls. These successive calls are then ranked by success rate, and the best few passing solutions are selected to be included in the dataset