NLP B1-5.

studied byStudied by 8 people
5.0(1)
Get a hint
Hint

Attention layer types

1 / 41

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

42 Terms

1

Attention layer types

  • self-attention

  • cross-attention

New cards
2

Self-attention

The queries generated from the input are used to query the input iself: X = I.

New cards
3

Cross-attention

An external vector sequence is queried, e.g., in encoder-decoder transformer architectures a sequence created by an encoder.

New cards
4

Purpose of multi-head attention

To be able to attend to multiple aspects of the input

New cards
5

Goal of task-oriented dialog systems

To complete a task or tasks in a predefined task set, e.g., order something, make a call, transfer money, get directions etc.

New cards
6

Goal of open-domain dialog systems

ā–¶ The goal is an open-ended and unstructured, extended conversation.

ā–¶ There is no predetermined task (or set of tasks) whose successful execution would be the goal

ā–¶ The main result in many cases is simply ā€œentertainmentā€

New cards
7

Types of dialog systems based on initiation

  • user-initiated

  • system-controlled

  • mixed initiative

New cards
8

User-initiated dialog system

Dialogs are typically very short, e.g., a user question and a system answer using a mobile assistant

New cards
9

System-controlled dialog system

Variants:

ā–¶ the system initiates and controls, e.g. by warning or reminding the user of something

ā–¶ the user initiates by asking for instructions, from there the system instructs without essential user input

ā–¶ the user initiates by asking for a service, from there the system helps the user to ā€œfill in a questionnaireā€ by asking questions

New cards
10

Mixed initiative dialog system

There are several turns and both the system and the user can take the initiative ā€“ these are typically open-domain dialog systems.

New cards
11

General conversational requirements

  • grounding

  • adjecency pairs

  • pragmatic inferences

New cards
12

Grounding

There is a constantly evolving common ground established by the speakers who constantly acknowledge understanding what the other said.

Speakers:

ā–¶ introduce new pieces of information

ā–¶ acknowledge the added information (by gestures of verbal confirmation)

ā–¶ ask for clarification if needed

New cards
13

Response by retrieval (in the context of corpus-based ODD systems)

Respond with the utterance in the data set that is

ā–¶ most similar to the last turn, or

ā–¶ is the response to the utterance which is most similar to the last turn.

(Similarity can be totally pretrained, or trained/fine-tuned embedding based.)

New cards
14

Response by generation (in the context of corpus-based ODD systems)

Train a generator model on the data set, typical architectures:

ā–¶ RNN or Transformer based encoder-decoder

ā–¶ a fine-tunded ā€œPredict nextā€, language-model, e.g., a GPT-like architecture

New cards
15

Frame (in the context of TODs)

Structured representations of the userā€™s intentions, which contain slots that can be filled in with values.

New cards
16

Frame-based TOD system

Asks questions that help filling the frame slots until all slots are filled that are required for the current target task, and then executes it.

New cards
17

Components of early frame-based TODs

  • Control structure

  • Natural language understanding (NLU)

  • Natural language generation (NLG)

  • Optional ASR (Automatic Speech Recognition) module

New cards
18

Control structure of TODs

A production rule system controlling how to manipulate the slot values and which question to ask based on the actual state and the userā€™s input.

New cards
19

NLU module of TODs

A rule-based NLP module determining the domain (general topic), intent (concrete goal), and slots and filler values of the utterance.

It can be implemented by classifiers and sequence tagging models (e.g. IOB tagging)

New cards
20

NLG module for TODs

A template-based system to generate appropriate system questions for the user.

New cards
21

Differences between dialog-state and frame-based TODs

ā–¶ decomposing control into two separate modules:

  • the dialog state tracker

  • dialog policy

ā–¶ extensive use of machine learning methods in all modules, instead of the early systemsā€™ rule-based approach

New cards
22

Dialog state tracker

Based on the NLUā€™s (N-best) output and/or dialog history, it determines the dialog act that took place, and the current (updated) dialog state.

This can happen by generating a set of candidate states and scoring them, or by scoring individual (slot, value) pairs separately. The scorer can be based on a pretrained encoder like BERT.

New cards
23

Dialog policy

Decides which action should the system take next, on the basis of the dialogue state and possibly other elements of the dialog history.

Action types:

  • system dialog acts

  • querying a database

  • external API calls

Implementations:

  • rule based systems

  • supervised ML

  • RL optimized ML

New cards
24

NLG component in dialog-state system

When the required action is a type of system utterance, it generates the actual sentence based on the concrete action, the dialog state, and (optionally) the dialog history. It can be implemented as a rule-based system or as an ML model (seq2seq)

New cards
25

Parts of the NLG task

ā–¶ utterance planning: planning the content of the utterance (which slots/values should be mentioned, perhaps also their order and grouping),

ā–¶ utterance realization: actually generating the natural language expression of the planned content.

New cards
26

Dialog schema graph

Fully general task-oriented dialog model that explicitly conditions on task-oriented dialog descriptions.

<p>Fully general task-oriented dialog model that explicitly conditions on task-oriented dialog descriptions.</p>
New cards
27

Several schema-guided task-oriented dialog datasets

  • STAR

  • SGD

  • SGD-X

New cards
28

SimpleTOD

A single multi-task seq2seq model based on a pretrained LM.

It is simultaneously trained for:

  • dialog state tracking

  • dialog policy

  • NLG

<p>A single multi-task seq2seq model based on a pretrained LM.</p><p>It is simultaneously trained for:</p><ul><li><p>dialog state tracking</p></li><li><p>dialog policy</p></li><li><p>NLG</p></li></ul>
New cards
29

Open-domain dialog system evaluation aspects

ā–¶ how engaging was the dialog

ā–¶ are the utterances human(-like)

ā–¶ do responses make sense in the context

ā–¶ are they fluent

ā–¶ do they avoid repetitions

New cards
30

Task-oriented dialog system evaluation aspects

  • absolute task success

  • slot error rate

  • user satisfaction

  • general dialog quality

New cards
31

Temperature scaling

Modifying the modelā€™s probability distribution to control its ā€˜creativityā€™ with a T temperature parameter:

  • the higher the temperature, the closer it gets to a uniform distribution ā†’ more unexpected behavior, creativity

  • in case of 0 temperature, the single best option has 1.0 probability, no other option can be chosen -> deterministic model

New cards
32

Top-k sampling

Restricting the vocabulary at each step to the top k tokens based on their score

New cards
33

Top-p sampling

Restricts the vocabulary by keeping the smallest set of most probable tokens whose combined probability mass meets (and exceeds) a threshold p.

New cards
34

Logit biasing

Biasing the logits of the model to favor certain tokens. This can be used to prevent the model from generating harmful content, or to make it generate content that is more aligned with a certain style.

New cards
35

Presence penalty

Decreases bias towards tokens that appear in the current text with a flat penalty.

New cards
36

Frequency penalty

Incrementally decreases the bias towards the token with the number of its occurrences.

New cards
37

Beam size

A hyperparameter of beam search, the number of sequences that are kept at each step.

A larger beam size will result in more diverse outputs, but also in a significantly slower inference.

New cards
38

Flash decoding

Parallelizes the QK product calculation over the sequence length, softmax and the output are calculated after the parallel processing is done
ā†’ we can achieve higher GPU utilization

New cards
39

Flashdecoding++

It uses a fixed global constant based on activation statistics to prevent the overflow of the exponential in the softmax, thus the elements of the softmax can be calculated in parallel. If the method meets an overflow it will recompute the softmax with the actual maximum value, but this should happen with < 1% probability

New cards
40

Phases of inference

  1. Prefill

  2. Decoding

New cards
41

Prefill inference step

The user prompt is processed, K and V are calculated and cached. This could be done in a single pass, and it might be a much longer sequence than the generated output. This also includes generating the first output token.

New cards
42

Decoding inference step

The iterative process of generating the next token and calculating the next K and V. This cannot be parallelized, but the K and V can be reused from the cache. We only need to calculate a single Q for each pass.

New cards

Explore top notes

note Note
studied byStudied by 543 people
Updated ... ago
4.8 Stars(6)
note Note
studied byStudied by 227 people
Updated ... ago
4.7 Stars(6)
note Note
studied byStudied by 10 people
Updated ... ago
4.0 Stars(1)
note Note
studied byStudied by 137 people
Updated ... ago
5.0 Stars(2)
note Note
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 167 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 21 people
Updated ... ago
5.0 Stars(1)
note Note
studied byStudied by 303 people
Updated ... ago
5.0 Stars(2)

Explore top flashcards

flashcards Flashcard40 terms
studied byStudied by 7 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard31 terms
studied byStudied by 16 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard46 terms
studied byStudied by 26 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard96 terms
studied byStudied by 4 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard29 terms
studied byStudied by 58 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard67 terms
studied byStudied by 5 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard45 terms
studied byStudied by 56 people
Updated ... ago
5.0 Stars(1)
flashcards Flashcard40 terms
studied byStudied by 6 people
Updated ... ago
5.0 Stars(1)