Looks like no one added any tags here yet for you.
Attention layer types
self-attention
cross-attention
Self-attention
The queries generated from the input are used to query the input iself: X = I.
Cross-attention
An external vector sequence is queried, e.g., in encoder-decoder transformer architectures a sequence created by an encoder.
Purpose of multi-head attention
To be able to attend to multiple aspects of the input
Goal of task-oriented dialog systems
To complete a task or tasks in a predefined task set, e.g., order something, make a call, transfer money, get directions etc.
Goal of open-domain dialog systems
ā¶ The goal is an open-ended and unstructured, extended conversation.
ā¶ There is no predetermined task (or set of tasks) whose successful execution would be the goal
ā¶ The main result in many cases is simply āentertainmentā
Types of dialog systems based on initiation
user-initiated
system-controlled
mixed initiative
User-initiated dialog system
Dialogs are typically very short, e.g., a user question and a system answer using a mobile assistant
System-controlled dialog system
Variants:
ā¶ the system initiates and controls, e.g. by warning or reminding the user of something
ā¶ the user initiates by asking for instructions, from there the system instructs without essential user input
ā¶ the user initiates by asking for a service, from there the system helps the user to āfill in a questionnaireā by asking questions
Mixed initiative dialog system
There are several turns and both the system and the user can take the initiative ā these are typically open-domain dialog systems.
General conversational requirements
grounding
adjecency pairs
pragmatic inferences
Grounding
There is a constantly evolving common ground established by the speakers who constantly acknowledge understanding what the other said.
Speakers:
ā¶ introduce new pieces of information
ā¶ acknowledge the added information (by gestures of verbal confirmation)
ā¶ ask for clarification if needed
Response by retrieval (in the context of corpus-based ODD systems)
Respond with the utterance in the data set that is
ā¶ most similar to the last turn, or
ā¶ is the response to the utterance which is most similar to the last turn.
(Similarity can be totally pretrained, or trained/fine-tuned embedding based.)
Response by generation (in the context of corpus-based ODD systems)
Train a generator model on the data set, typical architectures:
ā¶ RNN or Transformer based encoder-decoder
ā¶ a fine-tunded āPredict nextā, language-model, e.g., a GPT-like architecture
Frame (in the context of TODs)
Structured representations of the userās intentions, which contain slots that can be filled in with values.
Frame-based TOD system
Asks questions that help filling the frame slots until all slots are filled that are required for the current target task, and then executes it.
Components of early frame-based TODs
Control structure
Natural language understanding (NLU)
Natural language generation (NLG)
Optional ASR (Automatic Speech Recognition) module
Control structure of TODs
A production rule system controlling how to manipulate the slot values and which question to ask based on the actual state and the userās input.
NLU module of TODs
A rule-based NLP module determining the domain (general topic), intent (concrete goal), and slots and filler values of the utterance.
It can be implemented by classifiers and sequence tagging models (e.g. IOB tagging)
NLG module for TODs
A template-based system to generate appropriate system questions for the user.
Differences between dialog-state and frame-based TODs
ā¶ decomposing control into two separate modules:
the dialog state tracker
dialog policy
ā¶ extensive use of machine learning methods in all modules, instead of the early systemsā rule-based approach
Dialog state tracker
Based on the NLUās (N-best) output and/or dialog history, it determines the dialog act that took place, and the current (updated) dialog state.
This can happen by generating a set of candidate states and scoring them, or by scoring individual (slot, value) pairs separately. The scorer can be based on a pretrained encoder like BERT.
Dialog policy
Decides which action should the system take next, on the basis of the dialogue state and possibly other elements of the dialog history.
Action types:
system dialog acts
querying a database
external API calls
Implementations:
rule based systems
supervised ML
RL optimized ML
NLG component in dialog-state system
When the required action is a type of system utterance, it generates the actual sentence based on the concrete action, the dialog state, and (optionally) the dialog history. It can be implemented as a rule-based system or as an ML model (seq2seq)
Parts of the NLG task
ā¶ utterance planning: planning the content of the utterance (which slots/values should be mentioned, perhaps also their order and grouping),
ā¶ utterance realization: actually generating the natural language expression of the planned content.
Dialog schema graph
Fully general task-oriented dialog model that explicitly conditions on task-oriented dialog descriptions.
Several schema-guided task-oriented dialog datasets
STAR
SGD
SGD-X
SimpleTOD
A single multi-task seq2seq model based on a pretrained LM.
It is simultaneously trained for:
dialog state tracking
dialog policy
NLG
Open-domain dialog system evaluation aspects
ā¶ how engaging was the dialog
ā¶ are the utterances human(-like)
ā¶ do responses make sense in the context
ā¶ are they fluent
ā¶ do they avoid repetitions
Task-oriented dialog system evaluation aspects
absolute task success
slot error rate
user satisfaction
general dialog quality
Temperature scaling
Modifying the modelās probability distribution to control its ācreativityā with a T temperature parameter:
the higher the temperature, the closer it gets to a uniform distribution ā more unexpected behavior, creativity
in case of 0 temperature, the single best option has 1.0 probability, no other option can be chosen -> deterministic model
Top-k sampling
Restricting the vocabulary at each step to the top k tokens based on their score
Top-p sampling
Restricts the vocabulary by keeping the smallest set of most probable tokens whose combined probability mass meets (and exceeds) a threshold p.
Logit biasing
Biasing the logits of the model to favor certain tokens. This can be used to prevent the model from generating harmful content, or to make it generate content that is more aligned with a certain style.
Presence penalty
Decreases bias towards tokens that appear in the current text with a flat penalty.
Frequency penalty
Incrementally decreases the bias towards the token with the number of its occurrences.
Beam size
A hyperparameter of beam search, the number of sequences that are kept at each step.
A larger beam size will result in more diverse outputs, but also in a significantly slower inference.
Flash decoding
Parallelizes the QK product calculation over the sequence length, softmax and the output are calculated after the parallel processing is done
ā we can achieve higher GPU utilization
Flashdecoding++
It uses a fixed global constant based on activation statistics to prevent the overflow of the exponential in the softmax, thus the elements of the softmax can be calculated in parallel. If the method meets an overflow it will recompute the softmax with the actual maximum value, but this should happen with < 1% probability
Phases of inference
Prefill
Decoding
Prefill inference step
The user prompt is processed, K and V are calculated and cached. This could be done in a single pass, and it might be a much longer sequence than the generated output. This also includes generating the first output token.
Decoding inference step
The iterative process of generating the next token and calculating the next K and V. This cannot be parallelized, but the K and V can be reused from the cache. We only need to calculate a single Q for each pass.