1/17
Flashcards about The Illustrated Transformer.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
The Transformer
A model that uses attention to boost the speed with which models can be trained and lends itself to parallelization.
Black Box
The component of the Transformer model that takes a sentence in one language and outputs its translation in another.
Encoding Component
The component of the Transformer model responsible for processing the input sequence, consisting of a stack of encoders.
Decoding Component
The component of the Transformer model that generates the output sequence, consisting of a stack of decoders.
Self-Attention Layer
A layer that helps the encoder look at other words in the input sentence as it encodes a specific word.
Word Embedding
Turning each input word into a vector using an embedding algorithm.
Query, Key, and Value Vectors
Vectors created from the embedding of each word, used in calculating self-attention.
Multi-Headed Attention
A mechanism that improves the performance of the attention layer by allowing the model to focus on different positions and providing multiple representation subspaces.
Positional Encoding
Vectors added to input embeddings to account for the order of words in the input sequence.
Residual Connection
A connection around each sub-layer in the encoder (self-attention, ffnn), followed by a layer-normalization step.
Encoder-Decoder Attention
Helps the decoder focus on appropriate places in the input sequence.
Linear Layer
A fully connected neural network that projects the vector produced by the stack of decoders into a logits vector.
Softmax Layer
Turns scores from the Linear layer into probabilities for each word in the vocabulary.
One-Hot Encoding
Indicates each word in our vocabulary.
Loss Function
A metric optimized during the training phase to achieve an accurate model.
Expected Output
A probability distribution indicating the word thanks.
Greedy Decoding
A decoding method where the word with the highest probability is selected at each step.
Beam Search
A decoding method that holds onto multiple top words and runs the model multiple times to find the best translation.