Transformers

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/27

Earn XP

Description and Tags

This has all the nlp data

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

28 Terms

New cards

Transformers

This combines encoder + decoder

New cards

Encoder

This help in text classification , sentiment analysis

Ex : BERT - bidirectiobnal encoder representation transformer

This cannot generate text.

New cards

Decoder only

They are used for generation of text.

Ex GPT - generative Pretrained Model

New cards

What is sequntial 2 Sequntial model

Encoder + decoder

This is used for language translation

Ex BART - bidirectional auto regressive transformer

Good for summarization also

New cards

What is key Difference between encoder and decoder

Prompt

New cards

Context embedding

This has contextual words . This is bidirectional.

New cards

Self Attention vs cross Attention

Feature	Self-Attention	Cross-Attention
Definition	Focuses on different parts of the same input sequence.	Focuses on different parts of another sequence.
Usage	Commonly used in encoder and decoder layers of transformers.	Typically used in decoder layers of transformers.
Input	Single sequence (e.g., a sentence).	Two sequences (e.g., a query and a context).
Purpose	Captures relationships within the same sequence.	Captures relationships between two different sequences.
Example	Understanding word dependencies in a sentence.	Aligning a question with a relevant context passage.

New cards

Decoder is used for

This is used for prediction of next token

Ex Akhil went to MYR ← here MYR is predicted

New cards

BERT

Bidirectional encoder

this is developed by google

used for text classification , next word predciction , Q&A

Ex akhil love amazon forest.

anesh work in amazon

New cards

Self attention mechanism

akhil is working in CTS. he is working as DS.

He is akhil ← self attention

New cards

Bert Fine tuning

Training the preexisting model.

Bert is trained on

Mask language model
Sentence Prediction

New cards

Mask language modeling in Bert

This mask few randomness token.

this work in 1 direction but transformer work in both direction.

New cards

Mask language model is Classified model

Auto Recursive - unidirectional ex Open Ai + Summerization (ARLM)
Auto Encoding - Bidirectional ex Transformers

New cards

sentance prediction in bert

BERT is given pairs of sentences. For each pair, it predicts whether the second sentence follows the first sentence in the original text. This helps BERT understand the relationship between sentences, which is crucial for tasks like question answering and natural language inference.

New cards

Roberta - Robustly optimized bert approch

Best for large data set here performance is more.

Training time is more & Big Vocabulary size

New cards

Distil Bert

here model size is low & speed is High

accuracy is low

It is same architecture as Roberta but encoder is less

New cards

Alberta

alite Bert

has less performance , parameter , accuracy

Speed is 1.7 high then bert

New cards

Cross Parameters Sharing

reduce the number of parameters in decoder

New cards

cross Parameters sharing types

feed forward share : parameter are share only feed forward

Multihead attention : reduces the parameters

all parameter : all parameters are shared

New cards

Self attention mechanism also called as

Infra attention

This allow to focus on most relevant part of info

New cards

what is most used activation function in multihead attention

RELU

New cards

how do we reduce the vanishing gradient decent in the multihead attention

Normalization

New cards

Attention mechanism

This focus on particular sentance of input sequence

Scaled by QKVV dot product & scaled by squareroot passed through softmax for weight.

New cards

Q K V full form

Query , key , Value

New cards

contextual windowing vs postional encoding

Feature	Contextual Windowing	Positional Encoding
Definition	Divides input into smaller, manageable windows or chunks.	Adds positional information to each token in the sequence.
Purpose	Helps manage long sequences by focusing on smaller parts.	Helps the model understand the order of tokens in a sequence.
Usage	Often used in models dealing with long texts or sequences.	Used in transformer models to retain sequence order.
Input Handling	Processes chunks independently or with limited overlap.	Processes the entire sequence with positional context.
Example	Splitting a long document into paragraphs for analysis.	Encoding positions of words in a sentence to maintain order.

New cards