Transformers

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/27

flashcard set

Earn XP

Description and Tags

This has all the nlp data

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

28 Terms

1
New cards

Transformers

This combines encoder + decoder

2
New cards

Encoder

This help in text classification , sentiment analysis

Ex : BERT - bidirectiobnal encoder representation transformer

This cannot generate text.

3
New cards

Decoder only

They are used for generation of text.

Ex GPT - generative Pretrained Model

4
New cards

What is sequntial 2 Sequntial model

Encoder + decoder

This is used for language translation

Ex BART - bidirectional auto regressive transformer

Good for summarization also

5
New cards

What is key Difference between encoder and decoder

Prompt

6
New cards

Context embedding

This has contextual words . This is bidirectional.

7
New cards

Self Attention vs cross Attention

Feature

Self-Attention

Cross-Attention

Definition

Focuses on different parts of the same input sequence.

Focuses on different parts of another sequence.

Usage

Commonly used in encoder and decoder layers of transformers.

Typically used in decoder layers of transformers.

Input

Single sequence (e.g., a sentence).

Two sequences (e.g., a query and a context).

Purpose

Captures relationships within the same sequence.

Captures relationships between two different sequences.

Example

Understanding word dependencies in a sentence.

Aligning a question with a relevant context passage.

8
New cards

Decoder is used for

This is used for prediction of next token

Ex Akhil went to MYR ← here MYR is predicted

9
New cards

BERT

Bidirectional encoder

this is developed by google

used for text classification , next word predciction , Q&A

Ex akhil love amazon forest.

anesh work in amazon

10
New cards

Self attention mechanism

akhil is working in CTS. he is working as DS.

He is akhil ← self attention

11
New cards

Bert Fine tuning

Training the preexisting model.

Bert is trained on

  • Mask language model

  • Sentence Prediction

12
New cards

Mask language modeling in Bert

This mask few randomness token.

this work in 1 direction but transformer work in both direction.

13
New cards

Mask language model is Classified model

  • Auto Recursive - unidirectional ex Open Ai + Summerization (ARLM)

  • Auto Encoding - Bidirectional ex Transformers

14
New cards
sentance prediction in bert 

BERT is given pairs of sentences. For each pair, it predicts whether the second sentence follows the first sentence in the original text. This helps BERT understand the relationship between sentences, which is crucial for tasks like question answering and natural language inference.

15
New cards

Roberta - Robustly optimized bert approch

Best for large data set here performance is more.

Training time is more & Big Vocabulary size

16
New cards

Distil Bert

here model size is low & speed is High

accuracy is low

It is same architecture as Roberta but encoder is less

17
New cards

Alberta

alite Bert

has less performance , parameter , accuracy

Speed is 1.7 high then bert

18
New cards

Cross Parameters Sharing

reduce the number of parameters in decoder

19
New cards

cross Parameters sharing types

feed forward share : parameter are share only feed forward

Multihead attention : reduces the parameters

all parameter : all parameters are shared

20
New cards

Self attention mechanism also called as

Infra attention

This allow to focus on most relevant part of info

21
New cards

what is most used activation function in multihead attention

RELU

22
New cards

how do we reduce the vanishing gradient decent in the multihead attention

Normalization

23
New cards

Attention mechanism

This focus on particular sentance of input sequence

Scaled by QKVV dot product & scaled by squareroot passed through softmax for weight.

24
New cards

Q K V full form

Query , key , Value

25
New cards
contextual windowing vs postional encoding


Feature

Contextual Windowing

Positional Encoding

Definition

Divides input into smaller, manageable windows or chunks.

Adds positional information to each token in the sequence.

Purpose

Helps manage long sequences by focusing on smaller parts.

Helps the model understand the order of tokens in a sequence.

Usage

Often used in models dealing with long texts or sequences.

Used in transformer models to retain sequence order.

Input Handling

Processes chunks independently or with limited overlap.

Processes the entire sequence with positional context.

Example

Splitting a long document into paragraphs for analysis.

Encoding positions of words in a sentence to maintain order.

26
New cards
27
New cards
28
New cards