6. Scaling LLM Training

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/3

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

New cards

What are FLOPs?

New cards

What are key differences between model Inference and Training?

New cards

What is Activation Checkpointing?

Trade-off memory for compute in model training.
Drop activations from memory and re-compute when needed => keep some activations as “Checkpoints”.

=> Re-computing activations is slow when done from the beginning. Checkpoints throughout make average re-compute faster.

New cards

What is Gradient Accumulation?

Trade-off memory for compute in model training.
Problem: we want to run a large batch size, that does not fit in memory.
Solution: We run multiple forward-backward passes before doing an optimizer step, by keeping a running mean of the gradients.