6. Scaling LLM Training

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/3

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

4 Terms

1
New cards

What are FLOPs?

  • Measure for required compute for a model=> “Floating Point Operations”.

2
New cards

What are key differences between model Inference and Training?

  • Training takes 3x more compute than inference.

  • Training holds activations in memory for the backward pass.

<ul><li><p>Training takes <span style="color: yellow">3x more compute</span> than inference.</p></li><li><p>Training holds activations in memory for the backward pass.</p></li></ul><p></p>
3
New cards

What is Activation Checkpointing?

  • Trade-off memory for compute in model training.

  • Drop activations from memory and re-compute when needed => keep some activations as “Checkpoints”.

=> Re-computing activations is slow when done from the beginning. Checkpoints throughout make average re-compute faster.

<ul><li><p><span style="color: yellow">Trade-off memory for compute</span> in model training.</p></li><li><p><span style="color: yellow">Drop activations</span> from memory and <span style="color: yellow">re-compute</span> when needed =&gt; keep some activations as “Checkpoints”.</p></li></ul><p></p><p>=&gt; Re-computing activations is slow when done from the beginning. Checkpoints throughout make average re-compute faster.</p>
4
New cards

What is Gradient Accumulation?

  • Trade-off memory for compute in model training.

  • Problem: we want to run a large batch size, that does not fit in memory.

  • Solution: We run multiple forward-backward passes before doing an optimizer step, by keeping a running mean of the gradients.

<ul><li><p><span style="color: yellow">Trade-off memory for compute</span> in model training.</p></li><li><p>Problem: we want to run a large <span style="color: red">batch size, that does not fit in memory</span>. </p></li><li><p>Solution: We run <span style="color: yellow">multiple forward-backward passes</span> before doing an optimizer step, by keeping a <span style="color: yellow">running mean of the gradients</span>.</p></li></ul><p></p>