7. Scaling LLM Training

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/6

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 2:32 PM on 8/6/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

7 Terms

1
New cards

What are FLOPs?

  • Measure for required compute for a model=> “Floating Point Operations”.

2
New cards

What are key differences between model Inference and Training?

  • Training takes 3x more compute than inference.

  • Training holds activations in memory for the backward pass.

<ul><li><p>Training takes <span style="color: yellow">3x more compute</span> than inference.</p></li><li><p>Training holds activations in memory for the backward pass.</p></li></ul><p></p>
3
New cards

What is Activation Checkpointing?

  • Trade-off memory for compute in model training.

  • Drop activations from memory and re-compute when needed => keep some activations as “Checkpoints”.

=> Re-computing activations is slow when done from the beginning. Checkpoints throughout make average re-compute faster.

<ul><li><p><span style="color: yellow">Trade-off memory for compute</span> in model training.</p></li><li><p><span style="color: yellow">Drop activations</span> from memory and <span style="color: yellow">re-compute</span> when needed =&gt; keep some activations as “Checkpoints”.</p></li></ul><p></p><p>=&gt; Re-computing activations is slow when done from the beginning. Checkpoints throughout make average re-compute faster.</p>
4
New cards

What is Gradient Accumulation?

  • Trade-off memory for compute in model training.

  • Problem: we want to run a large batch size, that does not fit in memory.

  • Solution: We run multiple forward-backward passes before doing an optimizer step, by keeping a running mean of the gradients.

<ul><li><p><span style="color: yellow">Trade-off memory for compute</span> in model training.</p></li><li><p>Problem: we want to run a large <span style="color: red">batch size, that does not fit in memory</span>. </p></li><li><p>Solution: We run <span style="color: yellow">multiple forward-backward passes</span> before doing an optimizer step, by keeping a <span style="color: yellow">running mean of the gradients</span>.</p></li></ul><p></p>
5
New cards

How does the memory footprint scale for Activations?

  • Small sequences => Memory for activations is negligible.

  • Large sequences => Memory footprint is way larger as params + grads + optimizer combined.

<ul><li><p>Small sequences =&gt; Memory for activations is negligible.</p></li><li><p>Large sequences =&gt; Memory footprint is way larger as params + grads + optimizer combined.</p></li></ul><p></p>
6
New cards

What needs to be in memory when training an LLM?

  • Parameters

  • Gradients

  • Optimizer

<ul><li><p>Parameters</p></li><li><p>Gradients</p></li><li><p>Optimizer</p></li></ul><p></p>
7
New cards

What is Data Parallelism in LLM Training?

  • Run different batches of data in parallel on different chips.

  • Model weights must be duplicated across chips.

  • After gradients are computed they must be communicated across different chips.

<ul><li><p>Run different batches of data in parallel on different chips.</p></li><li><p>Model weights must be duplicated across chips.</p></li><li><p>After gradients are computed they must be communicated across different chips.</p></li></ul><p></p>

Explore top flashcards