10. Scaling LLM Training II

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/3

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

4 Terms

1
New cards

What is the idea of ZeRO Sharding?

  • Reduces memory redundancy when training on multiple GPUs.

  • Considers Parameters + Gradients + Optimizer States.

  • 3 different levls - defines what to have on GPU.

    => Reducing memory redundancy increases need for communication between GPUs.

<ul><li><p>Reduces memory redundancy when training on multiple GPUs.</p></li><li><p>Considers Parameters + Gradients + Optimizer States.</p></li><li><p>3 different levls - defines what to have on GPU.</p><p>=&gt; Reducing memory redundancy increases need for communication between GPUs.</p></li></ul><p></p>
2
New cards

What is the idea of Tensor Parallelism?

  • Shard weight and activation tensors into groups that are executed on different GPUs.

  • Can be applied to Attention as well as Linear Layers.

=> However: Many more communications. Becomes issue when going beyond one node.

=> Also: Later operations like dropout or layernorm require unsharded activations.

<ul><li><p>Shard weight and activation tensors into groups that are executed on different GPUs.</p></li><li><p>Can be applied to Attention as well as Linear Layers.</p></li></ul><p>=&gt; However: Many more communications. Becomes issue when going beyond one node.</p><p>=&gt; Also: Later operations like dropout or layernorm require unsharded activations.</p><p></p>
3
New cards

What is the idea of Ring Attention?

  • Standard attention layer prevents sequence parallelism, as it requires all tokens to interact.

  • In Ring Attention GPUs compute own portion of causal mask and then communicate keys and values to next GPU.

=> Drastically increases possible sequence length (context window).

<ul><li><p>Standard attention layer prevents sequence parallelism, as it requires all tokens to interact.</p></li><li><p>In Ring Attention GPUs compute own portion of causal mask and then communicate keys and values to next GPU.</p></li></ul><p>=&gt; Drastically increases possible sequence length (context window).</p><p></p>
4
New cards

Placeholder

Placeholder