10. Scaling LLM Training II

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/3

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

New cards

What is the idea of ZeRO Sharding?

Reduces memory redundancy when training on multiple GPUs.
Considers Parameters + Gradients + Optimizer States.
3 different levls - defines what to have on GPU.
=> Reducing memory redundancy increases need for communication between GPUs.

New cards

What is the idea of Tensor Parallelism?

Shard weight and activation tensors into groups that are executed on different GPUs.
Can be applied to Attention as well as Linear Layers.

=> However: Many more communications. Becomes issue when going beyond one node.

=> Also: Later operations like dropout or layernorm require unsharded activations.

New cards

What is the idea of Ring Attention?

Standard attention layer prevents sequence parallelism, as it requires all tokens to interact.
In Ring Attention GPUs compute own portion of causal mask and then communicate keys and values to next GPU.

=> Drastically increases possible sequence length (context window).

New cards

Placeholder