Papers related to distributed training
-
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Paper • 2304.11277 • Published • 1 -
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Paper • 1909.08053 • Published • 2 -
Reducing Activation Recomputation in Large Transformer Models
Paper • 2205.05198 • Published -
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Paper • 1811.06965 • Published