Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| ## | |
| Run `accelerate config` and answer the questionnaire accordingly. | |
| Below is an example yaml for BF16 mixed-precision training using Megatron-LM with DPxTPxPP=2x2x2 degrees on 8 GPUs. (DP-Data Parallelism, PP-Pipeline Parallelism, TP-Tensor Parallelism). It is also using Sequence Parallelism and selective activation checkpointing along with sharded optimizer. | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| distributed_type: MEGATRON_LM | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| fsdp_config: {} | |
| machine_rank: 0 | |
| main_training_function: main | |
| megatron_lm_config: | |
| megatron_lm_gradient_clipping: 1.0 | |
| megatron_lm_num_micro_batches: 2 | |
| megatron_lm_pp_degree: 2 | |
| megatron_lm_recompute_activations: true | |
| megatron_lm_sequence_parallelism: true | |
| megatron_lm_tp_degree: 2 | |
| megatron_lm_use_distributed_optimizer: true | |
| mixed_precision: bf16 | |
| num_machines: 1 | |
| num_processes: 8 | |
| rdzv_backend: static | |
| same_network: true | |
| use_cpu: false | |
| </pre> | |
| ## | |
| <pre> | |
| from accelerate import Accelerator | |
| + def main(): | |
| accelerator = Accelerator() | |
| ... | |
| - lr_scheduler = get_scheduler( | |
| - name=args.lr_scheduler_type, | |
| + lr_scheduler = accelerate.utils.MegatronLMDummyScheduler( | |
| optimizer=optimizer, | |
| num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, | |
| num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, | |
| ) | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler | |
| ) | |
| total_batch_size = ( | |
| - args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps | |
| + accelerator.state.megatron_lm_plugin.global_batch_size | |
| ) | |
| for batch in training_dataloader: | |
| optimizer.zero_grad() | |
| inputs, targets = batch | |
| outputs = model(inputs) | |
| loss = loss_function(outputs, targets) | |
| accelerator.backward(loss) | |
| optimizer.step() | |
| scheduler.step() | |
| ... | |
| # in eval loop | |
| for step, batch in enumerate(eval_dataloader): | |
| with torch.no_grad(): | |
| outputs = model(**batch) | |
| loss = outputs.loss | |
| - losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) | |
| + losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group | |
| - losses = torch.cat(losses) | |
| + losses = torch.tensor(losses) | |
| eval_loss = torch.mean(losses) | |
| perplexity = math.exp(eval_loss) | |
| logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") | |
| + accelerator.save_state(output_dir) | |
| + if __name__ == "__main__": | |
| + main() | |
| </pre> | |
| Launching a script using default accelerate config file looks like the following: | |
| ``` | |
| accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| Alternatively, you can use `accelerate launch` with right config params for multi-gpu training as shown below | |
| ``` | |
| accelerate launch \ | |
| --use_megatron_lm \ | |
| --num_processes=8 \ | |
| --mixed_precision=bf16 \ | |
| --megatron_lm_tp_degree=2 \ | |
| --megatron_lm_pp_degree=2 \ | |
| --megatron_lm_num_micro_batches=2 \ | |
| --megatron_lm_sequence_parallelism=true \ | |
| --megatron_lm_recompute_activations=true \ | |
| --megatron_lm_use_distributed_optimizer=true \ | |
| {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| ## | |
| For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are: | |
| 1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. | |
| 2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. | |
| 3. Losses are already averaged across the data parallel group | |
| 4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed` | |
| These changes have been highlited in the code snippet above. | |
| Megatron-LM intergration supports many advanced features such as ability to leverage custom train step, using Megatron-LM indexed datasets, checkpoint reshaping and interoperabiloity utilities, `megatron_generate` function for text generation using Tensor and Pipeline Parallelism and support for ROPE/ALibi Positional embeddings and Multi-Query Attention. However, these require more changes owing to the complexity; worth it for getting the highest performance. | |
| ## | |
| To learn more checkout the related documentation: | |
| - <a href="https://huggingface.co/docs/accelerate/usage_guides/megatron_lm" target="_blank">How to use Megatron-LM</a> | |
| - <a href="https://github.com/pacman100/accelerate-megatron-test" target="_blank">Examples showcasing the Megatron-LM integration of Accelerate</a> |