Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| ## | |
| Run `accelerate config` and answer the questionnaire accordingly. | |
| Below is an example yaml for BF16 mixed-precision training using PyTorch FSDP with CPU offloading on 8 GPUs. | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| distributed_type: FSDP | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| fsdp_config: | |
| fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP | |
| fsdp_backward_prefetch_policy: BACKWARD_PRE | |
| fsdp_offload_params: true | |
| fsdp_sharding_strategy: 1 | |
| fsdp_state_dict_type: FULL_STATE_DICT | |
| fsdp_transformer_layer_cls_to_wrap: T5Block | |
| machine_rank: 0 | |
| main_training_function: main | |
| megatron_lm_config: {} | |
| mixed_precision: bf16 | |
| num_machines: 1 | |
| num_processes: 8 | |
| rdzv_backend: static | |
| same_network: true | |
| use_cpu: false | |
| </pre> | |
| ## | |
| <pre> | |
| from accelerate import Accelerator | |
| + def main(): | |
| accelerator = Accelerator() | |
| model = accelerator.prepare(model) | |
| optimizer, training_dataloader, scheduler = accelerator.prepare( | |
| optimizer, training_dataloader, scheduler | |
| ) | |
| for batch in training_dataloader: | |
| optimizer.zero_grad() | |
| inputs, targets = batch | |
| outputs = model(inputs) | |
| loss = loss_function(outputs, targets) | |
| accelerator.backward(loss) | |
| optimizer.step() | |
| scheduler.step() | |
| ... | |
| + if __name__ == "__main__": | |
| + main() | |
| </pre> | |
| Launching a script using default accelerate config file looks like the following: | |
| ``` | |
| accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| Alternatively, you can use `accelerate launch` with right config params for multi-gpu training as shown below | |
| ``` | |
| accelerate launch \ | |
| --use_fsdp \ | |
| --num_processes=8 \ | |
| --mixed_precision=bf16 \ | |
| --fsdp_sharding_strategy=1 \ | |
| --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ | |
| --fsdp_transformer_layer_cls_to_wrap=T5Block \ | |
| --fsdp_offload_params=true \ | |
| {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| ## | |
| For PyTorch FDSP, you need to prepare the model first before preparing the optimizer since FSDP will shard parameters in-place and this will break any previously initialized optimizers. Same in outlined in the above code snippet. For transformer models, please use `TRANSFORMER_BASED_WRAP` auto wrap policy as shown in the config above. | |
| ## | |
| To learn more checkout the related documentation: | |
| - <a href="https://huggingface.co/docs/accelerate/usage_guides/fsdp" target="_blank">How to use FSDP</a> | |
| - <a href="https://huggingface.co/blog/pytorch-fsdp" target="_blank">Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel</a> |