Spaces:
Configuration error
Configuration error
File size: 2,045 Bytes
a01ef8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# TLT TF Distributed Training
  
A Helm chart for Kubernetes
## Values
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| batchDenom | int | `1` | Batch denominator to be used to divide global batch size |
| batchSize | int | `128` | Global batch size to distributed data |
| datasetName | string | `"cifar10"` | Dataset name to load from tfds |
| epochs | int | `1` | Total epochs to train the model |
| imageName | string | `"intel/ai-tools"` | |
| imageTag | string | `"0.5.0-dist-devel"` | |
| metadata.name | string | `"tlt-distributed"` | |
| metadata.namespace | string | `"kubeflow"` | |
| modelName | string | `"https://tfhub.dev/google/efficientnet/b1/feature-vector/1"` | TF Hub or HuggingFace model URL |
| pvcName | string | `"tlt"` | |
| pvcResources.data | string | `"2Gi"` | Amount of Storage for Dataset |
| pvcResources.output | string | `"1Gi"` | Amount of Storage for Output Directory |
| pvcScn | string | `"nil"` | PVC `StorageClassName` |
| resources.cpu | int | `2` | Number of Compute for Launcher |
| resources.memory | string | `"4Gi"` | Amount of Memory for Launcher |
| scaling | string | `"strong"` | For `weak` scaling, `lr` is scaled by a factor of `sqrt(batch_size/batch_denom)` and uses global batch size for all the processes. For `strong` scaling, lr is scaled by world size and divides global batch size by world size |
| slotsPerWorker | int | `1` | Number of Processes Per Worker |
| useCase | string | `"image_classification"` | Use case (`image_classification`|`text_classification`) |
| workerResources.cpu | int | `4` | Number of Compute per Worker |
| workerResources.memory | string | `"8Gi"` | Amount of Memory per Worker |
| workers | int | `4` | Number of Workers |
|