Spaces:
Configuration error
Configuration error
TLT TF Distributed Training
A Helm chart for Kubernetes
Values
Key | Type | Default | Description |
---|---|---|---|
batchDenom | int | 1 |
Batch denominator to be used to divide global batch size |
batchSize | int | 128 |
Global batch size to distributed data |
datasetName | string | "cifar10" |
Dataset name to load from tfds |
epochs | int | 1 |
Total epochs to train the model |
imageName | string | "intel/ai-tools" |
|
imageTag | string | "0.5.0-dist-devel" |
|
metadata.name | string | "tlt-distributed" |
|
metadata.namespace | string | "kubeflow" |
|
modelName | string | "https://tfhub.dev/google/efficientnet/b1/feature-vector/1" |
TF Hub or HuggingFace model URL |
pvcName | string | "tlt" |
|
pvcResources.data | string | "2Gi" |
Amount of Storage for Dataset |
pvcResources.output | string | "1Gi" |
Amount of Storage for Output Directory |
pvcScn | string | "nil" |
PVC StorageClassName |
resources.cpu | int | 2 |
Number of Compute for Launcher |
resources.memory | string | "4Gi" |
Amount of Memory for Launcher |
scaling | string | "strong" |
For weak scaling, lr is scaled by a factor of sqrt(batch_size/batch_denom) and uses global batch size for all the processes. For strong scaling, lr is scaled by world size and divides global batch size by world size |
slotsPerWorker | int | 1 |
Number of Processes Per Worker |
useCase | string | "image_classification" |
Use case (image_classification |
workerResources.cpu | int | 4 |
Number of Compute per Worker |
workerResources.memory | string | "8Gi" |
Amount of Memory per Worker |
workers | int | 4 |
Number of Workers |