Is CUDA supported when running on Jetson Orin?
#19
by
kikaitachi
- opened
Some of MeMo models (like https://huggingface.co/nvidia/mel-codec-22khz for example) explicitly list Jetson as supported hardware but not this model. It is runnable with CUDA? I managed to run it on CPU on Jetson Orin Nano but failed to get it running on CUDA with trying various base containers and even rebuilding NeMo framework from latest main branch.
We haven't tried on Jetson.
Could you share the error you are getting
Running container based on nvcr.io/nvidia/nemo:dev
fails this way:
WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 540.4.0 which has support for CUDA 12.6. This container
was built with CUDA 12.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for NeMo Framework. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
[NeMo W 2025-05-08 20:11:55 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
[NeMo I 2025-05-08 20:12:04 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
use_lhotse: true
skip_missing_manifest_entries: true
input_cfg: null
tarred_audio_filepaths: null
manifest_filepath: null
sample_rate: 16000
shuffle: true
num_workers: 2
pin_memory: true
max_duration: 40.0
min_duration: 0.1
text_field: answer
batch_duration: null
use_bucketing: true
bucket_duration_bins: null
bucket_batch_size: null
num_buckets: 30
bucket_buffer_size: 20000
shuffle_buffer_size: 10000
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
use_lhotse: true
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
max_duration: 40.0
min_duration: 0.1
num_workers: 2
pin_memory: true
text_field: answer
[NeMo I 2025-05-08 20:12:05 nemo_logging:393] PADDING: 0
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
Traceback (most recent call last):
File "/app/asr.py", line 4, in <module>
asr_model = nemo_asr.models.ASRModel.restore_from(restore_path="asr.nemo")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/classes/modelPT.py", line 482, in restore_from
instance = cls._save_restore_connector.restore_from(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 260, in restore_from
loaded_params = self.load_config_and_state_dict(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 183, in load_config_and_state_dict
instance = instance.to(map_location)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 55, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1355, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 290, in _apply
self._init_flat_weights()
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 215, in _init_flat_weights
self.flatten_parameters()
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 271, in flatten_parameters
torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Running container based on nvcr.io/nvidia/pytorch:24.07-py3
and installing "nemo_toolkit['asr']@git+https://github.com/NVIDIA/NeMo" inside container works:
[NeMo I 2025-05-08 20:19:32 features:305] PADDING: 0
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
Cuda graphs with while loops are disabled, decoding speed will be slower
Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:41 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /app/asr.nemo.
Transcribing: 100%|ββββββββββ| 1/1 [00:01<00:00, 1.92s/it]
but from l see in logs it seems to be running on CPU not CUDA.