Is CUDA supported when running on Jetson Orin?

#19
by kikaitachi - opened

Some of MeMo models (like https://huggingface.co/nvidia/mel-codec-22khz for example) explicitly list Jetson as supported hardware but not this model. It is runnable with CUDA? I managed to run it on CPU on Jetson Orin Nano but failed to get it running on CUDA with trying various base containers and even rebuilding NeMo framework from latest main branch.

NVIDIA org

We haven't tried on Jetson.
Could you share the error you are getting

Running container based on nvcr.io/nvidia/nemo:dev fails this way:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 540.4.0 which has support for CUDA 12.6.  This container
  was built with CUDA 12.8 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for NeMo Framework.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

[NeMo W 2025-05-08 20:11:55 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
[NeMo I 2025-05-08 20:12:04 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_lhotse: true
    skip_missing_manifest_entries: true
    input_cfg: null
    tarred_audio_filepaths: null
    manifest_filepath: null
    sample_rate: 16000
    shuffle: true
    num_workers: 2
    pin_memory: true
    max_duration: 40.0
    min_duration: 0.1
    text_field: answer
    batch_duration: null
    use_bucketing: true
    bucket_duration_bins: null
    bucket_batch_size: null
    num_buckets: 30
    bucket_buffer_size: 20000
    shuffle_buffer_size: 10000
    
[NeMo W 2025-05-08 20:12:05 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    use_lhotse: true
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 16
    shuffle: false
    max_duration: 40.0
    min_duration: 0.1
    num_workers: 2
    pin_memory: true
    text_field: answer
    
[NeMo I 2025-05-08 20:12:05 nemo_logging:393] PADDING: 0
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:12:10 nemo_logging:393] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
Traceback (most recent call last):
  File "/app/asr.py", line 4, in <module>
    asr_model = nemo_asr.models.ASRModel.restore_from(restore_path="asr.nemo")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 482, in restore_from
    instance = cls._save_restore_connector.restore_from(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 260, in restore_from
    loaded_params = self.load_config_and_state_dict(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 183, in load_config_and_state_dict
    instance = instance.to(map_location)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 55, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 915, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 290, in _apply
    self._init_flat_weights()
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 215, in _init_flat_weights
    self.flatten_parameters()
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/rnn.py", line 271, in flatten_parameters
    torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Running container based on nvcr.io/nvidia/pytorch:24.07-py3 and installing "nemo_toolkit['asr']@git+https://github.com/NVIDIA/NeMo" inside container works:

[NeMo I 2025-05-08 20:19:32 features:305] PADDING: 0
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:37 rnnt_models:226] Using RNNT Loss : tdt
    Loss tdt_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0, 'durations': [0, 1, 2, 3, 4], 'sigma': 0.02, 'omega': 0.1}
[NeMo W 2025-05-08 20:19:37 tdt_loop_labels_computer:305] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: CUDA is not available
[NeMo I 2025-05-08 20:19:41 save_restore_connector:275] Model EncDecRNNTBPEModel was successfully restored from /app/asr.nemo.
Transcribing: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.92s/it]

but from l see in logs it seems to be running on CPU not CUDA.

Sign up or log in to comment