Skip to content

pynvm unknown error handling #58

@sehoffmann

Description

@sehoffmann

------- Training failed with an exception -------
Traceback (most recent call last):
  File "/mnt/lustre/work/martius/mot030/code/dmlcloud/dmlcloud/core/pipeline.py", line 282, in run
    self._pre_run()
  File "/mnt/lustre/work/martius/mot030/code/dmlcloud/dmlcloud/core/pipeline.py", line 318, in _pre_run
    callback.pre_run(self)
  File "/mnt/lustre/work/martius/mot030/code/dmlcloud/dmlcloud/core/callbacks.py", line 587, in pre_run
    handle = torch.cuda._get_pynvml_handler(pipe.device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/lustre/work/martius/mot030/conda/envs/torch-nightly/lib/python3.12/site-packages/torch/cuda/__init__.py", line 1073, in _get_pynvml_handler
    handle = pynvml.nvmlDeviceGetHandleByIndex(device)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/lustre/work/martius/mot030/conda/envs/torch-nightly/lib/python3.12/site-packages/pynvml.py", line 2604, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/mnt/lustre/work/martius/mot030/conda/envs/torch-nightly/lib/python3.12/site-packages/pynvml.py", line 1042, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_Unknown: Unknown Error

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions