Skip to content

feat(gpu): add WSL2 CDI spec watcher for GPU passthrough#608

Open
elezar wants to merge 2 commits intomainfrom
feat/wsl-cdi-spec-watcher
Open

feat(gpu): add WSL2 CDI spec watcher for GPU passthrough#608
elezar wants to merge 2 commits intomainfrom
feat/wsl-cdi-spec-watcher

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 25, 2026

Summary

Adds a CDI spec watcher for WSL2 GPU passthrough. On WSL2, the NVIDIA device plugin writes a CDI spec that references /dev/dxg but uses a device name incompatible with the index-based deviceIDStrategy. The watcher transforms this spec at startup and on subsequent updates so the device plugin can correctly enumerate GPUs in WSL2 environments.

Related Issue

Closes #404

Depends on #495 and #503.

Note: NVIDIA/k8s-device-plugin#1671 is a potential upstream fix for the CDI spec compatibility issue. If that PR lands, this workaround can be revisited.

Changes

  • deploy/docker/cluster-entrypoint.sh: adds watch_cdi_specs function that transforms the device plugin's WSL CDI spec (k8s.device-plugin.nvidia.com-gpu.json) to set cdiVersion: 0.5.0 and normalize the device name to "0". Watcher runs as a background process when /dev/dxg is present.
  • deploy/docker/Dockerfile.images: adds inotify-tools and jq dependencies required by the watcher.
  • deploy/helm/nvidia-device-plugin-helmchart.yaml: sets deviceIDStrategy: index so device IDs align with the transformed CDI spec.
  • architecture/gateway-single-node.md: documents the WSL2 CDI spec watcher design and its interaction with the device plugin.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (not applicable)
  • E2E tests added/updated (not applicable -- existing e2e tests should run on WLS2)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@elezar elezar self-assigned this Mar 25, 2026
elezar added 2 commits March 25, 2026 15:15
On WSL2 hosts the NVIDIA device plugin generates CDI specs that cannot
be used directly by k3s containerd since it includes a single device
name "all" and not one based on the index or UUID of the device.

Add a background watch_cdi_specs function to cluster-entrypoint.sh that:
- detects WSL2 via /dev/dxg presence
- handles specs already present at gateway restart
- uses inotifywait to watch for new/updated specs
- transforms the spec with jq (cdiVersion=0.5.0, devices[0].name="0")

Add inotify-tools and jq to the cluster image apt-get install block to
support the watcher.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
…architecture

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/wsl-cdi-spec-watcher branch from 2f1232c to 31ea520 Compare March 25, 2026 14:16
@elezar elezar requested review from pimlock March 25, 2026 14:38
@elezar elezar marked this pull request as ready for review March 25, 2026 18:05
@elezar elezar requested a review from a team as a code owner March 25, 2026 18:05
@elezar
Copy link
Member Author

elezar commented Mar 25, 2026

I'm marking this as ready, but it depends on the mentioned PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so

1 participant