Skip to content

feat: Python version-aware Docker image selection#261

Open
deanq wants to merge 5 commits intomainfrom
deanq/ae-2391-python-versions-for-workers
Open

feat: Python version-aware Docker image selection#261
deanq wants to merge 5 commits intomainfrom
deanq/ae-2391-python-versions-for-workers

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 6, 2026

Summary

  • Add python_version field to ServerlessResource base model with validation
  • Add get_image_name() and validate_python_version() to constants.py
  • GPU images require Python 3.11+ (no PyTorch CUDA 12.8 for 3.10), CPU supports 3.10-3.12
  • Default Python version: 3.11 (matches GPU PyTorch base image)
  • Live* classes use get_image_name() for dynamic, version-aware image resolution
  • python_version included in _hashed_fields (forces re-deploy on version change)
  • flash build validates local Python version against supported range
  • Manifest includes python_version for traceability
  • pyproject.toml narrowed to >=3.10,<3.13
  • Env var overrides (FLASH_GPU_IMAGE, etc.) continue to bypass version logic

Context

AE-2391: Python version mismatch between user build environment and worker runtime causes silent failures with binary packages. This PR adds version-aware image selection so the SDK picks the correct runtime image for the user's Python version.

Companion PR: runpod-workers/flash (versioned Docker images)

Test plan

  • 24 tests for get_image_name() covering all type/version combinations and env var overrides
  • 6 tests for python_version validation on ServerlessResource
  • 10 tests for Live* class default image assertions (py3.11)
  • Manifest test includes python_version field
  • make quality-check passes (84.71% coverage)

Auto-detect user's Python version at build/deploy time and select
the matching runtime image. Supports 3.10-3.12 (3.13+ not stable).

SDK changes:
- Add get_image_name() for versioned tag resolution (py3.11-latest)
- Add python_version field to ServerlessResource (in _hashed_fields)
- Wire LiveServerless/LiveLoadBalancer to use get_image_name()
- Add build-time validation with actionable error for unsupported versions
- Record python_version in flash_manifest.json
- Narrow pyproject.toml to >=3.10,<3.13

GPU images require Python 3.11+ (no PyTorch CUDA 12.8 image for 3.10).
CPU images support 3.10-3.12. Env var overrides (FLASH_GPU_IMAGE etc.)
bypass version logic entirely.

Tag format: runpod/flash:py3.11-latest (latest alias kept for py3.12)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a python_version concept across serverless resource models and build tooling to support Python-version-aware runtime Docker image selection (motivated by avoiding local build/runtime ABI mismatches).

Changes:

  • Added Python version support + validation (python_version on ServerlessResource, helpers in constants.py) and switched Live* resources to resolve versioned image names dynamically.
  • Updated CLI build/manifest generation to record and validate the local Python minor version.
  • Expanded/adjusted unit tests to cover image resolution, version validation, and manifest traceability.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/runpod_flash/core/resources/constants.py Adds supported Python versions and get_image_name()/validate_python_version() helpers; updates default image constants to include pyX.Y.
src/runpod_flash/core/resources/serverless.py Adds python_version field, validation, and includes it in hashed/input-only sets for drift/redeploy behavior.
src/runpod_flash/core/resources/live_serverless.py Live* classes now derive imageName via get_image_name() using python_version or default.
src/runpod_flash/cli/commands/build.py Validates local Python minor version before dependency installation.
src/runpod_flash/cli/commands/build_utils/manifest.py Adds python_version field to the generated manifest for traceability.
pyproject.toml Narrows supported Python range to >=3.10,<3.13.
tests/unit/core/resources/test_constants.py New tests for version support + image name resolution + env var override behavior.
tests/unit/resources/test_serverless.py Adds tests for ServerlessResource.python_version validation and hashing/input-only membership.
tests/unit/resources/test_live_serverless.py Adds tests asserting Live* default/explicit python-version image selection and GPU constraints.
tests/unit/resources/test_live_load_balancer.py Updates expected LiveLoadBalancer image tag format to include py3.11-....
tests/unit/cli/commands/build_utils/test_manifest.py Ensures manifest includes the current Python minor version.
tests/unit/test_dotenv_loading.py Updates env-var import location for image constants.
tests/unit/core/__init__.py / tests/unit/core/resources/__init__.py Adds package initializers to support test module imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@KAJdev KAJdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when deploying I'm always getting the default 3.11 image (even when deploying within a 3.12 environment with 3.12 in the manifest)

what i think is happening is when running flash deploy, create_resource_from_manifest doesn't extract python_version from the manifest and pass it to the ServerlessResource constructors, so they default to 3.11.

It also seems like its not exposed on Endpoint so users wouldn't be able to override it if they wanted to (not sure if this is intentional or not, just something to note)

- Move env-var override check before version validation in get_image_name
- Add regression tests for override bypassing unsupported versions
- Fix python_version field description (no auto-detection at field level)
- Fix build error message to recommend switching local Python
- Rename mismatched test (test_default_python_version_is_3_11)
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Python version check silently skipped for projects with no dependencies

The version validation only runs when requirements is non-empty — it lives inside install_dependencies(), which is gated behind if requirements: at build.py:312.

A user on Python 3.13 with no requirements.txt and no dependencies=[...] on their @remote decorator will run flash build, see it succeed, deploy their worker, and then hit cryptic cloudpickle deserialization failures at runtime. No warning anywhere. The entire point of this PR is to catch exactly this mismatch, but it misses the most common case — a simple worker with no extra deps.

We can demonstrate this directly: calling install_dependencies(build_dir, [], no_deps=False) with sys.version_info mocked to 3.13 returns True. The same call with ["requests"] returns False. Same Python version, opposite outcomes, depending purely on whether the requirements list is empty.

Fix: call validate_python_version(f"{sys.version_info.major}.{sys.version_info.minor}") unconditionally before the if requirements: block.


Issue: Manifest records the build machine's Python, not the deployed image's Python

manifest.py writes sys.version_info (local Python) as python_version. But if a user explicitly sets python_version="3.12" on their resource config and builds on a 3.11 machine, the manifest records 3.11 while the deployed image is runpod/flash:py3.12-latest.

We can show this concretely: create a LiveServerless with python_version="3.12" (image correctly resolves to py3.12), call ManifestBuilder.build() with sys.version_info patched to 3.11, and the manifest comes back with python_version: "3.11" — not "3.12". The manifest is wrong from the moment it's written.

This is a new field on a new schema — the moment any tooling reads manifest.python_version to understand what's running in production, it gets the wrong answer. Easier to record it correctly now than to migrate it later.

Fix: pass the resource's python_version (or DEFAULT_PYTHON_VERSION if unset) into the manifest builder rather than using sys.version_info.


Nit: Two error messages shown for one failure

When Python version validation fails inside install_dependencies(), the function prints a detailed 3-line explanation and returns False. The caller then prints a second message: "Error: Failed to install dependencies". Calling install_dependencies directly with 3.13 and a non-empty requirements list confirms both: the version error is printed and the function returns False, at which point run_build adds its own redundant banner.

Fix: since install_dependencies() already handles its own error output on version failure, remove the generic "Error: Failed to install dependencies" print from the caller, or raise an exception instead of returning False so the caller can distinguish version errors from pip errors.


🤖 Reviewed by Henrik's AI-Powered Bug Finder

Prevents python_version from leaking into GraphQL payloads for
CpuServerlessEndpoint and CpuLoadBalancerSlsResource.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +253 to +261
@field_validator("python_version")
@classmethod
def validate_python_version(cls, v: Optional[str]) -> Optional[str]:
if v is not None and v not in SUPPORTED_PYTHON_VERSIONS:
supported = ", ".join(SUPPORTED_PYTHON_VERSIONS)
raise ValueError(
f"Python {v} is not supported. Supported versions: {supported}"
)
return v
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerlessResource.validate_python_version() duplicates the validation logic already implemented in runpod_flash.core.resources.constants.validate_python_version(). To avoid the supported-version list and error message drifting over time, consider importing and calling the shared helper here (or otherwise centralizing this validation in one place).

Copilot uses AI. Check for mistakes.
- Move Python version validation before `if requirements:` so no-dep
  projects are also checked (was silently skipped)
- Pass python_version to ManifestBuilder instead of reading sys.version_info
  at manifest build time
- Remove duplicate version check from install_dependencies (now done
  unconditionally in run_build)
@deanq
Copy link
Member Author

deanq commented Mar 6, 2026

@runpod-Henrik All three issues from your review addressed in commit 1c85da4:

Bug (version check skipped for no-dep projects): Moved validate_python_version() to the top of run_build(), before the if requirements: gate. Now runs unconditionally regardless of whether the project has dependencies.

Issue (manifest records wrong Python): ManifestBuilder now accepts an explicit python_version parameter. run_build() passes the validated version through. Added test for explicit version override.

Nit (duplicate error messages): Removed the version check from install_dependencies() entirely — run_build() now validates once at the top and raises typer.Exit(1) on failure. The generic "Failed to install dependencies" message only fires for actual pip failures, not version mismatches.

create_resource_from_manifest now accepts python_version and includes
it in deployment_kwargs. All three call sites in deployment.py extract
python_version from the manifest and pass it through.

Fixes the bug where deploying from a 3.12 environment always selected
the default 3.11 image because python_version was never propagated.
@deanq
Copy link
Member Author

deanq commented Mar 6, 2026

@KAJdev Fixed in commit e3f90cf. The bug was exactly as you described: create_resource_from_manifest never extracted python_version from the manifest, so resources always defaulted to 3.11.

Fix: create_resource_from_manifest now accepts a python_version parameter. All three call sites in deployment.py extract python_version from the manifest top-level field and pass it through to the resource constructor. A 3.12 manifest now correctly creates resources with python_version="3.12", which resolves to runpod/flash:py3.12-latest.

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Plan Results — PR #261

Ran the full test plan against the PR branch across Python 3.10, 3.11, 3.12, and 3.13.

46 passed, 1 xfailed — results identical across all four Python versions.


What passes

  • Scenarios 1, 4, 5, 6, 7, existing users upgrading: all green — features implemented correctly
  • GPU rejects python_version='3.10', CPU accepts it — correct
  • validate_python_version rejects patch-level strings, empty string, shell metacharacters — correct
  • Config hash changes when python_version changes — correct
  • DEFAULT_PYTHON_VERSION used when no version specified — correct

BUG-261-A fixed ✓

validate_python_version() now runs in run_build() before the requirements gate. Python 3.13 + no-deps correctly exits 1 — the silent success bug is gone.

Implementation note: the check lives in run_build(), not install_dependencies(). install_dependencies() is version-agnostic by design — run_build() rejects unsupported versions before that function is ever called. Tests updated to match this design.

1 remaining gap (xfail)

No warning when local Python is higher than the target image Python (e.g. building on 3.12, targeting py3.11). Python 3.11 containers cannot load 3.12 bytecode — the worker will fail at runtime with a cloudpickle deserialization error. Documented as xfail for a follow-up PR.


🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #261: Python Version-Aware Docker Images

Reviewer: Henrik's AI-Powered Bug Finder
Testing: Full unit suite (46/46 passed, 1 xfailed pre-existing), all plan tests pass on pr-261 branch. Tested across Python 3.10.19, 3.11.14, 3.12.12, 3.13.11 — identical results on all four versions.


1. Context

Short-term guardrail for the Python version mismatch problem: user builds on Python 3.12, deployed runtime is Python 3.11 (Conda in the PyTorch base image), binary-incompatible packages (e.g. numpy) silently fail. This PR makes the SDK version-aware so the correct runtime image is selected and the manifest records what the build was compiled against. PR #260 (process injection) is the longer-term architectural solution and is intended to merge after this one.


Bug: Version check bypassed for no-deps projects — fixed correctly

On main, install_dependencies gates the Python version check behind if requirements:. A project with no requirements.txt skips the check entirely, exits 0, and produces an artifact for any Python version including unsupported ones. This PR fixes it by running validate_python_version in run_build() before install_dependencies is called — the check runs regardless of whether there are dependencies. Verified by the test suite.


2. Manifest python_version field

Recording the build machine's Python version in flash_manifest.json is the right call and remains useful even after PR #260 merges — it gives an audit trail of what the artifact was compiled against, independent of the image strategy. No concerns here.


Note: Image selection logic will be superseded by PR #260

The version-aware image picker (selecting the correct 3.10/3.11/3.12 image variant) becomes the non-primary path once injection is the default. Worth a comment in the image selection code noting this is the pre-injection path, so future readers understand why it exists alongside the injection mechanism.


Note: Supported version range

PR description mentions adjusting support from 3.10–3.14 down to 3.10–3.12 since 3.13 is not stable yet. The test suite confirms validate_python_version rejects 3.13. Looks correct.


Nits

  • DEFAULT_PYTHON_VERSION — worth a comment explaining how this value is chosen and when it should be updated (e.g. tied to the LTS image, bumped manually per release).
  • validate_python_version rejects patch-level strings (e.g. "3.11.14") — good defensive behaviour, but the error message could tell the user what format is expected ("3.11" not "3.11.14").

Verdict: PASS WITH NITS

The version check fix is correct and well-placed. Manifest field is useful long-term. Image selection logic is appropriately scoped as a short-term guardrail ahead of PR #260.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants