Skip to content

Ray backend creates per-worker venvs that fill disk and crash with concurrent experiments #331

@xhluca

Description

@xhluca

Summary

When using parallel_backend="ray" (the default), Ray auto-packages the working directory and creates a fresh virtual environment per worker in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:

  1. Disk exhaustion: Each Ray cluster creates a full venv copy (~12GB+). With 4-10 concurrent experiments, this can consume 50-120GB+ in /tmp, filling the root partition.
  2. Worker startup hangs: Workers hang during uv sync / pip install in the temp venv, producing repeated worker_pool.cc: Some workers of the worker process have not registered within the timeout errors.
  3. GCS crashes: When too many clusters compete for resources, Ray's GCS (Global Control Store) becomes unresponsive, causing Failed to connect to GCS within 60 seconds and terminating experiments.
  4. AF_UNIX socket path limit: If the temp directory path is long (e.g., a scratch filesystem), the Unix socket path exceeds the 107-byte limit, causing OSError: AF_UNIX path length cannot exceed 107 bytes.

Reproduction

  1. Have a project with pyproject.toml that depends on PyTorch (or similar large packages)
  2. Launch 3+ concurrent Study.run() calls with parallel_backend="ray" (default)
  3. Observe /tmp filling up with ray_* directories, each containing a full venv

Root Cause

In agentlab/experiments/launch_exp.py:85:

ray.init(num_cpus=n_jobs)

This bare ray.init() causes Ray to auto-detect the working directory (which contains pyproject.toml) and package it for workers. Each worker then runs uv sync to create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.

Key behaviors:

  • Ray creates a new temp directory per ray.init() call (each experiment gets its own cluster)
  • Each cluster's workers build an independent venv copy
  • Failed/completed experiments leave their temp directories behind (no cleanup)
  • ray.shutdown() in launch_exp.py:89 does not clean up the temp directory

Impact

  • Experiments silently fail with ENOSPC errors (Playwright can't create browser profiles when disk is full)
  • Hundreds of tasks get recorded as errors that are actually disk-full failures, requiring full reruns
  • The problem compounds: each relaunch creates additional temp directories

Workaround

Using parallel_backend="joblib" avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).

Another workaround is to set RAY_TMPDIR to a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.

Suggested Fix

  1. Disable Ray's auto-packaging by setting runtime_env={"worker_process_setup_hook": ...} or RAY_RUNTIME_ENV_HOOK to prevent venv creation in workers
  2. Or pass runtime_env={"py_modules": [...]} with only the necessary module instead of the full working directory
  3. Or set ray.init(runtime_env={"working_dir": None}) to prevent auto-packaging
  4. Add temp directory cleanup in the finally block after ray.shutdown() -- clean up the Ray temp directory

Environment

  • AgentLab v0.4.0
  • Ray 2.51.1
  • Python 3.12
  • Ubuntu 22.04
  • Dependencies include PyTorch 2.8.0 (~12GB installed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions