-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Summary
When using parallel_backend="ray" (the default), Ray auto-packages the working directory and creates a fresh virtual environment per worker in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:
- Disk exhaustion: Each Ray cluster creates a full venv copy (~12GB+). With 4-10 concurrent experiments, this can consume 50-120GB+ in
/tmp, filling the root partition. - Worker startup hangs: Workers hang during
uv sync/pip installin the temp venv, producing repeatedworker_pool.cc: Some workers of the worker process have not registered within the timeouterrors. - GCS crashes: When too many clusters compete for resources, Ray's GCS (Global Control Store) becomes unresponsive, causing
Failed to connect to GCS within 60 secondsand terminating experiments. - AF_UNIX socket path limit: If the temp directory path is long (e.g., a scratch filesystem), the Unix socket path exceeds the 107-byte limit, causing
OSError: AF_UNIX path length cannot exceed 107 bytes.
Reproduction
- Have a project with
pyproject.tomlthat depends on PyTorch (or similar large packages) - Launch 3+ concurrent
Study.run()calls withparallel_backend="ray"(default) - Observe
/tmpfilling up withray_*directories, each containing a full venv
Root Cause
In agentlab/experiments/launch_exp.py:85:
ray.init(num_cpus=n_jobs)This bare ray.init() causes Ray to auto-detect the working directory (which contains pyproject.toml) and package it for workers. Each worker then runs uv sync to create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.
Key behaviors:
- Ray creates a new temp directory per
ray.init()call (each experiment gets its own cluster) - Each cluster's workers build an independent venv copy
- Failed/completed experiments leave their temp directories behind (no cleanup)
ray.shutdown()inlaunch_exp.py:89does not clean up the temp directory
Impact
- Experiments silently fail with
ENOSPCerrors (Playwright can't create browser profiles when disk is full) - Hundreds of tasks get recorded as errors that are actually disk-full failures, requiring full reruns
- The problem compounds: each relaunch creates additional temp directories
Workaround
Using parallel_backend="joblib" avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).
Another workaround is to set RAY_TMPDIR to a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.
Suggested Fix
- Disable Ray's auto-packaging by setting
runtime_env={"worker_process_setup_hook": ...}orRAY_RUNTIME_ENV_HOOKto prevent venv creation in workers - Or pass
runtime_env={"py_modules": [...]}with only the necessary module instead of the full working directory - Or set
ray.init(runtime_env={"working_dir": None})to prevent auto-packaging - Add temp directory cleanup in the
finallyblock afterray.shutdown()-- clean up the Ray temp directory
Environment
- AgentLab v0.4.0
- Ray 2.51.1
- Python 3.12
- Ubuntu 22.04
- Dependencies include PyTorch 2.8.0 (~12GB installed)