Skip to content

feat: add Docker Compose deployment, refactor config paths and mock modes to CLI#6

Open
hwei0 wants to merge 17 commits intoNetSys:mainfrom
hwei0:main
Open

feat: add Docker Compose deployment, refactor config paths and mock modes to CLI#6
hwei0 wants to merge 17 commits intoNetSys:mainfrom
hwei0:main

Conversation

@hwei0
Copy link
Collaborator

@hwei0 hwei0 commented Mar 1, 2026

Description

This PR adds a full Docker Compose deployment setup for TURBO, refactors how config paths (ZMQ sockets, log directories) are resolved at runtime, and moves mock mode toggles from YAML config fields to CLI flags.

Pre-built Docker image workflow (new)

  • Adds a root-level compose.yaml and .env.example that pull pre-built images from DockerHub, eliminating the need for local builds, SSL key generation, or Rust toolchain installation
  • The Quick Start now defaults to docker compose pull + docker compose up (no --build), reducing setup to: clone, download data, configure .env, and run
  • The existing docker/ directory (Dockerfiles, build-oriented compose.yaml, docker/.env.example) is preserved as the "build from source" workflow for development/customization
  • docker/README.Docker.md rewritten to focus on building from source as the secondary path, with shared config reference and troubleshooting kept in place

Docker Compose deployment

  • Adds 4 multi-stage Dockerfiles (Python base/binary, Rust base, QUIC binary) and a compose.yaml orchestrating 7 services across client and server profiles
  • Services start in dependency order via health checks: client_python_main writes /health/client_main_ready (only after all Client processes have bound their quic_rcv_zmq_socket), then client_python_monitor writes /health/monitor_ready, then quic_client starts. This is coordinated via a multiprocessing.Manager().Queue() for cross-process ZMQ bind readiness
  • All services use ipc: host for ZeroMQ IPC and POSIX shared memory across containers, init: true (tini) for signal forwarding, and a custom bridge network (10.64.89.0/24) for QUIC UDP
  • Both Python orchestrators now handle SIGTERM in addition to SIGINT for Docker graceful shutdown
  • Docker-specific YAML configs (docker/config/) mirror the manual configs with container-internal paths (/app/...)
  • .env.example provides all configurable variables (host UID/GID, model/eval data paths, mock mode toggles, networking)
  • Docker Compose Watch support for hot-reload during development

GPU support made optional via compose override

  • Removed hardcoded NVIDIA GPU reservations from both client_python_main (never needed — client uses PyTorch for CPU tensor ops only, no torch.cuda usage) and server_python_main
  • GPU reservations for the server are now in separate compose.gpu.yaml override files (root and docker/), included only on GPU hosts via -f compose.gpu.yaml
  • Non-GPU hosts can run the server with mock inference without the NVIDIA Container Toolkit: docker compose --profile server up
  • GPU hosts include the override: docker compose -f compose.yaml -f compose.gpu.yaml --profile server up
  • Fixes Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]] on non-GPU hosts

Documentation improvements

  • Updated README.md to recommend pre-built Docker images as the default, with detailed prerequisites including:
    • Docker 28 incompatibility warning (nf_tables/iptables issue; Docker 27 recommended)
    • Disk space requirements (client ~50 GB, server ~13 GB)
    • Clarified that NVIDIA GPU drivers + Container Toolkit are server-only (not needed for client or mock inference)
    • USB webcam prerequisites (client-only, with mock camera alternative)
    • Docker Compose V2 plugin installation instructions
    • "No GPU?" guidance for running with MOCK_INFERENCE=true
  • Added per-step annotations for multi-machine deployments ("both client and server", "server only", "client only") across all setup steps
  • .env table now shows quickstart default values (~/av-models, ~/full-eval, ~/experiment2-out) instead of "(must set)"
  • Clarified mock mode data requirements: full-eval evaluation data is always required on the client (even in mock camera mode — the bandwidth allocator needs it for utility curve computation); av-models model checkpoints are not needed in mock inference mode
  • Added "Data Required" column to the mock mode combination table
  • Manual (non-Docker) setup collapsed into a <details> block with deduplicated prerequisites
  • docker/README.Docker.md streamlined — duplicated setup steps replaced with cross-references to main README; removed erroneous Rust host prerequisite (Rust compiles inside the Docker build stage)

Config path refactoring (Python + Rust + YAML)

  • ZMQ socket paths: Changed from hardcoded full ipc:///absolute/path/socket-name to bare names (e.g., service1-camera-socket) in all YAML configs. The Python orchestrators (client_main.py, server_main.py) and web dashboard (web_config.py) resolve them at runtime to ipc://<zmq_dir>/<name>. The Rust QUIC binaries already used this pattern and were updated to match the renamed config key (zmq_pathdirzmq_dir)
  • Log output directories: Removed per-component client_savedir, camera_savedir, ping_savedir, server_log_savedir, quic_client_log_path, and quic_server_log_path from YAML. Each orchestrator now auto-creates a timestamped run directory (e.g., experiment_output_dir/client_main_2024-01-15_10-30-00/client/) and injects the save path into component configs at startup. The Rust QUIC binaries now also create timestamped subdirectories using chrono
  • Server IP for ping: Removed DST_IP YAML field. Server IP is now extracted from the new required -s/--server_address CLI argument on client_main.py
  • New top-level config keys: experiment_output_dir, zmq_dir, client_subdir/server_subdir, quic_client_log_subdir/quic_server_log_subdir

Mock mode refactoring

  • Mock camera (client_main.py --mock-camera) and mock inference (server_main.py --mock-inference) are now toggled via CLI flags instead of setting null vs a path in YAML. The YAML always specifies the mock data paths; the CLI flag controls whether they're used
  • In Docker, mock modes are toggled via MOCK_CAMERA/MOCK_INFERENCE environment variables in .env
  • Added mock_model_latency_csv_path to ModelServerConfig — mock inference now simulates realistic per-model latency from experiment_model_info.csv instead of a hardcoded 50ms sleep
  • Renamed mock output file from .npz to .npy and switched from np.fromfile() to np.load()

Other changes

  • Cargo.lock now tracked in version control — removed **/Cargo.lock from .gitignore and committed the lockfile to ensure reproducible Rust builds across environments
  • Added panic hooks with backtrace capture in both Rust QUIC binaries
  • Error callbacks in Python orchestrators now log full stack traces
  • start_web_dashboard.py: Added -c short flag for --config
  • web_frontend.py: Added allow_unsafe_werkzeug=True for running inside Docker containers
  • Dockerfile UID/GID defaults normalized to 1000 across all images; groupadd moved from base to binary stage
  • Updated CONFIGURATION.md, IPC.md, and LOGGING.md to reflect all the above changes

Fixes # (issue)

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Tested successful connectivity using mock gpus and mock cameras on google cloud. Also tested successful connectivity using mock modes from laptop to google cloud.

Checklist:

  • My commit title follows conventional commit guidelines
  • I have made corresponding changes to the documentation
  • I have tested my changes, confirming that my feature works

@hwei0 hwei0 enabled auto-merge (squash) March 2, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants