feat(cli,nodejs): add daemon process with ocap daemon CLI#843
feat(cli,nodejs): add daemon process with ocap daemon CLI#843
Conversation
…tils Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…patch The system console vat manages a REPL loop over an IO channel, dispatching CLI commands (help, status, launch, terminate, subclusters, listRefs, revoke) and managing refs in persistent baggage. Refs use a monotonic counter (d-1, d-2, ...) since crypto.randomUUID() is unavailable under SES lockdown. Cross-vat errors are serialized via JSON.stringify fallback for reliable error reporting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add startDaemon() which boots a kernel with a system console vat listening on a UNIX domain socket IO channel. The kernel process IS the daemon — no separate HTTP server. Includes socket channel fix to block reads when no client is connected, flush-daemon utility, and e2e tests for the full daemon stack protocol. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the 'ok' CLI that communicates with the kernel daemon over a UNIX domain socket using newline-delimited JSON. Uses yargs for command definitions with --help support on all commands. Supports three input modes: file arg (ok file.ocap method), stdin redirect (ok launch < config.json), and pipe (cat config.json | ok launch). Relative bundleSpec paths in launch configs are resolved to file:// URLs against CWD. Ref results are output as .ocap files when stdout is not a TTY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement a two-tier access model: unauthenticated daemon-tier commands (help, status) and privileged ref-based dispatch via .ocap capability files. Self-ref dispatch bypasses kernel round-trip for the console root object. Fix kref leaks, improve socket channel reliability with stale connection detection and client-side retry.
…rect JSON-RPC daemon
Replace the system-console-vat architecture with direct JSON-RPC over Unix
socket. The old flow routed CLI commands through IOChannels and a REPL vat;
the new flow sends JSON-RPC requests directly to kernel RPC handlers.
- Add RPC socket server and daemon lifecycle to @ocap/nodejs under ./daemon
export path, reusing RpcService and rpcHandlers from the kernel
- Simplify CLI: ok.ts sends JSON-RPC commands, daemon-entry.ts boots kernel
and starts the daemon socket server
- Move libp2p relay from @ocap/cli to @metamask/kernel-utils under ./libp2p
export path, breaking the cli<->nodejs dependency cycle
- Remove @ocap/cli devDep from packages that only used the binary; use
yarn run -T ocap for workspace-wide binary access
- Delete system-console-vat and related IOChannel/ref plumbing
- makeKernel now returns { kernel, kernelDatabase }
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Preserve kernel state across restarts (resetStorage: false) - Clean up stale socket files before listen - Add socket-based shutdown RPC with PID+SIGTERM fallback - Stop daemon before flushing state in begone handler - Narrow sendCommand retry to ECONNREFUSED/ECONNRESET only - Replace bare socket probe with getStatus RPC ping - Use JsonRpcResponse from @metamask/utils with runtime validation - Extract shared readLine/writeLine into socket-line.ts - Document 6 known limitations in CLI readme Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge the standalone `ok` binary into the existing `ocap` CLI as nested `daemon` subcommands (start, stop, begone, exec), removing the need for two separate entry points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n.sock Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…od to --force Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…p eslint disables Move n/no-process-env exemption for cli package to eslint config, replace process.exit() calls with process.exitCode to allow pending I/O to complete, and simplify daemon-entry.ts error handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard the shutdown function with a stored promise so concurrent calls from RPC shutdown, SIGTERM, and SIGINT coalesce into a single handle.close() instead of throwing ERR_SERVER_NOT_RUNNING. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass an explicit dbFilename to makeKernel so the daemon uses a on-disk SQLite database instead of the default in-memory one. This matches the path deleteDaemonState already expects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use .finally() instead of .then() for PID file removal so stale daemon.pid files do not persist after a failed shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…path Return a boolean from stopDaemon so callers (purge, stop) can react to failure. Replace the SIGTERM poll loop with a short sleep since SIGTERM delivery is reliable. Refuse to delete state in purge if the daemon failed to stop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `e2e` CI job was failing on #843 with \"Executable doesn't exist\" for the Playwright Chromium binary. The root cause: the `prepare` job installs dependencies with `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1` and caches `node_modules`. When the `e2e` job restores that cache, `yarn install` is skipped entirely, so `postinstall`/`playwright-install.sh` never runs and the browsers are never downloaded. Playwright browsers live in `~/.cache/ms-playwright/`, not in `node_modules`, so they are not covered by the node_modules cache. ## Changes - Add a **Playwright browser cache** step (`actions/cache`) to the `e2e` job, keyed on `runner.os` + `yarn.lock` hash so it automatically invalidates when Playwright is updated. - Add an **Install Playwright browsers** step that runs `yarn playwright install chromium` only on a cache miss, and always installs Playwright dependencies, which are not cached. - Bump Playwright deps to trigger cache miss in CI. ## Testing This is a CI configuration change. It is verified by pushing the branch and confirming the `E2E Tests / omnium-gatherum` (and sibling matrix) jobs pass. On the first run the cache will miss and download Chromium; subsequent runs with the same `yarn.lock` will hit the cache and skip the download. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > CI/workflow and dependency-version updates only; main risk is longer first-run CI times or cache-key issues affecting E2E job reliability. > > **Overview** > Prevents E2E CI failures caused by missing Playwright Chromium binaries when `node_modules` is restored from cache and Playwright `postinstall` doesn’t run. > > Updates the `e2e` GitHub Actions job to **cache `~/.cache/ms-playwright`** (keyed by OS + `yarn.lock`) and to **install Playwright Chromium on cache miss**, plus always install required system deps via `playwright install-deps`. > > Bumps Playwright tooling to `^1.58.2` across the repo (root `playwright` devDependency and workspace `@playwright/test`/`playwright` versions), updating `yarn.lock` accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit fb33ca9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
kernel.stop() now closes the database (ee4fed9), making the explicit close call in the daemon's shutdown path redundant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FUDCo
left a comment
There was a problem hiding this comment.
Still looking at stuff, but wanted to get these comments in the pipeline for you.
Meta concern: This design seems to manifest what MarkM likes to call a "per error" (though I'm not sure it's actually an error as such). The CLI has a concept of The Daemon rather than A Daemon; the notion that there might be more than one of them is not supported by this tooling, though the possibility is certainly latent in some of the underlying implementation. It strikes me that being able to spin up multiple daemons might be useful for a number of testing and debugging scenarios, especially if we can arrange to wire up the various kernels to each other. On the other hand, supporting multiple daemons would complicate the UI, so I'm not sure how far we actually want to take that.
I was initially confused about what this was, which points to the existence of other possibilities in design space that may be worth exploring. What is implemented here is a headless process that contains an ocap kernel instance, and you can command the process or the kernel it contains by sending JSON RPC messages to the process via a domain socket, with shell command line support for sending such messages. I had started out thinking in terms of a headless process that would load an arbitrary bundle and then connect over a domain socket to an I/O configured vat running in some locally executing kernel. The vat would send commands to the daemon that would either command the daemon itself to do things or pass messages along to the code in the bundle it had loaded. Both of these ideas seem potentially useful, but they're very different.
packages/cli/src/commands/daemon.ts
Outdated
| * failed to stop within the timeout. | ||
| */ | ||
| export async function stopDaemon(socketPath: string): Promise<boolean> { | ||
| if (!(await pingDaemon(socketPath))) { |
There was a problem hiding this comment.
It seems to me that this use of ping is vulnerable to being unable to distinguish between the case where daemon process is not running and the case where it is running but non-responsive. In particular, if you happen to discover (out of band somehow) that the process is stuck, this function will return before falling back on SIGTERM. If I'm reading this right, it only does the latter if the daemon is unresponsive to the shutdown request, but it will never get to the point of trying to send shutdown if the process is just stuck.
There was a problem hiding this comment.
So pingDaemon() already had a timeout via sendCommand(), but I tidied up the shutdown process and added a SIGKILL fallback in 87b6b55
Check both socket responsiveness and PID-based process liveness before declaring the daemon stopped. Escalate through socket shutdown, SIGTERM, and SIGKILL with proper exit verification at each stage. Also refactor sendCommand to use an options bag and give pingDaemon a 3s timeout instead of the default 30s so stuck-daemon detection is fast. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
launchSubcluster params are shaped { config: ClusterConfig }, but
isClusterConfigLike was checking the top-level params object. Check
parsed.config instead so relative bundleSpec paths are resolved.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use conventional 0600 notation instead of JS 0o600 in documentation. Simplify post-shutdown socket check to a single ping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage Report
File Coverage |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use removeListener with specific handler references instead of removeAllListeners, so callers' listeners are preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers line parsing, buffering, error/end/close rejection, timeout, and verifies that readLine only removes its own socket listeners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a long-running daemon process to the OCAP kernel, managed via new
ocap daemonCLI subcommands. The daemon spawns as a detached child process, exposes the kernel's RPC service over a Unix domain socket (~/.ocap/daemon.sock), and auto-starts on firstexecinvocation. The kernel database is persisted at~/.ocap/kernel.sqlite.Supersedes #842, and defers the introduction of its notion of a "console vat" and repl / IO functionality to a later date.
New CLI commands
ocap daemon start— start the daemon (or confirm it is already running)ocap daemon stop— gracefully shut down the daemonocap daemon purge --force— stop the daemon and delete all persisted stateocap daemon exec [method] [params-json]— send a JSON-RPC call to the daemon (defaults togetStatus)Kernel changes
makeKernel()now returns{ kernel, kernelDatabase }and accepts optionalsystemSubclustersifDefinedutility moved fromkernel-agentstokernel-utilsstartRelaymoved fromclitokernel-utils/libp2pNew modules
@ocap/nodejs/daemon— daemon orchestration (startDaemon,deleteDaemonState,startRpcSocketServer, socket line protocol)@ocap/cli/commands/daemon*— CLI-side daemon client, spawner, and command handlersNote
High Risk
Adds a new local RPC control plane (daemon + socket server) and changes kernel construction/IO-channel semantics, which can impact process lifecycle, persistence, and local security posture (e.g., arbitrary SQL via
executeDBQuery).Overview
Adds a detached, long-running OCAP daemon that hosts kernel JSON-RPC over a Unix domain socket and persists state under
~/.ocap, with newocap daemon start|stop|purge|execcommands (including auto-spawn onexec) and prototype safeguards/behavior documented.Introduces
@ocap/nodejs/daemon(RPC socket server, line protocol helpers, daemon start/stop + state deletion) and updatesmakeKernelto return{ kernel, kernelDatabase }(plus optionalsystemSubclusters) to support the daemon lifecycle.Refactors shared utilities by moving
startRelayto@metamask/kernel-utils/libp2p(and shifting libp2p deps accordingly) and movingifDefinedinto@metamask/kernel-utils; updates tests/scripts and fixesmakeSocketIOChannelreads to block until a client connects (instead of returningnull).Written by Cursor Bugbot for commit c0adc26. This will update automatically on new commits. Configure here.