Defer ChainMonitor completion signals to ensure manager-first persistence#4422
Defer ChainMonitor completion signals to ensure manager-first persistence#4422joostjager wants to merge 2 commits intolightningdevkit:mainfrom
Conversation
Add a `deferred` parameter to `ChainMonitor::new` and `ChainMonitor::new_async_beta`. When set to true, the Watch trait methods (watch_channel and update_channel) will unimplemented!() for now. All existing callers pass false to preserve current behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
👋 Hi! I see this is a draft PR. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4422 +/- ##
==========================================
+ Coverage 86.06% 86.12% +0.06%
==========================================
Files 156 156
Lines 103188 103941 +753
Branches 103188 103941 +753
==========================================
+ Hits 88808 89522 +714
- Misses 11868 11897 +29
- Partials 2512 2522 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When ChainMonitor is constructed with deferred=true, monitor operations (persist, insert, apply update) still execute inline, but completion signals are held back in a queue. pending_operation_count() returns the queue length and flush(count) delivers up to that many completions via channel_monitor_updated(). The public channel_monitor_updated() method checks the deferred flag and queues completions rather than resolving them immediately. This ensures that both synchronous persistence completions and external async callers are properly deferred. flush() calls an internal non-deferring variant to actually deliver the signals. The BackgroundProcessor snapshots the pending count before persisting the ChannelManager, then flushes that many completions afterward. This ensures the ChannelManager is always persisted before its associated monitor completions are signaled, avoiding force closures from a crash between monitor and channel manager persistence. A test is added covering the interaction between deferred mode and async persistence (persister returning InProgress), verifying the two-phase completion flow: persister signals completion via channel_monitor_updated (queued into deferred_completions), then flush delivers them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
159b30c to
8306add
Compare
|
Doesn't work because once a monitor is on disk, channel manager has to be there too, regardless of when the completion signal comes. |
|
This approach could possibly still work if after restart monitors are only read up to the point that chan mgr is aware of. And care is taken with cleanup of old updates. |
This is a simpler alternative to #4351 for solving the monitor/manager persistence ordering problem. Instead of deferring the monitor operations themselves (persist, insert, apply update), this PR only defers the completion signals. The operations execute inline as before, but
ChainMonitorholds back thechannel_monitor_updatedcalls untilflush()is invoked. This means theChannelManagersees all monitor updates asInProgressuntil the caller has persisted theChannelManagerand explicitly flushed.Compared to #4351, this approach is simpler because it does not need to queue full monitor operations. There is no buffering of
ChannelMonitordata or updates and no deferred persistence. The monitors are persisted immediately, and only the lightweight completion signal(ChannelId, u64)is queued. That said, this implementation does lean more towards implementing the deferral at thePersistlevel.Benchmarking A->B node payments with a 150 ms write latency shows this approach is ~25% faster than #4351. The reason is that #4351 defers the actual monitor persist to flush time, which means the monitor write happens sequentially after the
ChannelManagerwrite. In this PR, the monitor write happens inline during theChannelManageroperation, so by the time the background processor persists theChannelManager, the monitor is already on disk. The monitor and manager writes effectively overlap rather than running back-to-back.When the background processor loop is further parallelized (#4419), another ~20% speed up is gained. Alltogether this makes the difference with non-deferred writing small.