Skip to content

fix(sync-service): recover from dead consumers blocking WAL flush advancement#3975

Draft
balegas wants to merge 1 commit intomainfrom
fix/dead-consumer-flush-stall
Draft

fix(sync-service): recover from dead consumers blocking WAL flush advancement#3975
balegas wants to merge 1 commit intomainfrom
fix/dead-consumer-flush-stall

Conversation

@balegas
Copy link
Contributor

@balegas balegas commented Mar 7, 2026

Summary

Fixes a bug where crashed consumer processes permanently block WAL flush advancement (confirmed_flush_lsn), causing unbounded WAL growth on Postgres.

When a consumer crashes during ConsumerRegistry.broadcast:

  1. Its dead PID stays in the ETS lookup table (consumers use restart: :temporary)
  2. All subsequent events are silently dropped (non-suspend {:DOWN, ...} returns [])
  3. FlushTracker still records the shape as needing to flush (updated after publish)
  4. Dead consumer never calls notify_flushedlast_global_flushed_offset permanently stuck
  5. A single dead consumer blocks flush advancement for all shapes

Observed in production with recurring WAL spikes up to ~24 GB at 300K concurrent shapes.

Changes

Fix 1 — Dead consumer recovery (consumer_registry.ex)

  • broadcast/1: Non-suspend {:DOWN} now returns the handle for retry instead of silently dropping
  • publish/2: Before retrying, checks Process.alive? on the PID in ETS, deletes dead entries, logs a warning, and lets the recursive call start a fresh consumer via start_consumer!

Fix 2 — Nil pending_txn guard (consumer.ex)

  • New process_txn_fragment/2 clause for has_begin?: false + pending_txn: nil
  • A replacement consumer started after a crash may receive a mid-transaction fragment without prior context — skip it instead of crashing on nil.consider_flushed?
  • On commit fragments, calls consider_flushed so the FlushTracker isn't blocked

Fix 3 — Delivery failure reporting (consumer_registry.ex, shape_log_collector.ex)

  • publish/2 now returns a MapSet of handles where delivery permanently failed (shape removed between event routing and delivery)
  • SLC subtracts undeliverable handles from affected_shapes before updating the FlushTracker, preventing it from tracking shapes that will never flush

Deferred: Periodic liveness sweep

A periodic sweep checking Process.alive? on all ETS entries would serve as a safety net for any edge cases not caught by Fix 1. This was deferred for human evaluation — a timeout-based staleness check is unsafe because legitimate consumers can be very slow:

  • Synchronous broadcast blocks {:writer_flushed, ...} cast processing for the duration of the fan-out (seconds to minutes at 300K shapes)
  • Storage flush is periodic, not immediate
  • New shapes buffer fragments (buffering?: true) until snapshot info arrives (minutes under pool contention)
  • SLC mailbox backlog (1M+ messages observed) can delay flush casts indefinitely

A Process.alive? check has zero false positives — a dead process will never flush.

Test plan

  • New FlushTracker tests: "dead consumer blocks flush advancement" — verifies a single unflushed shape permanently blocks last_global_flushed_offset, and that handle_shape_removed unblocks it
  • New ConsumerRegistry tests: "dead consumer recovery" — verifies crashed consumers are detected, cleaned from ETS, and replaced with fresh consumers that receive the retried event
  • All existing FlushTracker tests pass (14)
  • All existing ConsumerRegistry tests pass (15)
  • All SLC tests pass (24)
  • All consumer tests pass (26)

🤖 Generated with Claude Code

…ancement

When a consumer process crashes during ConsumerRegistry.broadcast, its
dead PID remains in the ETS lookup table. All subsequent events for that
shape are silently dropped, but the FlushTracker still records the shape
as needing to flush. Since the dead consumer never calls notify_flushed,
last_global_flushed_offset is permanently blocked, preventing
confirmed_flush_lsn from advancing and causing unbounded WAL growth.

Three fixes:

1. Dead consumer recovery (consumer_registry.ex): On non-suspend DOWN
   in broadcast, return the handle for retry instead of silently dropping
   it. In publish, detect dead PIDs via Process.alive?, clean the ETS
   entry, and let the recursive call start a fresh consumer.

2. Nil pending_txn guard (consumer.ex): A replacement consumer started
   after a crash may receive a mid-transaction fragment (has_begin?=false)
   with pending_txn=nil. Add a guard clause that skips the fragment and
   calls consider_flushed on commit fragments to unblock the FlushTracker.

3. Delivery failure reporting (consumer_registry.ex, shape_log_collector.ex):
   publish now returns a MapSet of handles where delivery permanently
   failed (shape removed). The SLC subtracts these from affected_shapes
   before updating the FlushTracker, preventing it from tracking shapes
   that will never flush.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.74%. Comparing base (303b7b5) to head (ef52c9f).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3975       +/-   ##
===========================================
+ Coverage   75.75%   88.74%   +12.98%     
===========================================
  Files          11       25       +14     
  Lines         693     2425     +1732     
  Branches      172      608      +436     
===========================================
+ Hits          525     2152     +1627     
- Misses        167      271      +104     
- Partials        1        2        +1     
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.93% <ø> (?)
packages/y-electric 56.05% <ø> (ø)
typescript 88.74% <ø> (+12.98%) ⬆️
unit-tests 88.74% <ø> (+12.98%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@balegas balegas marked this pull request as draft March 7, 2026 02:07
@balegas
Copy link
Contributor Author

balegas commented Mar 7, 2026

I've put an agent working on hypothesis. not sure if this is useful or not

@balegas balegas added the claude label Mar 7, 2026
@claude
Copy link

claude bot commented Mar 7, 2026

Claude Code Review

Summary

This PR fixes a production bug where crashed temporary consumers permanently block WAL flush advancement by leaving dead PIDs in the ETS lookup table, causing unbounded WAL growth (~24 GB at 300K shapes). The three-part fix is well-reasoned: dead consumer recovery in ConsumerRegistry, a nil pending_txn guard in Consumer for orphaned mid-transaction fragments, and delivery failure propagation so FlushTracker does not track shapes that will never flush.

What Is Working Well

  • Root cause analysis is precise - the PR description walks through the exact 5-step failure chain from consumer crash to last_global_flushed_offset stall.
  • Defense in depth - three complementary fixes, each independently reducing the blast radius.
  • MapSet return type - clean API change that propagates delivery failure without a separate function.
  • FlushTracker regression test - verifies the original bug (stuck last_global_flushed_offset) and that handle_shape_removed unblocks it.
  • process_txn_fragment nil guard - correctly handles non-commit (skip) and commit (consider_flushed) cases.

Issues Found

Important (Should Fix)

1. Unbounded recursion in publish/2 under repeated consumer crashes

File: packages/sync-service/lib/electric/shapes/consumer_registry.ex:106-128

Issue: publish/2 recursively calls itself when consumers crash. If the freshly started replacement consumer (from start_consumer!) also crashes immediately on the same event, the recursion is unbounded. Cycle: broadcast -> consumer crashes -> DOWN returned -> ETS cleanup -> start_consumer! -> new consumer also crashes -> recurse.

Impact: Blocks the SLC mailbox-processing loop indefinitely. At 300K shapes with a pathological transaction, this stalls replication for the entire stack.

Suggested fix: Add a retry counter parameter and cap retries at 3, returning handles as undeliverable if the limit is exceeded.

2. Missing changeset file

Issue: packages/sync-service is a publishable package but the PR does not include a .changeset/*.md file.

Impact: The release tooling will not pick up this fix for version bumps and changelogs.

3. Missing unit test for process_txn_fragment/2 nil pending_txn clause

File: packages/sync-service/lib/electric/shapes/consumer.ex:550-562

Issue: The new has_begin?: false + pending_txn: nil clause handles a real production scenario but has no dedicated test in consumer_test.exs. The two distinct code paths (skip mid-fragment vs. call consider_flushed on commit fragment) are not exercised in isolation.

Impact: A regression in this clause would not be caught by the test suite.

Suggestions (Nice to Have)

1. Add shape handle and offset to the nil pending_txn warning

packages/sync-service/lib/electric/shapes/consumer.ex:554-556 - the current warning omits the shape handle, making production log correlation difficult. Including state.shape_handle and txn_fragment.last_log_offset would help.

2. Use MapSet.empty?/1 instead of size check

packages/sync-service/lib/electric/shapes/consumer_registry.ex:100 - if MapSet.size(undeliverable) > 0 reads more clearly as unless MapSet.empty?(undeliverable).

Issue Conformance

No linked issue. The PR description compensates well with the 5-step failure chain, production reproduction context (300K shapes, 24 GB WAL), and explicit deferral reasoning for the periodic sweep. A GitHub issue would allow tracking the deferred periodic liveness sweep as future work.


Review iteration: 1 | 2026-03-07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant