fix(sync-service): recover from dead consumers blocking WAL flush advancement by balegas · Pull Request #3975 · electric-sql/electric

balegas · 2026-03-07T02:02:16Z

Summary

Fixes a bug where crashed consumer processes permanently block WAL flush advancement (confirmed_flush_lsn), causing unbounded WAL growth on Postgres.

When a consumer crashes during ConsumerRegistry.broadcast:

Its dead PID stays in the ETS lookup table (consumers use restart: :temporary)
All subsequent events are silently dropped (non-suspend {:DOWN, ...} returns [])
FlushTracker still records the shape as needing to flush (updated after publish)
Dead consumer never calls notify_flushed → last_global_flushed_offset permanently stuck
A single dead consumer blocks flush advancement for all shapes

Observed in production with recurring WAL spikes up to ~24 GB at 300K concurrent shapes.

Changes

Fix 1 — Dead consumer recovery (`consumer_registry.ex`)

broadcast/1: Non-suspend {:DOWN} now returns the handle for retry instead of silently dropping
publish/2: Before retrying, checks Process.alive? on the PID in ETS, deletes dead entries, logs a warning, and lets the recursive call start a fresh consumer via start_consumer!

Fix 2 — Nil pending_txn guard (`consumer.ex`)

New process_txn_fragment/2 clause for has_begin?: false + pending_txn: nil
A replacement consumer started after a crash may receive a mid-transaction fragment without prior context — skip it instead of crashing on nil.consider_flushed?
On commit fragments, calls consider_flushed so the FlushTracker isn't blocked

Fix 3 — Delivery failure reporting (`consumer_registry.ex`, `shape_log_collector.ex`)

publish/2 now returns a MapSet of handles where delivery permanently failed (shape removed between event routing and delivery)
SLC subtracts undeliverable handles from affected_shapes before updating the FlushTracker, preventing it from tracking shapes that will never flush

Deferred: Periodic liveness sweep

A periodic sweep checking Process.alive? on all ETS entries would serve as a safety net for any edge cases not caught by Fix 1. This was deferred for human evaluation — a timeout-based staleness check is unsafe because legitimate consumers can be very slow:

Synchronous broadcast blocks {:writer_flushed, ...} cast processing for the duration of the fan-out (seconds to minutes at 300K shapes)
Storage flush is periodic, not immediate
New shapes buffer fragments (buffering?: true) until snapshot info arrives (minutes under pool contention)
SLC mailbox backlog (1M+ messages observed) can delay flush casts indefinitely

A Process.alive? check has zero false positives — a dead process will never flush.

Test plan

New FlushTracker tests: "dead consumer blocks flush advancement" — verifies a single unflushed shape permanently blocks last_global_flushed_offset, and that handle_shape_removed unblocks it
New ConsumerRegistry tests: "dead consumer recovery" — verifies crashed consumers are detected, cleaned from ETS, and replaced with fresh consumers that receive the retried event
All existing FlushTracker tests pass (14)
All existing ConsumerRegistry tests pass (15)
All SLC tests pass (24)
All consumer tests pass (26)

🤖 Generated with Claude Code

…ancement When a consumer process crashes during ConsumerRegistry.broadcast, its dead PID remains in the ETS lookup table. All subsequent events for that shape are silently dropped, but the FlushTracker still records the shape as needing to flush. Since the dead consumer never calls notify_flushed, last_global_flushed_offset is permanently blocked, preventing confirmed_flush_lsn from advancing and causing unbounded WAL growth. Three fixes: 1. Dead consumer recovery (consumer_registry.ex): On non-suspend DOWN in broadcast, return the handle for retry instead of silently dropping it. In publish, detect dead PIDs via Process.alive?, clean the ETS entry, and let the recursive call start a fresh consumer. 2. Nil pending_txn guard (consumer.ex): A replacement consumer started after a crash may receive a mid-transaction fragment (has_begin?=false) with pending_txn=nil. Add a guard clause that skips the fragment and calls consider_flushed on commit fragments to unblock the FlushTracker. 3. Delivery failure reporting (consumer_registry.ex, shape_log_collector.ex): publish now returns a MapSet of handles where delivery permanently failed (shape removed). The SLC subtracts these from affected_shapes before updating the FlushTracker, preventing it from tracking shapes that will never flush. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-07T02:05:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.74%. Comparing base (303b7b5) to head (ef52c9f).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3975       +/-   ##
===========================================
+ Coverage   75.75%   88.74%   +12.98%     
===========================================
  Files          11       25       +14     
  Lines         693     2425     +1732     
  Branches      172      608      +436     
===========================================
+ Hits          525     2152     +1627     
- Misses        167      271      +104     
- Partials        1        2        +1

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`93.93% <ø> (?)`
packages/y-electric	`56.05% <ø> (ø)`
typescript	`88.74% <ø> (+12.98%)`	⬆️
unit-tests	`88.74% <ø> (+12.98%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

balegas · 2026-03-07T02:08:21Z

I've put an agent working on hypothesis. not sure if this is useful or not

claude · 2026-03-07T02:14:36Z

Claude Code Review

Summary

This PR fixes a production bug where crashed temporary consumers permanently block WAL flush advancement by leaving dead PIDs in the ETS lookup table, causing unbounded WAL growth (~24 GB at 300K shapes). The three-part fix is well-reasoned: dead consumer recovery in ConsumerRegistry, a nil pending_txn guard in Consumer for orphaned mid-transaction fragments, and delivery failure propagation so FlushTracker does not track shapes that will never flush.

What Is Working Well

Root cause analysis is precise - the PR description walks through the exact 5-step failure chain from consumer crash to last_global_flushed_offset stall.
Defense in depth - three complementary fixes, each independently reducing the blast radius.
MapSet return type - clean API change that propagates delivery failure without a separate function.
FlushTracker regression test - verifies the original bug (stuck last_global_flushed_offset) and that handle_shape_removed unblocks it.
process_txn_fragment nil guard - correctly handles non-commit (skip) and commit (consider_flushed) cases.

Issues Found

Important (Should Fix)

1. Unbounded recursion in publish/2 under repeated consumer crashes

File: packages/sync-service/lib/electric/shapes/consumer_registry.ex:106-128

Issue: publish/2 recursively calls itself when consumers crash. If the freshly started replacement consumer (from start_consumer!) also crashes immediately on the same event, the recursion is unbounded. Cycle: broadcast -> consumer crashes -> DOWN returned -> ETS cleanup -> start_consumer! -> new consumer also crashes -> recurse.

Impact: Blocks the SLC mailbox-processing loop indefinitely. At 300K shapes with a pathological transaction, this stalls replication for the entire stack.

Suggested fix: Add a retry counter parameter and cap retries at 3, returning handles as undeliverable if the limit is exceeded.

2. Missing changeset file

Issue: packages/sync-service is a publishable package but the PR does not include a .changeset/*.md file.

Impact: The release tooling will not pick up this fix for version bumps and changelogs.

3. Missing unit test for process_txn_fragment/2 nil pending_txn clause

File: packages/sync-service/lib/electric/shapes/consumer.ex:550-562

Issue: The new has_begin?: false + pending_txn: nil clause handles a real production scenario but has no dedicated test in consumer_test.exs. The two distinct code paths (skip mid-fragment vs. call consider_flushed on commit fragment) are not exercised in isolation.

Impact: A regression in this clause would not be caught by the test suite.

Suggestions (Nice to Have)

1. Add shape handle and offset to the nil pending_txn warning

packages/sync-service/lib/electric/shapes/consumer.ex:554-556 - the current warning omits the shape handle, making production log correlation difficult. Including state.shape_handle and txn_fragment.last_log_offset would help.

2. Use MapSet.empty?/1 instead of size check

packages/sync-service/lib/electric/shapes/consumer_registry.ex:100 - if MapSet.size(undeliverable) > 0 reads more clearly as unless MapSet.empty?(undeliverable).

Issue Conformance

No linked issue. The PR description compensates well with the 5-step failure chain, production reproduction context (300K shapes, 24 GB WAL), and explicit deferral reasoning for the periodic sweep. A GitHub issue would allow tracking the deferred periodic liveness sweep as future work.

Review iteration: 1 | 2026-03-07

balegas marked this pull request as draft March 7, 2026 02:07

balegas added the claude label Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync-service): recover from dead consumers blocking WAL flush advancement#3975

fix(sync-service): recover from dead consumers blocking WAL flush advancement#3975
balegas wants to merge 1 commit intomainfrom
fix/dead-consumer-flush-stall

balegas commented Mar 7, 2026

Uh oh!

codecov bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

balegas commented Mar 7, 2026

Uh oh!

claude bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

balegas commented Mar 7, 2026

Summary

Changes

Fix 1 — Dead consumer recovery (consumer_registry.ex)

Fix 2 — Nil pending_txn guard (consumer.ex)

Fix 3 — Delivery failure reporting (consumer_registry.ex, shape_log_collector.ex)

Deferred: Periodic liveness sweep

Test plan

Uh oh!

codecov bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

balegas commented Mar 7, 2026

Uh oh!

claude bot commented Mar 7, 2026

Claude Code Review

Summary

What Is Working Well

Issues Found

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix 1 — Dead consumer recovery (`consumer_registry.ex`)

Fix 2 — Nil pending_txn guard (`consumer.ex`)

Fix 3 — Delivery failure reporting (`consumer_registry.ex`, `shape_log_collector.ex`)

codecov bot commented Mar 7, 2026 •

edited

Loading