Skip to content

Add leader election recovery with generation isolation#92

Open
ssteele110 wants to merge 1 commit intomasterfrom
ssteele/leader-election-recovery
Open

Add leader election recovery with generation isolation#92
ssteele110 wants to merge 1 commit intomasterfrom
ssteele/leader-election-recovery

Conversation

@ssteele110
Copy link
Contributor

Summary

  • Add leader election architecture to handle master worker death during queue population
  • Add MasterDied exception for detecting master failures during setup
  • Add configurable master_lock_ttl (30s default) and max_election_attempts (3 default)
  • Use SET NX EX for master lock with TTL instead of SETNX (which had no TTL)
  • Namespace all queue data keys by generation UUID for complete isolation
  • Add retry loop with automatic re-election when master dies

How It Works

  1. Master acquires lock with SET key "setup:{uuid}" NX EX 30 (30s TTL)
  2. If master dies mid-population, lock expires automatically
  3. Workers detect lock expiry → MasterDied exception
  4. Any worker can become new master with new generation UUID
  5. New population uses isolated namespace: build:{id}:gen:{uuid}:*
  6. Old workers detect generation staleness and exit gracefully

Test plan

  • Verify existing tests pass
  • Test master death scenario: start master, kill mid-population, verify new master elected
  • Verify workers on old generation detect staleness and exit
  • Test 2 sequential master deaths within 120s timeout
  • Verify max_election_attempts limit works (fails after 3 attempts)

🤖 Generated with Claude Code

Implement leader election architecture to handle master worker death during
queue population. Key changes:

- Add MasterDied exception for detecting master failures
- Add master_lock_ttl (30s) and max_election_attempts (3) config options
- Use SET NX EX for master lock with TTL instead of SETNX (no TTL)
- Namespace all queue data keys by generation UUID for isolation
- Detect master death in wait_for_master when lock expires during setup
- Add retry loop in populate() to handle MasterDied with configurable attempts
- Add generation staleness check in poll loop
- Update all queue operations to use generation-scoped keys

This allows workers to recover when a master dies mid-population by:
1. Detecting the lock expiry (MasterDied exception)
2. Electing a new master with a new generation
3. Repopulating the queue in an isolated namespace
4. Old workers detect staleness and exit gracefully

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant