Skip to content

feat(core): add RetryManager state machine for TAPI backoff/retry#1158

Open
abueide wants to merge 7 commits intotapi/config-and-settingsfrom
tapi/retry-manager
Open

feat(core): add RetryManager state machine for TAPI backoff/retry#1158
abueide wants to merge 7 commits intotapi/config-and-settingsfrom
tapi/retry-manager

Conversation

@abueide
Copy link
Contributor

@abueide abueide commented Mar 10, 2026

Summary

  • Add RetryManager class with three states: READY, RATE_LIMITED, BACKING_OFF
  • Handles 429 rate limiting with Retry-After header parsing and configurable max intervals
  • Implements exponential backoff with jitter for transient errors (5xx)
  • Uses sovran store for state persistence across app restarts with validation on restore
  • Supports eager/lazy retry strategies for concurrent batch error consolidation
  • Includes autoFlush callback for automatic retry scheduling
  • Uses getState(true) (queue-safe) for all state reads to prevent race conditions between concurrent canRetry/handle429/handleTransientError calls
  • Consolidated handleError/handleErrorWithBackoff into a single unified method with computeWaitUntilTime function parameter — eliminates duplicated logic
  • Side effects (logging, Math.random) extracted from dispatch reducers for purity
  • Returns RetryResult type ('rate_limited' | 'backed_off' | 'limit_exceeded') from handle429/handleTransientError so callers can detect when retry limits are exceeded
  • transitionToReady clears auto-flush timer to prevent spurious flushes
  • isPersistedStateValid validates state string against known values
  • Add barrel export and test helper utilities

PR 3 of 5 in the TAPI backoff/retry stack. Depends on #1157. Tests in #1159.

Test plan

🤖 Generated with Claude Code

@abueide abueide force-pushed the tapi/config-and-settings branch 2 times, most recently from bb983fd to 7a0b957 Compare March 12, 2026 14:57
@abueide abueide force-pushed the tapi/retry-manager branch 2 times, most recently from 3371651 to 0c85b0f Compare March 12, 2026 14:57
@abueide abueide force-pushed the tapi/config-and-settings branch from 7a0b957 to f631f9f Compare March 12, 2026 15:24
@abueide abueide force-pushed the tapi/retry-manager branch from 0c85b0f to 8f21c88 Compare March 12, 2026 15:30
@abueide abueide force-pushed the tapi/config-and-settings branch from f631f9f to 024a1a2 Compare March 12, 2026 16:11
@abueide abueide force-pushed the tapi/retry-manager branch from 8f21c88 to 225e64a Compare March 12, 2026 16:11
@abueide abueide force-pushed the tapi/config-and-settings branch from 024a1a2 to 6d30565 Compare March 12, 2026 16:40
@abueide abueide force-pushed the tapi/retry-manager branch from 225e64a to f51911b Compare March 12, 2026 16:40
@abueide abueide force-pushed the tapi/config-and-settings branch from 6d30565 to 50cad97 Compare March 12, 2026 16:48
@abueide abueide force-pushed the tapi/retry-manager branch from f51911b to fcdc491 Compare March 12, 2026 16:48
abueide and others added 3 commits March 12, 2026 12:38
Add RetryManager with three states (READY, RATE_LIMITED, BACKING_OFF)
that handles 429 rate limiting with Retry-After parsing and transient
error exponential backoff with jitter. Includes sovran-based state
persistence and configurable retry strategies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove early return in handle429 when in BACKING_OFF state. 429 is a
  server-explicit signal that should always take precedence over transient
  backoff.
- Change jitter from ±jitterPercent to additive-only (0 to jitterPercent)
  per SDD specification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abueide abueide force-pushed the tapi/config-and-settings branch from 50cad97 to 04bb992 Compare March 12, 2026 17:38
abueide and others added 4 commits March 12, 2026 12:38
- Validate persisted state in canRetry() to handle clock changes/corruption
  per SDD §Metadata Lifecycle
- Move backoff calculation inside dispatch to avoid stale retryCount from
  concurrent batch failures (handleErrorWithBackoff)
- Ensure RATE_LIMITED state is never downgraded to BACKING_OFF
- Update reset() docstring to clarify when it should be called

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify class docstring to describe current architecture without
referencing SDD deviations. Remove redundant inline comments, compact
JSDoc to single-line where appropriate, and ensure all comments use
present tense.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a 429 arrives while in BACKING_OFF state, use the server's
Retry-After directly instead of applying the lazy/eager strategy.
The server's timing signal is authoritative over calculated backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gnaling

- Use getState(true) for queue-safe reads to prevent race conditions
  between concurrent canRetry/handle429/handleTransientError calls
- Consolidate handleError and handleErrorWithBackoff into a single
  method that accepts a computeWaitUntilTime function
- Extract side effects (logging, Math.random) from dispatch reducers
- Return RetryResult ('rate_limited'|'backed_off'|'limit_exceeded')
  from handle429/handleTransientError so callers can drop events on
  limit exceeded
- Clear auto-flush timer in transitionToReady
- Validate state string in isPersistedStateValid

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abueide abueide force-pushed the tapi/retry-manager branch from fcdc491 to a8c9435 Compare March 12, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant