Skip to content

[FLINK-38997] Fix StateTransitionManager stuck in Transitioning phase on failed transition#27781

Open
dubin555 wants to merge 1 commit intoapache:masterfrom
dubin555:oss-scout/verify-adaptive-scheduler
Open

[FLINK-38997] Fix StateTransitionManager stuck in Transitioning phase on failed transition#27781
dubin555 wants to merge 1 commit intoapache:masterfrom
dubin555:oss-scout/verify-adaptive-scheduler

Conversation

@dubin555
Copy link

What is the purpose of the change

When transitionToSubsequentState() throws an exception (e.g., OutOfMemoryError during ExecutionGraph creation), the DefaultStateTransitionManager becomes permanently stuck in the Transitioning phase. Since Transitioning has no-op onChange() and onTrigger() handlers, and the progressToPhase() guard prevents any phase change out of Transitioning, the manager becomes permanently unresponsive to resource changes. The job is stuck and cannot recover, even after resources stabilize.

This was reported in FLINK-38997: an OutOfMemoryError during createExecutionGraphWithAvailableResourcesAsync leaves the AdaptiveScheduler stuck in WaitingForResources indefinitely.

The root cause is in triggerTransitionToSubsequentState(), where the phase is set to Transitioning before calling transitionContext.transitionToSubsequentState(). If the latter throws, the phase remains Transitioning with no way out.

Brief change log

  • Wrap transitionContext.transitionToSubsequentState() in a try-catch in triggerTransitionToSubsequentState()
  • On failure, reset the phase directly to Idling (bypassing the progressToPhase guard) and re-throw the exception so the caller's error handling still applies
  • Add failOnTransition() / stopFailingOnTransition() to TestingStateTransitionManagerContext to simulate transition failures

Verifying this change

This change added tests and can be verified as follows:

  • Added testManagerResetsToIdlingWhenTransitionToSubsequentStateFails which verifies that: (1) when transitionToSubsequentState() throws, the manager resets to Idling instead of getting stuck in Transitioning; (2) after the failure, the manager can still process new onChange() events and move to Stabilizing; (3) a subsequent onTrigger() successfully completes the transition

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes (JobManager adaptive scheduler recovery behavior)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 18, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

… on failed transition

When transitionToSubsequentState() throws an exception (e.g. OutOfMemoryError
during ExecutionGraph creation), the DefaultStateTransitionManager's phase is
already set to Transitioning. Since the Transitioning phase ignores all
onChange() and onTrigger() events, and the progressToPhase() guard prevents
transitions out of Transitioning, the manager becomes permanently
unresponsive to resource changes.

This change wraps the transitionToSubsequentState() call in a try-catch that
resets the phase to Idling on failure, allowing the manager to respond to
future resource changes and retry the transition.
@dubin555 dubin555 force-pushed the oss-scout/verify-adaptive-scheduler branch from 79f2093 to 48bc3fd Compare March 19, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants