[FLINK-38997] Fix StateTransitionManager stuck in Transitioning phase on failed transition#27781
Open
dubin555 wants to merge 1 commit intoapache:masterfrom
Open
[FLINK-38997] Fix StateTransitionManager stuck in Transitioning phase on failed transition#27781dubin555 wants to merge 1 commit intoapache:masterfrom
dubin555 wants to merge 1 commit intoapache:masterfrom
Conversation
Collaborator
… on failed transition When transitionToSubsequentState() throws an exception (e.g. OutOfMemoryError during ExecutionGraph creation), the DefaultStateTransitionManager's phase is already set to Transitioning. Since the Transitioning phase ignores all onChange() and onTrigger() events, and the progressToPhase() guard prevents transitions out of Transitioning, the manager becomes permanently unresponsive to resource changes. This change wraps the transitionToSubsequentState() call in a try-catch that resets the phase to Idling on failure, allowing the manager to respond to future resource changes and retry the transition.
79f2093 to
48bc3fd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
When
transitionToSubsequentState()throws an exception (e.g.,OutOfMemoryErrorduringExecutionGraphcreation), theDefaultStateTransitionManagerbecomes permanently stuck in theTransitioningphase. SinceTransitioninghas no-oponChange()andonTrigger()handlers, and theprogressToPhase()guard prevents any phase change out ofTransitioning, the manager becomes permanently unresponsive to resource changes. The job is stuck and cannot recover, even after resources stabilize.This was reported in FLINK-38997: an
OutOfMemoryErrorduringcreateExecutionGraphWithAvailableResourcesAsyncleaves theAdaptiveSchedulerstuck inWaitingForResourcesindefinitely.The root cause is in
triggerTransitionToSubsequentState(), where the phase is set toTransitioningbefore callingtransitionContext.transitionToSubsequentState(). If the latter throws, the phase remainsTransitioningwith no way out.Brief change log
transitionContext.transitionToSubsequentState()in a try-catch intriggerTransitionToSubsequentState()Idling(bypassing theprogressToPhaseguard) and re-throw the exception so the caller's error handling still appliesfailOnTransition()/stopFailingOnTransition()toTestingStateTransitionManagerContextto simulate transition failuresVerifying this change
This change added tests and can be verified as follows:
testManagerResetsToIdlingWhenTransitionToSubsequentStateFailswhich verifies that: (1) whentransitionToSubsequentState()throws, the manager resets toIdlinginstead of getting stuck inTransitioning; (2) after the failure, the manager can still process newonChange()events and move toStabilizing; (3) a subsequentonTrigger()successfully completes the transitionDoes this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation