Skip to content

fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749)#16751

Open
altendky wants to merge 3 commits intoanomalyco:devfrom
altendky:test/missing-step-boundary-interleaving
Open

fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749)#16751
altendky wants to merge 3 commits intoanomalyco:devfrom
altendky:test/missing-step-boundary-interleaving

Conversation

@altendky
Copy link
Contributor

@altendky altendky commented Mar 9, 2026

Issue for this PR

Closes #16749
Related: #10616, #8377, #2720, #1662, #5750, #2214, #8312, #8010

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Fixes the root causes and provides a reconstruction-time safety net for the widespread tool_use ids were found without tool_result blocks immediately after error that corrupts sessions and makes them unrecoverable.

The fix is three layers of defense-in-depth, each catching what the previous one misses:

Layer 1 — processor.ts: Tool-error race condition (line 211)

The tool-error handler only processed errors for tools in "running" status. Due to the AI SDK's merged-stream event ordering, tool-error can arrive before tool-call, when the tool is still "pending". The error was silently dropped, leaving the tool in "pending" state to be cleaned up later as "Tool execution aborted" with empty input: {}.

Fix: Accept tool-error for both "running" and "pending" status. Uses Date.now() as start time for pending tools (which don't have a time.start field).

Layer 2 — processor.ts: Recovery step-finish before retry (line 374)

When a stream error interrupts processing before finish-step is reached, or the finish-step handler itself throws, the step boundary is never written. The retry loop's continue creates a new stream whose events are appended to the same DB message without a step-finish/step-start boundary. Both steps' content merges into one message, and toModelMessages() produces a single assistant block with interleaved tool_use/text that the Anthropic API rejects.

Fix: Before continueing the retry loop, scan parts backward for an unclosed step (step-start without a matching step-finish). If found, write a recovery step-finish with reason: "error" and zero tokens/cost. Wrapped in try/catch so recovery failures don't block the retry.

Layer 3 — message-v2.ts: Synthetic step-start injection (line 623)

A reconstruction-time safety net that handles already-corrupted DB data regardless of how step boundaries were lost.

Fix: In toModelMessages(), track whether we've seen a tool part in the current step (sawTool flag). If text or reasoning appears after a tool part without an intervening step-start, inject a synthetic { type: "step-start" } to force the AI SDK to split content into separate assistant+tool blocks.

How layers interact

Layer Where it acts What it prevents
Layer 1 (tool-error race) Stream event handling Silent error drops that leave tools in wrong state
Layer 2 (recovery step-finish) Retry loop, before continue DB corruption at write time — ensures step boundaries are preserved
Layer 3 (synthetic step-start) Message reconstruction Handles already-corrupted DB data + any future edge cases the above layers miss

Real-world evidence

Session ses_32fb35486ffeeJAHmplKU1gB2t, message msg_cd05ba534001gICo48Lsy1NHWp:

part_id                        | type        | tool  | status | error
-------------------------------+-------------+-------+--------+------------------------
prt_cd05bb9ac001...            | step-start  |       |        |
prt_cd05bb9ad001...            | text        |       |        |
prt_cd05bb9f0001...            | tool        | write | error  | Tool execution aborted
                                                                 ← 96 SECOND GAP
prt_cd05d3273001...            | text        |       |        |
prt_cd05d35a8001...            | tool        | write | completed |
prt_cd05f3c5d001...            | step-finish |       |        |
  • The errored tool has input: {}tool-error was dropped because status was "pending" (Layer 1 root cause)
  • No step-finish/step-start boundary between the two groups (Layer 2 root cause)
  • The 96-second gap is the retry delay

How did you verify your code works?

  • All 20 message-v2 tests pass, 0 failures
  • New test constructs the exact corrupted DB pattern (two merged steps with [step-start, text, tool(error), text, tool(completed)]) and asserts the structural invariant: no text or reasoning part appears after a tool-call part in the same assistant ModelMessage
  • Before the fix: Content types in this message: [text, tool-call, text, tool-call]
  • After the fix: passes (content split into separate blocks)
  • 6 pre-existing compaction test failures unrelated to this change

Files changed

File Change
packages/opencode/src/session/processor.ts Tool-error race fix (accept "pending") + recovery step-finish before retry
packages/opencode/src/session/message-v2.ts Synthetic step-start injection in toModelMessages()
packages/opencode/test/session/message-v2.test.ts Test reproducing corrupted DB interleaving pattern

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

…smatch (anomalyco#16749)

Add failing test demonstrating that when step-finish/step-start parts are
missing (due to retryable stream errors), toModelMessages produces a single
assistant block with interleaved tool-call and text parts. Currently fails
with: Content types in this message: [text, tool-call, text, tool-call].
@github-actions github-actions bot added needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

The following comment was made by an LLM, it may be inaccurate:

Found related PRs that may address similar issues:

  1. PR fix(core): repair interleaved text/tool-call parts in assistant messages #14456: "fix(core): repair interleaved text/tool-call parts in assistant messages"

  2. PR fix: handle dangling tool_use blocks for LiteLLM proxy compatibility #8497: "fix: handle dangling tool_use blocks for LiteLLM proxy compatibility"

These PRs may already be working on fixes for the issues this test case is reproducing. Check if they've been merged or if there's overlap in the solution approaches.

@altendky altendky marked this pull request as draft March 9, 2026 13:02
…terleaved due to missing step boundaries

When the finish-step handler throws during a retryable error, step-finish for
step 1 and step-start for step 2 are never saved. Both steps' content merges
into one DB message without boundaries. On replay, convertToModelMessages()
produces a single assistant block with interleaved tool_use/text, which the
Anthropic API rejects.

Fix: track whether we've seen a tool part in the current step. If text or
reasoning appears after a tool part without an intervening step-start, inject
a synthetic step-start to force the AI SDK to split content into separate
blocks.
@altendky altendky changed the title test: reproduce missing step boundary causing tool_use/tool_result mismatch (#16749) fix(session): inject synthetic step-start when tool/text parts are interleaved (#16749) Mar 9, 2026
@altendky altendky marked this pull request as ready for review March 9, 2026 13:12
@github-actions github-actions bot removed needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Thanks for updating your PR! It now meets our contributing guidelines. 👍

@altendky altendky marked this pull request as draft March 11, 2026 00:03
@altendky altendky changed the title fix(session): inject synthetic step-start when tool/text parts are interleaved (#16749) fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749) Mar 11, 2026
@altendky altendky marked this pull request as ready for review March 11, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing step-finish/step-start parts after retryable stream errors cause tool_use/tool_result mismatch

1 participant