fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749) by altendky · Pull Request #16751 · anomalyco/opencode

altendky · 2026-03-09T13:01:37Z

Issue for this PR

Closes #16749
Related: #10616, #8377, #2720, #1662, #5750, #2214, #8312, #8010

Type of change

Bug fix
New feature
Refactor / code improvement
Documentation

What does this PR do?

Fixes the root causes and provides a reconstruction-time safety net for the widespread tool_use ids were found without tool_result blocks immediately after error that corrupts sessions and makes them unrecoverable.

The fix is three layers of defense-in-depth, each catching what the previous one misses:

Layer 1 — `processor.ts`: Tool-error race condition (line 211)

The tool-error handler only processed errors for tools in "running" status. Due to the AI SDK's merged-stream event ordering, tool-error can arrive before tool-call, when the tool is still "pending". The error was silently dropped, leaving the tool in "pending" state to be cleaned up later as "Tool execution aborted" with empty input: {}.

Fix: Accept tool-error for both "running" and "pending" status. Uses Date.now() as start time for pending tools (which don't have a time.start field).

Layer 2 — `processor.ts`: Recovery step-finish before retry (line 374)

When a stream error interrupts processing before finish-step is reached, or the finish-step handler itself throws, the step boundary is never written. The retry loop's continue creates a new stream whose events are appended to the same DB message without a step-finish/step-start boundary. Both steps' content merges into one message, and toModelMessages() produces a single assistant block with interleaved tool_use/text that the Anthropic API rejects.

Fix: Before continueing the retry loop, scan parts backward for an unclosed step (step-start without a matching step-finish). If found, write a recovery step-finish with reason: "error" and zero tokens/cost. Wrapped in try/catch so recovery failures don't block the retry.

Layer 3 — `message-v2.ts`: Synthetic step-start injection (line 623)

A reconstruction-time safety net that handles already-corrupted DB data regardless of how step boundaries were lost.

Fix: In toModelMessages(), track whether we've seen a tool part in the current step (sawTool flag). If text or reasoning appears after a tool part without an intervening step-start, inject a synthetic { type: "step-start" } to force the AI SDK to split content into separate assistant+tool blocks.

How layers interact

Layer	Where it acts	What it prevents
Layer 1 (tool-error race)	Stream event handling	Silent error drops that leave tools in wrong state
Layer 2 (recovery step-finish)	Retry loop, before `continue`	DB corruption at write time — ensures step boundaries are preserved
Layer 3 (synthetic step-start)	Message reconstruction	Handles already-corrupted DB data + any future edge cases the above layers miss

Real-world evidence

Session ses_32fb35486ffeeJAHmplKU1gB2t, message msg_cd05ba534001gICo48Lsy1NHWp:

part_id                        | type        | tool  | status | error
-------------------------------+-------------+-------+--------+------------------------
prt_cd05bb9ac001...            | step-start  |       |        |
prt_cd05bb9ad001...            | text        |       |        |
prt_cd05bb9f0001...            | tool        | write | error  | Tool execution aborted
                                                                 ← 96 SECOND GAP
prt_cd05d3273001...            | text        |       |        |
prt_cd05d35a8001...            | tool        | write | completed |
prt_cd05f3c5d001...            | step-finish |       |        |

The errored tool has input: {} — tool-error was dropped because status was "pending" (Layer 1 root cause)
No step-finish/step-start boundary between the two groups (Layer 2 root cause)
The 96-second gap is the retry delay

How did you verify your code works?

All 20 message-v2 tests pass, 0 failures
New test constructs the exact corrupted DB pattern (two merged steps with [step-start, text, tool(error), text, tool(completed)]) and asserts the structural invariant: no text or reasoning part appears after a tool-call part in the same assistant ModelMessage
Before the fix: Content types in this message: [text, tool-call, text, tool-call]
After the fix: passes (content split into separate blocks)
6 pre-existing compaction test failures unrelated to this change

Files changed

File	Change
`packages/opencode/src/session/processor.ts`	Tool-error race fix (accept `"pending"`) + recovery step-finish before retry
`packages/opencode/src/session/message-v2.ts`	Synthetic step-start injection in `toModelMessages()`
`packages/opencode/test/session/message-v2.test.ts`	Test reproducing corrupted DB interleaving pattern

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

…smatch (anomalyco#16749) Add failing test demonstrating that when step-finish/step-start parts are missing (due to retryable stream errors), toModelMessages produces a single assistant block with interleaved tool-call and text parts. Currently fails with: Content types in this message: [text, tool-call, text, tool-call].

github-actions · 2026-03-09T13:01:49Z

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

Open an issue describing the bug/feature (if one doesn't exist)
Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

github-actions · 2026-03-09T13:02:16Z

The following comment was made by an LLM, it may be inaccurate:

Found related PRs that may address similar issues:

PR fix(core): repair interleaved text/tool-call parts in assistant messages #14456: "fix(core): repair interleaved text/tool-call parts in assistant messages"
- fix(core): repair interleaved text/tool-call parts in assistant messages #14456
- Related because it directly addresses the interleaved text/tool-call parts problem mentioned in this PR
PR fix: handle dangling tool_use blocks for LiteLLM proxy compatibility #8497: "fix: handle dangling tool_use blocks for LiteLLM proxy compatibility"
- fix: handle dangling tool_use blocks for LiteLLM proxy compatibility #8497
- Related as it deals with tool_use/tool_result mismatches in message handling

These PRs may already be working on fixes for the issues this test case is reproducing. Check if they've been merged or if there's overlap in the solution approaches.

…terleaved due to missing step boundaries When the finish-step handler throws during a retryable error, step-finish for step 1 and step-start for step 2 are never saved. Both steps' content merges into one DB message without boundaries. On replay, convertToModelMessages() produces a single assistant block with interleaved tool_use/text, which the Anthropic API rejects. Fix: track whether we've seen a tool part in the current step. If text or reasoning appears after a tool part without an intervening step-start, inject a synthetic step-start to force the AI SDK to split content into separate blocks.

github-actions · 2026-03-09T13:12:53Z

Thanks for updating your PR! It now meets our contributing guidelines. 👍

github-actions bot added needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Mar 9, 2026

altendky marked this pull request as draft March 9, 2026 13:02

altendky changed the title ~~test: reproduce missing step boundary causing tool_use/tool_result mismatch (#16749)~~ fix(session): inject synthetic step-start when tool/text parts are interleaved (#16749) Mar 9, 2026

altendky marked this pull request as ready for review March 9, 2026 13:12

github-actions bot removed needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Mar 9, 2026

Merge branch 'dev' into test/missing-step-boundary-interleaving

1c25929

altendky marked this pull request as draft March 11, 2026 00:03

altendky changed the title ~~fix(session): inject synthetic step-start when tool/text parts are interleaved (#16749)~~ fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749) Mar 11, 2026

altendky marked this pull request as ready for review March 11, 2026 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749)#16751

fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749)#16751
altendky wants to merge 3 commits intoanomalyco:devfrom
altendky:test/missing-step-boundary-interleaving

altendky commented Mar 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

altendky commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

Layer 1 — processor.ts: Tool-error race condition (line 211)

Layer 2 — processor.ts: Recovery step-finish before retry (line 374)

Layer 3 — message-v2.ts: Synthetic step-start injection (line 623)

How layers interact

Real-world evidence

How did you verify your code works?

Files changed

Checklist

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

altendky commented Mar 9, 2026 •

edited

Loading

Layer 1 — `processor.ts`: Tool-error race condition (line 211)

Layer 2 — `processor.ts`: Recovery step-finish before retry (line 374)

Layer 3 — `message-v2.ts`: Synthetic step-start injection (line 623)