[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow by jackyYang6 · Pull Request #6934 · PaddlePaddle/FastDeploy

jackyYang6 · 2026-03-19T07:31:40Z

Motivation

Fix two behavior inconsistencies in ThinkingBudgetLogitsProcessor:

thinking_budget ended thinking with an extra newline before </think>, which was not aligned with the existing reasoning_max_tokens behavior.
For ERNIE models, thinking_budget could fail to take effect on GPU because prompt-side <think> state was not propagated through ERNIE-specific processors, and the runtime fallback only checked prompt_ids, which is not available on the current GPU path.

This PR aligns thinking_budget with the current ERNIE reasoning flow and removes the extra newline before </think>.

Modifications

Update ThinkingBudgetLogitsProcessor to terminate thinking by forcing </think> directly instead of \n + </think>.
Keep think_stop_sentence behavior as think_stop_sentence + </think>, and remove the implicit leading newline before the stop sentence.
Fix GPU runtime fallback in ThinkingBudgetLogitsProcessor:
- keep prompt_ids scan for XPU/HPU compatibility
- use token_ids_all + prompt_lens when prompt_ids is unavailable on GPU
Extract shared thinking-budget request-side preprocessing into common helpers and reuse it in:
- fastdeploy/input/text_processor.py
- fastdeploy/input/v1/text_processor.py
- fastdeploy/input/ernie4_5_processor.py
- fastdeploy/input/v1/ernie4_5_processor.py
- fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py
- fastdeploy/input/v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py
Align prompt-side counting semantics with operator-level reasoning length control:
- prompt-side <think> scaffolding or prefilled thinking content does not consume thinking_budget
- decode-time generated thinking tokens consume thinking_budget
- think_stop_sentence still consumes thinking_budget
- </think> does not consume thinking_budget
Update docs

Usage or Command

FD_USE_GET_SAVE_OUTPUT_V1=1 python -m fastdeploy.entrypoints.openai.api_server \
  --model PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking \
  --host 0.0.0.0 \
  --port 8582 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --kv-cache-ratio 0.75 \
  --enable-expert-parallel \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --enable-logprob \
  --max-logprobs 2 \
  --logprobs-mode processed_logprobs \
  --reasoning-parser ernie-x1 \
  --logits-processors ThinkingBudgetLogitsProcessor

Example request:

{
  "messages": [
    {
      "role": "user",
      "content": "你好"
    }
  ],
  "enable_thinking": true,
  "reasoning_max_tokens": 10,
  "logits_processors_args": {
    "thinking_budget": 10,
    "think_stop_sentence": "思考已结束，开始回复"
  },
  "logprobs": true,
  "top_logprobs": 1,
  "include_logprobs_decode_token": true,
  "temperature": 0,
  "max_tokens": 20,
  "return_token_ids": true
}

Accuracy Tests

This PR does not modify kernel math or model forward numerics. The validation focus is output behavior consistency for reasoning truncation.

Behavioral validation on ERNIE thinking model:

Case	Config	Result
`thinking_budget` only	`thinking_budget=10`	`ThinkingBudgetLogitsProcessor` takes effect on ERNIE and inserts `</think>` after 10 decode-time thinking tokens
`thinking_budget + think_stop_sentence`	`thinking_budget=10`, `think_stop_sentence="思考已结束，开始回复"`	stop sentence is emitted completely, followed by `</think>`
`reasoning_max_tokens` only	`reasoning_max_tokens=10`	`</think>` is inserted after 10 thinking tokens
both enabled	`reasoning_max_tokens=10`, `thinking_budget=10`, `think_stop_sentence=...`	stop sentence remains complete and total thinking tokens before `</think>` stay aligned at 10
newline check	all above cases	no extra newline before `</think>`

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-19T07:31:50Z

Thanks for your contribution!

codecov-commenter · 2026-03-19T09:20:27Z

Codecov Report

❌ Patch coverage is 85.06787% with 33 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3a4e139). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/input/v1/text_processor.py	76.74%	10 Missing and 10 partials ⚠️
fastdeploy/input/text_processor.py	90.47%	4 Missing and 4 partials ⚠️
...model_executor/logits_processor/thinking_budget.py	83.87%	3 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6934   +/-   ##
==========================================
  Coverage           ?   73.78%           
==========================================
  Files              ?      399           
  Lines              ?    55697           
  Branches           ?     8784           
==========================================
  Hits               ?    41094           
  Misses             ?    11699           
  Partials           ?     2904

Flag	Coverage Δ
GPU	`73.78% <85.06%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… develop

Copilot

Pull request overview

该 PR 旨在修复并统一 ThinkingBudgetLogitsProcessor 与 ERNIE 推理（reasoning）流程在“思考段截断”上的行为：去掉 </think> 前的隐式换行，并补齐 ERNIE 在 GPU 路径下 prompt 侧 <think> 状态无法传播导致 thinking_budget 失效的问题，同时将 request 侧的 thinking-budget 预处理逻辑抽取为通用 helper 以复用到多类 Processor 中。

Changes:

调整 ThinkingBudgetLogitsProcessor：budget 达到后直接强制输出 </think>（stop sentence 场景则先输出 stop sentence 再输出 </think>），并新增 prompt 状态扫描的 GPU fallback（token_ids_all + prompt_lens）。
在 text/ernie/v1/ernie_vl 等多个输入处理器中抽取并复用 thinking-budget 相关的 request-side 预处理 helper（stop sentence 编码、prompt <think> 状态更新等），对齐“prompt 侧不消耗 budget、decode 侧消耗 budget”的语义。
更新单测与中英文文档，覆盖新语义与 GPU fallback 行为。

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`fastdeploy/model_executor/logits_processor/thinking_budget.py`	budget 达到后直接强制 `</think>`；stop sentence 优先；增加 prompt 扫描的 `token_ids_all` fallback。
`fastdeploy/input/text_processor.py`	抽取/新增 thinking-budget 通用 helper（stop sentence 编码、prompt 状态更新、literal 编码缓存）；tokenize cache 支持 lazy init。
`fastdeploy/input/v1/text_processor.py`	同上（v1 版本）。
`fastdeploy/input/ernie4_5_processor.py`	复用通用 helper，确保 ERNIE 文本处理器能写入 thinking-budget 所需的 prompt-side 状态与 stop sentence token ids。
`fastdeploy/input/v1/ernie4_5_processor.py`	同上（v1 版本）。
`fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py`	复用通用 helper，确保 ERNIE-VL 路径也能准备 thinking-budget 参数与 prompt-side 状态。
`fastdeploy/input/v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py`	同上（v1 版本）。
`tests/model_executor/test_thinking_budget.py`	更新/新增用例以匹配新语义（prompt 不消耗 budget、直接 `</think>`、stop sentence 行为、GPU fallback）。
`docs/features/thinking_budget.md`	更新英文文档描述与实践建议，对齐新行为。
`docs/zh/features/thinking_budget.md`	更新中文文档描述与实践建议，对齐新行为。

[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

be0064c

jackyYang6 had a problem deploying to Metax_ci March 19, 2026 07:31 — with GitHub Actions Error

Merge branch 'develop' into fix/ernie-thinking-budget

6ddeb50

jackyYang6 temporarily deployed to Metax_ci March 19, 2026 07:32 — with GitHub Actions Inactive

[Docs] Fix thinking_budget markdown formatting

5de1cb5

jackyYang6 had a problem deploying to Metax_ci March 20, 2026 02:56 — with GitHub Actions Failure

[BugFix][DataProcessor] Resolve thinking_budget conflicts with latest…

e882216

… develop

jackyYang6 had a problem deploying to Metax_ci March 20, 2026 03:16 — with GitHub Actions Error

Merge branch 'develop' into fix/ernie-thinking-budget

27b5f8f

jackyYang6 temporarily deployed to Metax_ci March 20, 2026 03:17 — with GitHub Actions Inactive

[Test] Align ernie thinking budget test with process_request_dict

9387e74

jackyYang6 had a problem deploying to Metax_ci March 20, 2026 05:23 — with GitHub Actions Error

Merge branch 'develop' into fix/ernie-thinking-budget

f54c2a7

jackyYang6 temporarily deployed to Metax_ci March 20, 2026 05:24 — with GitHub Actions Inactive

Jiang-Jia-Jun requested a review from Copilot March 20, 2026 07:25

Copilot started reviewing on behalf of Jiang-Jia-Jun March 20, 2026 07:25 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow#6934

[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow#6934
jackyYang6 wants to merge 7 commits intoPaddlePaddle:developfrom
jackyYang6:fix/ernie-thinking-budget

jackyYang6 commented Mar 19, 2026

Uh oh!

paddle-bot bot commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jackyYang6 commented Mar 19, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 19, 2026 •

edited

Loading