Skip to content

[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow#6934

Open
jackyYang6 wants to merge 7 commits intoPaddlePaddle:developfrom
jackyYang6:fix/ernie-thinking-budget
Open

[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow#6934
jackyYang6 wants to merge 7 commits intoPaddlePaddle:developfrom
jackyYang6:fix/ernie-thinking-budget

Conversation

@jackyYang6
Copy link
Contributor

Motivation

Fix two behavior inconsistencies in ThinkingBudgetLogitsProcessor:

  1. thinking_budget ended thinking with an extra newline before </think>, which was not aligned with the existing reasoning_max_tokens behavior.
  2. For ERNIE models, thinking_budget could fail to take effect on GPU because prompt-side <think> state was not propagated through ERNIE-specific processors, and the runtime fallback only checked prompt_ids, which is not available on the current GPU path.

This PR aligns thinking_budget with the current ERNIE reasoning flow and removes the extra newline before </think>.

Modifications

  1. Update ThinkingBudgetLogitsProcessor to terminate thinking by forcing </think> directly instead of \n + </think>.
  2. Keep think_stop_sentence behavior as think_stop_sentence + </think>, and remove the implicit leading newline before the stop sentence.
  3. Fix GPU runtime fallback in ThinkingBudgetLogitsProcessor:
    • keep prompt_ids scan for XPU/HPU compatibility
    • use token_ids_all + prompt_lens when prompt_ids is unavailable on GPU
  4. Extract shared thinking-budget request-side preprocessing into common helpers and reuse it in:
    • fastdeploy/input/text_processor.py
    • fastdeploy/input/v1/text_processor.py
    • fastdeploy/input/ernie4_5_processor.py
    • fastdeploy/input/v1/ernie4_5_processor.py
    • fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py
    • fastdeploy/input/v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py
  5. Align prompt-side counting semantics with operator-level reasoning length control:
    • prompt-side <think> scaffolding or prefilled thinking content does not consume thinking_budget
    • decode-time generated thinking tokens consume thinking_budget
    • think_stop_sentence still consumes thinking_budget
    • </think> does not consume thinking_budget
  6. Update docs

Usage or Command

FD_USE_GET_SAVE_OUTPUT_V1=1 python -m fastdeploy.entrypoints.openai.api_server \
  --model PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking \
  --host 0.0.0.0 \
  --port 8582 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --kv-cache-ratio 0.75 \
  --enable-expert-parallel \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --enable-logprob \
  --max-logprobs 2 \
  --logprobs-mode processed_logprobs \
  --reasoning-parser ernie-x1 \
  --logits-processors ThinkingBudgetLogitsProcessor

Example request:

{
  "messages": [
    {
      "role": "user",
      "content": "你好"
    }
  ],
  "enable_thinking": true,
  "reasoning_max_tokens": 10,
  "logits_processors_args": {
    "thinking_budget": 10,
    "think_stop_sentence": "思考已结束,开始回复"
  },
  "logprobs": true,
  "top_logprobs": 1,
  "include_logprobs_decode_token": true,
  "temperature": 0,
  "max_tokens": 20,
  "return_token_ids": true
}

Accuracy Tests

This PR does not modify kernel math or model forward numerics. The validation focus is output behavior consistency for reasoning truncation.

Behavioral validation on ERNIE thinking model:

Case Config Result
thinking_budget only thinking_budget=10 ThinkingBudgetLogitsProcessor takes effect on ERNIE and inserts </think> after 10 decode-time thinking tokens
thinking_budget + think_stop_sentence thinking_budget=10, think_stop_sentence="思考已结束,开始回复" stop sentence is emitted completely, followed by </think>
reasoning_max_tokens only reasoning_max_tokens=10 </think> is inserted after 10 thinking tokens
both enabled reasoning_max_tokens=10, thinking_budget=10, think_stop_sentence=... stop sentence remains complete and total thinking tokens before </think> stay aligned at 10
newline check all above cases no extra newline before </think>

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Mar 19, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 85.06787% with 33 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3a4e139). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/v1/text_processor.py 76.74% 10 Missing and 10 partials ⚠️
fastdeploy/input/text_processor.py 90.47% 4 Missing and 4 partials ⚠️
...model_executor/logits_processor/thinking_budget.py 83.87% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6934   +/-   ##
==========================================
  Coverage           ?   73.78%           
==========================================
  Files              ?      399           
  Lines              ?    55697           
  Branches           ?     8784           
==========================================
  Hits               ?    41094           
  Misses             ?    11699           
  Partials           ?     2904           
Flag Coverage Δ
GPU 73.78% <85.06%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复并统一 ThinkingBudgetLogitsProcessor 与 ERNIE 推理(reasoning)流程在“思考段截断”上的行为:去掉 </think> 前的隐式换行,并补齐 ERNIE 在 GPU 路径下 prompt 侧 <think> 状态无法传播导致 thinking_budget 失效的问题,同时将 request 侧的 thinking-budget 预处理逻辑抽取为通用 helper 以复用到多类 Processor 中。

Changes:

  • 调整 ThinkingBudgetLogitsProcessor:budget 达到后直接强制输出 </think>(stop sentence 场景则先输出 stop sentence 再输出 </think>),并新增 prompt 状态扫描的 GPU fallback(token_ids_all + prompt_lens)。
  • 在 text/ernie/v1/ernie_vl 等多个输入处理器中抽取并复用 thinking-budget 相关的 request-side 预处理 helper(stop sentence 编码、prompt <think> 状态更新等),对齐“prompt 侧不消耗 budget、decode 侧消耗 budget”的语义。
  • 更新单测与中英文文档,覆盖新语义与 GPU fallback 行为。

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
fastdeploy/model_executor/logits_processor/thinking_budget.py budget 达到后直接强制 </think>;stop sentence 优先;增加 prompt 扫描的 token_ids_all fallback。
fastdeploy/input/text_processor.py 抽取/新增 thinking-budget 通用 helper(stop sentence 编码、prompt 状态更新、literal 编码缓存);tokenize cache 支持 lazy init。
fastdeploy/input/v1/text_processor.py 同上(v1 版本)。
fastdeploy/input/ernie4_5_processor.py 复用通用 helper,确保 ERNIE 文本处理器能写入 thinking-budget 所需的 prompt-side 状态与 stop sentence token ids。
fastdeploy/input/v1/ernie4_5_processor.py 同上(v1 版本)。
fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py 复用通用 helper,确保 ERNIE-VL 路径也能准备 thinking-budget 参数与 prompt-side 状态。
fastdeploy/input/v1/ernie4_5_vl_processor/ernie4_5_vl_processor.py 同上(v1 版本)。
tests/model_executor/test_thinking_budget.py 更新/新增用例以匹配新语义(prompt 不消耗 budget、直接 </think>、stop sentence 行为、GPU fallback)。
docs/features/thinking_budget.md 更新英文文档描述与实践建议,对齐新行为。
docs/zh/features/thinking_budget.md 更新中文文档描述与实践建议,对齐新行为。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants