[Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852)#6910
Open
wikilsh wants to merge 4 commits intoPaddlePaddle:release/2.5from
Conversation
## Motivation
Loading full model snapshot files in one shot causes large memory spikes
during elastic recovery. This change introduces chunked part-file loading
to reduce peak memory usage.
## Modifications
Refactor `_update_ipc_snapshot` to load weights in priority order:
1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): load and
apply weights incrementally, calling `gc.collect()` after each part to
release memory promptly.
2. Single full pdparams file (legacy path).
3. Shared fallback directory (`/shared_ipc_meta/...`) as last resort.
Additional fixes:
- Use `paddle.distributed.get_rank()` instead of hardcoded `tp0` for
proper rank awareness in multi-process scenarios.
- Raise explicit `FileNotFoundError` with a descriptive message when no
snapshot file is found under any of the three paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…IPC snapshot
Add dot separator in snapshot file name (tp{rank}.{id}) to fix naming
ambiguity, and add legacy format (tp0{id}) as Priority 3 fallback for
backward compatibility with existing checkpoints.
Add unit tests covering all 4 loading priority branches and error path.
Co-Authored-By: lishuaihui <lishuaihui@baidu.com>
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.5 #6910 +/- ##
==============================================
Coverage ? 68.97%
==============================================
Files ? 389
Lines ? 53175
Branches ? 8344
==============================================
Hits ? 36679
Misses ? 13844
Partials ? 2652
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Parse part index from filename instead of using enumerate index, keeping logs and src_type consistent with actual file naming. - Add validation for part file naming pattern; skip and warn on files that do not match the expected .partN. convention. Co-Authored-By: wikilsh <wiki_hui@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Cherry-pick from develop branch PR #6852 to release/2.5.
In RL training elastic recovery via IPC snapshot, two issues existed in
_update_ipc_snapshot:.pdparamsfile in one shot causes a significant memory spike, risking OOM during recovery.model_state.tp{rank}{id}without a separator between rank and id, causing naming ambiguity (e.g., rank=1, id=234 and rank=12, id=34 both producetp1234). Additionally, the rank was previously hardcoded astp0, causing all ranks to load rank-0's shard in multi-process scenarios.Modifications
Refactored
_update_ipc_snapshotinfastdeploy/rl/dynamic_weight_manager.pywith a four-level loading priority:model_state.tp{rank}.{id}.part{N}.pdparams): Load multiple smaller shards sequentially in ascending numeric order, releasing memory between each chunk viagc.collect()to avoid memory spike.model_state.tp{rank}.{id}.pdparams): Standard single-file loading with corrected naming format (dot separator added).model_state.tp0{id}.pdparams): Backward-compatible fallback for checkpoints saved in the old format, ensuring smooth rolling upgrades./shared_ipc_meta/...): Oldest legacy fallback path preserved for compatibility.Key fixes:
tp{rank}{id}→tp{rank}.{id}to eliminate naming ambiguitytp0withpaddle.distributed.get_rank()so each rank loads its own shard correctlyFileNotFoundErrorwith clear message when no snapshot is found in any candidate pathUsage or Command
During RL elastic recovery, the trainer will automatically select the appropriate loading strategy based on available snapshot files. No additional configuration is needed.
To generate chunked part files on the trainer side, split the full state dict and save as:
model_state.tp{rank}.{id}.part0.pdparamsmodel_state.tp{rank}.{id}.part1.pdparams...
Accuracy Tests
This PR only modifies the weight loading path and chunking strategy in
_update_ipc_snapshot. It does not affect model forward computation or kernel logic. No accuracy regression is expected.Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.