[Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852)#6909
Open
wikilsh wants to merge 5 commits intoPaddlePaddle:release/2.4from
Conversation
## Motivation During elastic recovery, each rank should load its own model shard. The hardcoded `tp0` caused all ranks to load rank-0's shard, leading to incorrect weight initialization in multi-process scenarios. ## Modifications - Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both the primary model path and the fallback `/shared_ipc_meta/` path inside `_update_ipc_snapshot`, so each rank correctly loads its own shard during elastic recovery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Motivation During elastic recovery, each rank should load its own model shard. The hardcoded `tp0` caused all ranks to load rank-0's shard, leading to incorrect weight initialization in multi-process scenarios. ## Modifications - Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both the primary model path and the fallback `/shared_ipc_meta/` path inside `_update_ipc_snapshot`, so each rank correctly loads its own shard during elastic recovery. Co-Authored-By: lishuaihui <lishuaihui@baidu.com>
…ry spike
Refactor _update_ipc_snapshot with 4-level loading priority:
1. Chunked part files (with gc.collect per part to reduce peak memory)
2. Single full pdparams file (new naming: tp{rank}.{id})
3. Legacy format (tp0{id})
4. Shared fallback directory (/shared_ipc_meta/)
Add unit tests covering all priority branches and error path.
Co-Authored-By: lishuaihui <lishuaihui@baidu.com>
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #6909 +/- ##
==============================================
Coverage ? 56.39%
==============================================
Files ? 333
Lines ? 42553
Branches ? 6481
==============================================
Hits ? 23997
Misses ? 16673
Partials ? 1883
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Parse part index from filename instead of using enumerate index, keeping logs and src_type consistent with actual file naming. - Add validation for part file naming pattern; skip and warn on files that do not match the expected .partN. convention. Co-Authored-By: wikilsh <wiki_hui@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Cherry-pick from develop branch PR #6852 to release/2.4.
In RL training elastic recovery via IPC snapshot, two issues existed in
_update_ipc_snapshot:.pdparamsfile in one shot causes a significant memory spike, risking OOM during recovery.model_state.tp{rank}{id}without a separator between rank and id, causing naming ambiguity (e.g., rank=1, id=234 and rank=12, id=34 both producetp1234). Additionally, the rank was previously hardcoded astp0, causing all ranks to load rank-0's shard in multi-process scenarios.Modifications
Refactored
_update_ipc_snapshotinfastdeploy/rl/dynamic_weight_manager.pywith a four-level loading priority:model_state.tp{rank}.{id}.part{N}.pdparams): Load multiple smaller shards sequentially in ascending numeric order, releasing memory between each chunk viagc.collect()to avoid memory spike.model_state.tp{rank}.{id}.pdparams): Standard single-file loading with corrected naming format (dot separator added).model_state.tp0{id}.pdparams): Backward-compatible fallback for checkpoints saved in the old format, ensuring smooth rolling upgrades./shared_ipc_meta/...): Oldest legacy fallback path preserved for compatibility.Key fixes:
tp{rank}{id}→tp{rank}.{id}to eliminate naming ambiguitytp0withpaddle.distributed.get_rank()so each rank loads its own shard correctlyFileNotFoundErrorwith clear message when no snapshot is found in any candidate pathUsage or Command
During RL elastic recovery, the trainer will automatically select the appropriate loading strategy based on available snapshot files. No additional configuration is needed.
To generate chunked part files on the trainer side, split the full state dict and save as:
model_state.tp{rank}.{id}.part0.pdparamsmodel_state.tp{rank}.{id}.part1.pdparams...
Accuracy Tests
This PR only modifies the weight loading path and chunking strategy in
_update_ipc_snapshot. It does not affect model forward computation or kernel logic. No accuracy regression is expected.Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.