DAOS-18425 rebuild: NAK certain rebuild stop commands by kccain · Pull Request #17421 · daos-stack/daos

kccain · 2026-01-21T22:45:08Z

When a dmg pool rebuild stop (or system rebuild stop) command is run, the PS leader should refuse to stop a currently-running rebuild if there are more scheduled rebuilds for the pool in the rg_queue_list. In this case, -DER_NO_PERM is returned to the dmg command.

Also, for usability of the feature, the handling of the stop command will return errors when:

there is no currently-running rebuild (-DER_NONEXIST)
the rebuild has effectively finsihed, and is simply cleaning up (i.e., it is in op:Reclaim now) (-DER_BUSY)

Features: rebuild

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

When a dmg pool rebuild stop (or system rebuild stop) command is run, the PS leader should refuse to stop a currently-running rebuild if there are more scheduled rebuilds for the pool in the rg_queue_list. In this case, -DER_NO_PERM is returned to the dmg command. Also, for usability of the feature, the handling of the stop command will return errors when: - there is no currently-running rebuild (-DER_NONEXIST) - the rebuild has effectively finsihed, and is simply cleaning up (i.e., it is in op:Reclaim now) (-DER_BUSY) Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

github-actions · 2026-01-21T22:45:53Z

Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration'
Status is 'In Progress'
Labels: 'Rebuild,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18425

daosbuild3 · 2026-01-22T11:49:47Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17421/1/display/redirect

daosbuild3 · 2026-01-22T14:59:13Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/1/execution/node/1276/log

daosbuild3 · 2026-01-22T17:11:30Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/1/testReport/

daosbuild3 · 2026-01-22T23:07:18Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/

daosbuild3 · 2026-01-22T23:24:08Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/

fault_injection/pool.py failure is an instance of existing issue DAOS-18519

daosbuild3 · 2026-01-23T02:28:34Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/2/execution/node/492/log

erasurecode/multiple_rank_failure.py failure is an instance of known issue DAOS-16766
erasurecode/multiple_target_failure.py is also an instance of DAOS-16766
rebuild/widely_striped.py failure is an instance of known issue DAOS-18302

kccain · 2026-01-27T12:56:36Z

After successfully running Features: rebuild testing in build2, have merged with latest master and pushed for per-PR-only testing (in progress now, build 3). I think it's reasonable to perform code reviews in parallel.

daosbuild3 · 2026-01-27T13:47:00Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/3/testReport/

daosbuild3 · 2026-01-28T01:28:38Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/3/testReport/

There is a problem - on the surface (top level URL for test results) it looks like functional HW tests all passed (just like prior example in build 2). But diving into the testReport there is a daos_rebuild_interactive.c test that is failing, a "rebuild stop" command returns -DER_NONEXIST versus the expectation that it succeeds. A test timing issue, that we are seeing occasionally with this test, and in particular with this patch revealing the timing issue with the error return.

I'll look into this and refresh the patch.

to address test timing problems. - Remove reliance on pre-command sleep in test functions that perform pool rebuild stop commands.. - Change rebuild stop functions to check for -DER_NONEXIST NAK and loop until that condition disappears (prevent stop commands too early). - Add test_rebuild_wait_to_start_lower() for tests to monitor for transition from op:Rebuild to op:Fail_reclaim. - Add test_rebuild_wait_to_start_next() for tests to wait for tests to particularly monitor for Fail_reclaim->Rebuild (retry). - Copy pool query results into test arg pool info when an interactive rebuild test invokes the various test functions to (loop and) wait for certain rebuild conditions to occur. Remove manual pool query and map/rs_version monitoring from test case code, replacing it with simpler test function calls. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

daosbuild3 · 2026-01-31T01:11:42Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/4/testReport/

Looks like an instance of known issue https://daosio.atlassian.net/browse/DAOS-17416

kccain · 2026-02-02T01:09:59Z

@liuxuezhao and @wangshilong I think this one is ready for review now after fixing up a test timing issue from build 3 in the functional hw testing.

liuxuezhao · 2026-02-06T09:51:18Z

src/rebuild/srv.c

 		D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force,
 		       rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc));
 		rgt->rgt_abort           = 1;
 		rgt->rgt_status.rs_errno = -DER_OP_CANCELED;


some case rgt possible be NULL? won't it trigger segfault then?

if rgt==NULL, then rebuild_is_stoppable() will return false. So in this branch it will be non-NULL.

[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)

liuxuezhao · 2026-02-06T09:52:18Z

src/rebuild/srv.c

+	/* admin stop command does not usually terminate op:Fail_reclaim, but it is always
+	 * remembered to avoid retrying the original op:Rebuild.
 	 */
 	if (rgt->rgt_abort || (rgt->rgt_opc == RB_OP_FAIL_RECLAIM))


ditto, rgt possible is NULL?

If rgt==NULL then rebuild_is_stoppable() has returned false, and the above else branch is taken, and within that it returns the rc value -DER_NONEXIST. So by this point in the execution flow rgt is non-NULL.

liw · 2026-02-12T00:04:20Z

src/rebuild/srv.c

 		D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force,
 		       rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc));
 		rgt->rgt_abort           = 1;
 		rgt->rgt_status.rs_errno = -DER_OP_CANCELED;


[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)

kccain · 2026-02-12T01:49:49Z

Thanks @liw for the review. @liuxuezhao what do you think, should the patch be updated or good to proceed?

kccain · 2026-02-12T12:44:24Z

Actually, let's consider this for landing. For the same Jira ticket I have another patch #17382 that needs to have the changes in this patch. So I can consider refinements to handle rgt == NULL cases differently in the follow-on PR.

daltonbohning · 2026-02-12T15:12:49Z

FYI this test should be adjusted now to remove the sleeps

daos/src/tests/ftest/rebuild/interactive.py

Line 103 in f010597

time.sleep(secs_between_rebuild_start_and_manual_stop)

daos/src/tests/ftest/rebuild/interactive.py

Line 148 in f010597

time.sleep(secs_between_rebuild_start_and_manual_stop)

Do you want to handle that here or in a separate PR? Or, I could push a PR if you want.

kccain · 2026-02-12T16:35:54Z

FYI this test should be adjusted now to remove the sleeps

daos/src/tests/ftest/rebuild/interactive.py

Line 103 in f010597

time.sleep(secs_between_rebuild_start_and_manual_stop)

daos/src/tests/ftest/rebuild/interactive.py

Line 148 in f010597

time.sleep(secs_between_rebuild_start_and_manual_stop)

Do you want to handle that here or in a separate PR? Or, I could push a PR if you want.

Maybe it should be done in a separate PR, since I think the sleeps may need to be replaced with loops that retry upon seeing a -DER_NONEXIST failure in the rebuild stop command, similar to what the daos_test logic is doing in this patch in rebuild_stop_with_dmg

daltonbohning · 2026-02-12T17:11:50Z

Maybe it should be done in a separate PR, since I think the sleeps may need to be replaced with loops that retry upon seeing a -DER_NONEXIST failure in the rebuild stop command, similar to what the daos_test logic is doing in this patch in rebuild_stop_with_dmg

I created https://daosio.atlassian.net/browse/DAOS-18593 to handle this after this PR lands

liuxuezhao

LGTM, did not check test code's logic.

Merge branch 'master' into kccain/daos_18425_nak_stop_commands

c802f2b

kccain marked this pull request as ready for review January 27, 2026 12:55

kccain requested review from a team as code owners January 27, 2026 12:55

kccain requested review from liuxuezhao and wangshilong January 27, 2026 13:01

kccain added 2 commits January 30, 2026 18:08

Merge branch 'master' into kccain/daos_18425_nak_stop_commands

783ac97

kccain mentioned this pull request Feb 3, 2026

DAOS-18470 rebuild: re-schedule rebuild task after stopped #17492

Merged

6 tasks

liuxuezhao reviewed Feb 6, 2026

View reviewed changes

kccain requested review from gnailzenh and liw February 11, 2026 12:07

wangshilong approved these changes Feb 11, 2026

View reviewed changes

liw approved these changes Feb 12, 2026

View reviewed changes

kccain requested a review from a team February 12, 2026 12:44

liuxuezhao approved these changes Feb 13, 2026

View reviewed changes

gnailzenh approved these changes Feb 13, 2026

View reviewed changes

gnailzenh merged commit 1634f79 into master Feb 13, 2026
39 of 41 checks passed

gnailzenh deleted the kccain/daos_18425_nak_stop_commands branch February 13, 2026 12:45

Conversation

kccain commented Jan 21, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

daosbuild3 commented Jan 22, 2026

Uh oh!

daosbuild3 commented Jan 22, 2026

Uh oh!

daosbuild3 commented Jan 22, 2026

Uh oh!

daosbuild3 commented Jan 22, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Jan 22, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Jan 23, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kccain commented Jan 27, 2026

Uh oh!

daosbuild3 commented Jan 27, 2026

Uh oh!

daosbuild3 commented Jan 28, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Jan 31, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kccain commented Feb 2, 2026

Uh oh!

liuxuezhao Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kccain Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

liw Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

liuxuezhao Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kccain Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

liw Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

kccain commented Feb 12, 2026

Uh oh!

kccain commented Feb 12, 2026

Uh oh!

daltonbohning commented Feb 12, 2026

Uh oh!

kccain commented Feb 12, 2026

Uh oh!

daltonbohning commented Feb 12, 2026

Uh oh!

liuxuezhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

7 participants

daosbuild3 commented Jan 22, 2026 •

edited by kccain

Loading

daosbuild3 commented Jan 22, 2026 •

edited by kccain

Loading

daosbuild3 commented Jan 23, 2026 •

edited by kccain

Loading

daosbuild3 commented Jan 28, 2026 •

edited by kccain

Loading

daosbuild3 commented Jan 31, 2026 •

edited by kccain

Loading