Skip to content

DAOS-18425 rebuild: NAK certain rebuild stop commands#17421

Merged
gnailzenh merged 4 commits intomasterfrom
kccain/daos_18425_nak_stop_commands
Feb 13, 2026
Merged

DAOS-18425 rebuild: NAK certain rebuild stop commands#17421
gnailzenh merged 4 commits intomasterfrom
kccain/daos_18425_nak_stop_commands

Conversation

@kccain
Copy link
Contributor

@kccain kccain commented Jan 21, 2026

When a dmg pool rebuild stop (or system rebuild stop) command is run, the PS leader should refuse to stop a currently-running rebuild if there are more scheduled rebuilds for the pool in the rg_queue_list. In this case, -DER_NO_PERM is returned to the dmg command.

Also, for usability of the feature, the handling of the stop command will return errors when:

  • there is no currently-running rebuild (-DER_NONEXIST)
  • the rebuild has effectively finsihed, and is simply cleaning up (i.e., it is in op:Reclaim now) (-DER_BUSY)

Features: rebuild

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

When a dmg pool rebuild stop (or system rebuild stop) command is run,
the PS leader should refuse to stop a currently-running rebuild if
there are more scheduled rebuilds for the pool in the rg_queue_list.
In this case, -DER_NO_PERM is returned to the dmg command.

Also, for usability of the feature, the handling of the stop command
will return errors when:
- there is no currently-running rebuild (-DER_NONEXIST)
- the rebuild has effectively finsihed, and is simply cleaning
  up (i.e., it is in op:Reclaim now) (-DER_BUSY)

Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@github-actions
Copy link

Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration'
Status is 'In Progress'
Labels: 'Rebuild,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18425

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17421/1/display/redirect

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/1/execution/node/1276/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/1/testReport/

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 22, 2026

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/

This one actually ran fine as far as I can tell from https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/view/change-requests/job/PR-17421/2/pipeline-overview/?selected-node=507
17:58:54 RESULTS : PASS 29 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 22, 2026

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/

fault_injection/pool.py failure is an instance of existing issue DAOS-18519

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 23, 2026

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/2/execution/node/492/log

@kccain kccain marked this pull request as ready for review January 27, 2026 12:55
@kccain kccain requested review from a team as code owners January 27, 2026 12:55
@kccain
Copy link
Contributor Author

kccain commented Jan 27, 2026

After successfully running Features: rebuild testing in build2, have merged with latest master and pushed for per-PR-only testing (in progress now, build 3). I think it's reasonable to perform code reviews in parallel.

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 28, 2026

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/3/testReport/

There is a problem - on the surface (top level URL for test results) it looks like functional HW tests all passed (just like prior example in build 2). But diving into the testReport there is a daos_rebuild_interactive.c test that is failing, a "rebuild stop" command returns -DER_NONEXIST versus the expectation that it succeeds. A test timing issue, that we are seeing occasionally with this test, and in particular with this patch revealing the timing issue with the error return.

I'll look into this and refresh the patch.

to address test timing problems.

- Remove reliance on pre-command sleep in test functions that
  perform pool rebuild stop commands..
- Change rebuild stop functions to check for -DER_NONEXIST NAK and loop
  until that condition disappears (prevent stop commands too early).
- Add test_rebuild_wait_to_start_lower() for tests to monitor
  for transition from op:Rebuild to op:Fail_reclaim.
- Add test_rebuild_wait_to_start_next() for tests to wait
  for tests to particularly monitor for Fail_reclaim->Rebuild (retry).
- Copy pool query results into test arg pool info when an interactive
  rebuild test   invokes the various test functions to (loop and) wait
  for certain rebuild conditions to occur. Remove manual pool query
  and map/rs_version monitoring from test case code, replacing it with
  simpler test function calls.

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 31, 2026

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/4/testReport/

Looks like an instance of known issue https://daosio.atlassian.net/browse/DAOS-17416

@kccain
Copy link
Contributor Author

kccain commented Feb 2, 2026

@liuxuezhao and @wangshilong I think this one is ready for review now after fixing up a test timing issue from build 3 in the functional hw testing.

D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force,
rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc));
rgt->rgt_abort = 1;
rgt->rgt_status.rs_errno = -DER_OP_CANCELED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some case rgt possible be NULL? won't it trigger segfault then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if rgt==NULL, then rebuild_is_stoppable() will return false. So in this branch it will be non-NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)

/* admin stop command does not usually terminate op:Fail_reclaim, but it is always
* remembered to avoid retrying the original op:Rebuild.
*/
if (rgt->rgt_abort || (rgt->rgt_opc == RB_OP_FAIL_RECLAIM))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, rgt possible is NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rgt==NULL then rebuild_is_stoppable() has returned false, and the above else branch is taken, and within that it returns the rc value -DER_NONEXIST. So by this point in the execution flow rgt is non-NULL.

@kccain kccain requested review from gnailzenh and liw February 11, 2026 12:07
D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force,
rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc));
rgt->rgt_abort = 1;
rgt->rgt_status.rs_errno = -DER_OP_CANCELED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)

@kccain
Copy link
Contributor Author

kccain commented Feb 12, 2026

Thanks @liw for the review. @liuxuezhao what do you think, should the patch be updated or good to proceed?

@kccain
Copy link
Contributor Author

kccain commented Feb 12, 2026

Actually, let's consider this for landing. For the same Jira ticket I have another patch #17382 that needs to have the changes in this patch. So I can consider refinements to handle rgt == NULL cases differently in the follow-on PR.

@kccain kccain requested a review from a team February 12, 2026 12:44
@daltonbohning
Copy link
Contributor

FYI this test should be adjusted now to remove the sleeps

time.sleep(secs_between_rebuild_start_and_manual_stop)

time.sleep(secs_between_rebuild_start_and_manual_stop)

Do you want to handle that here or in a separate PR? Or, I could push a PR if you want.

@kccain
Copy link
Contributor Author

kccain commented Feb 12, 2026

FYI this test should be adjusted now to remove the sleeps

time.sleep(secs_between_rebuild_start_and_manual_stop)

time.sleep(secs_between_rebuild_start_and_manual_stop)

Do you want to handle that here or in a separate PR? Or, I could push a PR if you want.

Maybe it should be done in a separate PR, since I think the sleeps may need to be replaced with loops that retry upon seeing a -DER_NONEXIST failure in the rebuild stop command, similar to what the daos_test logic is doing in this patch in rebuild_stop_with_dmg

@daltonbohning
Copy link
Contributor

Maybe it should be done in a separate PR, since I think the sleeps may need to be replaced with loops that retry upon seeing a -DER_NONEXIST failure in the rebuild stop command, similar to what the daos_test logic is doing in this patch in rebuild_stop_with_dmg

I created https://daosio.atlassian.net/browse/DAOS-18593 to handle this after this PR lands

Copy link
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did not check test code's logic.

@gnailzenh gnailzenh merged commit 1634f79 into master Feb 13, 2026
39 of 41 checks passed
@gnailzenh gnailzenh deleted the kccain/daos_18425_nak_stop_commands branch February 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants