DAOS-18425 rebuild: NAK certain rebuild stop commands#17421
DAOS-18425 rebuild: NAK certain rebuild stop commands#17421
Conversation
When a dmg pool rebuild stop (or system rebuild stop) command is run, the PS leader should refuse to stop a currently-running rebuild if there are more scheduled rebuilds for the pool in the rg_queue_list. In this case, -DER_NO_PERM is returned to the dmg command. Also, for usability of the feature, the handling of the stop command will return errors when: - there is no currently-running rebuild (-DER_NONEXIST) - the rebuild has effectively finsihed, and is simply cleaning up (i.e., it is in op:Reclaim now) (-DER_BUSY) Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17421/1/display/redirect |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/1/execution/node/1276/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/1/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/ This one actually ran fine as far as I can tell from https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/view/change-requests/job/PR-17421/2/pipeline-overview/?selected-node=507 |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/2/testReport/ fault_injection/pool.py failure is an instance of existing issue DAOS-18519 |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17421/2/execution/node/492/log
|
|
After successfully running |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/3/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/3/testReport/ There is a problem - on the surface (top level URL for test results) it looks like functional HW tests all passed (just like prior example in build 2). But diving into the testReport there is a daos_rebuild_interactive.c test that is failing, a "rebuild stop" command returns -DER_NONEXIST versus the expectation that it succeeds. A test timing issue, that we are seeing occasionally with this test, and in particular with this patch revealing the timing issue with the error return. I'll look into this and refresh the patch. |
to address test timing problems. - Remove reliance on pre-command sleep in test functions that perform pool rebuild stop commands.. - Change rebuild stop functions to check for -DER_NONEXIST NAK and loop until that condition disappears (prevent stop commands too early). - Add test_rebuild_wait_to_start_lower() for tests to monitor for transition from op:Rebuild to op:Fail_reclaim. - Add test_rebuild_wait_to_start_next() for tests to wait for tests to particularly monitor for Fail_reclaim->Rebuild (retry). - Copy pool query results into test arg pool info when an interactive rebuild test invokes the various test functions to (loop and) wait for certain rebuild conditions to occur. Remove manual pool query and map/rs_version monitoring from test case code, replacing it with simpler test function calls. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17421/4/testReport/ Looks like an instance of known issue https://daosio.atlassian.net/browse/DAOS-17416 |
|
@liuxuezhao and @wangshilong I think this one is ready for review now after fixing up a test timing issue from build 3 in the functional hw testing. |
| D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force, | ||
| rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc)); | ||
| rgt->rgt_abort = 1; | ||
| rgt->rgt_status.rs_errno = -DER_OP_CANCELED; |
There was a problem hiding this comment.
some case rgt possible be NULL? won't it trigger segfault then?
There was a problem hiding this comment.
if rgt==NULL, then rebuild_is_stoppable() will return false. So in this branch it will be non-NULL.
There was a problem hiding this comment.
[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)
| /* admin stop command does not usually terminate op:Fail_reclaim, but it is always | ||
| * remembered to avoid retrying the original op:Rebuild. | ||
| */ | ||
| if (rgt->rgt_abort || (rgt->rgt_opc == RB_OP_FAIL_RECLAIM)) |
There was a problem hiding this comment.
ditto, rgt possible is NULL?
There was a problem hiding this comment.
If rgt==NULL then rebuild_is_stoppable() has returned false, and the above else branch is taken, and within that it returns the rc value -DER_NONEXIST. So by this point in the execution flow rgt is non-NULL.
| D_INFO(DF_RB ": stopping rebuild force=%u opc %u(%s)\n", DP_RB_RGT(rgt), force, | ||
| rgt->rgt_opc, RB_OP_STR(rgt->rgt_opc)); | ||
| rgt->rgt_abort = 1; | ||
| rgt->rgt_status.rs_errno = -DER_OP_CANCELED; |
There was a problem hiding this comment.
[Not requesting changes] I had the same questions when reading here and below. ds_rebuild_admin_stop seems to suffer a bit too much for collecting all rebuild-stoppable logic into rebuild_is_stoppable. Handling rgt == NULL outside of rebuild_is_stopping (e.g., before calling rebuild_is_stopping) might be an alternative worth considering, if the PR needed to be revised for other reasons. Just my naive impression. :)
|
Thanks @liw for the review. @liuxuezhao what do you think, should the patch be updated or good to proceed? |
|
Actually, let's consider this for landing. For the same Jira ticket I have another patch #17382 that needs to have the changes in this patch. So I can consider refinements to handle |
|
FYI this test should be adjusted now to remove the sleeps daos/src/tests/ftest/rebuild/interactive.py Line 103 in f010597 daos/src/tests/ftest/rebuild/interactive.py Line 148 in f010597 Do you want to handle that here or in a separate PR? Or, I could push a PR if you want. |
Maybe it should be done in a separate PR, since I think the sleeps may need to be replaced with loops that retry upon seeing a |
I created https://daosio.atlassian.net/browse/DAOS-18593 to handle this after this PR lands |
liuxuezhao
left a comment
There was a problem hiding this comment.
LGTM, did not check test code's logic.
When a dmg pool rebuild stop (or system rebuild stop) command is run, the PS leader should refuse to stop a currently-running rebuild if there are more scheduled rebuilds for the pool in the rg_queue_list. In this case, -DER_NO_PERM is returned to the dmg command.
Also, for usability of the feature, the handling of the stop command will return errors when:
Features: rebuild
Steps for the author:
After all prior steps are complete: