Skip to content

DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382

Open
kccain wants to merge 4 commits intomasterfrom
kccain/daos_18425
Open

DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
kccain wants to merge 4 commits intomasterfrom
kccain/daos_18425

Conversation

@kccain
Copy link
Contributor

@kccain kccain commented Jan 15, 2026

Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start).

With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call.

Features: rebuild

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 15, 2026

Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration'
Status is 'In Review'
Labels: 'Rebuild,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18425

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 16, 2026

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/3/execution/node/468/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/3/testReport/

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@kccain kccain force-pushed the kccain/daos_18425 branch from 1f838a6 to 1853800 Compare February 7, 2026 03:23
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1281/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1322/log

@kccain kccain force-pushed the kccain/daos_18425 branch from 1853800 to 3af9e59 Compare February 9, 2026 16:31
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1338/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1348/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/10/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/10/execution/node/657/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/456/log

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Feb 13, 2026

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/466/log

Many multiple rank failure events or those initiated by control
plane commands (e.g., dmg) are processed in a rank-by-rank fashion.
This leads to a (rapid) sequence of pool map updates and rebuild
scheduling activities. It can lead to serialized rebuilds, one
per rank, rather than consolidating the related operations into
a single rebuild job. This can result in slower execution, and
cause a confusion for an administrator trying to monitor rebuild
progress via pool query, and possibly wanting to use interactive
rebuild controls (e.g., rebulid stop|start).

With this change, exclude, reintegration, and drain pool map updates
are changed to schedule a rebuild with a 5 second delay, causing the
resulting entry in the rebuild_gst.rg_queue_list to specify a
scheduling time (dst_schedule_time) in the near-future. And, logic in
rebuild_ults() is changed to not dequeue a rebuild task if it has a
future scheduling time. This allows compatible changes (that may be
processed imminently) to be merged by their ds_rebuild_schedule() call.

Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
and minor changes to ds_rebuild_admin_stop for rgt==NULL

Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Feb 15, 2026

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/13/execution/node/1433/log

erasurecode/multiple_rank_failure.py and another one failure are instances of known issue DAOS-16766

@kccain kccain marked this pull request as ready for review February 15, 2026 22:13
@kccain kccain requested review from a team as code owners February 15, 2026 22:13
@kccain kccain requested review from a team as code owners February 15, 2026 22:13
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@kccain
Copy link
Contributor Author

kccain commented Mar 2, 2026

ping reviewers

/* Not expected if rebuld_ults() prevents dequeuing task until schedule time is
* reached. */
D_INFO("rebuild task sleep " DF_U64 " second\n", task->dst_schedule_time - cur_ts);
dss_sleep((task->dst_schedule_time - cur_ts) * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still keep this dss_sleep() at all? I think if you change the code and make rebuild task on the waiting queue, we don't have to sleep again right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be removed now. With the new changes, in theory this branch won't ever be taken. In the future, if for some reason we change our mind on how rebuild_gst.rg_queue_list items are dequeued, then leaving this code here could be useful / defensive to make sure that the schedule time is still honored. I'm happy to remove it for simplicity - let me know what you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just remove it, I checked commit history of the code and believe Di made this change because he wants to avoid merge rebuild and reclaim or other operations, then he found another way to avoid that but forgot to remove it. Please double check and see if my understanding is correct, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge patch with latest master and apply the change.

Yes, it seems ds_rebuild_schedule() -> rebuild_try_merge_tgts() will avoid merging op:rebuild with anything other than another op:rebuild. And if there is a reclaim in the queue then a new op:rebuild with higher pool map version will be scheduled after that reclaim (i.e., it could only merge with another op:rebuild following the reclaim in the queue, not any op:rebuild that might precede that reclaim task).

I thought about it from this scenario perspective:

  • T=0 exclude targets of rank A occurs (dmg pool exclude --ranks=A) - gets queued, and 5 seconds later starts running. queue empty now.
  • T=6 exclude targets of rank B occurs (dmg pool exclude --ranks=B) occurs. Gets queued with schedule time 5 seconds in future
  • T=6.01 rebuild for exclude rank A targets finishes, and reclaim(A) is queued (behind exclude rank B targets task)
  • T=7 exclude targets of rank C occurs (dmg pool exclude --ranks=C,D) - queued behind exclude B task and reclaim A task. No merging here.
  • T=7.001 exclude targets of rank D occurs (from same pool exclude command) - merged with exclude targets of rank C task in the queue, behind exclude B task and reclaim A task.

kccain added 2 commits March 6, 2026 09:57
Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17382/15/display/redirect

@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants