DAOS-18425 rebuild: coalesce tasks for multi-rank operations by kccain · Pull Request #17382 · daos-stack/daos

kccain · 2026-01-15T13:03:32Z

Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start).

With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call.

Features: rebuild

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-01-15T13:03:49Z

Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration'
Status is 'In Review'
Labels: 'Rebuild,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18425

daosbuild3 · 2026-01-16T06:57:03Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

daosbuild3 · 2026-01-16T08:46:01Z

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302

daosbuild3 · 2026-01-23T20:24:13Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/3/execution/node/468/log

daosbuild3 · 2026-01-23T21:43:55Z

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/3/testReport/

daosbuild3 · 2026-02-07T02:31:37Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/4/testReport/

daosbuild3 · 2026-02-07T02:38:37Z

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/4/execution/node/1013/log

daosbuild3 · 2026-02-07T05:24:43Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/5/testReport/

daosbuild3 · 2026-02-08T06:24:11Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1281/log

daosbuild3 · 2026-02-08T06:56:45Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1322/log

daosbuild3 · 2026-02-09T18:37:21Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/7/testReport/

daosbuild3 · 2026-02-10T06:16:51Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1338/log

daosbuild3 · 2026-02-10T07:50:06Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1348/log

daosbuild3 · 2026-02-12T07:55:59Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/10/testReport/

daosbuild3 · 2026-02-12T09:31:02Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/10/execution/node/657/log

daosbuild3 · 2026-02-13T08:54:01Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/456/log

daosbuild3 · 2026-02-13T10:36:14Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/466/log

Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start). With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call. Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

and minor changes to ds_rebuild_admin_stop for rgt==NULL Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

daosbuild3 · 2026-02-14T00:23:32Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/13/testReport/

daosbuild3 · 2026-02-15T06:10:13Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/13/execution/node/1433/log

erasurecode/multiple_rank_failure.py and another one failure are instances of known issue DAOS-16766

daltonbohning

ftest LGTM

kccain · 2026-03-02T22:18:09Z

ping reviewers

gnailzenh · 2026-03-05T14:09:30Z

src/rebuild/srv.c

+		/* Not expected if rebuld_ults() prevents dequeuing task until schedule time is
+		 * reached. */
 		D_INFO("rebuild task sleep " DF_U64 " second\n", task->dst_schedule_time - cur_ts);
 		dss_sleep((task->dst_schedule_time - cur_ts) * 1000);


why do we still keep this dss_sleep() at all? I think if you change the code and make rebuild task on the waiting queue, we don't have to sleep again right?

Yes, this could be removed now. With the new changes, in theory this branch won't ever be taken. In the future, if for some reason we change our mind on how rebuild_gst.rg_queue_list items are dequeued, then leaving this code here could be useful / defensive to make sure that the schedule time is still honored. I'm happy to remove it for simplicity - let me know what you prefer.

I think we should just remove it, I checked commit history of the code and believe Di made this change because he wants to avoid merge rebuild and reclaim or other operations, then he found another way to avoid that but forgot to remove it. Please double check and see if my understanding is correct, thanks

I'll merge patch with latest master and apply the change.

Yes, it seems ds_rebuild_schedule() -> rebuild_try_merge_tgts() will avoid merging op:rebuild with anything other than another op:rebuild. And if there is a reclaim in the queue then a new op:rebuild with higher pool map version will be scheduled after that reclaim (i.e., it could only merge with another op:rebuild following the reclaim in the queue, not any op:rebuild that might precede that reclaim task).

I thought about it from this scenario perspective:

T=0 exclude targets of rank A occurs (dmg pool exclude --ranks=A) - gets queued, and 5 seconds later starts running. queue empty now.

T=6 exclude targets of rank B occurs (dmg pool exclude --ranks=B) occurs. Gets queued with schedule time 5 seconds in future

T=6.01 rebuild for exclude rank A targets finishes, and reclaim(A) is queued (behind exclude rank B targets task)

T=7 exclude targets of rank C occurs (dmg pool exclude --ranks=C,D) - queued behind exclude B task and reclaim A task. No merging here.

T=7.001 exclude targets of rank D occurs (from same pool exclude command) - merged with exclude targets of rank C task in the queue, behind exclude B task and reclaim A task.

Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

daosbuild3 · 2026-03-06T19:01:21Z

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17382/15/display/redirect

daosbuild3 · 2026-03-09T22:34:09Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/19/testReport/

kccain force-pushed the kccain/daos_18425 branch from 1f838a6 to 1853800 Compare February 7, 2026 03:23

kccain force-pushed the kccain/daos_18425 branch from 1853800 to 3af9e59 Compare February 9, 2026 16:31

kccain force-pushed the kccain/daos_18425 branch from 3af9e59 to 7f880e3 Compare February 10, 2026 21:26

kccain mentioned this pull request Feb 12, 2026

DAOS-18425 rebuild: NAK certain rebuild stop commands #17421

Merged

6 tasks

kccain force-pushed the kccain/daos_18425 branch from 7f880e3 to 551ce8e Compare February 12, 2026 16:18

kccain added 2 commits February 13, 2026 15:36

wait for rebuilds to start before invoking rebuild_cb

da370d3

and minor changes to ds_rebuild_admin_stop for rgt==NULL Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

kccain force-pushed the kccain/daos_18425 branch from 551ce8e to da370d3 Compare February 13, 2026 23:00

kccain marked this pull request as ready for review February 15, 2026 22:13

kccain requested review from a team as code owners February 15, 2026 22:13

kccain requested review from gnailzenh, liuxuezhao and wangshilong February 15, 2026 22:13

daltonbohning reviewed Feb 17, 2026

View reviewed changes

gnailzenh reviewed Mar 5, 2026

View reviewed changes

kccain added 2 commits March 6, 2026 09:57

Merge branch 'master' into kccain/daos_18425

ae3a314

address review feedback from Liang.

12a1267

Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

wangshilong approved these changes Mar 9, 2026

View reviewed changes

liuxuezhao approved these changes Mar 9, 2026

View reviewed changes

Conversation

kccain commented Jan 15, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Jan 16, 2026

Uh oh!

daosbuild3 commented Jan 16, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Jan 23, 2026

Uh oh!

daosbuild3 commented Jan 23, 2026

Uh oh!

daosbuild3 commented Feb 7, 2026

Uh oh!

daosbuild3 commented Feb 7, 2026

Uh oh!

daosbuild3 commented Feb 7, 2026

Uh oh!

daosbuild3 commented Feb 8, 2026

Uh oh!

daosbuild3 commented Feb 8, 2026

Uh oh!

daosbuild3 commented Feb 9, 2026

Uh oh!

daosbuild3 commented Feb 10, 2026

Uh oh!

daosbuild3 commented Feb 10, 2026

Uh oh!

daosbuild3 commented Feb 12, 2026

Uh oh!

daosbuild3 commented Feb 12, 2026

Uh oh!

daosbuild3 commented Feb 13, 2026

Uh oh!

daosbuild3 commented Feb 13, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild3 commented Feb 14, 2026

Uh oh!

daosbuild3 commented Feb 15, 2026 • edited by kccain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daltonbohning left a comment

Choose a reason for hiding this comment

Uh oh!

kccain commented Mar 2, 2026

Uh oh!

gnailzenh Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

kccain Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gnailzenh Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

kccain Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented Mar 6, 2026

Uh oh!

daosbuild3 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Jan 15, 2026 •

edited

Loading

daosbuild3 commented Jan 16, 2026 •

edited by kccain

Loading

daosbuild3 commented Feb 13, 2026 •

edited by kccain

Loading

daosbuild3 commented Feb 15, 2026 •

edited by kccain

Loading