DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
Conversation
|
Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration' |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/ rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302 |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/3/execution/node/468/log |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/3/testReport/ |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/4/testReport/ |
|
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/4/execution/node/1013/log |
1f838a6 to
1853800
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1281/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1322/log |
1853800 to
3af9e59
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/7/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1338/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1348/log |
3af9e59 to
7f880e3
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/10/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/10/execution/node/657/log |
7f880e3 to
551ce8e
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/456/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/466/log |
Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start). With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call. Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
and minor changes to ds_rebuild_admin_stop for rgt==NULL Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
551ce8e to
da370d3
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/13/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/13/execution/node/1433/log erasurecode/multiple_rank_failure.py and another one failure are instances of known issue DAOS-16766 |
|
ping reviewers |
src/rebuild/srv.c
Outdated
| /* Not expected if rebuld_ults() prevents dequeuing task until schedule time is | ||
| * reached. */ | ||
| D_INFO("rebuild task sleep " DF_U64 " second\n", task->dst_schedule_time - cur_ts); | ||
| dss_sleep((task->dst_schedule_time - cur_ts) * 1000); |
There was a problem hiding this comment.
why do we still keep this dss_sleep() at all? I think if you change the code and make rebuild task on the waiting queue, we don't have to sleep again right?
There was a problem hiding this comment.
Yes, this could be removed now. With the new changes, in theory this branch won't ever be taken. In the future, if for some reason we change our mind on how rebuild_gst.rg_queue_list items are dequeued, then leaving this code here could be useful / defensive to make sure that the schedule time is still honored. I'm happy to remove it for simplicity - let me know what you prefer.
There was a problem hiding this comment.
I think we should just remove it, I checked commit history of the code and believe Di made this change because he wants to avoid merge rebuild and reclaim or other operations, then he found another way to avoid that but forgot to remove it. Please double check and see if my understanding is correct, thanks
There was a problem hiding this comment.
I'll merge patch with latest master and apply the change.
Yes, it seems ds_rebuild_schedule() -> rebuild_try_merge_tgts() will avoid merging op:rebuild with anything other than another op:rebuild. And if there is a reclaim in the queue then a new op:rebuild with higher pool map version will be scheduled after that reclaim (i.e., it could only merge with another op:rebuild following the reclaim in the queue, not any op:rebuild that might precede that reclaim task).
I thought about it from this scenario perspective:
- T=0 exclude targets of rank A occurs (
dmg pool exclude --ranks=A) - gets queued, and 5 seconds later starts running. queue empty now. - T=6 exclude targets of rank B occurs (
dmg pool exclude --ranks=B) occurs. Gets queued with schedule time 5 seconds in future - T=6.01 rebuild for exclude rank A targets finishes, and reclaim(A) is queued (behind exclude rank B targets task)
- T=7 exclude targets of rank C occurs (
dmg pool exclude --ranks=C,D) - queued behind exclude B task and reclaim A task. No merging here. - T=7.001 exclude targets of rank D occurs (from same pool exclude command) - merged with exclude targets of rank C task in the queue, behind exclude B task and reclaim A task.
Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17382/15/display/redirect |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/19/testReport/ |
Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start).
With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call.
Features: rebuild
Steps for the author:
After all prior steps are complete: