-
Notifications
You must be signed in to change notification settings - Fork 503
Fix infinite loop OOM bug caused by rebalancing shards from inactive ingesters #6078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ingesters Signed-off-by: Nadav Gov-Ari <[email protected]>
Signed-off-by: Nadav Gov-Ari <[email protected]>
Signed-off-by: Nadav Gov-Ari <[email protected]>
Signed-off-by: Nadav Gov-Ari <[email protected]>
| shards | ||
| }) | ||
| .collect(); | ||
| let target_max = total_open_shards.div_ceil(num_available_ingesters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point of using 0.9/1.1 was to add some tolerance to make sure rebalance do not happen too often.
| let mut per_leader_open_shards: HashMap<&str, Vec<&Shard>> = HashMap::new(); | ||
|
|
||
| let mut total_open_shards: usize = 0; | ||
| let mut orphaned_shards: Vec<Shard> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counting orphaned shards as needing to be moved is dangerous.
Right now moving a shard is done by opening a shard, and closing the shard moved.
Closing a shard requires contacting the node hosting it, so it can fail. In this exact case, it WILL fail because the ingester is not even in the pool.
Shards on inactive indexers cannot be moved, because they cannot be closed. For this reason, we remove them from the equation. Also, this PR reintroduce some leniency on our definition of what balanced means. We allow for a variation of 10% between the indexer with the highest number of shards and the indexer with the lowest number of shards. Signed-off-by: Nadav Gov-Ari <[email protected]>
3602106 to
243d34e
Compare
Signed-off-by: Nadav Gov-Ari <[email protected]>
fac5569 to
640b9f1
Compare
| /// - `active_ingester_shards`: shards per active ingester (these are in the pool) | ||
| /// - `inactive_ingester_shards`: shards per inactive ingester (NOT in the pool - orphaned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// - `active_ingester_shards`: shards per active ingester (these are in the pool) | |
| /// - `inactive_ingester_shards`: shards per inactive ingester (NOT in the pool - orphaned) | |
| /// - `available_ingester_shards`: shards per available ingester (shards whose leader is in the pool) | |
| /// - `unavailable_ingester_shards`: shards per unavailable ingester (NOT in the pool - orphaned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the notion of orphaned. Aren't the shards whose leader is in NOT the pool the orphans already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the comments got jumbled in between revisions. Let me fix.
| active_ingester_shards: &[usize], | ||
| inactive_ingester_shards: &[usize], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/active/available/ everywhere in this function.
Signed-off-by: Nadav Gov-Ari <[email protected]>
Signed-off-by: Nadav Gov-Ari <[email protected]>
Signed-off-by: Nadav Gov-Ari <[email protected]>
05b563d to
90f24ea
Compare
90f24ea to
dd46e2d
Compare
Rebalances all shards from unavailable ingesters. Moves to only look at ingesters with more shards than allowed as candidates for rebalancing. Also moves to integer division to simplify target calculation.