Skip to content

Comments

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293

Open
thomasrebele wants to merge 1 commit intoapache:masterfrom
thomasrebele:tr/HIVE-29424
Open

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
thomasrebele wants to merge 1 commit intoapache:masterfrom
thomasrebele:tr/HIVE-29424

Conversation

@thomasrebele
Copy link
Contributor

See HIVE-29424.

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

@thomasrebele thomasrebele marked this pull request as ready for review February 4, 2026 08:53
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

  • IS NOT NULL(CAST($1):BIGINT)
  • =(CAST($1):DOUBLE, 1)
  • IN(CAST($1):TINYINT, 10, 20, 30)

Comment on lines +736 to +756
// swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
RexCall call = (RexCall) filter;
SqlOperator operator = ((RexCall) filter).getOperator();
SqlOperator swappedOp;
if (operator == LE) {
swappedOp = GE;
} else if (operator == LT) {
swappedOp = GT;
} else if (operator == GE) {
swappedOp = LE;
} else if (operator == GT) {
swappedOp = LT;
} else if (operator == BETWEEN) {
// BETWEEN cannot be swapped
return;
} else {
throw new UnsupportedOperationException();
}
RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.

* @param inclusive whether the respective boundary is inclusive or exclusive.
* @return the operand if the cast can be removed, otherwise the cast itself
*/
private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.

Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.

Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.

Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.

I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have a trade-off between precision and complexity. More complex code gives us better precision but its longer to write, test, review, and maintain. Personally, I would be OK to use RexUtil#isLosslessCast and sacrifice some precision opting for simpler code especially since we don't have any real data points about the importance of handling narrow casts and decimals. I assume that from your point of view precision is more important and thus you opted for the more complex solution. Since you are the one driving this I will let you decide how you prefer that we move forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the code is already written and tested, I would keep it.


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

* See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])}
* for an explanation of the parameters.
*/
private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not reviewed yet this part till we decide if we are going to include this part or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1) this method creates the range/interval (100.049995..99.94999] that seems invalid since lower endpoint is greater than upper endpoint. More details about how I bumped into this in previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can indeed happen for some corner cases. It was handled by FilterSelectivityEstimator#rangedSelectivity(KllFloatsSketch, float, float), which checks for the validity of the boundaries, and otherwise returns 0. If we want to continue with Guava's Range, we would need to extract the creation of the Range objects into a method that checks for valid boundaries, and otherwise creates an empty Range, e.g., [0,0).

When I did the refactoring, I decided against using Guava's Range, as a simple record class was enough for the purpose of the PR. @zabetak, if you want to continue with Guava's Range, let me know, I can make the necessary changes based on your branch.

Comment on lines 208 to 210
* @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
* indexes 2 and 3, if they exist, will be set to the boundaries of the type range
* @param inclusive whether the respective boundary is inclusive or exclusive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll refactor the boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the arrays to an immutable class FloatInterval. Could you have a look, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that FloatInterval is equivalent to Guava's Range API so it would be beneficial if we can reuse existing and more widely used APIs. While attempting to replace it in ef8dc6c I bumped into some errors as explained in my previous comment.

Copy link
Contributor Author

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.

Comment on lines +736 to +756
// swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
RexCall call = (RexCall) filter;
SqlOperator operator = ((RexCall) filter).getOperator();
SqlOperator swappedOp;
if (operator == LE) {
swappedOp = GE;
} else if (operator == LT) {
swappedOp = GT;
} else if (operator == GE) {
swappedOp = LE;
} else if (operator == GT) {
swappedOp = LT;
} else if (operator == BETWEEN) {
// BETWEEN cannot be swapped
return;
} else {
throw new UnsupportedOperationException();
}
RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.

Comment on lines 208 to 210
* @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
* indexes 2 and 3, if they exist, will be set to the boundaries of the type range
* @param inclusive whether the respective boundary is inclusive or exclusive.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll refactor the boundaries.

* @param inclusive whether the respective boundary is inclusive or exclusive.
* @return the operand if the cast can be removed, otherwise the cast itself
*/
private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.

Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.

Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.

I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

@sonarqubecloud
Copy link

@thomasrebele
Copy link
Contributor Author

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

@zabetak
Copy link
Member

zabetak commented Feb 23, 2026

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

FilterSelectivityEstimator estimator = new FilterSelectivityEstimator(scan, mq);
String between = "BETWEEN " + lower + " AND " + upper;
float expectedSelectivity = expectedEntries / total;
String message = between + ": calcite filter " + betweenFilter.toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When the tests fail the message looks like the following:

java.lang.AssertionError: NOT BETWEEN 100.0 AND 0.0: calcite filter BETWEEN(true, CAST($0):DECIMAL(2, 1) NOT NULL, 1E2:FLOAT, 0E0:FLOAT) 

We are printing the expression twice with different formatting which is a bit confusing. I think it is sufficient to do keep just the calcite toString serialization. This will also make things more uniform with checkSelectivity that does this already..

Comment on lines +736 to +756
// swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
RexCall call = (RexCall) filter;
SqlOperator operator = ((RexCall) filter).getOperator();
SqlOperator swappedOp;
if (operator == LE) {
swappedOp = GE;
} else if (operator == LT) {
swappedOp = GT;
} else if (operator == GE) {
swappedOp = LE;
} else if (operator == GT) {
swappedOp = LT;
} else if (operator == BETWEEN) {
// BETWEEN cannot be swapped
return;
} else {
throw new UnsupportedOperationException();
}
RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.

Comment on lines +87 to +91
private static final SqlBinaryOperator GT = SqlStdOperatorTable.GREATER_THAN;
private static final SqlBinaryOperator GE = SqlStdOperatorTable.GREATER_THAN_OR_EQUAL;
private static final SqlBinaryOperator LT = SqlStdOperatorTable.LESS_THAN;
private static final SqlBinaryOperator LE = SqlStdOperatorTable.LESS_THAN_OR_EQUAL;
private static final SqlOperator BETWEEN = HiveBetween.INSTANCE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The operators are used only in one place so no need to keep static aliases here.

Comment on lines +816 to +821
// invert the filter to a NOT BETWEEN
RexNode invBetween =
REX_BUILDER.makeCall(HiveBetween.INSTANCE, boolTrue, value, literalFloat(lower), literalFloat(upper));
String invMessage = "NOT " + between + ": calcite filter " + invBetween.toString();
float invExpectedSelectivity = (universe - expectedEntries) / total;
Assert.assertEquals(invMessage, invExpectedSelectivity, estimator.estimateSelectivity(invBetween), DELTA);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After HIVE-27102 NOT BETWEEN should not appear during planning so not sure if we should keep adding code & tests for this expression. I guess we can keep them for now but in the future we may opt to remove all code and logic around this.

Comment on lines +730 to +731
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

return defaultSelectivity;
}

if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight && isInputRefLeft != isInputRefRight) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we handle the isInputRefLeft != isInputRefRight case in the new code.

int literalOpIdx = leftLiteral.isPresent() ? 0 : 1;

// analyze the predicate
float value = leftLiteral.orElseGet(rightLiteral::get);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing both leftLiteral and rightLiteral to be null? It seems that previous version of the code had some logic for this case.

Comment on lines +431 to +434
// when they are equal it's an equality predicate, we cannot handle it as "BETWEEN"
if (Objects.equals(leftValue, rightValue)) {
return inverseBool ? computeNotEqualitySelectivity(call) : computeFunctionSelectivity(call);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simplification that happens already or should happen elsewhere. Given that previous version had this code as well we can consider the removal in a follow-up.

}

@Test
public void testComputeRangePredicateSelectivityBetweenWithCastDecimal2_1() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name for this and the following methods testing BETWEEN is a bit strange. Since we are testing the BETWEEN expression I would assume that we test org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeBetweenPredicateSelectivity and not org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeRangePredicateSelectivity.

How about one of the following alternatives:

  • testComputeBetweenPredicateSelectivityWithCastDecimal2_1
  • testEstimateSelectivityBetweenWithCastDecimal2_1

Comment on lines +292 to +293
private static void adjustBoundariesForDecimal(RexCall cast, MutableObject<FloatInterval> rangeBoundaries,
MutableObject<FloatInterval> typeBoundaries) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to check if it is possible to simplify this method such that we:

  • avoid the MutableObject
  • make the mutation explicit by returning object
  • separate/decouple handling of range and type boundaries

The signature would become like:

private static FloatInterval adjustBoundariesForDecimal(RexCall cast, FloatInterval boundaries);

I may give it a bit more though once we finalize the remaining points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants