HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST by thomasrebele · Pull Request #6293 · apache/hive

thomasrebele · 2026-02-04T00:01:44Z

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

IS NOT NULL(CAST($1):BIGINT)
=(CAST($1):DOUBLE, 1)
IN(CAST($1):TINYINT, 10, 20, 30)

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak · 2026-02-12T12:00:50Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
+    RexCall call = (RexCall) filter;
+    SqlOperator operator = ((RexCall) filter).getOperator();
+    SqlOperator swappedOp;
+    if (operator == LE) {
+      swappedOp = GE;
+    } else if (operator == LT) {
+      swappedOp = GT;
+    } else if (operator == GE) {
+      swappedOp = LE;
+    } else if (operator == GT) {
+      swappedOp = LT;
+    } else if (operator == BETWEEN) {
+      // BETWEEN cannot be swapped
+      return;
+    } else {
+      throw new UnsupportedOperationException();
+    }
+    RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
+    Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
+  }


What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.

The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.

Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

zabetak · 2026-02-12T13:21:40Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param inclusive whether the respective boundary is inclusive or exclusive.
+   * @return the operand if the cast can be removed, otherwise the cast itself
+   */
+  private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {


The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.

Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?

The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.

Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.

Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.

I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.

Here we have a trade-off between precision and complexity. More complex code gives us better precision but its longer to write, test, review, and maintain. Personally, I would be OK to use RexUtil#isLosslessCast and sacrifice some precision opting for simpler code especially since we don't have any real data points about the importance of handling narrow casts and decimals. I assume that from your point of view precision is more important and thus you opted for the more complex solution. Since you are the one driving this I will let you decide how you prefer that we move forward.

As the code is already written and tested, I would keep it.

zabetak · 2026-02-12T13:25:50Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

zabetak · 2026-02-12T13:26:28Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])}
+   * for an explanation of the parameters.
+   */
+  private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) {


Not reviewed yet this part till we decide if we are going to include this part or not.

When rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1) this method creates the range/interval (100.049995..99.94999] that seems invalid since lower endpoint is greater than upper endpoint. More details about how I bumped into this in previous comment.

That can indeed happen for some corner cases. It was handled by FilterSelectivityEstimator#rangedSelectivity(KllFloatsSketch, float, float), which checks for the validity of the boundaries, and otherwise returns 0. If we want to continue with Guava's Range, we would need to extract the creation of the Range objects into a method that checks for valid boundaries, and otherwise creates an empty Range, e.g., [0,0).

When I did the refactoring, I decided against using Guava's Range, as a simple record class was enough for the purpose of the PR. @zabetak, if you want to continue with Guava's Range, let me know, I can make the necessary changes based on your branch.

zabetak · 2026-02-12T13:31:11Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
+   *                   indexes 2 and 3, if they exist, will be set to the boundaries of the type range
+   * @param inclusive whether the respective boundary is inclusive or exclusive.


If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.

Indeed, I'll refactor the boundaries.

I've refactored the arrays to an immutable class FloatInterval. Could you have a look, please?

It seems that FloatInterval is equivalent to Guava's Range API so it would be beneficial if we can reuse existing and more widely used APIs. While attempting to replace it in ef8dc6c I bumped into some errors as explained in my previous comment.

thomasrebele

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

thomasrebele · 2026-02-17T10:34:44Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
+    RexCall call = (RexCall) filter;
+    SqlOperator operator = ((RexCall) filter).getOperator();
+    SqlOperator swappedOp;
+    if (operator == LE) {
+      swappedOp = GE;
+    } else if (operator == LT) {
+      swappedOp = GT;
+    } else if (operator == GE) {
+      swappedOp = LE;
+    } else if (operator == GT) {
+      swappedOp = LT;
+    } else if (operator == BETWEEN) {
+      // BETWEEN cannot be swapped
+      return;
+    } else {
+      throw new UnsupportedOperationException();
+    }
+    RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
+    Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
+  }


The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

thomasrebele · 2026-02-17T10:56:21Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param boundaries indexes 0 and 1 are the boundaries of the range predicate;
+   *                   indexes 2 and 3, if they exist, will be set to the boundaries of the type range
+   * @param inclusive whether the respective boundary is inclusive or exclusive.


Indeed, I'll refactor the boundaries.

thomasrebele · 2026-02-17T11:15:18Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+   * @param inclusive whether the respective boundary is inclusive or exclusive.
+   * @return the operand if the cast can be removed, otherwise the cast itself
+   */
+  private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) {


The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.

Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.

Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.

I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.

thomasrebele · 2026-02-17T11:23:24Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

…cates with a CAST

sonarqubecloud · 2026-02-19T18:22:49Z

Quality Gate passed

Issues
40 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.8% Duplication on New Code

See analysis details on SonarQube Cloud

thomasrebele · 2026-02-20T09:50:18Z

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

zabetak · 2026-02-23T08:06:44Z

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

zabetak · 2026-02-23T08:31:41Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    FilterSelectivityEstimator estimator = new FilterSelectivityEstimator(scan, mq);
+    String between = "BETWEEN " + lower + " AND " + upper;
+    float expectedSelectivity = expectedEntries / total;
+    String message = between + ": calcite filter " + betweenFilter.toString();


nit: When the tests fail the message looks like the following:

java.lang.AssertionError: NOT BETWEEN 100.0 AND 0.0: calcite filter BETWEEN(true, CAST($0):DECIMAL(2, 1) NOT NULL, 1E2:FLOAT, 0E0:FLOAT)

We are printing the expression twice with different formatting which is a bit confusing. I think it is sufficient to do keep just the calcite toString serialization. This will also make things more uniform with checkSelectivity that does this already..

zabetak · 2026-02-23T09:54:56Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same
+    RexCall call = (RexCall) filter;
+    SqlOperator operator = ((RexCall) filter).getOperator();
+    SqlOperator swappedOp;
+    if (operator == LE) {
+      swappedOp = GE;
+    } else if (operator == LT) {
+      swappedOp = GT;
+    } else if (operator == GE) {
+      swappedOp = LE;
+    } else if (operator == GT) {
+      swappedOp = LT;
+    } else if (operator == BETWEEN) {
+      // BETWEEN cannot be swapped
+      return;
+    } else {
+      throw new UnsupportedOperationException();
+    }
+    RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0));
+    Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA);
+  }


Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.

zabetak · 2026-02-23T09:58:57Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+  private static final SqlBinaryOperator GT = SqlStdOperatorTable.GREATER_THAN;
+  private static final SqlBinaryOperator GE = SqlStdOperatorTable.GREATER_THAN_OR_EQUAL;
+  private static final SqlBinaryOperator LT = SqlStdOperatorTable.LESS_THAN;
+  private static final SqlBinaryOperator LE = SqlStdOperatorTable.LESS_THAN_OR_EQUAL;
+  private static final SqlOperator BETWEEN = HiveBetween.INSTANCE;


nit: The operators are used only in one place so no need to keep static aliases here.

zabetak · 2026-02-23T12:02:36Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    // invert the filter to a NOT BETWEEN
+    RexNode invBetween =
+        REX_BUILDER.makeCall(HiveBetween.INSTANCE, boolTrue, value, literalFloat(lower), literalFloat(upper));
+    String invMessage = "NOT " + between + ": calcite filter " + invBetween.toString();
+    float invExpectedSelectivity = (universe - expectedEntries) / total;
+    Assert.assertEquals(invMessage, invExpectedSelectivity, estimator.estimateSelectivity(invBetween), DELTA);


After HIVE-27102 NOT BETWEEN should not appear during planning so not sure if we should keep adding code & tests for this expression. I guess we can keep them for now but in the future we may opt to remove all code and logic around this.

zabetak · 2026-02-23T12:04:17Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));


Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

zabetak · 2026-02-23T12:52:50Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+      return defaultSelectivity;
+    }

-    if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight && isInputRefLeft != isInputRefRight) {


Not sure if we handle the isInputRefLeft != isInputRefRight case in the new code.

zabetak · 2026-02-23T12:55:09Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    int literalOpIdx = leftLiteral.isPresent() ? 0 : 1;
+
+    // analyze the predicate
+    float value = leftLiteral.orElseGet(rightLiteral::get);


Is there anything preventing both leftLiteral and rightLiteral to be null? It seems that previous version of the code had some logic for this case.

zabetak · 2026-02-23T12:58:38Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+      // when they are equal it's an equality predicate, we cannot handle it as "BETWEEN"
+      if (Objects.equals(leftValue, rightValue)) {
+        return inverseBool ? computeNotEqualitySelectivity(call) : computeFunctionSelectivity(call);
+      }


This is a simplification that happens already or should happen elsewhere. Given that previous version had this code as well we can consider the removal in a follow-up.

zabetak · 2026-02-23T13:07:05Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+  }
+
+  @Test
+  public void testComputeRangePredicateSelectivityBetweenWithCastDecimal2_1() {


The test name for this and the following methods testing BETWEEN is a bit strange. Since we are testing the BETWEEN expression I would assume that we test org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeBetweenPredicateSelectivity and not org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeRangePredicateSelectivity.

How about one of the following alternatives:

testComputeBetweenPredicateSelectivityWithCastDecimal2_1

testEstimateSelectivityBetweenWithCastDecimal2_1

zabetak · 2026-02-23T13:15:13Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+  private static void adjustBoundariesForDecimal(RexCall cast, MutableObject<FloatInterval> rangeBoundaries,
+      MutableObject<FloatInterval> typeBoundaries) {


I wanted to check if it is possible to simplify this method such that we:

avoid the MutableObject

make the mutation explicit by returning object

separate/decouple handling of range and type boundaries

The signature would become like:

private static FloatInterval adjustBoundariesForDecimal(RexCall cast, FloatInterval boundaries);

I may give it a bit more though once we finalize the remaining points.

asf-ci-hive added tests pending tests passed and removed tests pending labels Feb 4, 2026

thomasrebele commented Feb 4, 2026

View reviewed changes

thomasrebele marked this pull request as ready for review February 4, 2026 08:53

zabetak reviewed Feb 12, 2026

View reviewed changes

thomasrebele commented Feb 17, 2026

View reviewed changes

HIVE-29424: CBO plans should use histogram statistics for range predi…

1e9fd2b

…cates with a CAST

thomasrebele force-pushed the tr/HIVE-29424 branch from f80c231 to 1e9fd2b Compare February 19, 2026 16:39

asf-ci-hive added tests pending and removed tests passed labels Feb 19, 2026

asf-ci-hive added tests unstable and removed tests pending labels Feb 19, 2026

zabetak reviewed Feb 23, 2026

View reviewed changes

		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));

		private static void adjustBoundariesForDecimal(RexCall cast, MutableObject<FloatInterval> rangeBoundaries,
		MutableObject<FloatInterval> typeBoundaries) {

Comments

Conversation

thomasrebele commented Feb 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Quality Gate passed

Uh oh!

thomasrebele commented Feb 20, 2026

Uh oh!

zabetak commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!