HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293thomasrebele wants to merge 1 commit intoapache:masterfrom
Conversation
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
zabetak
left a comment
There was a problem hiding this comment.
Thanks for the PR @thomasrebele , the proposal is very promising.
One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:
IS NOT NULL(CAST($1):BIGINT)=(CAST($1):DOUBLE, 1)IN(CAST($1):TINYINT, 10, 20, 30)
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
| // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same | ||
| RexCall call = (RexCall) filter; | ||
| SqlOperator operator = ((RexCall) filter).getOperator(); | ||
| SqlOperator swappedOp; | ||
| if (operator == LE) { | ||
| swappedOp = GE; | ||
| } else if (operator == LT) { | ||
| swappedOp = GT; | ||
| } else if (operator == GE) { | ||
| swappedOp = LE; | ||
| } else if (operator == GT) { | ||
| swappedOp = LT; | ||
| } else if (operator == BETWEEN) { | ||
| // BETWEEN cannot be swapped | ||
| return; | ||
| } else { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0)); | ||
| Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA); | ||
| } |
There was a problem hiding this comment.
What's the point of swapping if we are already testing explicitly the inverse operation in the test itself? I think its better to keep the tests explicit and drop this swapping logic.
There was a problem hiding this comment.
The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.
There was a problem hiding this comment.
Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
| * @param inclusive whether the respective boundary is inclusive or exclusive. | ||
| * @return the operand if the cast can be removed, otherwise the cast itself | ||
| */ | ||
| private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) { |
There was a problem hiding this comment.
The logic in this method is similar to org.apache.calcite.rex.RexUtil#isLosslessCast(org.apache.calcite.rex.RexNode). Since the method here has access to actual ranges and stats it may be more effective for CAST that narrow the data type. However, adjusting the boundaries and handling the DECIMAL types adds some complexity that we may not necessarily need at this stage.
Would it be feasible to postpone/defer the more complex CAST removal solution in a follow-up and use isLosslessCast for this first iteration? How much do we gain by the special handling of the DECIMAL types?
There was a problem hiding this comment.
The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.
Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.
Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.
I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.
There was a problem hiding this comment.
Here we have a trade-off between precision and complexity. More complex code gives us better precision but its longer to write, test, review, and maintain. Personally, I would be OK to use RexUtil#isLosslessCast and sacrifice some precision opting for simpler code especially since we don't have any real data points about the importance of handling narrow casts and decimals. I assume that from your point of view precision is more important and thus you opted for the more complex solution. Since you are the one driving this I will let you decide how you prefer that we move forward.
There was a problem hiding this comment.
As the code is already written and tested, I would keep it.
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?
In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
There was a problem hiding this comment.
Ok to use the switch then but let's base it on SqlTypeName.
If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.
There was a problem hiding this comment.
I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.
I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.
| * See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], boolean[])} | ||
| * for an explanation of the parameters. | ||
| */ | ||
| private static void adjustBoundariesForDecimal(RexCall cast, float[] boundaries, boolean[] inclusive) { |
There was a problem hiding this comment.
Not reviewed yet this part till we decide if we are going to include this part or not.
There was a problem hiding this comment.
When rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1) this method creates the range/interval (100.049995..99.94999] that seems invalid since lower endpoint is greater than upper endpoint. More details about how I bumped into this in previous comment.
There was a problem hiding this comment.
That can indeed happen for some corner cases. It was handled by FilterSelectivityEstimator#rangedSelectivity(KllFloatsSketch, float, float), which checks for the validity of the boundaries, and otherwise returns 0. If we want to continue with Guava's Range, we would need to extract the creation of the Range objects into a method that checks for valid boundaries, and otherwise creates an empty Range, e.g., [0,0).
When I did the refactoring, I decided against using Guava's Range, as a simple record class was enough for the purpose of the PR. @zabetak, if you want to continue with Guava's Range, let me know, I can make the necessary changes based on your branch.
| * @param boundaries indexes 0 and 1 are the boundaries of the range predicate; | ||
| * indexes 2 and 3, if they exist, will be set to the boundaries of the type range | ||
| * @param inclusive whether the respective boundary is inclusive or exclusive. |
There was a problem hiding this comment.
If we decide to proceed with this implementation then it may be cleaner and more readable to have a dedicated private static class Boundaries instead of passing around arrays and trying to decipher what the indexes mean.
There was a problem hiding this comment.
Indeed, I'll refactor the boundaries.
There was a problem hiding this comment.
I've refactored the arrays to an immutable class FloatInterval. Could you have a look, please?
There was a problem hiding this comment.
It seems that FloatInterval is equivalent to Guava's Range API so it would be beneficial if we can reuse existing and more widely used APIs. While attempting to replace it in ef8dc6c I bumped into some errors as explained in my previous comment.
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.
About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
| // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same | ||
| RexCall call = (RexCall) filter; | ||
| SqlOperator operator = ((RexCall) filter).getOperator(); | ||
| SqlOperator swappedOp; | ||
| if (operator == LE) { | ||
| swappedOp = GE; | ||
| } else if (operator == LT) { | ||
| swappedOp = GT; | ||
| } else if (operator == GE) { | ||
| swappedOp = LE; | ||
| } else if (operator == GT) { | ||
| swappedOp = LT; | ||
| } else if (operator == BETWEEN) { | ||
| // BETWEEN cannot be swapped | ||
| return; | ||
| } else { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0)); | ||
| Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA); | ||
| } |
There was a problem hiding this comment.
The test cases are all of the form col OP value. The swapping tests expressions of the form value OP col. This is not tested explicitly. For a few lines of code we get a much better coverage.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
| * @param boundaries indexes 0 and 1 are the boundaries of the range predicate; | ||
| * indexes 2 and 3, if they exist, will be set to the boundaries of the type range | ||
| * @param inclusive whether the respective boundary is inclusive or exclusive. |
There was a problem hiding this comment.
Indeed, I'll refactor the boundaries.
| * @param inclusive whether the respective boundary is inclusive or exclusive. | ||
| * @return the operand if the cast can be removed, otherwise the cast itself | ||
| */ | ||
| private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, float[] boundaries, boolean[] inclusive) { |
There was a problem hiding this comment.
The method RexUtil#isLosslessCast seems to be related (and at the beginning I had considered using it). However, it does not fit the requirements.
Take for example cast(intColumn as decimal(3,1). The isLosslessCast would return false, as an integer may be outside of the range of decimal(3,1). As Hive converts the illegal values to NULL, we can get the selectivity by checking the range -99.94999 to 99.94999. So for the purpose of this PR, the CAST is removable. I've included many test cases to cover the corner cases around decimals.
Also, as you had mentioned, the isLosslessCast does not take advantage of the statistics. If the statistics indicate that the actual values fall within the range of the type, we can remove the cast for the purpose of selectivity estimation, even if the cast is not lossless.
I assume that splitting the PR into two, probably simplifying the first PR, could lead to a more complicated follow-up PR.
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
…cates with a CAST
f80c231 to
1e9fd2b
Compare
|
|
The CI fails because of |
|
Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch. However, after replacing the |
| FilterSelectivityEstimator estimator = new FilterSelectivityEstimator(scan, mq); | ||
| String between = "BETWEEN " + lower + " AND " + upper; | ||
| float expectedSelectivity = expectedEntries / total; | ||
| String message = between + ": calcite filter " + betweenFilter.toString(); |
There was a problem hiding this comment.
nit: When the tests fail the message looks like the following:
java.lang.AssertionError: NOT BETWEEN 100.0 AND 0.0: calcite filter BETWEEN(true, CAST($0):DECIMAL(2, 1) NOT NULL, 1E2:FLOAT, 0E0:FLOAT)
We are printing the expression twice with different formatting which is a bit confusing. I think it is sufficient to do keep just the calcite toString serialization. This will also make things more uniform with checkSelectivity that does this already..
| // swap equation, e.g., col < 5 becomes 5 > col; selectivity stays the same | ||
| RexCall call = (RexCall) filter; | ||
| SqlOperator operator = ((RexCall) filter).getOperator(); | ||
| SqlOperator swappedOp; | ||
| if (operator == LE) { | ||
| swappedOp = GE; | ||
| } else if (operator == LT) { | ||
| swappedOp = GT; | ||
| } else if (operator == GE) { | ||
| swappedOp = LE; | ||
| } else if (operator == GT) { | ||
| swappedOp = LT; | ||
| } else if (operator == BETWEEN) { | ||
| // BETWEEN cannot be swapped | ||
| return; | ||
| } else { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| RexNode swapped = REX_BUILDER.makeCall(swappedOp, call.getOperands().get(1), call.getOperands().get(0)); | ||
| Assert.assertEquals(filter.toString(), expectedSelectivity, estimator.estimateSelectivity(swapped), DELTA); | ||
| } |
There was a problem hiding this comment.
Sounds good, in this case I think you can use org.apache.calcite.rex.RexUtil#invert that is doing the same thing.
| private static final SqlBinaryOperator GT = SqlStdOperatorTable.GREATER_THAN; | ||
| private static final SqlBinaryOperator GE = SqlStdOperatorTable.GREATER_THAN_OR_EQUAL; | ||
| private static final SqlBinaryOperator LT = SqlStdOperatorTable.LESS_THAN; | ||
| private static final SqlBinaryOperator LE = SqlStdOperatorTable.LESS_THAN_OR_EQUAL; | ||
| private static final SqlOperator BETWEEN = HiveBetween.INSTANCE; |
There was a problem hiding this comment.
nit: The operators are used only in one place so no need to keep static aliases here.
| // invert the filter to a NOT BETWEEN | ||
| RexNode invBetween = | ||
| REX_BUILDER.makeCall(HiveBetween.INSTANCE, boolTrue, value, literalFloat(lower), literalFloat(upper)); | ||
| String invMessage = "NOT " + between + ": calcite filter " + invBetween.toString(); | ||
| float invExpectedSelectivity = (universe - expectedEntries) / total; | ||
| Assert.assertEquals(invMessage, invExpectedSelectivity, estimator.estimateSelectivity(invBetween), DELTA); |
There was a problem hiding this comment.
After HIVE-27102 NOT BETWEEN should not appear during planning so not sure if we should keep adding code & tests for this expression. I guess we can keep them for now but in the future we may opt to remove all code and logic around this.
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE)); | ||
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP)); |
There was a problem hiding this comment.
Why checkTimeFieldOnIntraDayTimestamps are not relevant here?
| return defaultSelectivity; | ||
| } | ||
|
|
||
| if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight && isInputRefLeft != isInputRefRight) { |
There was a problem hiding this comment.
Not sure if we handle the isInputRefLeft != isInputRefRight case in the new code.
| int literalOpIdx = leftLiteral.isPresent() ? 0 : 1; | ||
|
|
||
| // analyze the predicate | ||
| float value = leftLiteral.orElseGet(rightLiteral::get); |
There was a problem hiding this comment.
Is there anything preventing both leftLiteral and rightLiteral to be null? It seems that previous version of the code had some logic for this case.
| // when they are equal it's an equality predicate, we cannot handle it as "BETWEEN" | ||
| if (Objects.equals(leftValue, rightValue)) { | ||
| return inverseBool ? computeNotEqualitySelectivity(call) : computeFunctionSelectivity(call); | ||
| } |
There was a problem hiding this comment.
This is a simplification that happens already or should happen elsewhere. Given that previous version had this code as well we can consider the removal in a follow-up.
| } | ||
|
|
||
| @Test | ||
| public void testComputeRangePredicateSelectivityBetweenWithCastDecimal2_1() { |
There was a problem hiding this comment.
The test name for this and the following methods testing BETWEEN is a bit strange. Since we are testing the BETWEEN expression I would assume that we test org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeBetweenPredicateSelectivity and not org.apache.hadoop.hive.ql.optimizer.calcite.stats.FilterSelectivityEstimator#computeRangePredicateSelectivity.
How about one of the following alternatives:
testComputeBetweenPredicateSelectivityWithCastDecimal2_1testEstimateSelectivityBetweenWithCastDecimal2_1
| private static void adjustBoundariesForDecimal(RexCall cast, MutableObject<FloatInterval> rangeBoundaries, | ||
| MutableObject<FloatInterval> typeBoundaries) { |
There was a problem hiding this comment.
I wanted to check if it is possible to simplify this method such that we:
- avoid the
MutableObject - make the mutation explicit by returning object
- separate/decouple handling of range and type boundaries
The signature would become like:
private static FloatInterval adjustBoundariesForDecimal(RexCall cast, FloatInterval boundaries);I may give it a bit more though once we finalize the remaining points.



See HIVE-29424.
What changes were proposed in this pull request?
This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.
Why are the changes needed?
This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests were added.