-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48481: [Ruby] Correctly infer types for nested integer arrays #48699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
Benchmark code require "benchmark"
require "arrow"
SIZES = [100_000].freeze
# Pre-generate arrays so we only measure builder cost.
DATASETS = SIZES.each_with_object({}) do |size, datasets|
datasets[["int32", size]] = Array.new(size) {|i| (i % 2_000_000) - 1_000_000 }
datasets[["int64", size]] = Array.new(size) {|i| (i % 2**40) - 2**39 }
datasets[["uint32", size]] = Array.new(size) {|i| i % 4_000_000_000 }
datasets[["uint64", size]] = Array.new(size) {|i| i % 2**40 }
end.freeze
Benchmark.bm(36) do |x|
DATASETS.each do |(name, size), values|
x.report(" array_builder #{name} #{size}") do
Arrow::ArrayBuilder.build(values)
end
next unless name.start_with?("int", "uint")
builder_class = name.start_with?("uint") ? Arrow::UIntArrayBuilder : Arrow::IntArrayBuilder
x.report("int_array_builder #{name} #{size}") do
builder = builder_class.new
builder.build(values)
end
end
end |
| builder_info[:scale]) | ||
| Decimal256ArrayBuilder.new(data_type) | ||
| when :int | ||
| required_bit_length = builder_info[:bit_length] + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need +1 only for :int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because Ruby's .bit_length method returns the number of bits required to represent the integer excluding the sign bit.
For example, GLib::MAXINT16 (32767) is the upper bound of Int16, but its .bit_length is 15, not 16.
irb> GLib::MAXINT16.bit_length
=> 15Although I could have adjusted the thresholds in the if conditions (e.g., using 15 for Int16), I decided to keep the builder_info[:bit_length] + 1. This aligns the value with standard integer type widths, which I believe makes the code more intuitive.
If necessary, I can add a comment (e.g., # Add 1 for the sign bit) to clarify this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment for this with an example like you showed in your reply? It may be difficult to understand why we need + 1 with only the Add 1 for the sign bit comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or how about computing it from min and max?
when :int
min = builder_info[:min]
max = builder_info[:max]
if GLib::MININT8 <= min and max <= GLib::MAXINT8
Int8ArrayBuilder.new
elsif GLib::MININT16 <= min and max <= GLib::MAXINT16
Int16ArrayBuilder.new
...
else
StringArrayBuilder.new
end
end| [3, 4], | ||
| ] | ||
| array = Arrow::Array.new(values) | ||
| data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this:
| data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new) | |
| data_type = Arrow::ListDataType.new(:uint8) |
| data_type = Arrow::ListDataType.new(Arrow::Int8DataType.new) | ||
| assert_equal({ | ||
| data_type: data_type, | ||
| values: [ | ||
| [0, -1, 2], | ||
| [3, 4], | ||
| ], | ||
| }, | ||
| { | ||
| data_type: array.value_data_type, | ||
| values: array.to_a, | ||
| }) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this test because list<int8>s boundary test covers this case?
| # Int8 can hold values from -128 to 127. | ||
| values = [ | ||
| [0, -2**7], | ||
| [2**7 - 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using defined constants instead of comment?
| # Int8 can hold values from -128 to 127. | |
| values = [ | |
| [0, -2**7], | |
| [2**7 - 1], | |
| values = [ | |
| [0, GLib::MININT8], | |
| [GLib::MAXINT8], |
| [0, -2**7 - 1], | ||
| [2**7 - 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [0, -2**7 - 1], | |
| [2**7 - 1], | |
| [0, GLib::MININT8 - 1], | |
| [GLib::MAXINT8], |
| [0, -129], | ||
| [127], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [0, -129], | |
| [127], | |
| [0, GLib::MININT8 - 1], | |
| [GLib::MAXINT8], |
Rationale for this change
When building an
Arrow::Tablefrom a Ruby Hash passed toArrow::Table.new, nested Integer arrays are incorrectly inferred aslist<uint8>orlist<int8>regardless of the actual values contained. Nested integer arrays should be correctly inferred as the appropriate list type (e.g.,list<int64>,list<uint64>) based on their values, similar to how flat arrays are handled, unless they contain values out of range for any integer type.What changes are included in this PR?
This PR modifies the logic in
detect_builder_info()to fix the inference issue. Specifically:sub_builder_infoacross sub-array elements: Previously,sub_builder_infowas recreated for each sub-array element in the Array. The logic has been updated to accumulate and carry over the builder information across elements to ensure correct type inference for the entire list.BigDecimal, the logic for determining the Integer builder has been moved tocreate_builder().detect_builder_info()now calls this function.Note:
BigDecimal(which were previously inferred asstring) may now have their types inferred. However, comprehensive testing and verification for nestedBigDecimalsupport will be addressed in a separate issue to keep this PR focused.IntArrayBuilderfor inference logic to ensure correctness. This results in a performance overhead (array building is approximately 2x slower) as we can no longer rely on the specialized builder's detection.Are these changes tested?
Yes.
ruby ruby/red-arrow/test/run-test.rbAre there any user-facing changes?
Yes.