Skip to content

perf(reader): Implement AsyncFileReader get_byte_ranges and coalesce close ranges#2181

Merged
blackmwk merged 9 commits intoapache:mainfrom
mbutrovich:get_byte_ranges
Mar 6, 2026
Merged

perf(reader): Implement AsyncFileReader get_byte_ranges and coalesce close ranges#2181
blackmwk merged 9 commits intoapache:mainfrom
mbutrovich:get_byte_ranges

Conversation

@mbutrovich
Copy link
Collaborator

@mbutrovich mbutrovich commented Feb 26, 2026

Which issue does this PR close?

What changes are included in this PR?

  • Adapt range coalescing from object_store.

Are these changes tested?

Existing tests, some new ones just to sanity check merge_ranges. Also ran full Iceberg Java suite via Comet. Benchmarks below.

@mbutrovich mbutrovich changed the title perf(reader): Add get_byte_ranges and merge_ranges. perf(reader): Add get_byte_ranges and merge_ranges Feb 26, 2026
@mbutrovich
Copy link
Collaborator Author

mbutrovich commented Feb 26, 2026

I ran SELECT SUM(l_extendedprice * l_discount) on a SF100 lineitem TPC-H table accessed via S3 (essentially just the scan from query 6 in TPC-H) with Comet. Here's the breakdown:

Metric main (uses get_bytes) PR #2181 (uses get_byte_ranges) Change
Total GETs 773 500 35% fewer
Total bytes 3,535 MB 3,535 MB same
Percentile main (uses get_bytes) PR #2181 (uses get_byte_ranges) Change
P50 (median) 1.5 MB 3.6 MB 2.4x larger
P75 3.3 MB 17.0 MB 5.2x larger

Histogram of S3 GET sizes:
histogram

@mbutrovich mbutrovich marked this pull request as ready for review February 26, 2026 20:54
@mbutrovich mbutrovich self-assigned this Feb 26, 2026
@mbutrovich mbutrovich changed the title perf(reader): Add get_byte_ranges and merge_ranges perf(reader): Implement AsyncFileReader get_byte_ranges and coalesce close ranges Feb 26, 2026
Copy link

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich, this is awesome.

Copy link
Collaborator

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the this change!

@mbutrovich mbutrovich requested a review from blackmwk March 5, 2026 15:04
Copy link
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr!

row_group_filtering_enabled: true,
row_selection_enabled: false,
metadata_size_hint: None,
parquet_read_options: ParquetReadOptions::builder().build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could implement a Default for ParquetReadOptions.

@blackmwk blackmwk merged commit 992a989 into apache:main Mar 6, 2026
19 checks passed
@mbutrovich mbutrovich deleted the get_byte_ranges branch March 6, 2026 02:59
gbrgr pushed a commit to RelationalAI/iceberg-rust that referenced this pull request Mar 10, 2026
…close ranges (apache#2181)

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Partially address apache#2172.

<!--
Provide a summary of the modifications in this PR. List the main changes
such as new features, bug fixes, refactoring, or any other updates.
-->

- Adapt range coalescing from object_store.

<!--
Specify what test covers (unit test, integration test, etc.).

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing tests, some new ones just to sanity check `merge_ranges`. Also
ran full Iceberg Java suite via Comet. Benchmarks below.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants