Skip to content

perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables#2176

Merged
blackmwk merged 10 commits intoapache:mainfrom
mbutrovich:double_open_fix
Mar 9, 2026
Merged

perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables#2176
blackmwk merged 10 commits intoapache:mainfrom
mbutrovich:double_open_fix

Conversation

@mbutrovich
Copy link
Collaborator

@mbutrovich mbutrovich commented Feb 24, 2026

Which issue does this PR close?

What changes are included in this PR?

Introduces open_parquet_file(), which opens the file once and returns both the ArrowFileReader and ArrowReaderMetadata. The caller inspects the metadata in-memory for field IDs, optionally rebuilds ArrowReaderMetadata with a custom schema for migrated tables, then passes the original ArrowFileReader to ParquetRecordBatchStreamBuilder::new_with_metadata(). This eliminates the redundant file open that previously occurred for migrated tables.

Are these changes tested?

Existing tests. Also ran full Iceberg Java suite via Comet.

@mbutrovich mbutrovich changed the title Double open fix perf(reader): Avoid second create_parquet_record_batch_stream_builder() call for migrated tables Feb 24, 2026
@mbutrovich mbutrovich self-assigned this Feb 24, 2026
@mbutrovich mbutrovich marked this pull request as ready for review February 26, 2026 02:17
@mbutrovich mbutrovich requested a review from blackmwk February 26, 2026 02:17
Copy link
Contributor

@jdockerty jdockerty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

You'll have to wait for a more authoritative review before merging, but I thought I'd at least give this a look 😄

Some non-blocking questions below.

# Conflicts:
#	crates/iceberg/src/arrow/delete_file_loader.rs
#	crates/iceberg/src/arrow/reader.rs
@mbutrovich
Copy link
Collaborator Author

Re-running Iceberg Java tests via Comet at apache/datafusion-comet#3642. I ran this change previously as part of apache/datafusion-comet#3551 which has a lot of the changes discussed in #2172, but those have mostly merged/been refactored as individual PRs, so I'll run just this PR.

@mbutrovich
Copy link
Collaborator Author

It passed all the Iceberg Java tests on Comet.

Copy link
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr!

@blackmwk blackmwk merged commit 32996b6 into apache:main Mar 9, 2026
19 checks passed
@mbutrovich mbutrovich deleted the double_open_fix branch March 9, 2026 01:57
gbrgr pushed a commit to RelationalAI/iceberg-rust that referenced this pull request Mar 10, 2026
…r()` call for migrated tables (apache#2176)

- Partially addresses apache#2172.

Introduces `open_parquet_file()`, which opens the file once and returns
both the `ArrowFileReader` and `ArrowReaderMetadata`. The caller
inspects the metadata in-memory for field IDs, optionally rebuilds
`ArrowReaderMetadata` with a custom schema for migrated tables, then
passes the original `ArrowFileReader` to
`ParquetRecordBatchStreamBuilder::new_with_metadata()`. This eliminates
the redundant file open that previously occurred for migrated tables.

Existing tests. Also ran full Iceberg Java suite via Comet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants