correct parquet leaf index mapping when schema contains struct cols#20698
Merged
adriangb merged 2 commits intoapache:mainfrom Mar 5, 2026
Merged
correct parquet leaf index mapping when schema contains struct cols#20698adriangb merged 2 commits intoapache:mainfrom
adriangb merged 2 commits intoapache:mainfrom
Conversation
friendlymatthew
commented
Mar 4, 2026
Contributor
Author
friendlymatthew
left a comment
There was a problem hiding this comment.
self review
Comment on lines
-442
to
-445
| // For primitive-only columns, root indices ARE the leaf indices | ||
| if nested == NestedColumnSupport::PrimitiveOnly { | ||
| return root_indices.to_vec(); | ||
| } |
Contributor
Author
There was a problem hiding this comment.
Just because a filter only references primitive columns doesn't mean Arrow indices equal Parquet leaf indices.
Struct columns elsewhere in the schema still shift the leaf numbering. The enum encoded the wrong signal (and was only used here), so I removed it and always do the proper mapping
adriangb
approved these changes
Mar 4, 2026
Contributor
adriangb
left a comment
There was a problem hiding this comment.
This makes sense to me. Could we add an SLT test?
adriangb
reviewed
Mar 4, 2026
Comment on lines
+416
to
+420
| // Always map root (Arrow) indices to Parquet leaf indices via the schema | ||
| // descriptor. Arrow root indices only equal Parquet leaf indices when the | ||
| // schema has no group columns (Struct, Map, etc.); when group columns | ||
| // exist, their children become separate leaves and shift all subsequent | ||
| // leaf indices. |
Contributor
There was a problem hiding this comment.
Suggested change
| // Always map root (Arrow) indices to Parquet leaf indices via the schema | |
| // descriptor. Arrow root indices only equal Parquet leaf indices when the | |
| // schema has no group columns (Struct, Map, etc.); when group columns | |
| // exist, their children become separate leaves and shift all subsequent | |
| // leaf indices. | |
| // Always map root (Arrow) indices to Parquet leaf indices via the schema | |
| // descriptor. Arrow root indices only equal Parquet leaf indices when the | |
| // schema has no group columns (Struct, Map, etc.); when group columns | |
| // exist, their children become separate leaves and shift all subsequent | |
| // leaf indices. | |
| // Struct columns are unsupported. |
adriangb
approved these changes
Mar 4, 2026
5561691 to
29f6891
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
leaf_indices_for_rootsreturns wrong Parquet leaf indices when schema contains Struct columns #20695Rationale for this change
Row filter pushdown assumed Arrow field indices equal Parquet leaf indices, which breaks when Struct columns are present because their children expand into separate leaves and shift all subsequent indices