[WIP] Parquet Java ALP Implementation#3397
Open
vinooganesh wants to merge 7 commits intoapache:masterfrom
Open
[WIP] Parquet Java ALP Implementation#3397vinooganesh wants to merge 7 commits intoapache:masterfrom
vinooganesh wants to merge 7 commits intoapache:masterfrom
Conversation
Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression. New files: - AlpConstants.java: Constants for ALP encoding - AlpEncoderDecoder.java: Core encoding/decoding logic - AlpValuesWriter.java: Writer implementation - AlpValuesReaderForFloat/Double.java: Reader implementations Includes comprehensive unit tests and interop test infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP encoding is not yet part of the parquet-format Thrift specification, so it cannot be converted to org.apache.parquet.format.Encoding. Skip it in the testEnumEquivalence test and add a clear error message in the converter for when ALP conversion is attempted. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
size and add independent reader/writer verification tests
93c365a to
6d65eaa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cc @julienledem @alamb
Rationale for this change
Reworks the ALP encoding implementation to address emkornfield's architectural feedback on PR #3390. The original buffered all values in memory and decoded eagerly. This makes the writer incremental (encode per-vector as values arrive) and the reader lazy (decode on demand), matching how other Parquet encodings work.
Builds on Julien Le Dem's original implementation (#3390). File structure, integration points, core math, and interop test infrastructure all come from his work. The rework focused on the internal writer/reader plumbing.
What changes are included in this PR?
Architecture (addressing review feedback):
Spec compliance:
Code quality:
Integration:
Are these changes tested?
Yes. 105 tests across 3 test classes, all passing. Full parquet-column suite (677 tests) also passes.
Key tests construct ALP page bytes directly according to the spec and feed them to the reader without going through the writer. This verifies the reader works independently and catches any bugs where writer and reader agree with each other but disagree with the spec. Also covers NaN bit pattern preservation, negative zero roundtrip, extreme values, every partial vector remainder mod 8, skip across vector boundaries, and preset caching under distribution change.
Are there any user-facing changes?
Users can enable ALP encoding for FLOAT and DOUBLE columns via ParquetProperties.withAlpEncoding(), globally or per-column.
Note: Likely me missing something - but ALP is not yet in the parquet-format Thrift spec (apache/parquet-format#533), so writing ALP files through the full Hadoop pipeline will fail at metadata serialization until parquet.thrift is updated (parquet-format PR #548).