[WIP] Parquet Java ALP Implementation by vinooganesh · Pull Request #3397 · apache/parquet-java

vinooganesh · 2026-02-17T23:15:47Z

Rationale for this change

Reworks the ALP encoding implementation to address emkornfield's architectural feedback on PR #3390. The original buffered all values in memory and decoded eagerly. This makes the writer incremental (encode per-vector as values arrive) and the reader lazy (decode on demand), matching how other Parquet encodings work.

Builds on Julien Le Dem's original implementation (#3390). File structure, integration points, core math, and interop test infrastructure all come from his work. The rework focused on the internal writer/reader plumbing.

What changes are included in this PR?

Architecture (addressing review feedback):

Incremental writer. Values buffer in a fixed-size vector, each full vector encodes and flushes immediately.
Lazy reader. Vectors decode on first access via offset array, skip() is O(1).
Interleaved page layout so each vector is self-contained.
Extracted AlpValuesReader abstract base class for shared logic.
Preset caching. Full parameter search for first 8 vectors, top 5 combos cached for the rest.

Spec compliance:

Fixed packed data size formula to ceil(n * bitWidth / 8)
Fixed unsigned delta comparison in float writer
Explicit little-endian byte reads instead of relying on ByteBuffer order
Using parquet-encoding's BytePacker instead of custom bit-packing
Capped max vector size at 32768 to prevent uint16 overflow in num_exceptions

Code quality:

Renamed bitWidth overloads to prevent silent type coercion
Package-private visibility for internals
Configurable vector size (default 1024)

Integration:

Wired ALP into DefaultV2ValuesWriterFactory and ParquetProperties

Are these changes tested?

Yes. 105 tests across 3 test classes, all passing. Full parquet-column suite (677 tests) also passes.

Key tests construct ALP page bytes directly according to the spec and feed them to the reader without going through the writer. This verifies the reader works independently and catches any bugs where writer and reader agree with each other but disagree with the spec. Also covers NaN bit pattern preservation, negative zero roundtrip, extreme values, every partial vector remainder mod 8, skip across vector boundaries, and preset caching under distribution change.

Are there any user-facing changes?

Users can enable ALP encoding for FLOAT and DOUBLE columns via ParquetProperties.withAlpEncoding(), globally or per-column.

Note: Likely me missing something - but ALP is not yet in the parquet-format Thrift spec (apache/parquet-format#533), so writing ALP files through the full Hadoop pipeline will fail at metadata serialization until parquet.thrift is updated (parquet-format PR #548).

Implements ALP encoding for FLOAT and DOUBLE types, which converts floating-point values to integers using decimal scaling, then applies Frame of Reference (FOR) encoding and bit-packing for compression. New files: - AlpConstants.java: Constants for ALP encoding - AlpEncoderDecoder.java: Core encoding/decoding logic - AlpValuesWriter.java: Writer implementation - AlpValuesReaderForFloat/Double.java: Reader implementations Includes comprehensive unit tests and interop test infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ALP encoding is not yet part of the parquet-format Thrift specification, so it cannot be converted to org.apache.parquet.format.Encoding. Skip it in the testEnumEquivalence test and add a clear error message in the converter for when ALP conversion is attempted. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

size and add independent reader/writer verification tests

julienledem and others added 7 commits January 22, 2026 08:44

Fix formatting in DirectCodecFactory and ParquetMetadataConverter

bc5ebe4

Restore original comment indentation that was accidentally changed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix javadoc HTML escaping in AlpEncoderDecoder

dfdd809

Escape <= characters as <= in javadoc comments to avoid malformed HTML errors during documentation generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Apply spotless formatting to ParquetMetadataConverter

03457c5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

first pass of ALP java implementation

0c393e0

Fix uint16 overflow bug in max vector

6d65eaa

size and add independent reader/writer verification tests

vinooganesh force-pushed the vinooganesh/alp-java-implementation branch from 93c365a to 6d65eaa Compare February 18, 2026 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[WIP] Parquet Java ALP Implementation#3397

[WIP] Parquet Java ALP Implementation#3397
vinooganesh wants to merge 7 commits intoapache:masterfrom
vinooganesh:vinooganesh/alp-java-implementation

vinooganesh commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

vinooganesh commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinooganesh commented Feb 17, 2026 •

edited

Loading