Skip to content

perf: improve json read#20823

Open
ariel-miculas wants to merge 8 commits intoapache:mainfrom
ariel-miculas:improve-json-read
Open

perf: improve json read#20823
ariel-miculas wants to merge 8 commits intoapache:mainfrom
ariel-miculas:improve-json-read

Conversation

@ariel-miculas
Copy link
Contributor

@ariel-miculas ariel-miculas commented Mar 9, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

This is an alternative approach to

Instead of reading the entire range in the json FileOpener, implement an
AlignedBoundaryStream which scans the range for newlines as the FileStream
requests data from the stream, by wrapping the original stream returned by the
ObjectStore.

This eliminated the overhead of the extra two get_opts requests needed by
calculate_range and more importantly, it allows for efficient read-ahead
implementations by the underlying ObjectStore. Previously this was inefficient
because the streams opened by calculate_range included a stream from
(start - 1) to file_size and another one from (end - 1) to end_of_file, just to
find the two relevant newlines.

What changes are included in this PR?

Added the AlignedBoundaryStream which wraps a stream returned by the object
store and finds the delimiting newlines for a particular file range. Notably it doesn't
do any standalone reads (unlike the calculate_range function), eliminating two calls
to get_opts.

Are these changes tested?

Yes, added unit tests.

Are there any user-facing changes?

No

@ariel-miculas
Copy link
Contributor Author

adding @Weijun-H and @martin-g since you've discussed the previous PR

@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 9, 2026
@ariel-miculas
Copy link
Contributor Author

a bit ironic

Run ci/scripts/typos_check.sh
[typos_check.sh] `typos --config typos.toml`
error: `tpos` should be `typos`
    ╭▸ ./datafusion/datasource-json/src/boundary_stream.rs:279:41
    │
279 │ …                     if let Some(tpos) =
    ╰╴                                  ━━━━
error: `tpos` should be `typos`
    ╭▸ ./datafusion/datasource-json/src/boundary_stream.rs:283:74
    │
283 │ …                     return Poll::Ready(Some(Ok(chunk.slice(..tpos + 1))));
    ╰╴                                                               ━━━━

This is an alternative approach to
apache#19687

Instead of reading the entire range in the json FileOpener, implement an
AlignedBoundaryStream which scans the range for newlines as the
FileStream requests data from the stream, by wrapping the original
stream returned by the ObjectStore.

This eliminated the overhead of the extra two get_opts requests needed
by calculate_range and more importantly, it allows for efficient
read-ahead implementations by the underlying ObjectStore. Previously
this was inefficient because the streams opened by calculate_range
included a stream from (start - 1) to file_size and another one from
(end - 1) to end_of_file, just to find the two relevant newlines.
@ariel-miculas
Copy link
Contributor Author

Who can help me with reviews? @martin-g @alamb

@alamb
Copy link
Contributor

alamb commented Mar 19, 2026

Thanks for this PR @ariel-miculas

Do you have any benchmark results for this change? Even some example queries

@Weijun-H do you know of any benchmarks to run?

@ariel-miculas
Copy link
Contributor Author

No, I'm having troubles coming up with a realistic benchmark.

The previous benchmark https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b incorrectly assumes that all the requested bytes (via get_opts) will be read, while you can actually request a 10GiB stream of bytes and read only 16KiB from it.

As a result, the benchmark of the previous PR for reducing the read amplification shows impressive improvements, but it hides the fact that it breaks the parallelization between data fetching and json decoding (by doing all the data fetching in the JsonOpener instead of allowing FileStream to do its magic).

So I'm not sure how to write a benchmark that can prove at the same time that:

  • I'm increasing performance (because there are no more read requests in the JsonOpener)
  • This solution is better than the original proposal perf: reduce read amplification for partitioned JSON file scanning #19687 because it doesn't break parallelization between fetching and decoding
  • This optimization is relevant for real-world object store implementations (where network latency matters, network speed matters, data computation can happen while waiting for bytes to be read, read-ahead is a relevant optimization etc.)

@martin-g martin-g self-requested a review March 20, 2026 03:02
file_size: u64,
terminator: u8,
) -> object_store::Result<Self> {
if raw_start >= raw_end {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same be done for raw_start => file_size ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I also removed them from the caller, so we don't have duplication

terminator: u8,
/// Effective end boundary. Set to `u64::MAX` when `end >= file_size`
/// (last partition), so `FetchingChunks` never transitions to
/// `ScanningLastTerminator` and simply streams to EOF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... streams to EOF is not clear to me. What do you mean ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means we passthrough all the chunks to the json decoder (the caller which polls AlignedBoundaryStream), staying in the FetchingChunks phase until we consume the entire inner stream; this only happens when raw_end >= file_size, i.e. for the last file range in a file, in which case there's nothing else to scan past raw_end for a terminator (nor is there any need to do so). So we consume only the initial stream, but since that one includes the end of the file, we passthrough all the remaining chunks until end of file (EOF) is reached.

ariel-miculas and others added 4 commits March 20, 2026 11:29
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants