Skip to content

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder#9497

Draft
liamzwbao wants to merge 1 commit intoapache:mainfrom
liamzwbao:issue-9298-replace-array-data
Draft

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder#9497
liamzwbao wants to merge 1 commit intoapache:mainfrom
liamzwbao:issue-9298-replace-array-data

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

While implementing ListViewArrayDecoder in arrow-json, I noticed we could potentially retire ArrayDataBuilder inside ListArrayDecoder. Therefore, I'd like to use a small PR here to make sure there's no regression

What changes are included in this PR?

Replace ArrayDataBuilder with GenericListArray in ListArrayDecoder

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No

@liamzwbao liamzwbao marked this pull request as ready for review March 2, 2026 02:28
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 2, 2026
@liamzwbao
Copy link
Contributor Author

Hi @alamb @Jefffrey , would appreciate a review and a benchmark for this PR. Thanks!

@alamb
Copy link
Contributor

alamb commented Mar 2, 2026

Any particular benchmark you want me to run?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liamzwbao -- this looks good to me -- I have just one question

// Safety
// Validated lengths above
Ok(unsafe { data.build_unchecked() })
let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does try_new validate the offsets ? That could be a significant performance hit

Basically as long as this doesn't do additional validation I think it looks good to me

Copy link
Contributor Author

@liamzwbao liamzwbao Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it will do validation, but I think it's cheap and the local benchmark does not cause a regression. BTW, we don't have an unchecked API for ListArray, the new API simply unwrap the try_new.

/// * `offsets.len() - 1 != nulls.len()`
/// * `offsets.last() > values.len()`
/// * `!field.is_nullable() && values.is_nullable()`
/// * `field.data_type() != values.data_type()`
pub fn try_new(
field: FieldRef,
offsets: OffsetBuffer<OffsetSize>,
values: ArrayRef,
nulls: Option<NullBuffer>,
) -> Result<Self, ArrowError> {
let len = offsets.len() - 1; // Offsets guaranteed to not be empty
let end_offset = offsets.last().unwrap().as_usize();
// don't need to check other values of `offsets` because they are checked
// during construction of `OffsetBuffer`
if end_offset > values.len() {
return Err(ArrowError::InvalidArgumentError(format!(
"Max offset of {end_offset} exceeds length of values {}",
values.len()
)));
}
if let Some(n) = nulls.as_ref() {
if n.len() != len {
return Err(ArrowError::InvalidArgumentError(format!(
"Incorrect length of null buffer for {}ListArray, expected {len} got {}",
OffsetSize::PREFIX,
n.len(),
)));
}
}
if !field.is_nullable() && values.is_nullable() {
return Err(ArrowError::InvalidArgumentError(format!(
"Non-nullable field of {}ListArray {:?} cannot contain nulls",
OffsetSize::PREFIX,
field.name()
)));
}
if field.data_type() != values.data_type() {
return Err(ArrowError::InvalidArgumentError(format!(
"{}ListArray expected data type {} got {} for {:?}",
OffsetSize::PREFIX,
field.data_type(),
values.data_type(),
field.name()
)));
}

// Validated lengths above
Ok(unsafe { data.build_unchecked() })
let array = GenericListArray::<O>::try_new(field, offsets, values, nulls)?;
Ok(array.into_data())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call simply creates an ArrayData (which is necessary given the API) so I am not sure it actually avoids any ArrayDatas

In order to avoid ArrayData we would probably need to change the signature of decode to return an ArrayRef directly (rather than ArrayData)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can change the signature of all the decoders if you think it's worth a try

@liamzwbao
Copy link
Contributor Author

Any particular benchmark you want me to run?

This one should be good: https://github.com/apache/arrow-rs/blob/main/arrow-json/benches/json-reader.rs
I tested it locally (MacOS M3 MAX) and there's no regression. That's why I brought it up to see if we can get similar results on a cloud server

@liamzwbao
Copy link
Contributor Author

Hi @alamb, just noticed that we don't have benchmark for List yet. I will create one and get back to you later

@liamzwbao liamzwbao marked this pull request as draft March 4, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants