-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
Summary
DataFusion currently supports additive schema evolution reasonably well for plain Struct columns, but it fails when the evolved struct is nested inside a container type such as List<Struct>.
This shows up in Parquet scans with a logical schema newer than some physical files. If a nested struct inside a list gains a new nullable field, DataFusion fails planning or execution instead of adapting the older files by filling the new field with nulls.
Version
Observed on DataFusion 52.1.0.
Problem
Given:
- older parquet files with a field shaped like
List(Struct(...)) - newer parquet files where the struct inside that list has additional nullable fields
- a scan using the latest logical schema across both old and new files
DataFusion fails with an error like:
Cannot cast struct field 'messages' from type List(Struct(...old shape...)) to type List(Struct(...new shape...))
In my case, the concrete drift is:
- old physical files:
inputAsset: Struct(type, token, amount)outputAsset: Struct(type, token)
- new logical schema:
inputAsset: Struct(type, token, amount, chain)outputAsset: Struct(type, token, chain)
where both chain fields are nullable additions.
Expected behavior
For additive schema evolution, DataFusion should treat nested container cases similarly to plain Struct evolution:
- missing fields in older files should be filled with nulls if the target field is nullable
- extra fields in older or newer files should be ignored when not present in the target
- recursive adaptation should work through:
ListLargeListFixedSizeListMap- combinations like
Struct -> List(Struct) -> Struct
This should allow both narrow projections and SELECT * across schema-drifted parquet files without application-side rewriting.
Actual behavior
DataFusion succeeds for some plain Struct evolution scenarios, but fails when the evolved struct is nested in a list or map-like container.
The failure appears during schema rewriting or cast validation for Parquet scan expressions.
Why this seems like a gap in the current implementation
From reading the current code:
DefaultPhysicalExprAdapterRewriter::rewrite_columnspecial-cases(Struct, Struct)compatibility and otherwise falls back to genericcan_cast_typesdatafusion_common::nested_struct::cast_columnspecial-cases targetStructand otherwise falls back to generic Arrow casting- as a result,
Structevolution gets custom handling, butList<Struct>does not
So the current behavior looks like:
- supported:
Struct -> Structwith missing or extra fields - not supported:
List<Struct> -> List<Struct>with additive nested fields
Relevant code paths
These are the places that seem most relevant:
datafusion-common/src/nested_struct.rscast_columnvalidate_struct_compatibility
datafusion-physical-expr-adapter/src/schema_rewriter.rsDefaultPhysicalExprAdapterRewriter::rewrite_column
datafusion-physical-expr/src/expressions/cast_column.rsCastColumnExpr::evaluate
Minimal shape of the repro
Logical schema:
data: Struct(
messages: List(
Struct(
kwargs: Struct(
tool_calls: List(
Struct(
args: Struct(
swaps: List(
Struct(
inputAsset: Struct(
amount: Struct(type, value),
token: Struct(identifier_type, value),
type,
chain
),
outputAsset: Struct(
token: Struct(identifier_type, value),
type,
chain
)
)
)
)
)
)
)
)
)
)
Older physical files have the same shape except inputAsset.chain and outputAsset.chain are absent.
Suggested fix direction
A clean fix seems to be:
- Generalize compatibility checking from plain struct fields to recursive nested type compatibility.
- Extend
cast_columnto recursively adapt container types whose child or value type contains evolved structs. - Use that recursive compatibility logic from the default physical expression adapter as well.
Concretely, this likely means adding support for recursive adaptation of:
ListLargeListFixedSizeListMap
instead of only Struct.
Proposed semantics
For nested container evolution:
- matching fields should still be cast using existing cast rules
- missing target fields should become null arrays when nullable
- nullable source to non-nullable target should still fail
- extra source fields should still be ignored
- incompatible primitive type changes should still error
Tests that would be useful
I think the missing coverage is around:
List<Struct>where target adds a nullable nested fieldLargeList<Struct>with the same patternFixedSizeList<Struct>with the same patternMap<_, Struct>or map entries containing evolved structs- recursive case like
Struct(messages: List(Struct(...)))
Impact
This currently forces application-level workarounds such as preprocessing or rewriting parquet files to the latest schema before querying, even though the evolution is additive and nullable.
It would be much better if the default Parquet scan path handled this directly, the same way plain Struct evolution is already handled.
To Reproduce
No response
Expected behavior
No response
Additional context
No response