Skip to content

fix: enforce schema evolution config at runtime in native_datafusion scan#3689

Closed
andygrove wants to merge 9 commits intoapache:mainfrom
andygrove:fix/schema-evolution-enforcement
Closed

fix: enforce schema evolution config at runtime in native_datafusion scan#3689
andygrove wants to merge 9 commits intoapache:mainfrom
andygrove:fix/schema-evolution-enforcement

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

Closes #3311.

Rationale for this change

When spark.comet.schemaEvolution.enabled is set to false (the default), the native_datafusion scan should reject Parquet files whose physical schema differs from the expected logical schema (e.g., int written as long). Previously, native_datafusion silently allowed schema widening, producing incorrect results or confusing errors instead of the expected SchemaColumnConvertNotSupportedException-style error that Spark produces.

What changes are included in this PR?

Runtime schema mismatch detection in native code:

  • Added detect_schema_mismatch() function in schema_adapter.rs that compares logical and physical schemas per-file at runtime
  • Added is_type_promotion() recursive function to distinguish real type promotions (Int32→Int64) from adapter-handled differences (timestamp tz/unit, list/map/struct metadata, unsigned ints, FixedSizeBinary)
  • The schema_evolution_enabled config flows from JVM through protobuf to SparkParquetOptions

Spark-compatible error conversion:

  • Added SchemaColumnConvertNotSupported variant to SparkError enum
  • Errors are emitted as DataFusionError::External(SparkError) so they flow through the JSON error path
  • Added SchemaColumnConvertNotSupported handler in ShimSparkErrorConverter (all 3 Spark versions) that calls QueryExecutionErrors.unsupportedSchemaColumnConvertError(), producing the same SparkException with error class _LEGACY_ERROR_TEMP_2063 that Spark natively produces

Spark SQL test updates:

  • Unignored SPARK-35640 tests (read binary as timestamp should throw schema incompatible error, int as long should throw schema incompatible error) for native_datafusion scan since the enforcement now produces matching Spark errors

How are these changes tested?

  • Rust unit test parquet_schema_mismatch_rejected_when_evolution_disabled validates that type mismatches are rejected when schema evolution is disabled and allowed when enabled
  • Existing ParquetReadSuite schema evolution tests validate end-to-end behavior
  • Spark SQL tests (SPARK-35640) run in CI with Comet enabled to verify error compatibility

Pass schema_evolution_enabled config through protobuf to the native
Parquet reader. When disabled, the SparkPhysicalExprAdapter rejects
type mismatches between file and table schemas at runtime, matching
Spark's behavior of throwing errors on incompatible types.
Timestamp timezone-only differences (e.g., Timestamp(us, "UTC") vs
Timestamp(us, None)) are metadata-only relabelings handled by the
existing CometCastColumnExpr, not real schema evolution.
The previous check was too strict, rejecting normal Parquet-to-Spark
metadata differences (timestamp units/timezones, list/map/struct field
names and nullability) that the adapter already handles. Now only
flag actual type promotions (Int32→Int64, Float32→Float64, etc.)
while allowing the adapter to handle structural metadata differences.
Unsigned int types (UInt8→Int16, UInt64→Decimal) and FixedSizeBinary→
Binary are Parquet physical type mappings handled by the adapter, not
schema evolution. Also simplify the schema evolution test since
native_datafusion now properly enforces the config.
Use the SparkError/SparkErrorConverter pattern to produce proper
SparkException with error class _LEGACY_ERROR_TEMP_2063, matching
Spark's native schema mismatch error format.
The schema evolution enforcement now properly rejects type mismatches
and produces Spark-compatible error messages, allowing these tests
to pass with native_datafusion scan.
Spark 4.0 replaced unsupportedSchemaColumnConvertError with
FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH error class.
@andygrove andygrove marked this pull request as draft March 13, 2026 17:53
… disabled

The is_type_promotion function was too aggressive, treating standard
Parquet type mappings as schema evolution. Spark's vectorized reader
natively handles integer family casts (int32→int8/int16/int64), decimal
widening, float→double, binary↔string, and int→string conversions
without requiring schema evolution. Only truly incompatible conversions
(e.g., string→timestamp) should be rejected.

Also re-add IgnoreCometNativeDataFusion tag for SPARK-35640 int-as-long
test since native_datafusion handles int→long natively (unlike Spark's
non-vectorized reader which the test targets).
@andygrove
Copy link
Member Author

I am leaning towards just documenting this is a difference

@andygrove andygrove closed this Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully

1 participant