fix: enforce schema evolution config at runtime in native_datafusion scan by andygrove · Pull Request #3689 · apache/datafusion-comet

andygrove · 2026-03-13T17:34:01Z

Which issue does this PR close?

Closes #3311.

Rationale for this change

When spark.comet.schemaEvolution.enabled is set to false (the default), the native_datafusion scan should reject Parquet files whose physical schema differs from the expected logical schema (e.g., int written as long). Previously, native_datafusion silently allowed schema widening, producing incorrect results or confusing errors instead of the expected SchemaColumnConvertNotSupportedException-style error that Spark produces.

What changes are included in this PR?

Runtime schema mismatch detection in native code:

Added detect_schema_mismatch() function in schema_adapter.rs that compares logical and physical schemas per-file at runtime
Added is_type_promotion() recursive function to distinguish real type promotions (Int32→Int64) from adapter-handled differences (timestamp tz/unit, list/map/struct metadata, unsigned ints, FixedSizeBinary)
The schema_evolution_enabled config flows from JVM through protobuf to SparkParquetOptions

Spark-compatible error conversion:

Added SchemaColumnConvertNotSupported variant to SparkError enum
Errors are emitted as DataFusionError::External(SparkError) so they flow through the JSON error path
Added SchemaColumnConvertNotSupported handler in ShimSparkErrorConverter (all 3 Spark versions) that calls QueryExecutionErrors.unsupportedSchemaColumnConvertError(), producing the same SparkException with error class _LEGACY_ERROR_TEMP_2063 that Spark natively produces

Spark SQL test updates:

Unignored SPARK-35640 tests (read binary as timestamp should throw schema incompatible error, int as long should throw schema incompatible error) for native_datafusion scan since the enforcement now produces matching Spark errors

How are these changes tested?

Rust unit test parquet_schema_mismatch_rejected_when_evolution_disabled validates that type mismatches are rejected when schema evolution is disabled and allowed when enabled
Existing ParquetReadSuite schema evolution tests validate end-to-end behavior
Spark SQL tests (SPARK-35640) run in CI with Comet enabled to verify error compatibility

Pass schema_evolution_enabled config through protobuf to the native Parquet reader. When disabled, the SparkPhysicalExprAdapter rejects type mismatches between file and table schemas at runtime, matching Spark's behavior of throwing errors on incompatible types.

Timestamp timezone-only differences (e.g., Timestamp(us, "UTC") vs Timestamp(us, None)) are metadata-only relabelings handled by the existing CometCastColumnExpr, not real schema evolution.

The previous check was too strict, rejecting normal Parquet-to-Spark metadata differences (timestamp units/timezones, list/map/struct field names and nullability) that the adapter already handles. Now only flag actual type promotions (Int32→Int64, Float32→Float64, etc.) while allowing the adapter to handle structural metadata differences.

Unsigned int types (UInt8→Int16, UInt64→Decimal) and FixedSizeBinary→ Binary are Parquet physical type mappings handled by the adapter, not schema evolution. Also simplify the schema evolution test since native_datafusion now properly enforces the config.

Use the SparkError/SparkErrorConverter pattern to produce proper SparkException with error class _LEGACY_ERROR_TEMP_2063, matching Spark's native schema mismatch error format.

The schema evolution enforcement now properly rejects type mismatches and produces Spark-compatible error messages, allowing these tests to pass with native_datafusion scan.

Spark 4.0 replaced unsupportedSchemaColumnConvertError with FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH error class.

… disabled The is_type_promotion function was too aggressive, treating standard Parquet type mappings as schema evolution. Spark's vectorized reader natively handles integer family casts (int32→int8/int16/int64), decimal widening, float→double, binary↔string, and int→string conversions without requiring schema evolution. Only truly incompatible conversions (e.g., string→timestamp) should be rejected. Also re-add IgnoreCometNativeDataFusion tag for SPARK-35640 int-as-long test since native_datafusion handles int→long natively (unlike Spark's non-vectorized reader which the test targets).

andygrove · 2026-03-14T17:27:07Z

I am leaning towards just documenting this is a difference

andygrove added 8 commits March 13, 2026 11:23

style: apply cargo fmt

81f2587

fix: exclude timestamp timezone differences from schema mismatch check

836df2d

Timestamp timezone-only differences (e.g., Timestamp(us, "UTC") vs Timestamp(us, None)) are metadata-only relabelings handled by the existing CometCastColumnExpr, not real schema evolution.

feat: convert schema mismatch errors to Spark-compatible exceptions

5c02fcf

Use the SparkError/SparkErrorConverter pattern to produce proper SparkException with error class _LEGACY_ERROR_TEMP_2063, matching Spark's native schema mismatch error format.

fix: unignore SPARK-35640 schema mismatch tests for native_datafusion

2c3c51e

The schema evolution enforcement now properly rejects type mismatches and produces Spark-compatible error messages, allowing these tests to pass with native_datafusion scan.

fix: use correct error class for schema mismatch in Spark 4.0

ad14686

Spark 4.0 replaced unsupportedSchemaColumnConvertError with FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH error class.

andygrove marked this pull request as draft March 13, 2026 17:53

andygrove mentioned this pull request Mar 13, 2026

feat: Enable native_datafusion scan in auto scan mode [WIP] #3682

Closed

andygrove closed this Mar 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enforce schema evolution config at runtime in native_datafusion scan#3689

fix: enforce schema evolution config at runtime in native_datafusion scan#3689
andygrove wants to merge 9 commits intoapache:mainfrom
andygrove:fix/schema-evolution-enforcement

andygrove commented Mar 13, 2026

Uh oh!

andygrove commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant