Skip to content

[SPARK-56175][SQL] FileTable implements SupportsPartitionManagement and V2 catalog table loading#55034

Draft
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-56175
Draft

[SPARK-56175][SQL] FileTable implements SupportsPartitionManagement and V2 catalog table loading#55034
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-56175

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Mar 26, 2026

What changes were proposed in this pull request?

This PR is part of SPARK-56170 (Remove file source V2 gate and unify V1/V2 file source paths). It makes three major changes:

1. FileTable implements SupportsPartitionManagement

FileTable now extends SupportsPartitionManagement with filesystem-based partition operations:

  • createPartition: creates partition directory and syncs to catalog metastore
  • dropPartition: deletes partition directory and syncs to catalog metastore
  • listPartitionIdentifiers: discovers partitions from filesystem directory structure
  • partitionSchema: returns partition schema from fileIndex, userSpecifiedPartitioning, or catalogTable

2. V2SessionCatalog.loadTable returns FileTable

V2SessionCatalog.loadTable now returns the V2 FileTable (e.g., ParquetTable, OrcTable, CSVTable) instead of V1Table for file-based catalog tables. This enables V2 capabilities for catalog tables:

  • Sets catalogTable and useCatalogFileIndex on FileTable for catalog metadata access
  • getDataSourceOptions includes storage.properties for proper option propagation (CSV header, ORC bloom filter columns, etc.)
  • FileTable.columns() restores NOT NULL constraints from catalog table metadata
  • FileTable.partitioning() falls back to catalog partition columns when fileIndex has no partition info
  • FileTable.fileIndex uses CatalogFileIndex when catalog has registered partitions with custom locations
  • FileTable.schema checks column name duplication for non-catalog tables only (catalog tables are handled by the analyzer)

3. Gate removal and V1 fallbacks

  • DataSourceV2Utils.getTableProvider: Removed FileDataSourceV2 gate that prevented V2 provider resolution for file sources
  • DataFrameWriter.insertInto: Enabled V2 path for file sources (previously blocked with TODO)
  • DataFrameWriter.saveAsTable: Kept V1 fallback because Overwrite mode creates ReplaceTableAsSelect which requires StagingTableCatalog (TODO: SPARK-56230)
  • ResolveSessionCatalog: Added V1 command fallbacks for FileTable-backed session catalog tables. Since FileTable doesn't match V1 extractors (ResolvedV1TableIdentifier, etc.), we intercept these commands and delegate to V1 using catalogTable metadata:
    • AnalyzeTable, AnalyzeColumn
    • TruncateTable, TruncatePartition
    • ShowPartitions
    • RecoverPartitions
    • AddPartitions, RenamePartitions, DropPartitions
    • SetTableLocation
    • CREATE TABLE data type validation and REPLACE TABLE blocking for file sources
  • FindDataSourceTable: Added streaming V1 fallback for FileTable. Since FileTable lacks MICRO_BATCH_READ/CONTINUOUS_READ capabilities, streaming reads from catalog file tables fall back to V1 StreamingRelation (TODO: SPARK-56233)
  • DataSource.planForWritingFileFormat: Changed .collect{}.head to .collectFirst + .flatMap to gracefully handle V2 tables (which resolve to DataSourceV2Relation instead of LogicalRelation)

Helper: partSpecToMap

Added a partSpecToMap helper in ResolveSessionCatalog that converts PartitionSpec (either ResolvedPartitionSpec or UnresolvedPartitionSpec) to V1's Map[String, String] format. This is needed because ResolvePartitionSpec may resolve partition specs before ResolveSessionCatalog runs (both are in the Resolution batch), and calling asUnresolvedPartitionSpecs on already-resolved specs would throw ClassCastException.

Why are the changes needed?

This is a key step toward unifying V1/V2 file source paths (SPARK-56170). By having V2SessionCatalog return native V2 FileTable instances for file-based catalog tables, we enable:

  • V2 partition management for file tables
  • V2 read/write paths for catalog table operations
  • Gradual removal of V1 fallback code as V2 capabilities mature

The V1 fallbacks in ResolveSessionCatalog ensure backward compatibility for commands that don't yet have V2-native implementations.

Does this PR introduce any user-facing change?

No. The behavior is functionally equivalent. Internally, catalog file tables are now loaded as V2 FileTable instances, but all user-visible operations produce the same results through V1 command fallbacks where needed.

How was this patch tested?

Existing tests updated to handle both V1 and V2 plan structures:

  • DataStreamTableAPISuite: streaming read with file source tables
  • InMemoryColumnarQuerySuite: catalog stats after ANALYZE TABLE (TODO: SPARK-56232 for V2 stats propagation)
  • CSVv2Suite: CSV parsing with char/varchar type columns
  • JsonV2Suite: case sensitivity of filter references
  • OrcSourceV2Suite: bloom filter creation and selective dictionary encoding
  • OrcV2QuerySuite: ORC file format detection in query plans
  • ParquetV2QuerySuite: INT96 to TIMESTAMP_MICROS migration
  • FileSourceSQLInsertTestSuite: partition spec handling with keepPartitionSpecAsStringLiteral

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

…Frame API writes and delete FallBackFileSourceV2

Key changes:
- FileWrite: added partitionSchema, customPartitionLocations,
  dynamicPartitionOverwrite, isTruncate; path creation and truncate
  logic; dynamic partition overwrite via FileCommitProtocol
- FileTable: createFileWriteBuilder with SupportsDynamicOverwrite
  and SupportsTruncate; capabilities now include TRUNCATE and
  OVERWRITE_DYNAMIC; fileIndex skips file existence checks when
  userSpecifiedSchema is provided (write path)
- All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use
  createFileWriteBuilder with partition/truncate/overwrite support
- DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for
  non-partitioned Append and Overwrite via df.write.save(path)
- DataFrameWriter.insertInto: V1 fallback for file sources
  (TODO: SPARK-56175)
- DataFrameWriter.saveAsTable: V1 fallback for file sources
  (TODO: SPARK-56230, needs StagingTableCatalog)
- DataSourceV2Utils.getTableProvider: V1 fallback for file sources
  (TODO: SPARK-56175)
- Removed FallBackFileSourceV2 rule
- V2SessionCatalog.createTable: V1 FileFormat data type validation
…catalog table loading, and gate removal

Key changes:
- FileTable extends SupportsPartitionManagement with createPartition,
  dropPartition, listPartitionIdentifiers, partitionSchema
- Partition operations sync to catalog metastore (best-effort)
- V2SessionCatalog.loadTable returns FileTable instead of V1Table,
  sets catalogTable and useCatalogFileIndex on FileTable
- V2SessionCatalog.getDataSourceOptions includes storage.properties
  for proper option propagation (header, ORC bloom filter, etc.)
- V2SessionCatalog.createTable validates data types via FileTable
- FileTable.columns() restores NOT NULL constraints from catalogTable
- FileTable.partitioning() falls back to userSpecifiedPartitioning
  or catalog partition columns
- FileTable.fileIndex uses CatalogFileIndex when catalog has
  registered partitions (custom partition locations)
- FileTable.schema checks column name duplication for non-catalog
  tables only
- DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate
- DataFrameWriter.insertInto: enabled V2 for file sources
- DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230)
- ResolveSessionCatalog: V1 fallback for FileTable-backed commands
  (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition,
  ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions,
  DropPartitions, SetTableLocation, CREATE TABLE validation,
  REPLACE TABLE blocking)
- FindDataSourceTable: streaming V1 fallback for FileTable
  (TODO: SPARK-56233)
- DataSource.planForWritingFileFormat: graceful V2 handling
@LuciferYang LuciferYang marked this pull request as draft March 26, 2026 13:42
@LuciferYang
Copy link
Copy Markdown
Contributor Author

This pr takes #54998 as the baseline. It is the second step for SPARK-56170. Commit f853912 contains the actual changes for SPARK-56175. SPARK-56170 seems to involve a considerable number of tasks, and I'm not sure if all of them can be completed before the 4.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant