Acceleration : Iceberg table compaction by Shekharrajak · Pull Request #3519 · apache/datafusion-comet

Shekharrajak · 2026-02-14T09:45:14Z

Which issue does this PR close?

PR Description

Rationale for this change

Iceberg table compaction using Spark's default rewriteDataFiles() action is slow due to Spark shuffle and task scheduling overhead. This PR adds native Rust-based compaction using DataFusion for direct Parquet read/write, achieving 1.5-1.8x speedup over Spark's default compaction.

What changes are included in this PR?

Native Rust compaction: DataFusion-based Parquet read/write via JNI ([iceberg_compaction_jni.rs]
Scala integration: CometNativeCompaction class that executes native compaction (Executes native scan + write via JNI) and commits via Iceberg Java API
Configuration: spark.comet.iceberg.compaction.enabled config option
Benchmark: TPC-H based compaction benchmark comparing Spark vs Native performance

How are these changes tested?

Unit tests in CometIcebergCompactionSuite covering:
- Non-partitioned table compaction
- Partitioned table compaction (bucket, truncate, date partitions)
- Data correctness verification after compaction
TPC-H benchmark (CometIcebergTPCCompactionBenchmark) measuring performance on lineitem, orders, customer tables
Manual testing with SF1 TPC-H data showing:
- lineitem (6M rows): 7.2s → 4.4s (1.6x)
- orders (1.5M rows): 1.5s → 0.9s (1.8x)

Shekharrajak · 2026-02-14T09:48:46Z

native/core/src/execution/operators/iceberg_parquet_writer.rs

+// under the License.
+
+//! Iceberg Parquet writer operator for writing RecordBatches to Parquet files
+//! with Iceberg-compatible metadata (DataFile structures).


DataFusion execution operator that writes Arrow RecordBatches to Parquet files with Iceberg-compatible metadata.

It enables native Rust to produce files that Iceberg's Java API can directly commit.
Metadata is serialized as JSON and passed back to JVM via JNI for commit.

Shekharrajak · 2026-02-14T09:49:51Z

native/core/src/execution/iceberg_compaction_jni.rs

+
+//! JNI bridge for Iceberg compaction operations.
+//!
+//! This module provides JNI functions for native Iceberg compaction (scan + write).


JNI bridge that exposes native Rust compaction to Scala/JVM.

executeIcebergCompaction() | JNI entry point - reads Parquet files via DataFusion, writes compacted output

Shekharrajak · 2026-02-14T09:50:49Z

native/core/src/execution/iceberg_compaction_jni.rs

+
+/// Configuration for Iceberg table metadata passed from JVM
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct IcebergTableConfig {


Table metadata from JVM (identifier, warehouse, snapshot ID, file IO props)

Shekharrajak · 2026-02-14T09:51:50Z

spark/src/test/scala/org/apache/spark/sql/comet/CometNativeCompaction.scala

+
+    logDebug(s"Executing native compaction with config: $configJson")
+
+    val resultJson = native.executeIcebergCompaction(configJson)


JNI entry point - reads Parquet files via DataFusion, writes compacted output

Shekharrajak · 2026-02-14T09:52:06Z

spark/src/test/scala/org/apache/spark/sql/comet/CometNativeCompaction.scala

+
+  def isAvailable: Boolean = {
+    try {
+      val version = new Native().getIcebergCompactionVersion()


Returns native library version for compatibility checks

Shekharrajak · 2026-02-14T09:56:07Z

spark/src/main/scala/org/apache/comet/iceberg/IcebergReflection.scala

-          "Iceberg reflection failure: Failed to get filter expressions from SparkScan: " +
-            s"${e.getMessage}")
-        None
+    findMethodInHierarchy(scan.getClass, "filterExpressions").flatMap { filterExpressionsMethod =>


previously we were assuming a fixed Iceberg class hierarchy, this findMethodInHierarchy walks up the class tree - better approach.

For compaction to work, we need to extract FileScanTask objects from the scan. Different Iceberg scan types expose tasks differently:

SparkBatchQueryScan -> tasks() method
SparkStagedScan -> taskGroups() method (returns groups, need to extract tasks from each)

…ma evolution, nested types

Shekharrajak added 4 commits February 14, 2026 14:51

feat: add native Rust Iceberg compaction with DataFusion and JNI bridge

b7c7559

feat: add Scala JNI interface and CometNativeCompaction for Iceberg

9c01c57

feat: add COMET_ICEBERG_COMPACTION_ENABLED config option

80051d0

test: add Iceberg compaction unit tests and TPC-H benchmark

9dca0f3

Shekharrajak changed the title ~~Feature/iceberg compaction benchmark~~ Iceberg table compaction Feb 14, 2026

Shekharrajak commented Feb 14, 2026

View reviewed changes

Shekharrajak added 2 commits February 14, 2026 15:35

test: add comprehensive Iceberg compaction tests for partitions, sche…

1df0011

…ma evolution, nested types

test: add file count validation and Spark vs Native comparison test

ad88f6e

Shekharrajak changed the title ~~Iceberg table compaction~~ Acceleration : Iceberg table compaction Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acceleration : Iceberg table compaction#3519

Acceleration : Iceberg table compaction#3519
Shekharrajak wants to merge 6 commits intoapache:mainfrom
Shekharrajak:feature/iceberg-compaction-benchmark

Shekharrajak commented Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026 •

edited

Loading

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		logDebug(s"Executing native compaction with config: $configJson")

		val resultJson = native.executeIcebergCompaction(configJson)

Conversation

Shekharrajak commented Feb 14, 2026

Which issue does this PR close?

PR Description

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shekharrajak Feb 14, 2026 •

edited

Loading

Shekharrajak Feb 14, 2026 •

edited

Loading