Skip to content

feat: aes encrypt support#3211

Open
Shekharrajak wants to merge 21 commits intoapache:mainfrom
Shekharrajak:feature/aes-encrypt-support
Open

feat: aes encrypt support#3211
Shekharrajak wants to merge 21 commits intoapache:mainfrom
Shekharrajak:feature/aes-encrypt-support

Conversation

@Shekharrajak
Copy link
Contributor

@Shekharrajak Shekharrajak commented Jan 17, 2026

Which issue does this PR close?

Closes #3187

Rationale for this change

The aes_encrypt function was fully implemented.

CometScalarFunction serialized expressions to protobuf without including the return type. This caused the native planner to attempt looking up the function in DataFusion's built-in UDF registry, which failed with "There is no UDF named 'aes_encrypt' in the registry."

What changes are included in this PR?

  1. Fix scalar function serialization (CometScalarFunction.scala):
    - Changed to use scalarFunctionExprToProtoWithReturnType() instead of scalarFunctionExprToProto()
    - Now passes expr.dataType to avoid native registry lookup
    - Enables native execution for all Comet-specific scalar functions
  2. Add comprehensive test suite (CometStaticInvokeSuite.scala)

How are these changes tested?

  1. Unit Tests: CometStaticInvokeSuite validates:
    • Native execution via physical plan inspection (assert(plan.contains("CometProject")))
  2. Benchmark Results: CometEncryptionBenchmark shows native execution with performance improvements:

@Shekharrajak
Copy link
Contributor Author

Findings with first version , where there was no improvements in benchmarks :

  1. aes_encrypt is a RuntimeReplaceable expression in Spark
  2. Spark converts it to StaticInvoke calling Java's native crypto libraries before Comet sees it
  3. Even with our StaticInvoke registration, Comet's physical planner falls back to Spark for the entire projection operator

@Shekharrajak
Copy link
Contributor Author

Shekharrajak commented Jan 18, 2026

Got improvements :


  SQL:
  SELECT hex(aes_encrypt(
    cast(data as binary),
    cast(key as binary),
    'CBC'
  ))
  FROM t1;

  Benchmark Results:
  aes_encrypt_cbc:                          Best Time(ms)   Relative
  ------------------------------------------------------------------
  Spark                                               260        1.0X
  Comet (Scan)                                        260        1.0X
  Comet (Scan + Exec)                                 174        1.5X 

override def convert(expr: T, inputs: Seq[Attribute], binding: Boolean): Option[Expr] = {
val childExpr = expr.children.map(exprToProtoInternal(_, inputs, binding))
val optExpr = scalarFunctionExprToProto(name, childExpr: _*)
// Pass return type to avoid native lookup in DataFusion registry
Copy link
Contributor Author

@Shekharrajak Shekharrajak Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When expr.return_type is None, it tries to look up the function in DataFusion's built-in UDF registry , which doesn't have aes_encrypt because it's a Comet-specific function registered later via create_comet_physical_fun

@Shekharrajak
Copy link
Contributor Author

Benchmark result :

➜  datafusion-comet git:(feature/aes-encrypt-support) ✗ cat spark/benchmarks/CometEncryptionBenchmark-jdk17-results.txt 
================================================================================================
Encryption expressions
================================================================================================

================================================================================================
aes_encrypt_gcm_basic
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_gcm_basic:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               274            279           3          0.4        2740.6       1.0X
Comet (Scan)                                        270            275           7          0.4        2703.9       1.0X
Comet (Scan + Exec)                                 230            238           8          0.4        2300.4       1.2X


================================================================================================
aes_encrypt_gcm_with_mode
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_gcm_with_mode:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               266            276          11          0.4        2661.5       1.0X
Comet (Scan)                                        264            269           4          0.4        2638.5       1.0X
Comet (Scan + Exec)                                 228            234           9          0.4        2279.5       1.2X


================================================================================================
aes_encrypt_cbc
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_cbc:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               260            263           2          0.4        2598.5       1.0X
Comet (Scan)                                        260            266           6          0.4        2595.3       1.0X
Comet (Scan + Exec)                                 174            176           1          0.6        1736.1       1.5X


================================================================================================
aes_encrypt_ecb
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_ecb:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               221            224           6          0.5        2206.5       1.0X
Comet (Scan)                                        223            228           4          0.4        2227.1       1.0X
Comet (Scan + Exec)                                 154            160           6          0.7        1537.5       1.4X


================================================================================================
aes_encrypt_gcm_with_iv
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_gcm_with_iv:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               236            240           4          0.4        2360.4       1.0X
Comet (Scan)                                        240            243           5          0.4        2395.7       1.0X
Comet (Scan + Exec)                                 226            236          20          0.4        2264.4       1.0X


================================================================================================
aes_encrypt_gcm_with_aad
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_gcm_with_aad:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               246            254          11          0.4        2460.1       1.0X
Comet (Scan)                                        247            252           4          0.4        2474.7       1.0X
Comet (Scan + Exec)                                 230            235           7          0.4        2300.4       1.1X


================================================================================================
aes_encrypt_with_base64
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_with_base64:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               251            255           2          0.4        2513.7       1.0X
Comet (Scan)                                        254            263           9          0.4        2543.9       1.0X
Comet (Scan + Exec)                                 259            268          10          0.4        2587.7       1.0X


================================================================================================
aes_encrypt_long_data
================================================================================================

OpenJDK 64-Bit Server VM 17.0.13+11 on Mac OS X 26.2
Apple M4 Max
aes_encrypt_long_data:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              2264           2327          89          0.0       22641.7       1.0X
Comet (Scan)                                       2283           2394         156          0.0       22833.5       1.0X
Comet (Scan + Exec)                                2635           2711         108          0.0       26346.5       0.9X

@Shekharrajak Shekharrajak changed the title feat: [DRAFT] aes encrypt support feat: aes encrypt support Jan 18, 2026
@Shekharrajak Shekharrajak force-pushed the feature/aes-encrypt-support branch from 643c1ba to 2401225 Compare January 29, 2026 17:43
@Shekharrajak
Copy link
Contributor Author

Please trigger the workflow

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for the aes_encrypt function to Comet, enabling native execution of AES encryption operations that would otherwise fall back to Spark's JVM implementation. The PR also includes a critical fix to CometScalarFunction serialization that was preventing proper registration of Comet-specific scalar functions.

Changes:

  • Fixed CometScalarFunction to use scalarFunctionExprToProtoWithReturnType() instead of scalarFunctionExprToProto(), passing the return type to avoid native registry lookup failures
  • Implemented comprehensive AES encryption support for ECB, CBC, and GCM cipher modes with PKCS7 padding in Rust
  • Added test suites (CometStaticInvokeSuite and CometEncryptionSuite) and benchmark suite (CometEncryptionBenchmark) for validation

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spark/src/main/scala/org/apache/comet/serde/CometScalarFunction.scala Critical serialization fix to pass return type for Comet scalar functions
spark/src/main/scala/org/apache/comet/serde/statics.scala Registration of aes_encrypt StaticInvoke mapping to CometScalarFunction
spark/src/test/scala/org/apache/comet/CometStaticInvokeSuite.scala Test suite validating native execution of aes_encrypt with various parameter combinations
spark/src/test/scala/org/apache/comet/CometEncryptionSuite.scala Additional encryption test cases
spark/src/test/scala/org/apache/spark/sql/benchmark/CometEncryptionBenchmark.scala Performance benchmark suite for encryption expressions
native/spark-expr/src/encryption_funcs/aes_encrypt.rs Main AES encryption implementation supporting scalar and batch processing
native/spark-expr/src/encryption_funcs/cipher_modes.rs ECB, CBC, and GCM cipher mode implementations
native/spark-expr/src/encryption_funcs/crypto_utils.rs Cryptographic utility functions for key/IV validation and random IV generation
native/spark-expr/src/encryption_funcs/mod.rs Module organization and public exports
native/spark-expr/src/comet_scalar_funcs.rs Registration of aes_encrypt UDF in native function registry
native/spark-expr/src/lib.rs Module declaration for encryption_funcs
native/spark-expr/Cargo.toml Added cryptography dependencies (aes, aes-gcm, cbc, cipher, ecb)
native/Cargo.lock Dependency lock file updates for new cryptography crates
spark/.gitignore Added spark-warehouse directory to gitignore (redundant with root .gitignore)
docs/source/user-guide/latest/configs.md Auto-generated configuration documentation updates
docs/source/user-guide/latest/compatibility.md Auto-generated compatibility matrix updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 24 to 100
class CometStaticInvokeSuite extends CometTestBase {

test("aes_encrypt basic - verify native execution") {
withTable("t1") {
sql("CREATE TABLE t1(data STRING, key STRING) USING parquet")
sql("""INSERT INTO t1 VALUES
('Spark', '0000111122223333'),
('SQL', 'abcdefghijklmnop')""")

val query = """
SELECT
data,
hex(aes_encrypt(cast(data as binary), cast(key as binary))) as encrypted
FROM t1
"""

checkSparkAnswerAndOperator(query)

val df = sql(query)
val plan = df.queryExecution.executedPlan.toString
assert(
plan.contains("CometProject") || plan.contains("CometNative"),
s"Expected native execution but got Spark fallback:\n$plan")
}
}

test("aes_encrypt with mode") {
withTable("t1") {
sql("CREATE TABLE t1(data STRING, key STRING) USING parquet")
sql("INSERT INTO t1 VALUES ('test', '1234567890123456')")

val query = """
SELECT hex(aes_encrypt(cast(data as binary), cast(key as binary), 'GCM'))
FROM t1
"""

checkSparkAnswerAndOperator(query)
}
}

test("aes_encrypt with all parameters") {
withTable("t1") {
sql("CREATE TABLE t1(data STRING, key STRING) USING parquet")
sql("INSERT INTO t1 VALUES ('test', '1234567890123456')")

val query = """
SELECT hex(aes_encrypt(
cast(data as binary),
cast(key as binary),
'GCM',
'DEFAULT',
cast('initialization' as binary),
cast('additional' as binary)
))
FROM t1
"""

checkSparkAnswerAndOperator(query)
}
}

test("aes_encrypt wrapped in multiple functions") {
withTable("t1") {
sql("CREATE TABLE t1(data STRING, key STRING) USING parquet")
sql("INSERT INTO t1 VALUES ('test', '1234567890123456')")

val query = """
SELECT
upper(hex(aes_encrypt(cast(data as binary), cast(key as binary)))) as encrypted,
length(hex(aes_encrypt(cast(data as binary), cast(key as binary)))) as len
FROM t1
"""

checkSparkAnswerAndOperator(query)
}
}
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is missing important edge cases for the aes_encrypt function. Consider adding tests for: null input/key handling, invalid key lengths (e.g., 8 bytes, 15 bytes, 17 bytes), invalid IV lengths for different modes, unsupported mode/padding combinations, empty input data, and very large input data. These test cases would help ensure robust error handling and compatibility with Spark's behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +158
match get_cipher_mode(mode, padding) {
Ok(cipher) => match cipher.encrypt(input, key, iv, aad) {
Ok(encrypted) => builder.append_value(&encrypted),
Err(_) => builder.append_null(),
},
Err(_) => builder.append_null(),
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling in batch processing silently converts encryption errors to null values. This may mask important errors such as invalid key lengths or unsupported cipher modes. Consider logging errors or propagating them in a way that provides better visibility to users, particularly for cryptographic operations where failure reasons are critical for debugging.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +46
use rand::Rng;
let mut iv = vec![0u8; length];
rand::rng().fill(&mut iv[..]);
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate_random_iv function uses rand::rng() which generates cryptographically insecure random numbers. For AES encryption, especially GCM mode, initialization vectors must be cryptographically secure random values. Using non-secure random IVs can compromise the security of the encryption. Consider using rand::rngs::OsRng or rand::thread_rng() with a cryptographically secure RNG to generate IVs.

Suggested change
use rand::Rng;
let mut iv = vec![0u8; length];
rand::rng().fill(&mut iv[..]);
use rand::rngs::OsRng;
use rand::RngCore;
let mut iv = vec![0u8; length];
OsRng.fill_bytes(&mut iv[..]);

Copilot uses AI. Check for mistakes.
Comment on lines +211 to +227
24 | 32 => {
let cipher = Aes256Gcm::new(key.into());
let payload = match aad {
Some(aad_data) => Payload {
msg: input,
aad: aad_data,
},
None => Payload {
msg: input,
aad: &[],
},
};
cipher
.encrypt(nonce, payload)
.map_err(|e| CryptoError::EncryptionFailed(e.to_string()))?
}
_ => unreachable!("Key length validated above"),
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GCM mode implementation maps both 24-byte and 32-byte keys to Aes256Gcm. However, the standard AES-GCM supports AES-128, AES-192, and AES-256, which require 16, 24, and 32-byte keys respectively. Using Aes256Gcm for a 24-byte key is incorrect. The aes-gcm crate provides Aes192Gcm for 24-byte keys. This should be updated to use the correct cipher implementation for each key size to ensure proper encryption standards compliance.

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.86%. Comparing base (f09f8af) to head (65c5a61).
⚠️ Report is 927 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3211      +/-   ##
============================================
+ Coverage     56.12%   59.86%   +3.73%     
- Complexity      976     1460     +484     
============================================
  Files           119      175      +56     
  Lines         11743    16166    +4423     
  Branches       2251     2681     +430     
============================================
+ Hits           6591     9678    +3087     
- Misses         4012     5132    +1120     
- Partials       1140     1356     +216     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: aes_encrypt

3 participants