Spark soundex function implementation by kazantsev-maksim · Pull Request #20725 · apache/datafusion

kazantsev-maksim · 2026-03-05T16:32:37Z

Which issue does this PR close?

N/A

Rationale for this change

Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#soundex

What changes are included in this PR?

Implementation
SLT tests

Are these changes tested?

Yes, tests added as part of this PR.

Are there any user-facing changes?

No, these are new function.

davidlghellin

I'd be happy to add the SLT tests for these edge cases if you'd like — I already have them validated against Spark JVM. Just let me know!

davidlghellin · 2026-03-14T07:23:53Z

datafusion/sqllogictest/test_files/spark/string/soundex.slt

+query T
+SELECT soundex('Datafusion');
+----
+D312


Hey! I had actually started working on a Spark soundex implementation too and didn't realize there was already a PR for it. Happy to see this moving forward!

I had put together a battery of edge-case tests validated against Spark JVM that might be useful. The current SLT coverage is a bit thin — there are some tricky Soundex behaviors that are easy to get wrong:

tests = [ # H/W transparency (must NOT separate same codes) ("H/W transparency", "SELECT soundex('Ashcroft') AS result"), # Separators (digit, space, vowel MUST separate same codes) ("Digit separates same-code", "SELECT soundex('B1B') AS result"), ("Space separates same-code", "SELECT soundex('B B') AS result"), ("Vowel separates same-code", "SELECT soundex('BAB') AS result"), # Non-alpha first character (returns input unchanged) ("Non-alpha first char", "SELECT soundex('#hello') AS result"), ("Space first char", "SELECT soundex(' hello') AS result"), ("Only spaces", "SELECT soundex(' ') AS result"), ("Tab prefix", "SELECT soundex('\thello') AS result"), ("Emoji prefix", "SELECT soundex('😀hello') AS result"), ("Only digits", "SELECT soundex('123') AS result"), ("Starts with digit", "SELECT soundex('1abc') AS result"), # Basic behavior ("Single character", "SELECT soundex('A') AS result"), ("All same-code letters", "SELECT soundex('BFPV') AS result"), ("Similar names Robert", "SELECT soundex('Robert') AS result"), ("Similar names Rupert", "SELECT soundex('Rupert') AS result"), ("NULL", "SELECT soundex(NULL) AS result"), ("Empty string", "SELECT soundex('') AS result"), # Case insensitivity ("Lowercase", "SELECT soundex('robert') AS result"), ("Mixed case same", "SELECT soundex('rObErT') AS result"), # Unicode ("Unicode umlaut", "SELECT soundex('Müller') AS result"), # Truncation (only first 3 codes after initial) ("Long string", "SELECT soundex('Abcdefghijklmnop') AS result"), # Extra edge cases ("Adjacent same codes collapse", "SELECT soundex('Lloyd') AS result"), ("W between same codes", "SELECT soundex('BWB') AS result"), ("H between same codes", "SELECT soundex('BHB') AS result"), ("Double letters", "SELECT soundex('Tymczak') AS result"), ("All vowels after first", "SELECT soundex('Aeiou') AS result"), ("First char digit rest alpha", "SELECT soundex('1Robert') AS result"), ("Hyphen in name", "SELECT soundex('Smith-Jones') AS result"), ("Single non-alpha", "SELECT soundex('#') AS result"), ("Newline prefix", "SELECT soundex('\nhello') AS result"), ] for label, sql in tests: r = spark.sql(sql).collect() print(f"{label}: {repr(r[0].result)}") # Multi-row column test print("\nColumn test:") spark.sql(""" SELECT soundex(name) AS result FROM VALUES ('Robert'), ('Rupert'), (NULL), (''), ('123') AS t(name) """).show()

Spark-3.5

Big thanks to @davidlghellin for the test cases.

kazantsev-maksim · 2026-03-18T18:38:12Z

@davidlghellin could you take another look when you have time?

davidlghellin

Hey @kazantsev-maksim, nice work!

This is my first time reviewing a PR here, so please take my comments as suggestions rather than strict requirements — happy to be corrected.

Nothing here is a blocker — the core algorithm looks correct and the test coverage is solid. Just flagging a few things for consistency with the rest of the crate.

cc @Jefffrey in case I’m off base on any of this.

davidlghellin · 2026-03-18T19:28:22Z

datafusion/spark/src/function/string/soundex.rs

+use datafusion::logical_expr::{ColumnarValue, Signature, Volatility};
+use datafusion_common::cast::as_generic_string_array;
+use datafusion_common::utils::take_function_args;
+use datafusion_common::{Result, exec_err};
+use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};


Suggested change

use datafusion::logical_expr::{ColumnarValue, Signature, Volatility};

use datafusion_common::cast::as_generic_string_array;

use datafusion_common::utils::take_function_args;

use datafusion_common::{Result, exec_err};

use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};

use datafusion_common::cast::as_generic_string_array;

use datafusion_common::utils::take_function_args;

use datafusion_common::{Result, exec_err};

use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};](datafusion::logical_expr::{ColumnarValue, Signature, Volatility};)

if you compile only the crate:

cargo clippy -p datafusion-spark

davidlghellin · 2026-03-18T19:34:44Z

datafusion/spark/src/function/string/soundex.rs

+    let [array] = take_function_args("soundex", arg)?;
+    match &array.data_type() {
+        DataType::Utf8 => soundex::<i32>(array),
+        DataType::LargeUtf8 => soundex::<i64>(array),


I think need add DataType::Utf8View

davidlghellin · 2026-03-18T19:40:55Z

datafusion/spark/src/function/string/soundex.rs

+    }
+
+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+        Ok(DataType::Utf8)


Suggested change

Ok(DataType::Utf8)

match &arg_types[0] {

DataType::LargeUtf8 => Ok(DataType::LargeUtf8),

_ => Ok(DataType::Utf8),

}

I think we need to return this by default
Utf8 y Utf8View → Utf8

Kazantsev Maksim and others added 4 commits March 5, 2026 20:29

Spark soundex function implementation

f232898

Merge branch 'main' into spark_soundex

2cd58fe

Add more tests

e2aadb3

Merge remote-tracking branch 'origin/spark_soundex' into spark_soundex

2a426d7

kazantsev-maksim changed the title ~~feat: Spark soundex function implementation~~ Spark soundex function implementation Mar 5, 2026

Merge branch 'main' into spark_soundex

e53f738

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Mar 5, 2026

Kazantsev Maksim added 7 commits March 5, 2026 23:01

Clippy fixing

37c3390

Merge remote-tracking branch 'origin/spark_soundex' into spark_soundex

6f0480e

Clippy fixing

5058986

Fix compute_soundex

5682c4f

Fix compute_soundex

9b014ec

Fix compute_soundex

74569b4

Clippy fixing

b89175e

davidlghellin reviewed Mar 14, 2026

View reviewed changes

Kazantsev Maksim added 14 commits March 14, 2026 21:46

Ad more slt tests

25c763d

Add more slt tests

3965da1

fix

1153124

fix

1a867bf

fix

6d47d83

fix

061c6b1

fix

d7852bf

fix

cab229a

fix

1af1d30

fix

a63709a

fix

bb5f6f0

refactoring

67f40b9

refactoring

b2c5671

refactoring

15a48ab

Kazantsev Maksim added 3 commits March 18, 2026 21:45

refactoring

399b60c

refactoring

1ba5b71

refactoring

6f84ad9

davidlghellin reviewed Mar 18, 2026

View reviewed changes

davidlghellin mentioned this pull request Mar 18, 2026

fix: use datafusion_expr instead of datafusion crate in spark #21043

Open

Kazantsev Maksim added 3 commits March 19, 2026 09:18

fix PR issues

ecb3578

fix PR issues

93bda07

fix PR issues

8283721

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark soundex function implementation#20725

Spark soundex function implementation#20725
kazantsev-maksim wants to merge 32 commits intoapache:mainfrom
kazantsev-maksim:spark_soundex

kazantsev-maksim commented Mar 5, 2026

Uh oh!

davidlghellin left a comment

Uh oh!

davidlghellin Mar 14, 2026

Uh oh!

kazantsev-maksim Mar 14, 2026

Uh oh!

kazantsev-maksim commented Mar 18, 2026

Uh oh!

davidlghellin left a comment

Uh oh!

davidlghellin Mar 18, 2026

Uh oh!

davidlghellin Mar 18, 2026

Uh oh!

davidlghellin Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        Ok(DataType::Utf8)
+    match &arg_types[0] {
+        DataType::LargeUtf8 => Ok(DataType::LargeUtf8),
+        _ => Ok(DataType::Utf8),
+    }

Conversation

kazantsev-maksim commented Mar 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

davidlghellin left a comment

Choose a reason for hiding this comment

Uh oh!

davidlghellin Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim commented Mar 18, 2026

Uh oh!

davidlghellin left a comment

Choose a reason for hiding this comment

Uh oh!

davidlghellin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

davidlghellin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

davidlghellin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants