Skip to content

Spark soundex function implementation#20725

Open
kazantsev-maksim wants to merge 32 commits intoapache:mainfrom
kazantsev-maksim:spark_soundex
Open

Spark soundex function implementation#20725
kazantsev-maksim wants to merge 32 commits intoapache:mainfrom
kazantsev-maksim:spark_soundex

Conversation

@kazantsev-maksim
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#soundex

What changes are included in this PR?

  • Implementation
  • SLT tests

Are these changes tested?

Yes, tests added as part of this PR.

Are there any user-facing changes?

No, these are new function.

@kazantsev-maksim kazantsev-maksim changed the title feat: Spark soundex function implementation Spark soundex function implementation Mar 5, 2026
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Mar 5, 2026
Copy link
Contributor

@davidlghellin davidlghellin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to add the SLT tests for these edge cases if you'd like — I already have them validated against Spark JVM. Just let me know!

query T
SELECT soundex('Datafusion');
----
D312
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I had actually started working on a Spark soundex implementation too and didn't realize there was already a PR for it. Happy to see this moving forward!

I had put together a battery of edge-case tests validated against Spark JVM that might be useful. The current SLT coverage is a bit thin — there are some tricky Soundex behaviors that are easy to get wrong:

tests = [
    # H/W transparency (must NOT separate same codes)
    ("H/W transparency", "SELECT soundex('Ashcroft') AS result"),
    # Separators (digit, space, vowel MUST separate same codes)
    ("Digit separates same-code", "SELECT soundex('B1B') AS result"),
    ("Space separates same-code", "SELECT soundex('B B') AS result"),
    ("Vowel separates same-code", "SELECT soundex('BAB') AS result"),
    # Non-alpha first character (returns input unchanged)
    ("Non-alpha first char", "SELECT soundex('#hello') AS result"),
    ("Space first char", "SELECT soundex(' hello') AS result"),
    ("Only spaces", "SELECT soundex('   ') AS result"),
    ("Tab prefix", "SELECT soundex('\thello') AS result"),
    ("Emoji prefix", "SELECT soundex('😀hello') AS result"),
    ("Only digits", "SELECT soundex('123') AS result"),
    ("Starts with digit", "SELECT soundex('1abc') AS result"),
    # Basic behavior
    ("Single character", "SELECT soundex('A') AS result"),
    ("All same-code letters", "SELECT soundex('BFPV') AS result"),
    ("Similar names Robert", "SELECT soundex('Robert') AS result"),
    ("Similar names Rupert", "SELECT soundex('Rupert') AS result"),
    ("NULL", "SELECT soundex(NULL) AS result"),
    ("Empty string", "SELECT soundex('') AS result"),
    # Case insensitivity
    ("Lowercase", "SELECT soundex('robert') AS result"),
    ("Mixed case same", "SELECT soundex('rObErT') AS result"),
    # Unicode
    ("Unicode umlaut", "SELECT soundex('Müller') AS result"),
    # Truncation (only first 3 codes after initial)
    ("Long string", "SELECT soundex('Abcdefghijklmnop') AS result"),
    # Extra edge cases
    ("Adjacent same codes collapse", "SELECT soundex('Lloyd') AS result"),
    ("W between same codes", "SELECT soundex('BWB') AS result"),
    ("H between same codes", "SELECT soundex('BHB') AS result"),
    ("Double letters", "SELECT soundex('Tymczak') AS result"),
    ("All vowels after first", "SELECT soundex('Aeiou') AS result"),
    ("First char digit rest alpha", "SELECT soundex('1Robert') AS result"),
    ("Hyphen in name", "SELECT soundex('Smith-Jones') AS result"),
    ("Single non-alpha", "SELECT soundex('#') AS result"),
    ("Newline prefix", "SELECT soundex('\nhello') AS result"),
]

for label, sql in tests:
    r = spark.sql(sql).collect()
    print(f"{label}: {repr(r[0].result)}")

# Multi-row column test
print("\nColumn test:")
spark.sql("""
    SELECT soundex(name) AS result 
    FROM VALUES ('Robert'), ('Rupert'), (NULL), (''), ('123') AS t(name)
""").show()

Spark-3.5

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks to @davidlghellin for the test cases.

@kazantsev-maksim
Copy link
Contributor Author

@davidlghellin could you take another look when you have time?

Copy link
Contributor

@davidlghellin davidlghellin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kazantsev-maksim, nice work!

This is my first time reviewing a PR here, so please take my comments as suggestions rather than strict requirements — happy to be corrected.

Nothing here is a blocker — the core algorithm looks correct and the test coverage is solid. Just flagging a few things for consistency with the rest of the crate.

cc @Jefffrey in case I’m off base on any of this.

Comment on lines +20 to +24
use datafusion::logical_expr::{ColumnarValue, Signature, Volatility};
use datafusion_common::cast::as_generic_string_array;
use datafusion_common::utils::take_function_args;
use datafusion_common::{Result, exec_err};
use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use datafusion::logical_expr::{ColumnarValue, Signature, Volatility};
use datafusion_common::cast::as_generic_string_array;
use datafusion_common::utils::take_function_args;
use datafusion_common::{Result, exec_err};
use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};
use datafusion_common::cast::as_generic_string_array;
use datafusion_common::utils::take_function_args;
use datafusion_common::{Result, exec_err};
use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl};](datafusion::logical_expr::{ColumnarValue, Signature, Volatility};)

if you compile only the crate:

cargo clippy -p datafusion-spark

let [array] = take_function_args("soundex", arg)?;
match &array.data_type() {
DataType::Utf8 => soundex::<i32>(array),
DataType::LargeUtf8 => soundex::<i64>(array),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need add DataType::Utf8View

}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(DataType::Utf8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ok(DataType::Utf8)
match &arg_types[0] {
DataType::LargeUtf8 => Ok(DataType::LargeUtf8),
_ => Ok(DataType::Utf8),
}

I think we need to return this by default
Utf8 y Utf8View → Utf8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants