[SPARK-56222][PYTHON] Create ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer by Yicong-Huang · Pull Request #55026 · apache/spark

Yicong-Huang · 2026-03-26T00:47:02Z

What changes were proposed in this pull request?

Refactors ArrowStreamSerializer by extracting group and cogroup loading logic into dedicated subclasses:

ArrowStreamSerializer              (plain Arrow stream I/O)
  ├── ArrowStreamGroupSerializer    (grouped loading, 1 df/group)
  ├── ArrowStreamCoGroupSerializer  (cogrouped loading, 2 dfs/group)
  ├── ArrowStreamUDFSerializer
  ├── ArrowStreamPandasSerializer
  └── ArrowStreamArrowUDFSerializer

Key changes:

ArrowStreamSerializer: simplified to only handle plain Arrow stream read/write. Removed num_dfs parameter.
ArrowStreamGroupSerializer(ArrowStreamSerializer): new class that overrides load_stream with group-count protocol for single-dataframe groups.
ArrowStreamCoGroupSerializer(ArrowStreamSerializer): new class that overrides load_stream with group-count protocol for two-dataframe cogroups.

Why are the changes needed?

This is part of the ongoing serializer simplification effort (SPARK-55384). The previous ArrowStreamSerializer mixed plain stream I/O with group-count protocol logic via a num_dfs parameter and a multi-purpose _load_group_dataframes method. This made the return types ambiguous — callers couldn't tell from the type signature whether they'd get pa.RecordBatch, Iterator[pa.RecordBatch], or Tuple[...]. By splitting group and cogroup into separate classes, each load_stream has a clear, precise return type, improving readability and enabling better static analysis.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

…rializer

…ializer

gaogaotiantian · 2026-03-26T18:22:29Z

python/pyspark/sql/pandas/serializers.py

+        dataframes_in_group: Optional[int] = None
+
+        while dataframes_in_group is None or dataframes_in_group > 0:
+            dataframes_in_group = read_int(stream)


Suggested change

dataframes_in_group: Optional[int] = None

while dataframes_in_group is None or dataframes_in_group > 0:

dataframes_in_group = read_int(stream)

while dataframes_in_group := read_int(stream):

gaogaotiantian · 2026-03-26T18:29:12Z

python/pyspark/sql/pandas/serializers.py

-        self._num_dfs: int = num_dfs

-    def dump_stream(self, iterator, stream):
+    def dump_stream(self, iterator: Iterator["pa.RecordBatch"], stream: IO[bytes]) -> None:


Normally we want to make input less restrictive and output more restrictive. Do we have to restrict the input iterator to Iterator rather than Iterable? I saw we had to do iter on lists to make tests work. What if we take an Iterable as an input and do a iter() inside the function to get the actual iterator? If an Iterator is passed it, iter() is basically an identity function that returns the iterator itself - we lose nothing.

gaogaotiantian · 2026-03-26T18:31:11Z

python/pyspark/sql/pandas/serializers.py

+            dataframes_in_group = read_int(stream)
+
+            if dataframes_in_group == 1:
+                yield self._read_arrow_stream(stream)


Now that the base class has a clear load_stream, we don't have the recursive issue anymore. Would it make sense to do super().load_stream() here and get rid of _read_arrow_stream? As _read_arrow_stream is literally load_stream in the trivial case.

Yicong-Huang added 3 commits March 25, 2026 17:20

refactor: extract ArrowStreamGroupSerializer and ArrowStreamCoGroupSe…

59c401f

…rializer

fix: keep CogroupArrowUDFSerializer inheriting ArrowStreamGroupUDFSer…

404026f

…ializer

fix: resolve mypy type errors in ArrowStreamSerializer call sites

fc0e50d

Yicong-Huang changed the title ~~[SPARK-56222][PYTHON] Extract ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer from ArrowStreamSerializer~~ [SPARK-56222][PYTHON] Create ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer Mar 26, 2026

gaogaotiantian reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56222][PYTHON] Create ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer#55026

[SPARK-56222][PYTHON] Create ArrowStreamGroupSerializer and ArrowStreamCoGroupSerializer#55026
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-56222

Yicong-Huang commented Mar 26, 2026 •

edited

Loading

Uh oh!

gaogaotiantian Mar 26, 2026

Uh oh!

gaogaotiantian Mar 26, 2026

Uh oh!

gaogaotiantian Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gaogaotiantian Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Mar 26, 2026 •

edited

Loading