[Stateful] Implement length-aware keying to minimize padding in BatchElements (Part 2/3) by Eliaaazzz · Pull Request #37565 · apache/beam

Eliaaazzz · 2026-02-11T11:22:44Z

[Stateful] Implement length-aware keying to minimize padding in BatchElements (Part 2/3)

Rationale

Issue: #37531 (Stateful Core - Part 2)
Part 1: #37532

This PR adds length-aware keying to BatchElements to improve batching efficiency for variable-length inputs (for example, NLP inference workloads).

Today, stateful BatchElements uses one shared key (WithSharedKey). That causes short and long sequences to be mixed in the same batch, so padding is dictated by the longest item and compute is wasted. This PR addresses that by routing elements into length buckets before stateful batching.

What changed

New DoFn: WithLengthBucketKey

Implemented in apache_beam/transforms/util.py
Uses bisect-based bucket lookup
Routes elements into length buckets (for example, 0-16, 16-32, etc.) so similarly sized elements are batched together
Uses a composite key: (worker_uuid, bucket_index)

API updates

BatchElements now accepts length_fn and bucket_boundaries
ModelHandler.init now accepts length_fn and bucket_boundaries
Default boundaries: [16, 32, 64, 128, 256, 512]

Stateful-path integration

Length-aware routing is enabled automatically on the stateful path when max_batch_duration_secs is set and length_fn is provided

Testing and results

Added test_padding_efficiency_bimodal in util_test.py to represent a bimodal workload:

500 short elements (length 5-30)
500 long elements (length 200-512)

Observed result:

Unbucketed baseline padding efficiency: about 68%
Bucketed padding efficiency (this PR): about 77%
Improvement: about +9 percentage points

Interpretation:

Unbucketed path mixes short and long elements, increasing padding waste
Bucketed path separates short/long cohorts, reducing wasted compute and memory

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.

gemini-code-assist · 2026-02-11T11:23:09Z

Summary of Changes

Hello @Eliaaazzz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the BatchElements transform by implementing length-aware keying. This enhancement allows elements of similar lengths to be batched together, drastically reducing padding waste, especially beneficial for workloads with variable-length inputs such as NLP inference. The changes involve a new DoFn for length-based routing, API updates to configure this behavior, and automatic integration into the stateful processing path, leading to improved resource utilization and overall efficiency.

Highlights

Length-Aware Keying Implementation: Introduced a new DoFn, WithLengthBucketKey, which enables elements to be keyed based on their length, routing similarly sized elements to the same state partition for more efficient batching.
API Enhancements: The BatchElements transform and ModelHandler.__init__ now accept length_fn and bucket_boundaries parameters, allowing users to configure length-aware batching. Default bucket boundaries are provided if length_fn is set without explicit boundaries.
Automatic Stateful Path Integration: Length-aware routing is automatically activated on the stateful path when both max_batch_duration_secs and length_fn are provided, ensuring optimized batching for streaming contexts.
Improved Padding Efficiency: New benchmark tests demonstrate a significant improvement in padding efficiency, increasing from approximately 68% to 77% for bimodal length distributions, by separating short and long elements into distinct batches.
Robust Validation: Added validation checks for bucket_boundaries to ensure they are non-empty, sorted, positive, and are only used when length_fn is also provided.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

sdks/python/apache_beam/ml/inference/base.py
- Added length_fn and bucket_boundaries parameters to the ModelHandler.__init__ method.
- Updated docstrings to describe the new length_fn and bucket_boundaries parameters.
- Stored the new batching parameters in _batching_kwargs for configuration.
sdks/python/apache_beam/ml/inference/base_test.py
- Added test_length_fn_and_bucket_boundaries to verify correct parameter passing.
- Added test_length_fn_only to confirm length_fn is passed without bucket_boundaries when only length_fn is provided.
sdks/python/apache_beam/transforms/util.py
- Imported the bisect module for efficient bucket lookup.
- Implemented the WithLengthBucketKey DoFn for length-based element routing to specific state partitions.
- Modified BatchElements.__init__ to accept length_fn and bucket_boundaries.
- Added validation logic for bucket_boundaries to ensure they are sorted, positive, and require length_fn.
- Defined _DEFAULT_BUCKET_BOUNDARIES for length-aware batching when custom boundaries are not provided.
- Updated BatchElements.expand to conditionally use WithLengthBucketKey for stateful, length-aware batching.
sdks/python/apache_beam/transforms/util_test.py
- Imported is_not_empty for new test assertions.
- Added test_length_bucket_assignment to verify the correct bucket indexing by WithLengthBucketKey.
- Added test_stateful_length_aware_constant_batch to ensure elements in distinct length groups produce separate batches.
- Added test_stateful_length_aware_default_boundaries to confirm the application of default bucket boundaries.
- Added tests for ValueError conditions: test_length_aware_requires_length_fn, test_bucket_boundaries_must_be_sorted, and test_bucket_boundaries_must_be_positive.
- Added test_length_fn_without_stateful_is_ignored to confirm that length_fn is ignored if stateful batching is not enabled.
- Added test_padding_efficiency_bimodal benchmark to compare padding efficiency with and without length-aware bucketing.

Activity

No human activity (comments, reviews, etc.) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-11T12:50:14Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

codecov · 2026-02-11T14:18:48Z

Codecov Report

❌ Patch coverage is 37.50000% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.88%. Comparing base (195cc59) to head (92b546a).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
sdks/python/apache_beam/transforms/util.py	42.85%	16 Missing ⚠️
sdks/python/apache_beam/ml/inference/base.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #37565   +/-   ##
=========================================
  Coverage     35.88%   35.88%           
  Complexity     1691     1691           
=========================================
  Files          1063     1063           
  Lines        166721   166752   +31     
  Branches       1227     1227           
=========================================
+ Hits          59832    59844   +12     
- Misses       104694   104713   +19     
  Partials       2195     2195

Flag	Coverage Δ
python	`39.69% <37.50%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-02-11T15:26:53Z

Assigning reviewers:

R: @tvalentyn for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

- Add length_fn and bucket_boundaries parameters to ModelHandler.__init__ to support length-aware bucketed keying for ML inference batching - Add WithLengthBucketKey DoFn to route elements by length buckets - Update BatchElements to support length-aware batching when max_batch_duration_secs is set, reducing padding waste for variable-length sequences (e.g., NLP workloads) - Default bucket boundaries: [16, 32, 64, 128, 256, 512] - Add comprehensive tests validating bucket assignment, mixed-length batching, and padding efficiency improvements (77% vs 68% on bimodal data) - All formatting (yapf) and lint (pylint 10/10) checks passed

…keying

github-actions · 2026-02-19T12:26:04Z

Reminder, please take a look at this pr: @tvalentyn

Eliaaazzz · 2026-02-22T21:36:39Z

Hi @damccorm, just a gentle ping on this when you have a spare moment. No rush at all. Just wanted to note that this PR and PR 2 are blocking the final integration PR, so I'd love to get your thoughts on this core logic so I can adjust the downstream work if needed.

…keying

damccorm · 2026-02-25T01:21:06Z

/gemini review

damccorm

Thanks - this is neat! Just had minor feedback, it generally LGTM

damccorm · 2026-02-25T01:14:29Z

sdks/python/apache_beam/ml/inference/base.py

      max_batch_weight: Optional[int] = None,
      element_size_fn: Optional[Callable[[Any], int]] = None,
+      length_fn: Optional[Callable[[Any], int]] = None,
+      bucket_boundaries: Optional[list[int]] = None,


Can we make it clear these are batching parameters? e.g. batch_length_fn and batch_bucket_boundaries?

damccorm · 2026-02-25T01:19:23Z

sdks/python/apache_beam/ml/inference/base.py

+      length_fn: a callable mapping an element to its length. When set with
+        max_batch_duration_secs, enables length-aware bucketed keying so
+        elements of similar length are batched together.
+      bucket_boundaries: sorted list of positive boundary values for length


Could we add more data to this description, similar to below?

gemini-code-assist

Code Review

This PR introduces length-aware keying in BatchElements to improve batching efficiency, especially for variable-length inputs. The changes include a new WithLengthBucketKey DoFn, API updates to BatchElements and ModelHandler to support length-based bucketing, and comprehensive tests to validate the new functionality and demonstrate its performance benefits. The implementation is clear and well-tested. I have one suggestion regarding the bucketing logic to make it more intuitive.

gemini-code-assist · 2026-02-25T01:24:06Z

sdks/python/apache_beam/transforms/util.py

+        load_shared_key, "WithLengthBucketKey").key
+
+  def _get_bucket(self, length):
+    return bisect.bisect_left(self._bucket_boundaries, length)


Using bisect.bisect_right would be more intuitive for defining bucket boundaries. Typically, boundaries are inclusive on the lower end and exclusive on the upper end (e.g., a bucket for lengths in [10, 50)). bisect.bisect_right achieves this behavior, which seems to be what was intended based on comments and logic in the tests.

With the current bisect.bisect_left, a length of 10 falls into bucket 0 (for lengths <= 10), which can be surprising. With bisect.bisect_right, a length of 10 would fall into bucket 1 (for lengths in [10, 50)).

If you make this change, please also update the assertions in test_length_bucket_assignment in util_test.py to match.

Suggested change

return bisect.bisect_left(self._bucket_boundaries, length)

return bisect.bisect_right(self._bucket_boundaries, length)

gemini-code-assist · 2026-02-25T01:24:06Z

sdks/python/apache_beam/transforms/util_test.py

+    # bisect_left: length < 10 -> bucket 0, 10 <= length < 50 -> bucket 1, etc.
+    self.assertEqual(dofn._get_bucket(5), 0)
+    self.assertEqual(dofn._get_bucket(10), 0)
+    self.assertEqual(dofn._get_bucket(11), 1)
+    self.assertEqual(dofn._get_bucket(50), 1)
+    self.assertEqual(dofn._get_bucket(51), 2)
+    self.assertEqual(dofn._get_bucket(100), 2)
+    self.assertEqual(dofn._get_bucket(101), 3)
+    self.assertEqual(dofn._get_bucket(999), 3)


The comment here seems to describe the behavior of bisect.bisect_right, but the assertions match bisect.bisect_left, which is confusing.

If _get_bucket is changed to use bisect.bisect_right as suggested in my other comment, this test should be updated to reflect the new behavior. The boundaries would be inclusive on the lower end, which is more conventional.

Here is a suggested update for the test to align with bisect.bisect_right:

Suggested change

# bisect_left: length < 10 -> bucket 0, 10 <= length < 50 -> bucket 1, etc.

self.assertEqual(dofn._get_bucket(5), 0)

self.assertEqual(dofn._get_bucket(10), 0)

self.assertEqual(dofn._get_bucket(11), 1)

self.assertEqual(dofn._get_bucket(50), 1)

self.assertEqual(dofn._get_bucket(51), 2)

self.assertEqual(dofn._get_bucket(100), 2)

self.assertEqual(dofn._get_bucket(101), 3)

self.assertEqual(dofn._get_bucket(999), 3)

# bisect_right creates buckets where the boundary is the lower-inclusive bound.

# e.g., for boundaries [10, 50], buckets are (-inf, 10), [10, 50), [50, inf)

self.assertEqual(dofn._get_bucket(5), 0)

self.assertEqual(dofn._get_bucket(10), 1)

self.assertEqual(dofn._get_bucket(11), 1)

self.assertEqual(dofn._get_bucket(50), 2)

self.assertEqual(dofn._get_bucket(51), 2)

self.assertEqual(dofn._get_bucket(100), 3)

self.assertEqual(dofn._get_bucket(999), 3)

Expands parameter documentation for clarity and replaces bisect_left with bisect_right to ensure bucket boundaries are inclusive on the lower bound. Updates util_test.py assertions accordingly.

github-actions bot added the python label Feb 11, 2026

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from 9f6b1c2 to 92b546a Compare February 11, 2026 14:01

github-actions bot added the Next Action: Reviewers label Feb 11, 2026

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch 2 times, most recently from 8926e75 to 77b23a7 Compare February 12, 2026 02:06

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch 2 times, most recently from 2f68510 to 8713eac Compare February 12, 2026 08:12

Merge branch 'apache:master' into feature/stateful-core-length-aware-…

47b5a9b

…keying

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from 8713eac to 47b5a9b Compare February 12, 2026 09:50

github-actions bot added the slow-review label Feb 19, 2026

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from 6af4d2e to f820874 Compare February 22, 2026 21:34

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch 3 times, most recently from a889c3f to d6d33c8 Compare February 24, 2026 04:47

Merge branch 'apache:master' into feature/stateful-core-length-aware-…

53454f3

…keying

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from d6d33c8 to 53454f3 Compare February 24, 2026 11:11

damccorm reviewed Feb 25, 2026

View reviewed changes

github-actions bot removed the slow-review label Feb 25, 2026

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from d1fa315 to cf2997a Compare February 25, 2026 13:19

Refine length bucketing docs and fix boundary inclusivity

35a622e

Expands parameter documentation for clarity and replaces bisect_left with bisect_right to ensure bucket boundaries are inclusive on the lower bound. Updates util_test.py assertions accordingly.

Eliaaazzz force-pushed the feature/stateful-core-length-aware-keying branch from cf2997a to 35a622e Compare February 25, 2026 13:32

	return bisect.bisect_left(self._bucket_boundaries, length)
	return bisect.bisect_right(self._bucket_boundaries, length)

Conversation

Eliaaazzz commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Eliaaazzz commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damccorm commented Feb 25, 2026

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

damccorm Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

damccorm Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 11, 2026 •

edited

Loading

Eliaaazzz commented Feb 22, 2026 •

edited

Loading