Skip to content

Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587

Merged
stephentoub merged 3 commits intodotnet:mainfrom
stephentoub:protobufdep
Mar 11, 2026
Merged

Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587
stephentoub merged 3 commits intodotnet:mainfrom
stephentoub:protobufdep

Conversation

@stephentoub
Copy link
Member

Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.

Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.
@stephentoub
Copy link
Member Author

cc: @crickman

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the Google.Protobuf dependency from Microsoft.ML.Tokenizers by replacing the large generated SentencepieceModel.cs with a small, purpose-built protobuf wire-format reader/parser for the subset of SentencePiece fields the tokenizer consumes.

Changes:

  • Replace the generated SentencePiece protobuf model types with a minimal internal parser (SentencepieceModel.cs).
  • Remove the Google.Protobuf package reference and adjust warning suppression to project-level.
  • Add/extend tests covering SentencePieceTokenizer.Create(...) / LlamaTokenizer.Create(...) null/invalid stream scenarios.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/Microsoft.ML.Tokenizers.Tests/SentencePieceTests.cs Adds basic factory/stream validation tests for SentencePieceTokenizer.
test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs Adds a null-stream guard test for LlamaTokenizer.Create.
src/Microsoft.ML.Tokenizers/SentencepieceModel.cs Replaces generated protobuf code with a minimal wire-format reader and parsers for ModelProto/TrainerSpec/NormalizerSpec.
src/Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs Removes local warning pragmas now suppressed at project level.
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs Removes local warning pragmas now suppressed at project level.
src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj Drops Google.Protobuf reference and adds NoWarn for MSML_NoInstanceInitializers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@stephentoub
Copy link
Member Author

@ericstj / @tarekgh, the nuget config looks broken:

/home/cloudtest_azpcontainer/.nuget/packages/microsoft.dotnet.arcade.sdk/11.0.0-beta.25626.7/tools/Tools.proj : error NU1301: Unable to load the service index for source https://dnceng.pkgs.visualstudio.com/public/_packaging/darc-pub-dotnet-maintenance-packages-ab95a1f1/nuget/v3/index.json.

@ericstj
Copy link
Member

ericstj commented Mar 10, 2026

The feed was deleted -- we can remove it since the packages are now public.

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 96.65552% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.54%. Comparing base (70d7603) to head (5b2f4e7).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...icrosoft.ML.Tokenizers.Tests/SentencePieceTests.cs 96.63% 17 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7587      +/-   ##
==========================================
+ Coverage   69.07%   69.54%   +0.47%     
==========================================
  Files        1483     1484       +1     
  Lines      274513   273206    -1307     
  Branches    28285    27919     -366     
==========================================
+ Hits       189625   190011     +386     
+ Misses      77503    75831    -1672     
+ Partials     7385     7364      -21     
Flag Coverage Δ
Debug 69.54% <96.65%> (+0.47%) ⬆️
production 63.81% <ø> (+0.47%) ⬆️
test 89.59% <96.65%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs 100.00% <ø> (ø)
.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs 100.00% <ø> (ø)
src/Microsoft.ML.Tokenizers/SentencepieceModel.cs 91.04% <ø> (+70.19%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs 99.85% <100.00%> (+<0.01%) ⬆️
...icrosoft.ML.Tokenizers.Tests/SentencePieceTests.cs 96.63% <96.63%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Handle null backing array in SentencePieceByteString.Span (default struct)
- Add big-endian guard to ReadFloat
- Make ModelProto.Parser readonly
- Optimize ParseFrom with MemoryStream.TryGetBuffer fast-path and pre-sized fallback
- Document last-wins vs merge semantics for non-repeated message fields
- Add comprehensive synthetic protobuf tests (49 SentencePiece tests total)
  covering wire format edge cases, forward compatibility, error handling,
  all field types, and stream path variations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@stephentoub stephentoub enabled auto-merge (squash) March 11, 2026 03:23
@stephentoub stephentoub merged commit def7a4a into dotnet:main Mar 11, 2026
20 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants