Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587
Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587stephentoub merged 3 commits intodotnet:mainfrom
Conversation
Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.
|
cc: @crickman |
There was a problem hiding this comment.
Pull request overview
This PR removes the Google.Protobuf dependency from Microsoft.ML.Tokenizers by replacing the large generated SentencepieceModel.cs with a small, purpose-built protobuf wire-format reader/parser for the subset of SentencePiece fields the tokenizer consumes.
Changes:
- Replace the generated SentencePiece protobuf model types with a minimal internal parser (
SentencepieceModel.cs). - Remove the
Google.Protobufpackage reference and adjust warning suppression to project-level. - Add/extend tests covering
SentencePieceTokenizer.Create(...)/LlamaTokenizer.Create(...)null/invalid stream scenarios.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.ML.Tokenizers.Tests/SentencePieceTests.cs | Adds basic factory/stream validation tests for SentencePieceTokenizer. |
| test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs | Adds a null-stream guard test for LlamaTokenizer.Create. |
| src/Microsoft.ML.Tokenizers/SentencepieceModel.cs | Replaces generated protobuf code with a minimal wire-format reader and parsers for ModelProto/TrainerSpec/NormalizerSpec. |
| src/Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs | Removes local warning pragmas now suppressed at project level. |
| src/Microsoft.ML.Tokenizers/Model/BertOptions.cs | Removes local warning pragmas now suppressed at project level. |
| src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj | Drops Google.Protobuf reference and adds NoWarn for MSML_NoInstanceInitializers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@ericstj / @tarekgh, the nuget config looks broken: |
|
The feed was deleted -- we can remove it since the packages are now public. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7587 +/- ##
==========================================
+ Coverage 69.07% 69.54% +0.47%
==========================================
Files 1483 1484 +1
Lines 274513 273206 -1307
Branches 28285 27919 -366
==========================================
+ Hits 189625 190011 +386
+ Misses 77503 75831 -1672
+ Partials 7385 7364 -21
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- Handle null backing array in SentencePieceByteString.Span (default struct) - Add big-endian guard to ReadFloat - Make ModelProto.Parser readonly - Optimize ParseFrom with MemoryStream.TryGetBuffer fast-path and pre-sized fallback - Document last-wins vs merge semantics for non-repeated message fields - Add comprehensive synthetic protobuf tests (49 SentencePiece tests total) covering wire format edge cases, forward compatibility, error handling, all field types, and stream path variations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.