Replies: 2 comments 2 replies
-
|
Impressive!
Think adding possessive quantifiers would be much work? It's actually used (but disabled due to not working everywhere) for f.ex. BailingMoE. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Could perhaps integrate this in |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The current STL regex approach for the
gpt4opretokenizer used bygpt-ossis causing a few issues for people:The core problem is that
std::regexuses recursive backtracking. Large inputs can exceed the engine's internal stack limit, causing it to throw or fail. It also doesn't support PCRE, so there's no\p{L}or other Unicode categories. llama.cpp has to use manually translated patterns that collapse Unicode ranges to ASCII (e.g.,\p{L}becomes[A-Za-z]).Researching the lore, @ggerganov hinted at creating a C++ solution for each pattern. However, writing an equivalent C++ solution is tedious, and might be too much for every pattern. This proposal automates that process.
Proposal
I created pcre-to-cpp. It takes a PCRE pattern and generates a C++ function that produces the BPE offsets. It does so by parsing the pattern into an AST, then visiting the AST to emit C++ code:
This isn't intended to be a separate project, just a way to share and gauge interest.
flowchart TD A["PCRE Pattern"] --> B["Recursive Descent Parser"] B --> C["AST"] C --> D["AST Optimizer"] D --> E["C++ Emitter"] E --> F["C++ Function"]Usage
Example:
python pcre_to_cpp.py \ --pattern "'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+" \ --name "gpt2" \ --output gpt2_split.cppDetails
The script is written in pure Python with no dependencies, around ~1800 LOC. It was created with the assistance of AI, quite a bit.
Benefits
std::regex.\p{L},\p{N}, lookaheads, lazy quantifiers (*?,+?,??), possessive quantifiers (*+,++,?+), etc.tokenizerslibrary with 100% match rate.Benchmarks
Across 25 test strings with 1000 iterations each:
Results: https://github.com/aldehir/pcre-to-cpp/blob/main/logs/benchmark.log
The STL patterns aren't identical to PCRE. They are converted to ASCII forms since
std::regexcan't handle\p{...}. So token counts may differ slightly.Limitations
Splitpretokenizer type withinverse = false. I believe all of the patterns used in llama.cpp are of this type, but feel free to correct me if I am wrong.Integration Path
pcre_to_cpp.pytoscripts/unicode_regex_split_*functions.Show Me the Code
Generated examples: https://github.com/aldehir/pcre-to-cpp/blob/main/examples/
If there is interest, I can submit the PR and add a custom implementation for
gpt4oto address the issues above.Important
The
pcre_to_cpp.pyscript was written with AI assistance.Beta Was this translation helpful? Give feedback.
All reactions