Enhancement: Compile pretokenizer regex to C++ code #17992

aldehir · 2025-12-13T13:06:57Z

aldehir
Dec 13, 2025
Collaborator

The current STL regex approach for the gpt4o pretokenizer used by gpt-oss is causing a few issues for people:

Bug: llama-server crashes (segfault) when processing prompts with repeated identical characters #17636 (fixed by fix: prevent segfault in tokenizer on highly repetitive input #17786)
Regression: PR #17786 breaks model loading on Windows/MSVC for models with complex tokenizer regex #17830 (broken with Visual Studio C++ 2026 after the "fix")
# Bug: GPT-OSS-20B models fail to load with regex_error during vocabulary loading #17942

The core problem is that std::regex uses recursive backtracking. Large inputs can exceed the engine's internal stack limit, causing it to throw or fail. It also doesn't support PCRE, so there's no \p{L} or other Unicode categories. llama.cpp has to use manually translated patterns that collapse Unicode ranges to ASCII (e.g., \p{L} becomes [A-Za-z]).

Researching the lore, @ggerganov hinted at creating a C++ solution for each pattern. However, writing an equivalent C++ solution is tedious, and might be too much for every pattern. This proposal automates that process.

Proposal

I created pcre-to-cpp. It takes a PCRE pattern and generates a C++ function that produces the BPE offsets. It does so by parsing the pattern into an AST, then visiting the AST to emit C++ code:

This isn't intended to be a separate project, just a way to share and gauge interest.

flowchart TD
    A["PCRE Pattern"] --> B["Recursive Descent Parser"]
    B --> C["AST"]
    C --> D["AST Optimizer"]
    D --> E["C++ Emitter"]
    E --> F["C++ Function"]

Usage

python pcre_to_cpp.py --pattern "PATTERN" --name "NAME" [--output FILE]

Example:

python pcre_to_cpp.py \
    --pattern "'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+" \
    --name "gpt2" \
    --output gpt2_split.cpp

Details

The script is written in pure Python with no dependencies, around ~1800 LOC. It was created with the assistance of AI, quite a bit.

Benefits

Single iterative function. Holds a stack for backtracking instead of recursing, so it won't hit the internal stack limits like std::regex.
Supports actual PCRE features: \p{L}, \p{N}, lookaheads, lazy quantifiers (*?, +?, ??), possessive quantifiers (*+, ++, ?+), etc.
Test harness validates against HuggingFace's tokenizers library with 100% match rate.

Benchmarks

Across 25 test strings with 1000 iterations each:

GPT-2: 112ms vs 771ms (STL), 12.2x speedup
LLaMA 3: 86ms vs 1706ms (STL), 29.8x speedup
GPT-4o: 234ms vs 5661ms (STL), 18.2x speedup

Results: https://github.com/aldehir/pcre-to-cpp/blob/main/logs/benchmark.log

The STL patterns aren't identical to PCRE. They are converted to ASCII forms since std::regex can't handle \p{...}. So token counts may differ slightly.

Limitations

Only supports the Split pretokenizer type with inverse = false. I believe all of the patterns used in llama.cpp are of this type, but feel free to correct me if I am wrong.
No lookbehind or backreferences. Not needed for the current patterns.
The generated code is more verbose than hand-written equivalents. I added comments so it isn't too hard to follow.

Integration Path

Add pcre_to_cpp.py to scripts/
Pre-generate C++ for each pretokenizer pattern, stored in separate C++ source files.
Integrate in the unicode_regex_split_* functions.

Show Me the Code

Generated examples: https://github.com/aldehir/pcre-to-cpp/blob/main/examples/

If there is interest, I can submit the PR and add a custom implementation for gpt4o to address the issues above.

Important

The pcre_to_cpp.py script was written with AI assistance.

CISC · 2025-12-13T13:35:22Z

CISC
Dec 13, 2025
Collaborator

Impressive!

Limitations

* No lookbehind, backreferences, or possessive quantifiers. Not needed for the current patterns.

Think adding possessive quantifiers would be much work? It's actually used (but disabled due to not working everywhere) for f.ex. BailingMoE.

1 reply

aldehir Dec 13, 2025
Collaborator Author

It would not be much work at all. In fact, it was my first implementation until I found out I needed to backtrack to match patterns such as \s*[\n\r]+ and \s+(?!\S). Adding possessive quantifiers is straightforward.

CISC · 2025-12-13T13:42:24Z

CISC
Dec 13, 2025
Collaborator

Could perhaps integrate this in convert_hf_to_gguf_update.py so we can semi-automate this process?

1 reply

aldehir Dec 13, 2025
Collaborator Author

That seems doable, you can import it from the module.

At a high level:

from pcre_to_cpp import parse_pcre, optimize, generate_cpp

# .. pull pattern from tokenizer.json ..

try:
    ast = parse_pcre(pretok_pattern)
except ValueError as e:
    print(f"Error parsing pattern: {e}", file=sys.stderr)
    # .. handle gracefully..

with open(f"src/pretokenizer/unicode-split-{pretok_name}.cpp") as f:
  f.write(generate_cpp(optimize(ast), pretok_name, pretok_pattern))

# .. generate a shared header file with function declarations ..
# .. generate a cmake file enumerating all pretokenizer source files ..

~~I can make the interface nicer, e.g. add optimize_ast(ast) and emit_cpp(name, ast, pattern) or similar.~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: Compile pretokenizer regex to C++ code #17992

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Enhancement: Compile pretokenizer regex to C++ code #17992

Uh oh!

Uh oh!

aldehir Dec 13, 2025 Collaborator

Proposal

Usage

Details

Show Me the Code

Replies: 2 comments · 2 replies

Uh oh!

CISC Dec 13, 2025 Collaborator

Uh oh!

Uh oh!

aldehir Dec 13, 2025 Collaborator Author

Uh oh!

CISC Dec 13, 2025 Collaborator

Uh oh!

Uh oh!

aldehir Dec 13, 2025 Collaborator Author

aldehir
Dec 13, 2025
Collaborator

Replies: 2 comments 2 replies

CISC
Dec 13, 2025
Collaborator

aldehir Dec 13, 2025
Collaborator Author

CISC
Dec 13, 2025
Collaborator

aldehir Dec 13, 2025
Collaborator Author