Skip to content

Conversation

@Turee
Copy link

@Turee Turee commented Dec 12, 2025

Add support for encoder-decoder models in llama-server, matching the behavior of llama-cli. This enables translation models like MADLAD and other T5-based models to work with the server.

Changes:

  • Add has_encoder flag to detect encoder-decoder models at load time
  • Implement llama_encode() call for encoder-decoder prompt processing
  • Use decoder_start_token to initialize decoder after encoding
  • Clear decoder KV cache before each new request (no prefix caching)
  • Disable incompatible features for encoder-decoder models:
    • Context shift (encoder outputs are fixed)
    • Speculative decoding (not supported)
    • Warn about prompt caching differences
  • Add edge case handling for empty text tokens

The encoder processes the full prompt, then the decoder generates output using cross-attention to the encoder's hidden states.

Make sure to read the contributing guidelines before submitting a PR

Cursor /w Opus 4.5 was heavily utilized to develop this change.

Add support for encoder-decoder models in llama-server, matching the
behavior of llama-cli. This enables translation models like MADLAD
and other T5-based models to work with the server.

Changes:
- Add has_encoder flag to detect encoder-decoder models at load time
- Implement llama_encode() call for encoder-decoder prompt processing
- Use decoder_start_token to initialize decoder after encoding
- Clear decoder KV cache before each new request (no prefix caching)
- Disable incompatible features for encoder-decoder models:
  - Context shift (encoder outputs are fixed)
  - Speculative decoding (not supported)
  - Warn about prompt caching differences
- Add edge case handling for empty text tokens

The encoder processes the full prompt, then the decoder generates
output using cross-attention to the encoder's hidden states.
@Turee Turee marked this pull request as draft December 12, 2025 11:06
@Turee
Copy link
Author

Turee commented Dec 12, 2025

Ran cpu only tests:

100% tests passed, 0 tests failed out of 38

Label Time Summary:
main = 268.73 sec*proc (38 tests)

@Turee
Copy link
Author

Turee commented Dec 12, 2025

Performance seems unaffected

(add-enc-dec-model-support)> ./build/bin/llama-bench -m ~/Downloads/gemma-7b.Q2_K.gguf
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M3 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 28991.03 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma 7B Q2_K - Medium         |   3.24 GiB |     8.54 B | Metal,BLAS |      10 |           pp512 |        503.37 ± 0.50 |
| gemma 7B Q2_K - Medium         |   3.24 GiB |     8.54 B | Metal,BLAS |      10 |           tg128 |         45.38 ± 0.06 |

build: 80305701 (7362)
 ((a81a5695))> ./build/bin/llama-bench -m ~/Downloads/gemma-7b.Q2_K.gguf
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M3 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 28991.03 MB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma 7B Q2_K - Medium         |   3.24 GiB |     8.54 B | Metal,BLAS |      10 |           pp512 |        503.40 ± 0.31 |
| gemma 7B Q2_K - Medium         |   3.24 GiB |     8.54 B | Metal,BLAS |      10 |           tg128 |         45.03 ± 0.26 |

build: a81a5695 (7361)

@Turee
Copy link
Author

Turee commented Dec 12, 2025

Perplexity before:
Final estimate: PPL = 3672.3405 +/- 94.20817
After:
Final estimate: PPL = 3672.3405 +/- 94.20817

@Turee Turee marked this pull request as ready for review December 12, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant