-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Open
Copy link
Labels
bugSomething isn't workingSomething isn't workingperformanceSpeed related topicsSpeed related topicsserver
Description
Name and Version
version: 7157 (583cb83)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 5060 Ti + 9950x + 96 GB RAM
Models
--gpt-oss-120b-default (see below for extra parameters that might affect it)
Problem description & steps to reproduce
I am running llama-server with
--gpt-oss-120b-default
--ctx-size 0
--kv-unified
--jinja
--chat-template-kwargs {\"reasoning_effort\":\"high\"}
-ub 2048 -b 2048
--cpu-moe --n-gpu-layers 999
--prio -1
--parallel 8
And two parallel agentic workflows: one via llama.vscode, another via codex.
Both contexts are far from 64K (somewhere in 2xK each). Yet, llama-server falls back to full prompt re-processing "due to lack of cache data".
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: GPT-OSS
slot get_availabl: id 7 | task -1 | selected slot by LCP similarity, sim_best = 0.901 (> 0.100 thold), f_keep = 0.823
slot launch_slot_: id 7 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 7 | task 60314 | processing task
slot update_slots: id 7 | task 60314 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 27291
slot update_slots: id 7 | task 60314 | n_past = 24582, slot.prompt.tokens.size() = 29882, seq_id = 7, pos_min = 29755, n_swa = 128
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
slot update_slots: id 7 | task 60314 | failed to restore context checkpoint (pos_min = 21700, pos_max = 24516, size = 99.068 MiB)
slot update_slots: id 7 | task 60314 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)isgallagher, TetrisBlack, Orion-zhen, ssweens and othermod
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingperformanceSpeed related topicsSpeed related topicsserver