Performance of llama.cpp in batch processing mode #18030

karambaso · 2025-12-14T13:41:59Z

karambaso
Dec 14, 2025

To compare different devices in a correct way we need a common base that doesn't change with device switch. Here such base is proposed and performance of a few devices is shown. Other participants are encouraged to post similar results for their devices.

First, we need to understand that single task performance of devices doesn't reflect a true performance of the hardware. Then we need to compare the performance in batched mode when many requests (user queries) are processed in parallel. For it we have the tool form llama.cpp toolset: llama-batched-bench. It can batch up to 256 tasks simultaneously on one device. As a result device performance is displayed with most possible precision, for example for RTX 3090 we have approximately 15 times better throughput in comparison with plain single task mode.

For the test the following model is selected: Phi-4-mini-instruct-Q4_K_M.gguf. Use 4-bit quantization.

The model is small enough to fit into 6GB memory of old, but still capable devices. In itself the model (with 4-bit quantization) only fits even in 3GB memory, but for the test we need a lot of memory to keep simultaneous requests and KV-cache for all of them. Then for 6GB devices the small model allows exploration of processing of many simultaneous tasks at once (up to 128 requests simultaneously). But if your device has much more memory you still can increase the request number up to 256. Higher numbers do not make sense because for relatively capable devices, like RTX 3090, we see throughput decrease when 256 tasks are processed at once. And, unfortunately, current version of llama.cpp batched test allows only 256 task maximum, so, to not complicate things with a requirement to find (or write) a new version of the test, it is better to use a common solution, which is the llama-batched-bench.

Now discuss a little llama-batched-bench start parameters.

The following command line is proposed:

./llama-batched-bench -m "../models/Phi-4-mini-instruct-Q4_K_M.gguf" -ngl 100 -mg 0 -ts 100,0 -fa on -n 256 -c 17000 -b 2048 -ub 512 -npp 128 -pps -ntg 128,256 -npl 1,2,4,8,16,32,64,128 -kvu -p "who is napoleon?"

The critical parts of it are the context size (-c 17000) and the comma separated list of task number (-npl 1,2,4,8,16,32,64,128). The 17GB context is enough for the test and it fits into 6GB of memory. If you have more memory then you can increase it to 34000 and then to run up to 256 tasks simultaneously, adding 256 to the task number list, like this: -npl 1,2,4,8,16,32,64,128,256.

Other parameters are:

-m: path to a model.
-ngl: move up to this number of layers on GPU, here 100 is chosen to fit all needs because most models have less than 100 layers.
-mg: if you have more than one GPU then you can select a main GPU by it's index (starting from zero).
-ts: tensor split, tells llama.cpp how many layers to place on which device, in the example all layers are put onto the first device.
-fa: turn flash attention on, it increases processing speed.
-n: a number of tokens to predict, the test is run with up to 256 generated tokens, then more of them is just a useless waste.
-b: a logical buffer size for the processing, increasing it adds little, so in the example the default value is taken to ensure the test always uses the same value.
-ub: a physical buffer size for the processing, same as above, but closely related to memory allocation.
-npp: number of prompt tokens. This number plus generated tokens are used for each task, so this sum should be multiplied by simultaneous task number to get memory requirements.
-pps: a comma separated list of generated token number.
-kvu: use unified KV cache.
-p: the prompt.

Now the test results. The model is Phi-4-mini-instruct-Q4_K_M.gguf. The columns are:

PP: prompt tokens
TG: generated tokens
B: simultaneous task number
N_KV: size of KV-cache (memory requirements)
T_PP s: time to process the prompt
S_PP t/s: prompt processing speed
T_TG s: time to generate tokens
S_TG t/s: token generation speed
T s: total time taken
S t/s: synthetic speed, prompt and token speed, divided by total time

Most interesting column here is the S_TG t/s: token generation speed. A higher value reflects a higher performance of your device.

Phi-4-mini-instruct-Q4_K_M.gguf

print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.18 BPW)

NVIDIA P102-100, compute capability 6.1

main: n_kv_max = 18176, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 1, is_tg_separate = 0, n_gpu_layers = 100, n_threads = 4, n_threads_batch = 4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.097 |  1324.23 |    1.631 |    78.46 |    1.728 |   148.15 |
|   128 |    128 |    2 |    384 |    0.077 |  1661.95 |    2.045 |   125.20 |    2.122 |   180.98 |
|   128 |    128 |    4 |    640 |    0.077 |  1667.84 |    3.232 |   158.39 |    3.309 |   193.40 |
|   128 |    128 |    8 |   1152 |    0.077 |  1655.33 |    5.532 |   185.09 |    5.610 |   205.36 |
|   128 |    128 |   16 |   2176 |    0.077 |  1670.58 |    5.019 |   408.01 |    5.096 |   427.00 |
|   128 |    128 |   32 |   4224 |    0.076 |  1678.07 |    8.569 |   478.01 |    8.645 |   488.60 |
|   128 |    128 |   64 |   8320 |    0.077 |  1672.26 |   16.615 |   493.05 |   16.691 |   498.46 |
|   128 |    128 |  128 |  16512 |    0.077 |  1654.69 |   38.368 |   427.02 |   38.446 |   429.49 |
|   128 |    256 |    1 |    384 |    0.078 |  1642.35 |    3.328 |    76.92 |    3.406 |   112.74 |
|   128 |    256 |    2 |    640 |    0.079 |  1619.72 |    4.157 |   123.16 |    4.236 |   151.08 |
|   128 |    256 |    4 |   1152 |    0.078 |  1633.69 |    6.661 |   153.74 |    6.739 |   170.94 |
|   128 |    256 |    8 |   2176 |    0.080 |  1599.36 |   11.461 |   178.69 |   11.541 |   188.54 |
|   128 |    256 |   16 |   4224 |    0.079 |  1617.00 |   10.690 |   383.16 |   10.769 |   392.23 |
|   128 |    256 |   32 |   8320 |    0.078 |  1645.43 |   18.836 |   434.91 |   18.914 |   439.88 |
|   128 |    256 |   64 |  16512 |    0.079 |  1624.26 |   39.251 |   417.41 |   39.330 |   419.83 |

NVIDIA GeForce RTX 3060, compute capability 8.6

main: n_kv_max = 34048, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 1, is_tg_separate = 0, n_gpu_layers = 100, n_threads = 4, n_threads_batch = 4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.044 |  2904.47 |    1.185 |   108.00 |    1.229 |   208.26 |
|   128 |    128 |    2 |    384 |    0.036 |  3522.10 |    1.479 |   173.05 |    1.516 |   253.36 |
|   128 |    128 |    4 |    640 |    0.036 |  3580.22 |    2.237 |   228.86 |    2.273 |   281.58 |
|   128 |    128 |    8 |   1152 |    0.036 |  3544.53 |    3.714 |   275.70 |    3.750 |   307.17 |
|   128 |    128 |   16 |   2176 |    0.037 |  3484.51 |    3.036 |   674.48 |    3.073 |   708.07 |
|   128 |    128 |   32 |   4224 |    0.036 |  3586.64 |    4.520 |   906.13 |    4.556 |   927.12 |
|   128 |    128 |   64 |   8320 |    0.036 |  3581.02 |    7.852 |  1043.34 |    7.887 |  1054.84 |
|   128 |    128 |  128 |  16512 |    0.036 |  3588.05 |   15.595 |  1050.62 |   15.630 |  1056.41 |
|   128 |    128 |  256 |  32896 |    0.036 |  3560.30 |   36.385 |   900.59 |   36.421 |   903.22 |
|   128 |    256 |    1 |    384 |    0.036 |  3512.62 |    2.410 |   106.23 |    2.446 |   156.97 |
|   128 |    256 |    2 |    640 |    0.037 |  3472.69 |    3.034 |   168.74 |    3.071 |   208.40 |
|   128 |    256 |    4 |   1152 |    0.036 |  3516.68 |    4.632 |   221.06 |    4.669 |   246.75 |
|   128 |    256 |    8 |   2176 |    0.037 |  3490.31 |    7.744 |   264.46 |    7.781 |   279.67 |
|   128 |    256 |   16 |   4224 |    0.036 |  3509.54 |    6.473 |   632.77 |    6.510 |   648.89 |
|   128 |    256 |   32 |   8320 |    0.036 |  3564.96 |    9.786 |   837.14 |    9.822 |   847.11 |
|   128 |    256 |   64 |  16512 |    0.036 |  3553.78 |   17.254 |   949.58 |   17.290 |   955.01 |
|   128 |    256 |  128 |  32896 |    0.036 |  3532.89 |   36.678 |   893.40 |   36.714 |   896.00 |

NVIDIA GeForce RTX 3090, compute capability 8.6

main: n_kv_max = 34048, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 1, is_tg_separate = 0, n_gpu_layers = 100, n_threads = 4, n_threads_batch = 4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.027 |  4663.02 |    0.545 |   234.89 |    0.572 |   447.25 |
|   128 |    128 |    2 |    384 |    0.019 |  6647.28 |    0.721 |   355.02 |    0.740 |   518.68 |
|   128 |    128 |    4 |    640 |    0.018 |  6921.54 |    0.980 |   522.21 |    0.999 |   640.67 |
|   128 |    128 |    8 |   1152 |    0.019 |  6824.12 |    1.517 |   675.19 |    1.535 |   750.31 |
|   128 |    128 |   16 |   2176 |    0.019 |  6595.90 |    1.073 |  1909.56 |    1.092 |  1992.84 |
|   128 |    128 |   32 |   4224 |    0.018 |  7227.56 |    1.344 |  3048.57 |    1.361 |  3102.94 |
|   128 |    128 |   64 |   8320 |    0.018 |  7140.07 |    2.085 |  3928.14 |    2.103 |  3955.51 |
|   128 |    128 |  128 |  16512 |    0.018 |  7058.56 |    4.124 |  3973.05 |    4.142 |  3986.56 |
|   128 |    128 |  256 |  32896 |    0.019 |  6821.94 |    9.583 |  3419.46 |    9.602 |  3426.11 |
|   128 |    256 |    1 |    384 |    0.020 |  6509.03 |    1.115 |   229.58 |    1.135 |   338.39 |
|   128 |    256 |    2 |    640 |    0.020 |  6560.74 |    1.456 |   351.67 |    1.475 |   433.78 |
|   128 |    256 |    4 |   1152 |    0.019 |  6826.67 |    1.998 |   512.45 |    2.017 |   571.15 |
|   128 |    256 |    8 |   2176 |    0.019 |  6703.33 |    3.112 |   658.20 |    3.131 |   695.07 |
|   128 |    256 |   16 |   4224 |    0.019 |  6568.48 |    2.247 |  1822.90 |    2.266 |  1863.70 |
|   128 |    256 |   32 |   8320 |    0.018 |  7172.88 |    2.933 |  2792.83 |    2.951 |  2819.32 |
|   128 |    256 |   64 |  16512 |    0.018 |  7031.42 |    4.827 |  3394.43 |    4.845 |  3408.09 |
|   128 |    256 |  128 |  32896 |    0.019 |  6904.74 |   10.583 |  3096.34 |   10.601 |  3103.00 |

karambaso · 2025-12-14T20:49:19Z

karambaso
Dec 14, 2025
Author

Now performance of llama.cpp on CPU only. Note, that from 64 tasks (128 generated tokens) and on the CPU frequency was throttled by approximately 10%, that's why I made a separate run for 64 tasks only without throttling, which shows 15% throughput increase.

Processor: Intel Ultra 9 285k (Arrow Lake)

main: n_kv_max = 34048, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 1, is_tg_separate = 0, n_gpu_layers = 0, n_threads = 24, n_threads_batch = 96

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.559 |   228.95 |    4.986 |    25.67 |    5.545 |    46.17 |
|   128 |    128 |    2 |    384 |    0.559 |   228.89 |    7.682 |    33.32 |    8.241 |    46.59 |
|   128 |    128 |    4 |    640 |    0.559 |   229.17 |    8.738 |    58.60 |    9.296 |    68.84 |
|   128 |    128 |    8 |   1152 |    0.560 |   228.62 |   11.863 |    86.32 |   12.423 |    92.73 |
|   128 |    128 |   16 |   2176 |    0.542 |   236.01 |   17.280 |   118.52 |   17.822 |   122.09 |
|   128 |    128 |   32 |   4224 |    0.556 |   230.41 |   27.726 |   147.73 |   28.281 |   149.36 |
|   128 |    128 |   64 |   8320 |    0.551 |   232.49 |   55.208 |   148.38 |   55.759 |   149.21 |
|   128 |    128 |  128 |  16512 |    0.605 |   211.64 |  110.682 |   148.03 |  111.287 |   148.37 |
|   128 |    128 |  256 |  32896 |    0.605 |   211.48 |  226.738 |   144.52 |  227.343 |   144.70 |
|   128 |    256 |    1 |    384 |    0.581 |   220.26 |    8.840 |    28.96 |    9.421 |    40.76 |
|   128 |    256 |    2 |    640 |    0.535 |   239.40 |   13.914 |    36.80 |   14.449 |    44.29 |
|   128 |    256 |    4 |   1152 |    0.520 |   246.11 |   16.014 |    63.94 |   16.534 |    69.67 |
|   128 |    256 |    8 |   2176 |    0.516 |   248.03 |   22.621 |    90.53 |   23.138 |    94.05 |
|   128 |    256 |   16 |   4224 |    0.565 |   226.72 |   33.763 |   121.32 |   34.327 |   123.05 |
|   128 |    256 |   32 |   8320 |    0.558 |   229.24 |   56.425 |   145.18 |   56.983 |   146.01 |
|   128 |    256 |   64 |  16512 |    0.563 |   227.34 |  112.982 |   145.01 |  113.545 |   145.42 |
|   128 |    256 |  128 |  32896 |    0.570 |   224.46 |  243.906 |   134.35 |  244.476 |   134.56 |

Processor: Intel Ultra 9 285k (Arrow Lake) No throttling

main: n_kv_max = 34048, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 1, is_tg_separate = 0, n_gpu_layers = 0, n_threads = 24, n_threads_batch = 96

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |   64 |   8320 |    0.527 |   242.92 |   48.036 |   170.54 |   48.563 |   171.32 |

0 replies

karambaso · 2025-12-15T17:29:11Z

karambaso
Dec 15, 2025
Author

Here is a table that represents the performance results, sort order is descending from best performance to the worst.

Hardware	Task number with max tokens	Token number generated per second
NVIDIA GeForce RTX 3090	128	3973.05
NVIDIA GeForce RTX 3060	128	1050.62
NVIDIA P102-100	64	493.05
Intel Ultra 9 285k	64	148.38

The table will be expanded when additional speed tests arrive.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance of llama.cpp in batch processing mode #18030

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance of llama.cpp in batch processing mode #18030

Uh oh!

Uh oh!

karambaso Dec 14, 2025

Replies: 2 comments

Uh oh!

karambaso Dec 14, 2025 Author

Uh oh!

karambaso Dec 15, 2025 Author

karambaso
Dec 14, 2025

karambaso
Dec 14, 2025
Author

karambaso
Dec 15, 2025
Author