Performance of llama.cpp in batch processing mode #18030
Replies: 2 comments
-
|
Now performance of llama.cpp on CPU only. Note, that from 64 tasks (128 generated tokens) and on the CPU frequency was throttled by approximately 10%, that's why I made a separate run for 64 tasks only without throttling, which shows 15% throughput increase. Processor: Intel Ultra 9 285k (Arrow Lake)
Processor: Intel Ultra 9 285k (Arrow Lake) No throttling
|
Beta Was this translation helpful? Give feedback.
-
|
Here is a table that represents the performance results, sort order is descending from best performance to the worst.
The table will be expanded when additional speed tests arrive. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To compare different devices in a correct way we need a common base that doesn't change with device switch. Here such base is proposed and performance of a few devices is shown. Other participants are encouraged to post similar results for their devices.
First, we need to understand that single task performance of devices doesn't reflect a true performance of the hardware. Then we need to compare the performance in batched mode when many requests (user queries) are processed in parallel. For it we have the tool form llama.cpp toolset: llama-batched-bench. It can batch up to 256 tasks simultaneously on one device. As a result device performance is displayed with most possible precision, for example for RTX 3090 we have approximately 15 times better throughput in comparison with plain single task mode.
For the test the following model is selected: Phi-4-mini-instruct-Q4_K_M.gguf. Use 4-bit quantization.
The model is small enough to fit into 6GB memory of old, but still capable devices. In itself the model (with 4-bit quantization) only fits even in 3GB memory, but for the test we need a lot of memory to keep simultaneous requests and KV-cache for all of them. Then for 6GB devices the small model allows exploration of processing of many simultaneous tasks at once (up to 128 requests simultaneously). But if your device has much more memory you still can increase the request number up to 256. Higher numbers do not make sense because for relatively capable devices, like RTX 3090, we see throughput decrease when 256 tasks are processed at once. And, unfortunately, current version of llama.cpp batched test allows only 256 task maximum, so, to not complicate things with a requirement to find (or write) a new version of the test, it is better to use a common solution, which is the llama-batched-bench.
Now discuss a little llama-batched-bench start parameters.
The following command line is proposed:
./llama-batched-bench -m "../models/Phi-4-mini-instruct-Q4_K_M.gguf" -ngl 100 -mg 0 -ts 100,0 -fa on -n 256 -c 17000 -b 2048 -ub 512 -npp 128 -pps -ntg 128,256 -npl 1,2,4,8,16,32,64,128 -kvu -p "who is napoleon?"The critical parts of it are the context size (-c 17000) and the comma separated list of task number (-npl 1,2,4,8,16,32,64,128). The 17GB context is enough for the test and it fits into 6GB of memory. If you have more memory then you can increase it to 34000 and then to run up to 256 tasks simultaneously, adding 256 to the task number list, like this: -npl 1,2,4,8,16,32,64,128,256.
Other parameters are:
Now the test results. The model is Phi-4-mini-instruct-Q4_K_M.gguf. The columns are:
Most interesting column here is the S_TG t/s: token generation speed. A higher value reflects a higher performance of your device.
NVIDIA P102-100, compute capability 6.1
NVIDIA GeForce RTX 3060, compute capability 8.6
NVIDIA GeForce RTX 3090, compute capability 8.6
Beta Was this translation helpful? Give feedback.
All reactions