Performance of llama.cpp on NVIDIA Grace Hopper GH200 (+optimizations) #18005
fairydreaming
started this conversation in
Show and tell
Replies: 1 comment
-
|
@JohannesGaessler Do you have any experience with NVIDIA uniform memory architecture systems? Are there any obvious optimizations for running large MoE LLMs like DeepSeek V3/R1 or Kimi K2 Thinking I could try on GH200? Things I tried so far:
Things to try some day:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
2025-12-14 - Updated the patch (tensors kept in the CPU mem reduced to blk.*.ffn_(up|gate|down)_exps.weight), this results in a minor performance uplift (in generation +1-2 t/s)
Introduction
I had brief access to a NVIDIA Grace Hopper GH200 system kindly shared by GPTshop and wanted to share results of some benchmarks that I ran on this system.
System info
Performance of small/medium reference models
I ran the models below on unmodified c6f6e4f revision of llama.cpp - first on a CPU-only llama.cpp build on a Grace GPU and then on a CUDA build of llama.cpp on a Hopper GPU.
There's large difference in the CPU token generation performance depending on the number of used threads. The Grace CPU has 72 cores, but optimal number of threads for the token generation is around 32-36. That's why I ran llama-bench during CPU benchmarks twice, first with 32 threads (optimal token generation) and then with 72 threads (optimal prompt processing).
ggml-org/gpt-oss-20b-GGUF
Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
CPU
./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 32 -mmp 0./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 72 -mmp 0./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32 -t 32 -tb 72 --no-mmapGPU
./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32ggml-org/gpt-oss-120b-GGUF
Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF
CPU
./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 32 -mmp 0./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 72 -mmp 0./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32 -t 32 -tb 72 --no-mmapGPU
./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -c 150000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16 --no-mmapPerformance of very large LLMs
When I tried to run very large LLMs like DeepSeek V3.1 or Kimi K2 Thinking on GH200 by using unified memory of Grace Hopper (GGML_CUDA_ENABLE_UNIFIED_MEMORY environment variable set to 1) I noticed that llama.cpp performed very poorly:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1After some investigation on how the unified memory of GH200 chip works I created a simple experimental patch that advised CUDA to keep the model experts in the CPU memory and all remaining tensors in the GPU memory during the tensor initialization. The patch is below:
This patch considerably improved the performance of large models:
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1Below are detailed benchmark results from the patched c6f6e4f llama.cpp revision.
unsloth/DeepSeek-V3.1-Terminus-GGUF Q4_K_M
Model: https://huggingface.co/unsloth/DeepSeek-V3.1-Terminus-GGUF Q4_K_M
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1 -d 0,4096,8192,16384,32768,65536 -p 2048 -n 32 -ub 2048GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-batched-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32unsloth/Kimi-K2-Thinking-GGUF Q3_K_M
Model: https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF Q3_K_M
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/Kimi-K2-Thinking-Q3_K_M-00001-of-00011.gguf -fa 1 -d 0,4096,8192,16384,32768,65536,131072 -p 2048 -n 32 -ub 2048GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-batched-bench -m ~/fairydreaming/models/Kimi-K2-Thinking-Q3_K_M-00001-of-00011.gguf -fa 1 -c 150000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16Encountered problems
During my experiments I noticed that when using unified memory sometimes (it happened once or twice a day) llama.cpp failed to terminate cleanly after being interrupted with ctrl+c or during exit. In the kernel logs there were some stacktraces indicating a problem in NVIDIA drivers:
This resulted in an unresponsive system that was unable to shut down cleanly and could only be restarted by power cycling.
I couldn't find any information about this problem in google search or NVIDIA forums.
Final words
Let me know if there are any other obvious optimizations that I could try.
Beta Was this translation helpful? Give feedback.
All reactions