Performance of llama.cpp on NVIDIA Grace Hopper GH200 (+optimizations) #18005

fairydreaming · 2025-12-13T20:07:58Z

fairydreaming
Dec 13, 2025
Collaborator

2025-12-14 - Updated the patch (tensors kept in the CPU mem reduced to blk.*.ffn_(up|gate|down)_exps.weight), this results in a minor performance uplift (in generation +1-2 t/s)

Introduction

I had brief access to a NVIDIA Grace Hopper GH200 system kindly shared by GPTshop and wanted to share results of some benchmarks that I ran on this system.

System info

$ uname --all
Linux 624-7 6.14.0-1013-nvidia-64k #13-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 29 09:44:30 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

$ g++ --version
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ /usr/local/cuda-13.0/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

$ nvidia-smi
Sat Dec 13 18:31:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 144G HBM3e        Off |   00000009:01:00.0 Off |                    0 |
| N/A   34C    P0            122W /  900W |       0MiB / 146831MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ lscpu
Architecture:                aarch64
  CPU op-mode(s):            64-bit
  Byte Order:                Little Endian
CPU(s):                      72
  On-line CPU(s) list:       0-71
Vendor ID:                   ARM
  Model name:                Neoverse-V2
    Model:                   0
    Thread(s) per core:      1
    Core(s) per socket:      72
    Socket(s):               1
    Stepping:                r0p0
    Frequency boost:         disabled
    CPU(s) scaling MHz:      100%
    CPU max MHz:             3357.0000
    CPU min MHz:             81.0000
    BogoMIPS:                2000.00
    Flags:                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcp
                             op sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes 
                             svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):         
  L1d:                       4.5 MiB (72 instances)
  L1i:                       4.5 MiB (72 instances)
  L2:                        72 MiB (72 instances)
  L3:                        114 MiB (1 instance)
NUMA:                        
  NUMA node(s):              8
  NUMA node0 CPU(s):         0-71
  NUMA node2 CPU(s):         
  NUMA node3 CPU(s):         
  NUMA node4 CPU(s):         
  NUMA node5 CPU(s):         
  NUMA node6 CPU(s):         
  NUMA node7 CPU(s):         
  NUMA node8 CPU(s):         
Vulnerabilities:             
  Gather data sampling:      Not affected
  Ghostwrite:                Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Not affected
  Spec rstack overflow:      Not affected
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; __user pointer sanitization
  Spectre v2:                Mitigation; CSV2, BHB
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected
  Vmscape:                   Not affected

$ sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.

Handle 0x0012, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 480 GB
	Error Information Handle: No Error
	Number Of Devices: 1

Handle 0x0013, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x0012
	Error Information Handle: Not Provided
	Total Width: 576 bits
	Data Width: 512 bits
	Size: 480 GB
	Form Factor: Die
	Set: None
	Locator: LP5x_0
	Bank Locator: LP5x_0
	Type: LPDDR5
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 6400 MT/s
	Manufacturer: NVIDIA
	Serial Number: [REDACTED]
	Asset Tag: Not Specified
	Part Number: 699-2G530-0236-500
	Rank: 3
	Configured Memory Speed: 6400 MT/s
	Minimum Voltage: 1.1 V
	Maximum Voltage: 1.1 V
	Configured Voltage: 1.1 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Not Specified
	Module Manufacturer ID: Bank 4, Hex 0x6B
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 480 GB
	Cache Size: None
	Logical Size: None

$ likwid-bench -t load -w S0:8GB
Cycles:			7647483
CPU Clock:		44991
Cycle Clock:		0
Time:			7.653407e+00 sec
Iterations:		18432
Iterations per thread:	256
Inner loop executions:	13888888
Size (Byte):		7999999488
Size per thread:	111111104
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	2047999868928
MByte/s:		267593.23
Cycles per update:	0.000030
Cycles per cacheline:	0.000239
Loads per update:	1
Stores per update:	0
Load bytes per element:	8
Store bytes per elem.:	0
Instructions:		255999983620
UOPs:			2047999868928

$ ./nvbandwidth -t 0
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 13000
CUDA Driver Version: 13000
Driver Version: 580.105.08

624-7
Device 0: NVIDIA GH200 144G HBM3e (00000009:01:00)

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0    397.39

SUM host_to_device_memcpy_ce 397.39

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Performance of small/medium reference models

I ran the models below on unmodified c6f6e4f revision of llama.cpp - first on a CPU-only llama.cpp build on a Grace GPU and then on a CUDA build of llama.cpp on a Hopper GPU.

There's large difference in the CPU token generation performance depending on the number of used threads. The Grace CPU has 72 cores, but optimal number of threads for the token generation is around 32-36. That's why I ran llama-bench during CPU benchmarks twice, first with 32 threads (optimal token generation) and then with 72 threads (optimal prompt processing).

ggml-org/gpt-oss-20b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

CPU

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 32 -mmp 0

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	pp2048	216.27 ± 0.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	tg32	84.52 ± 0.17
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	pp2048 @ d4096	150.83 ± 0.11
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	tg32 @ d4096	70.92 ± 0.37
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	pp2048 @ d8192	117.30 ± 0.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	tg32 @ d8192	62.17 ± 0.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	pp2048 @ d16384	80.95 ± 0.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	tg32 @ d16384	49.28 ± 0.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	pp2048 @ d32768	49.71 ± 0.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	32	2048	1	tg32 @ d32768	35.39 ± 0.24

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 72 -mmp 0

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	pp2048	377.55 ± 1.41
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	tg32	34.83 ± 10.05
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	pp2048 @ d4096	265.53 ± 0.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	tg32 @ d4096	28.60 ± 2.58
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	pp2048 @ d8192	201.46 ± 0.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	tg32 @ d8192	26.66 ± 6.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	pp2048 @ d16384	135.96 ± 0.24
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	tg32 @ d16384	33.47 ± 3.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	pp2048 @ d32768	82.25 ± 0.96
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CPU	72	2048	1	tg32 @ d32768	27.64 ± 1.22

./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32 -t 32 -tb 72 --no-mmap

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	1.305	392.40	0.401	79.83	1.706	318.94
512	32	2	1088	2.511	407.84	0.807	79.35	3.317	327.97
512	32	4	2176	4.963	412.62	1.211	105.70	6.174	352.43
512	32	8	4352	9.861	415.37	1.824	140.35	11.685	372.44
512	32	16	8704	19.712	415.57	2.089	245.09	21.801	399.24
512	32	32	17408	39.396	415.88	3.542	289.12	42.937	405.43
4096	32	1	4128	11.946	342.87	0.478	66.91	12.424	332.25
4096	32	2	8256	23.875	343.12	1.094	58.49	24.969	330.65
4096	32	4	16512	47.732	343.25	1.081	118.43	48.813	338.27
4096	32	8	33024	95.443	343.33	1.996	128.28	97.438	338.92
4096	32	16	66048	190.817	343.45	2.967	172.56	193.784	340.83
4096	32	32	132096	381.672	343.42	5.039	203.21	386.711	341.59
8192	32	1	8224	28.714	285.30	0.546	58.58	29.260	281.07
8192	32	2	16448	57.393	285.47	0.909	70.42	58.302	282.12
8192	32	4	32896	114.895	285.20	1.606	79.71	116.501	282.37
8192	32	8	65792	229.726	285.28	2.479	103.28	232.205	283.34
8192	32	16	131584	459.434	285.29	3.937	130.05	463.371	283.97
8192	32	32	263168	918.773	285.32	7.393	138.51	926.166	284.15

GPU

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048	9682.19 ± 16.97
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	tg32	322.93 ± 1.66
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048 @ d4096	8445.64 ± 28.84
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	tg32 @ d4096	288.53 ± 2.50
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048 @ d8192	7526.03 ± 74.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	tg32 @ d8192	270.72 ± 5.50
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048 @ d16384	6160.71 ± 46.78
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	tg32 @ d16384	268.44 ± 3.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048 @ d32768	4529.75 ± 21.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	tg32 @ d32768	256.96 ± 3.00

./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-20b-mxfp4.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.105	4867.89	0.105	305.31	0.210	2590.61
512	32	2	1088	0.110	9268.06	0.307	208.73	0.417	2608.42
512	32	4	2176	0.196	10435.62	0.368	347.98	0.564	3857.55
512	32	8	4352	0.383	10705.75	0.418	611.73	0.801	5432.63
512	32	16	8704	0.763	10742.48	0.512	1000.76	1.274	6831.01
512	32	32	17408	1.512	10832.79	0.638	1604.22	2.151	8093.88
4096	32	1	4128	0.410	9997.32	0.106	301.64	0.516	8003.13
4096	32	2	8256	0.796	10293.27	0.310	206.49	1.106	7466.08
4096	32	4	16512	1.578	10382.33	0.368	348.27	1.946	8486.86
4096	32	8	33024	3.134	10454.77	0.425	602.65	3.559	9278.87
4096	32	16	66048	6.243	10497.09	0.540	948.37	6.783	9737.11
4096	32	32	132096	12.463	10516.83	0.695	1472.57	13.158	10038.87
8192	32	1	8224	0.814	10065.45	0.113	283.79	0.927	8875.13
8192	32	2	16448	1.600	10237.61	0.314	203.54	1.915	8589.86
8192	32	4	32896	3.180	10303.85	0.374	342.22	3.554	9255.54
8192	32	8	65792	6.337	10342.57	0.436	587.74	6.772	9715.16
8192	32	16	131584	12.665	10348.94	0.557	919.46	13.222	9951.81
8192	32	32	263168	25.317	10354.50	0.724	1413.53	26.041	10105.78

ggml-org/gpt-oss-120b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

CPU

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 32 -mmp 0

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	pp2048	140.60 ± 0.49
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	tg32	59.47 ± 0.62
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	pp2048 @ d4096	98.03 ± 0.17
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	tg32 @ d4096	49.18 ± 0.62
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	pp2048 @ d8192	77.33 ± 0.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	tg32 @ d8192	41.87 ± 1.17
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	pp2048 @ d16384	53.51 ± 0.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	tg32 @ d16384	33.12 ± 1.29
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	pp2048 @ d32768	32.97 ± 0.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	32	2048	1	tg32 @ d32768	23.27 ± 1.23

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -t 72 -mmp 0

model	size	params	backend	threads	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	pp2048	246.06 ± 0.60
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	tg32	28.00 ± 4.91
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	pp2048 @ d4096	173.19 ± 0.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	tg32 @ d4096	19.17 ± 2.71
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	pp2048 @ d8192	131.63 ± 0.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	tg32 @ d8192	17.40 ± 2.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	pp2048 @ d16384	89.37 ± 0.24
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	tg32 @ d16384	19.74 ± 1.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	pp2048 @ d32768	55.34 ± 0.27
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CPU	72	2048	1	tg32 @ d32768	17.91 ± 1.49

./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32 -t 32 -tb 72 --no-mmap

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	2.016	254.01	0.564	56.73	2.580	210.87
512	32	2	1088	3.837	266.89	1.436	44.57	5.273	206.34
512	32	4	2176	7.599	269.49	1.938	66.04	9.538	228.15
512	32	8	4352	15.123	270.85	2.372	107.94	17.495	248.76
512	32	16	8704	30.315	270.23	3.229	158.55	33.545	259.48
512	32	32	17408	60.699	269.92	5.478	186.94	66.177	263.05
4096	32	1	4128	18.259	224.33	0.661	48.38	18.920	218.18
4096	32	2	8256	36.583	223.93	1.789	35.77	38.373	215.15
4096	32	4	16512	73.154	223.97	1.760	72.73	74.914	220.41
4096	32	8	33024	146.328	223.94	2.825	90.62	149.153	221.41
4096	32	16	66048	292.638	223.95	4.552	112.47	297.190	222.24
4096	32	32	132096	585.164	223.99	7.766	131.86	592.930	222.79
8192	32	1	8224	43.898	186.62	0.798	40.09	44.696	184.00
8192	32	2	16448	87.751	186.71	1.628	39.31	89.379	184.03
8192	32	4	32896	175.795	186.40	2.144	59.71	177.938	184.87
8192	32	8	65792	351.381	186.51	3.813	67.15	355.194	185.23
8192	32	16	131584	703.563	186.30	6.026	84.97	709.589	185.44
8192	32	32	263168	1405.349	186.53	11.612	88.19	1416.960	185.73

GPU

./bin/llama-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048	5393.05 ± 24.60
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32	209.01 ± 1.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d4096	4958.14 ± 20.75
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d4096	200.05 ± 2.15
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d8192	4610.35 ± 21.47
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d8192	181.92 ± 16.31
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d16384	4016.87 ± 17.29
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d16384	186.03 ± 2.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d32768	3177.91 ± 6.74
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d32768	174.16 ± 12.10

./bin/llama-batched-bench -m /home/x/fairydreaming/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -c 150000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16 --no-mmap

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.250	2051.40	0.161	198.52	0.411	1324.30
512	32	2	1088	0.238	4302.88	0.526	121.76	0.764	1424.80
512	32	4	2176	0.382	5368.04	0.643	199.03	1.025	2123.70
512	32	8	4352	0.722	5676.40	0.772	331.66	1.493	2914.06
512	32	16	8704	1.431	5725.76	1.002	511.11	2.432	3578.26
4096	32	1	4128	0.745	5496.35	0.163	196.53	0.908	4546.02
4096	32	2	8256	1.474	5557.79	0.537	119.24	2.011	4106.07
4096	32	4	16512	2.923	5605.81	0.652	196.45	3.574	4619.71
4096	32	8	33024	5.829	5621.32	0.786	325.84	6.615	4992.37
4096	32	16	66048	11.621	5639.62	1.054	485.90	12.674	5211.16
8192	32	1	8224	1.493	5485.92	0.173	184.83	1.666	4935.17
8192	32	2	16448	2.963	5529.17	0.539	118.83	3.502	4697.03
8192	32	4	32896	5.895	5558.47	0.663	192.97	6.558	5015.80
8192	32	8	65792	11.783	5562.12	0.802	319.36	12.584	5228.16
8192	32	16	131584	23.539	5568.22	1.104	463.78	24.643	5339.55

Performance of very large LLMs

When I tried to run very large LLMs like DeepSeek V3.1 or Kimi K2 Thinking on GH200 by using unified memory of Grace Hopper (GGML_CUDA_ENABLE_UNIFIED_MEMORY environment variable set to 1) I noticed that llama.cpp performed very poorly:

$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	1	pp512	41.63 ± 0.38
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	1	tg128	8.86 ± 0.14

After some investigation on how the unified memory of GH200 chip works I created a simple experimental patch that advised CUDA to keep the model experts in the CPU memory and all remaining tensors in the GPU memory during the tensor initialization. The patch is below:

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8d17bc66..0e638cd8 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -602,6 +602,16 @@ static enum ggml_status ggml_backend_cuda_buffer_init_tensor(ggml_backend_buffer
         return GGML_STATUS_SUCCESS;
     }
 
+    if (getenv("GGML_CUDA_ENABLE_UNIFIED_MEMORY") != nullptr) {
+        // keep up/gate/down expert tensors (except shared experts) in CPU memory
+        bool isHostTensor = strstr(tensor->name, "_exps.weight") != NULL;
+        const size_t alloc_size = ggml_backend_buft_get_alloc_size(buffer->buft, tensor);
+        if (alloc_size > 0) {
+            cudaMemLocation loc = {.type = isHostTensor ? cudaMemLocationTypeHost : cudaMemLocationTypeDevice, .id = ctx->device};
+            CUDA_CHECK(cudaMemAdvise((char*)tensor->data, alloc_size, cudaMemAdviseSetPreferredLocation, loc));
+        }
+    }
+
     if (ggml_is_quantized(tensor->type) && tensor->view_src == nullptr && ggml_backend_buffer_get_usage(buffer) != GGML_BACKEND_BUFFER_USAGE_COMPUTE) {
         // initialize padding to 0 to avoid possible NaN values
         const size_t original_size = ggml_nbytes(tensor);

This patch considerably improved the performance of large models:

$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	1	pp512	281.78 ± 3.20
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	1	tg128	18.36 ± 0.01

Below are detailed benchmark results from the patched c6f6e4f llama.cpp revision.

unsloth/DeepSeek-V3.1-Terminus-GGUF Q4_K_M

Model: https://huggingface.co/unsloth/DeepSeek-V3.1-Terminus-GGUF Q4_K_M

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1 -d 0,4096,8192,16384,32768,65536 -p 2048 -n 32 -ub 2048

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048	609.89 ± 2.68
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32	18.30 ± 0.02
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048 @ d4096	487.86 ± 0.63
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32 @ d4096	17.46 ± 0.09
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048 @ d8192	405.90 ± 1.27
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32 @ d8192	17.31 ± 0.11
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048 @ d16384	300.68 ± 0.96
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32 @ d16384	17.06 ± 0.19
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048 @ d32768	191.09 ± 0.65
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32 @ d32768	16.50 ± 0.31
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	pp2048 @ d65536	110.51 ± 0.23
deepseek2 671B Q4_K - Medium	377.55 GiB	671.03 B	CUDA	99	2048	1	tg32 @ d65536	15.69 ± 0.05

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-batched-bench -m ~/fairydreaming/models/DeepSeek-V3.1-Terminus-Q4_K_M-00001-of-00009.gguf -fa 1 -c 300000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16,32

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	1.975	259.24	1.790	17.87	3.765	144.48
512	32	2	1088	2.302	444.77	4.134	15.48	6.436	169.05
512	32	4	2176	3.129	654.49	6.705	19.09	9.834	221.27
512	32	8	4352	6.153	665.69	10.542	24.28	16.695	260.67
512	32	16	8704	12.318	665.07	15.857	32.29	28.175	308.93
512	32	32	17408	24.586	666.41	23.304	43.94	47.890	363.50
4096	32	1	4128	7.066	579.71	1.843	17.36	8.909	463.35
4096	32	2	8256	14.110	580.58	4.176	15.33	18.286	451.50
4096	32	4	16512	28.193	581.13	6.905	18.54	35.098	470.46
4096	32	8	33024	56.354	581.46	11.034	23.20	67.388	490.06
4096	32	16	66048	112.719	581.41	16.885	30.32	129.603	509.62
4096	32	32	132096	225.087	582.32	24.802	41.29	249.888	528.62
8192	32	1	8224	15.730	520.78	1.858	17.22	17.588	467.58
8192	32	2	16448	31.398	521.82	4.231	15.13	35.629	461.65
8192	32	4	32896	62.713	522.50	6.953	18.41	69.666	472.20
8192	32	8	65792	125.449	522.41	11.207	22.84	136.656	481.44
8192	32	16	131584	250.864	522.48	17.150	29.85	268.014	490.96
8192	32	32	263168	501.651	522.56	25.282	40.50	526.933	499.43

unsloth/Kimi-K2-Thinking-GGUF Q3_K_M

Model: https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF Q3_K_M

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-bench -m ~/fairydreaming/models/Kimi-K2-Thinking-Q3_K_M-00001-of-00011.gguf -fa 1 -d 0,4096,8192,16384,32768,65536,131072 -p 2048 -n 32 -ub 2048

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048	521.11 ± 1.41
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32	21.57 ± 0.04
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d4096	470.87 ± 4.72
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d4096	19.92 ± 0.09
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d8192	431.44 ± 0.84
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d8192	19.70 ± 0.14
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d16384	359.44 ± 2.15
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d16384	19.46 ± 0.17
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d32768	266.36 ± 0.78
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d32768	19.06 ± 0.32
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d65536	174.89 ± 0.28
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d65536	18.42 ± 0.16
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	pp2048 @ d131072	104.83 ± 0.21
deepseek2 671B Q3_K - Medium	456.34 GiB	1026.41 B	CUDA	99	2048	1	tg32 @ d131072	17.39 ± 0.05

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./bin/llama-batched-bench -m ~/fairydreaming/models/Kimi-K2-Thinking-Q3_K_M-00001-of-00011.gguf -fa 1 -c 150000 -ub 2048 -npp 512,4096,8192 -ntg 32 -npl 1,2,4,8,16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	2.654	192.91	1.533	20.87	4.187	129.92
512	32	2	1088	3.022	338.86	4.414	14.50	7.436	146.32
512	32	4	2176	3.855	531.28	7.075	18.09	10.930	199.08
512	32	8	4352	7.646	535.70	11.473	22.31	19.119	227.63
512	32	16	8704	15.203	538.82	17.712	28.91	32.915	264.44
4096	32	1	4128	8.002	511.85	1.631	19.62	9.633	428.53
4096	32	2	8256	16.031	511.02	4.473	14.31	20.504	402.66
4096	32	4	16512	31.988	512.18	7.372	17.36	39.361	419.50
4096	32	8	33024	64.025	511.80	11.970	21.39	75.995	434.55
4096	32	16	66048	127.985	512.06	18.679	27.41	146.663	450.34
8192	32	1	8224	16.826	486.87	1.646	19.44	18.472	445.20
8192	32	2	16448	33.719	485.90	4.502	14.22	38.220	430.35
8192	32	4	32896	67.379	486.32	7.355	17.40	74.734	440.17
8192	32	8	65792	134.467	487.38	12.100	21.16	146.567	448.89
8192	32	16	131584	268.677	487.84	18.820	27.20	287.497	457.69

Encountered problems

During my experiments I noticed that when using unified memory sometimes (it happened once or twice a day) llama.cpp failed to terminate cleanly after being interrupted with ctrl+c or during exit. In the kernel logs there were some stacktraces indicating a problem in NVIDIA drivers:

[56527.509713] task:cuda00001400006 state:D stack:0     pid:10047 tgid:10046 ppid:3433   task_flags:0x40044c flags:0x0000000d
[56527.509717] Call trace:
[56527.509718]  __switch_to+0x158/0x248 (T)
[56527.509723]  __schedule+0x2c4/0x7c8
[56527.509726]  schedule+0x40/0x128
[56527.509728]  schedule_timeout+0x120/0x138
[56527.509731]  __wait_for_common+0xc4/0x1c8
[56527.509732]  wait_for_completion+0x2c/0x60
[56527.509734]  _raw_q_flush+0xa4/0x108 [nvidia]
[56527.509893]  nv_kthread_q_flush+0x2c/0xa8 [nvidia]
[56527.510042]  os_numa_remove_gpu_memory+0xb4/0x188 [nvidia]
[56527.510190]  osNumaRemoveGpuMemory+0x10/0x40 [nvidia]
[56527.510335]  kmemsysNumaRemoveMemory_GH100+0x4c/0xb0 [nvidia]
[56527.510476]  kmemsysNumaRemoveAllMemory_GH100+0x38/0x50 [nvidia]
[56527.510615]  kmemsysDestruct_IMPL+0x94/0xa0 [nvidia]
[56527.510756]  __nvoc_dtor_KernelMemorySystem+0x14/0x30 [nvidia]
[56527.510904]  __nvoc_objDelete+0x30/0x120 [nvidia]
[56527.511042]  gpuDestruct_IMPL+0x6c/0x250 [nvidia]
[56527.511182]  __nvoc_dtor_OBJGPU+0x14/0x40 [nvidia]
[56527.511328]  __nvoc_objDelete+0x30/0x120 [nvidia]
[56527.511464]  _gpumgrDestroyGpu+0x34/0x130 [nvidia]
[56527.511601]  gpumgrDetachGpu+0xc0/0x120 [nvidia]
[56527.511737]  RmShutdownAdapter+0x180/0x320 [nvidia]
[56527.511881]  rm_shutdown_adapter+0x50/0x70 [nvidia]
[56527.512023]  nv_shutdown_adapter+0x9c/0x1e0 [nvidia]
[56527.512173]  nv_close_device+0x218/0x2a8 [nvidia]
[56527.512322]  nvidia_close_callback+0x90/0x1c8 [nvidia]
[56527.512469]  nvidia_close+0x108/0x328 [nvidia]
[56527.512617]  __fput+0xe4/0x328
[56527.512619]  ____fput+0x20/0x48
[56527.512622]  task_work_run+0x80/0x100
[56527.512624]  do_exit+0x1c0/0x458
[56527.512628]  do_group_exit+0x40/0xa8
[56527.512630]  get_signal+0x91c/0x940
[56527.512633]  do_signal+0x94/0x2f0
[56527.512635]  do_notify_resume+0x114/0x1e0
[56527.512637]  el0_svc+0x17c/0x1f8
[56527.512641]  el0t_64_sync_handler+0x134/0x160
[56527.512643]  el0t_64_sync+0x1b8/0x1c0
[56527.512645] INFO: task nv_remove_numa_:10058 blocked for more than 614 seconds.
[56527.520444]       Tainted: G           OE      6.14.0-1013-nvidia-64k #13-Ubuntu
[56527.528263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[56527.536528] task:nv_remove_numa_ state:D stack:0     pid:10058 tgid:10058 ppid:2      task_flags:0x208040 flags:0x00000008
[56527.536531] Call trace:
[56527.536532]  __switch_to+0x158/0x248 (T)
[56527.536536]  __schedule+0x2c4/0x7c8
[56527.536538]  schedule+0x40/0x128
[56527.536541]  schedule_timeout+0x120/0x138
[56527.536543]  __wait_for_common+0xc4/0x1c8
[56527.536544]  wait_for_completion+0x2c/0x60
[56527.536546]  kthread_stop+0x84/0x288
[56527.536548]  kcompactd_stop+0x38/0x160
[56527.536551]  offline_pages+0x5bc/0x7a0
[56527.536553]  memory_subsys_offline+0x15c/0x220
[56527.536556]  device_offline+0xcc/0x168
[56527.536558]  try_offline_memory_block+0xa0/0x120
[56527.536560]  walk_memory_blocks+0x9c/0x138
[56527.536562]  offline_and_remove_memory+0xac/0x178
[56527.536564]  offline_numa_memory_callback+0x68/0xe0 [nvidia]
[56527.536713]  _main_loop+0xa0/0x180 [nvidia]
[56527.536864]  kthread+0x100/0x120
[56527.536867]  ret_from_fork+0x10/0x20
[56527.536870] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

This resulted in an unresponsive system that was unable to shut down cleanly and could only be restarted by power cycling.

I couldn't find any information about this problem in google search or NVIDIA forums.

Final words

Let me know if there are any other obvious optimizations that I could try.

fairydreaming · 2025-12-14T13:49:13Z

fairydreaming
Dec 14, 2025
Collaborator Author

@JohannesGaessler Do you have any experience with NVIDIA uniform memory architecture systems? Are there any obvious optimizations for running large MoE LLMs like DeepSeek V3/R1 or Kimi K2 Thinking I could try on GH200?

Things I tried so far:

cudaMemAdvise(..., cudaMemAdviseSetPreferredLocation, ...) - that helped a lot when used to place model experts on CPU LPDDR5X memory
cudaMemAdvise(..., cudaMemAdviseSetReadMostly, ...) - also tried that on expert tensors, for some reason it reduced the performance a lot
~~direct ggml_dup() on selected expert tensors - I though that maybe a local copy to GPU memory would help, but that's not the case, no change in performance~~ - this was a retarded idea, had no chance to work, I guess a proper solution would be to modify CUDA mul mat implementations to copy/prefetch memory indicated by passed ids, but they are big and scary functions :-(

Things to try some day:

new GGML OP to prefetch (with cudaMemPrefetchAsync) parts of up_exps, gate_exps and down_exps indicated by selected_experts

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance of llama.cpp on NVIDIA Grace Hopper GH200 (+optimizations) #18005

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Performance of llama.cpp on NVIDIA Grace Hopper GH200 (+optimizations) #18005

Uh oh!

Uh oh!

fairydreaming Dec 13, 2025 Collaborator

Introduction

System info

Performance of small/medium reference models

ggml-org/gpt-oss-20b-GGUF

CPU

GPU

ggml-org/gpt-oss-120b-GGUF

CPU

GPU

Performance of very large LLMs

unsloth/DeepSeek-V3.1-Terminus-GGUF Q4_K_M

unsloth/Kimi-K2-Thinking-GGUF Q3_K_M

Encountered problems

Final words

Replies: 1 comment

Uh oh!

Uh oh!

fairydreaming Dec 14, 2025 Collaborator Author

fairydreaming
Dec 13, 2025
Collaborator

fairydreaming
Dec 14, 2025
Collaborator Author