mtmd, llama: add GLM4V vision-language model support #17967

eelbaz · 2025-12-12T15:07:27Z

Summary

Add complete GLM4V (GLM-4.6V-Flash) vision-language model support.

This PR has been completely rebased and rewritten based on maintainer feedback from @ngxson. Key changes from the original submission:

✅ Removed custom RoPE - Now uses ggml_rope_multi() (same as Qwen2VL)
✅ Rebased on latest master - Includes clip: move model cgraphs into their own files #17965 (model cgraphs refactor)
✅ Added LLM architecture support - Full GLM4V decoder with M-RoPE
✅ Minimal, surgical changes - ~300 lines total

Note: This implementation was peer-coded and debugged with Claude Code.

Changes

Vision Encoder (tools/mtmd/):

Dual Conv2D patch embedding (simulating HF's Conv3D temporal reduction)
M-RoPE using ggml_rope_multi() with [h,w,h,w] position pattern
2x2 patch merger with downsample convolution
SwiGLU-based merger FFN

LLM Architecture (src/):

New LLM_ARCH_GLM4V based on GLM4 structure with M-RoPE
Uses LLAMA_ROPE_TYPE_MROPE with rope_sections from model config
Reuses ggml_rope_multi() for position encoding (shared with Qwen2VL)

Files Changed

File	Change
`src/llama-arch.h`	Added `LLM_ARCH_GLM4V` enum
`src/llama-arch.cpp`	Architecture name + tensor map
`src/models/glm4v.cpp`	New - LLM graph builder
`src/models/models.h`	Class declaration
`src/llama-model.cpp`	Hparams, tensor loading, builder
`src/CMakeLists.txt`	Added glm4v.cpp
`tools/mtmd/models/glm4v.cpp`	New - Vision encoder graph
`tools/mtmd/clip.cpp`	GLM4V integration
`tools/mtmd/clip-impl.h`	Projector type enum
`tools/mtmd/clip-model.h`	Model tensors
`tools/mtmd/models/models.h`	Graph class declaration
`tools/mtmd/CMakeLists.txt`	Added glm4v.cpp

Testing

# Convert model
python convert_hf_to_gguf.py zai-org/GLM-4.6V-Flash --outfile glm46v-bf16.gguf --outtype bf16

# Run inference
./build/bin/llama-mtmd-cli \
  -m glm46v-bf16.gguf \
  --mmproj glm46v-mmproj-f16.gguf \
  -p "Describe this image." \
  --image test.png \
  --ctx-size 4096 -n 100

Output correctly describes image content.

Design Notes

Vision encoder RoPE: Uses identical function to Qwen2VL (ggml_rope_multi with GGML_ROPE_TYPE_VISION)
LLM RoPE: Uses ggml_rope_multi with LLAMA_ROPE_TYPE_MROPE (same as Qwen2VL)
GLM4V LLM structure: Follows GLM4 (post-attention norm, post-FFN norm, SwiGLU) not Qwen2VL

Addresses feedback from #17967 review.

CISC · 2025-12-12T15:12:48Z

I have a feeling you didn't mean to submit this here?

ngxson

I read the clip.cpp changes quickly but I think the solution is not clean enough. Would appreciate if you can simplify it.

ngxson · 2025-12-12T15:13:40Z

tools/mtmd/clip.cpp

+ * 1. Normalize image (already done by caller)
+ * 2. Convert to channel-first [C, H, W]
+ * 3. Duplicate temporal dimension [2, C, H, W]
+ * 4. Reshape with merge_size=2: [grid_t, temporal, channels, merge_h, merge_size, patch_h, merge_w, merge_size, patch_w]


I'm not convinced that this procedure cannot be done using GGML ops.

converting to channel-first ggml_permute

duplicate dim can be ggml_repeat_4d, although I'm pretty sure this is redundant since GGML support broadcasting internally

reshape is ggml_reshape

Appreciate the pointer! updated to follow the Qwen2VL dual conv2d pattern.

Instead of:
raw_image → permute → repeat_4d → reshape → split → conv2d(k0, frame0) + conv2d(k1, frame1)

Now:
raw_image → conv2d(k0, raw) + conv2d(k1, raw)

ngxson · 2025-12-12T15:15:33Z

tools/mtmd/clip.cpp

+        // - Extract frame_1 = patches[:,:,1,:,:] [576, 3, 14, 14]
+        // - Conv2d(kernel_0, frame_0) + Conv2d(kernel_1, frame_1) -> [576, 1536]
+
+        const int patch_features = 3 * 2 * 14 * 14;  // 1176


what is 3, 2, and 14? any reasons not to read these info from hparams?

Thanks @ngxson, removed hard coded values from the the HF implementation debugging session. Now pulls from hparams.

patch_size → hparams.patch_size

spatial_merge_size → hparams.n_merge

image_size → hparams.image_size

ngxson · 2025-12-12T15:18:07Z

tools/mtmd/clip.cpp

+        cb(frame0, "frame0", -1);
+        cb(frame1, "frame1", -1);
+
+        // Apply Conv2d to each frame (simulates Conv3d with temporal kernel split)


I feel like this logic is the same as Qwen2/Qwen3 model

ngxson · 2025-12-12T15:20:22Z

Also, just want to remind that any AI usages must be explicitly stated in PR description.

CISC · 2025-12-12T15:21:16Z

@ngxson I think this (just the last commit) was meant to be submitted to @ddh0's branch.

ngxson · 2025-12-12T20:58:09Z

tools/mtmd/clip.cpp

+        // HF pattern: [h, w, h, w] at chunk_size granularity
+        // Chunk 0 (dims 0-31):   h position
+        // Chunk 1 (dims 32-63):  w position
+        // Chunk 2 (dims 64-95):  h position
+        // Chunk 3 (dims 96-127): w position


If you are referring to this code, then your implementation is likely incorrect

Indeed, I doubt that >50% of GLM-V's code is just Qwen-VL with a different naming. The (almost) same code can be found in modeling_qwen2_vl.py

ngxson · 2025-12-12T21:02:18Z

Remind again because some code looks too suspicious compared to #16600

Also, just want to remind that any AI usages must be explicitly stated in PR description.

I refuse to review newer commits if this is unclear.

Add complete GLM4V (GLM-4.6V-Flash) support including: **Vision Encoder (mtmd):** - Dual Conv2D patch embedding (simulating Conv3D temporal reduction) - M-RoPE using ggml_rope_multi() with [h,w,h,w] position pattern - 2x2 patch merger with downsample convolution - SwiGLU-based merger FFN **LLM Architecture (libllama):** - New LLM_ARCH_GLM4V based on GLM4 with M-RoPE - Uses LLAMA_ROPE_TYPE_MROPE with rope_sections from model config - Reuses ggml_rope_multi() (same as Qwen2VL) for position encoding Key design decisions: - Vision encoder uses ggml_rope_multi() instead of custom RoPE - LLM follows GLM4 structure with M-RoPE (not Qwen2VL structure) - Minimal code: ~300 lines total across all files Tested with GLM-4.6V-Flash producing correct image descriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

IIIIIllllIIIIIlllll · 2025-12-13T08:01:09Z

Sorry to bother you, I just wanted to try this PR.

I'm trying to convert an mmproj file, but it keeps giving me the error message：

(comfyui) mark@MarkPC:~/llama.cpp-glm4v-vision$ python convert_hf_to_gguf.py /home/mark/Models/BF16/GLM-4.6V-Flash/ --outfile /home/mark/Models/Q8/GLM-4.6V-Flash/mmproj.gguf --outtype f32 --mmproj
INFO:hf-to-gguf:Loading model: GLM-4.6V-Flash
INFO:hf-to-gguf:Model architecture: Glm4vForConditionalGeneration
ERROR:hf-to-gguf:Model Glm4vForConditionalGeneration is not supported
(comfyui) mark@MarkPC:~/llama.cpp-glm4v-vision$

What should I do?

ngxson · 2025-12-13T11:40:07Z

lol, I think the author of this PR doesn't actually know what he's doing, but only copy-paste my comment to claude code.

I'm closing this PR as it is too noisy. Will have a look a bit later because it's not that complicated

eelbaz requested review from CISC, ggerganov and ngxson as code owners December 12, 2025 15:07

ngxson reviewed Dec 12, 2025

View reviewed changes

eelbaz mentioned this pull request Dec 12, 2025

support GLM-4.5V and GLM-4.1V vision models #16600

Draft

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 12, 2025

ngxson reviewed Dec 12, 2025

View reviewed changes

eelbaz force-pushed the glm4v-vision branch from dd6c664 to ccf0bb5 Compare December 13, 2025 03:25

eelbaz changed the title ~~mtmd: fix GLM4V vision encoder 2D RoPE implementation~~ mtmd, llama: add GLM4V vision-language model support Dec 13, 2025

loci-dev mentioned this pull request Dec 13, 2025

UPSTREAM PR #17967: mtmd, llama: add GLM4V vision-language model support auroralabs-loci/llama.cpp#544

Open

ngxson closed this Dec 13, 2025

eelbaz deleted the glm4v-vision branch December 13, 2025 15:18

mtmd, llama: add GLM4V vision-language model support #17967

mtmd, llama: add GLM4V vision-language model support #17967

Conversation

eelbaz commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Files Changed

Testing

Design Notes

Uh oh!

CISC commented Dec 12, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

eelbaz Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

eelbaz Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

CISC commented Dec 12, 2025

Uh oh!

ngxson Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025

Uh oh!

ngxson commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eelbaz commented Dec 12, 2025 •

edited

Loading