-
Notifications
You must be signed in to change notification settings - Fork 14.1k
mtmd, llama: add GLM4V vision-language model support #17967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have a feeling you didn't mean to submit this here? |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the clip.cpp changes quickly but I think the solution is not clean enough. Would appreciate if you can simplify it.
tools/mtmd/clip.cpp
Outdated
| * 1. Normalize image (already done by caller) | ||
| * 2. Convert to channel-first [C, H, W] | ||
| * 3. Duplicate temporal dimension [2, C, H, W] | ||
| * 4. Reshape with merge_size=2: [grid_t, temporal, channels, merge_h, merge_size, patch_h, merge_w, merge_size, patch_w] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that this procedure cannot be done using GGML ops.
converting to channel-first ggml_permute
duplicate dim can be ggml_repeat_4d, although I'm pretty sure this is redundant since GGML support broadcasting internally
reshape is ggml_reshape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the pointer! updated to follow the Qwen2VL dual conv2d pattern.
Instead of:
raw_image → permute → repeat_4d → reshape → split → conv2d(k0, frame0) + conv2d(k1, frame1)
Now:
raw_image → conv2d(k0, raw) + conv2d(k1, raw)
tools/mtmd/clip.cpp
Outdated
| // - Extract frame_1 = patches[:,:,1,:,:] [576, 3, 14, 14] | ||
| // - Conv2d(kernel_0, frame_0) + Conv2d(kernel_1, frame_1) -> [576, 1536] | ||
|
|
||
| const int patch_features = 3 * 2 * 14 * 14; // 1176 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is 3, 2, and 14? any reasons not to read these info from hparams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ngxson, removed hard coded values from the the HF implementation debugging session. Now pulls from hparams.
- patch_size → hparams.patch_size
- spatial_merge_size → hparams.n_merge
- image_size → hparams.image_size
tools/mtmd/clip.cpp
Outdated
| cb(frame0, "frame0", -1); | ||
| cb(frame1, "frame1", -1); | ||
|
|
||
| // Apply Conv2d to each frame (simulates Conv3d with temporal kernel split) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this logic is the same as Qwen2/Qwen3 model
|
Also, just want to remind that any AI usages must be explicitly stated in PR description. |
tools/mtmd/clip.cpp
Outdated
| // HF pattern: [h, w, h, w] at chunk_size granularity | ||
| // Chunk 0 (dims 0-31): h position | ||
| // Chunk 1 (dims 32-63): w position | ||
| // Chunk 2 (dims 64-95): h position | ||
| // Chunk 3 (dims 96-127): w position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are referring to this code, then your implementation is likely incorrect
Indeed, I doubt that >50% of GLM-V's code is just Qwen-VL with a different naming. The (almost) same code can be found in modeling_qwen2_vl.py
|
Remind again because some code looks too suspicious compared to #16600
I refuse to review newer commits if this is unclear. |
Add complete GLM4V (GLM-4.6V-Flash) support including: **Vision Encoder (mtmd):** - Dual Conv2D patch embedding (simulating Conv3D temporal reduction) - M-RoPE using ggml_rope_multi() with [h,w,h,w] position pattern - 2x2 patch merger with downsample convolution - SwiGLU-based merger FFN **LLM Architecture (libllama):** - New LLM_ARCH_GLM4V based on GLM4 with M-RoPE - Uses LLAMA_ROPE_TYPE_MROPE with rope_sections from model config - Reuses ggml_rope_multi() (same as Qwen2VL) for position encoding Key design decisions: - Vision encoder uses ggml_rope_multi() instead of custom RoPE - LLM follows GLM4 structure with M-RoPE (not Qwen2VL structure) - Minimal code: ~300 lines total across all files Tested with GLM-4.6V-Flash producing correct image descriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
|
Sorry to bother you, I just wanted to try this PR. I'm trying to convert an mmproj file, but it keeps giving me the error message: What should I do? |
|
lol, I think the author of this PR doesn't actually know what he's doing, but only copy-paste my comment to claude code. I'm closing this PR as it is too noisy. Will have a look a bit later because it's not that complicated |
Summary
Add complete GLM4V (GLM-4.6V-Flash) vision-language model support.
This PR has been completely rebased and rewritten based on maintainer feedback from @ngxson. Key changes from the original submission:
ggml_rope_multi()(same as Qwen2VL)Changes
Vision Encoder (
tools/mtmd/):ggml_rope_multi()with[h,w,h,w]position patternLLM Architecture (
src/):LLM_ARCH_GLM4Vbased on GLM4 structure with M-RoPELLAMA_ROPE_TYPE_MROPEwithrope_sectionsfrom model configggml_rope_multi()for position encoding (shared with Qwen2VL)Files Changed
src/llama-arch.hLLM_ARCH_GLM4Venumsrc/llama-arch.cppsrc/models/glm4v.cppsrc/models/models.hsrc/llama-model.cppsrc/CMakeLists.txttools/mtmd/models/glm4v.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/models/models.htools/mtmd/CMakeLists.txtTesting
Output correctly describes image content.
Design Notes
ggml_rope_multiwithGGML_ROPE_TYPE_VISION)ggml_rope_multiwithLLAMA_ROPE_TYPE_MROPE(same as Qwen2VL)Addresses feedback from #17967 review.