mtmd: add GLM4V multimodal model with conversion support #17998

eelbaz · 2025-12-13T15:16:56Z

Adds complete GLM-4.6V-Flash support including runtime and conversion scripts.
(This pull is a working take-or-leave support for glm4.6V models while official support is provided by the maintainer.)

Usage

Convert from HuggingFace:

# Text model
python convert_hf_to_gguf.py zai-org/GLM-4.6V-Flash \
  --outfile model.gguf --outtype bf16

# Vision encoder
python convert_hf_to_gguf.py zai-org/GLM-4.6V-Flash \
  --outfile mmproj.gguf --outtype f16 --mmproj

Run inference:

llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf \
  --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 \
  --repeat_penalty 1.9 -fa on --jinja \
  -p "In English Only: Describe what you see in the image." \
  --image image.jpg

Note: peer-coded with Claude for Debugging

Adds complete support for GLM-4.6V-Flash and related models including runtime inference and HuggingFace-to-GGUF conversion scripts. Architecture: - Vision encoder with dual Conv2D patch embedding and M-RoPE - GLM4-based LLM with M-RoPE position encoding - 2x2 patch merger with SwiGLU FFN - Reuses existing ggml_rope_multi() infrastructure Conversion support: - GLM4VisionModel class for vision encoder conversion - Handles Conv3D to Conv2D split for patch embeddings - Lazy tensor evaluation and all GLM4V-specific tensors Testing (zai-org/GLM-4.6V-Flash): - Text model: 18.8GB, 523 tensors (bf16) - Vision encoder: 1.7GB, 182 tensors (f16) - Inference: Correct image descriptions Peer-coded with claude for debugging

ngxson

Stop spamming our project with low-quality AI-generated code.

Original work is from @ddh0 . We won't merge the current PR.

IIIIIllllIIIIIlllll · 2025-12-13T17:43:20Z

Tested, GLM-4.6V-Flash cannot correctly recognize image content (text content). The model on the official website can correctly recognize the same image content.

eelbaz · 2025-12-13T17:56:00Z

@IIIIIllllIIIIIlllll - this is the command I used to test:

lama-mtmd-cli -m llama.cpp/model.gguf --mmproj mmproj.gguf --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 --repeat_penalty 1.9 -fa on --jinja -p "What do you see in this image?" --image mona_lisa.jpg

`Got it, let's analyze the image. The user is asking what we see in this famous painting.

Firstly recognize that "Mona Lisa" by Leonardo da Vinci—so key elements: a woman (the MonaLisa) with long brown hair styled down over her shoulders; she has hands crossed at waist level or lower? Wait, looking closely—the pose of the figure. The background is an outdoor landscape scene in soft blues and greens.

So describe it:

This image depicts Leonardo da Vinci's famous painting "Mona Lisa" (also known as La Gioconda). In this artwork:....

With Text: "here's output of text image (https://i.sstatic.net/IvV2y.png)
"What is the exact text in this image?" response: ... "The actual OCR (the image text) is: It was the best of
Times it wase worst times, It wa age o f wisdom i twa s e ag eof foolishness...""

Can you share the command you're testing with, I want to see if I can replicate.

IIIIIllllIIIIIlllll · 2025-12-13T18:08:50Z

I copied your command and tested it again, but the result was the same.
It is completely unable to correctly recognize the text content in the image. Sadly.

The command:
llama-mtmd-cli -m /home/mark/Models/Q8/GLM-4.6V-Flash/GLM-4.6-Flash.gguf --mmproj /home/mark/Models/Q8/GLM-4.6V-Flash/mmproj.gguf --ctx-size 8192 --temp 0.8 --top-p 0.6 --top-k 1.1 --repeat_penalty 1.9 -fa on --jinja -p "What do you see in this image?" --image /home/mark/9395bd24-9b01-49f0-8573-09c6ffbb2cff.jpeg

Here is the image:
9395bd24-9b01-49f0-8573-09c6ffbb2cff.zip

Edited:
The model uses bf16 quantization, and the mmproj file uses f32 quantization.

Output:

<think>Got it, let's analyze the image. The screenshot shows a mobile app interface on an Android device (since we can see navigation buttons at bottom). 

Firstly there are two tabs or sections: one is "朋友圈" which means Moments/Feeds in WeChat-like apps? Wait no—this looks like Xiaohongshu ("Little Red Book") maybe. Because the top left has a logo with red and yellow, typical of some Chinese social media.

Looking at elements:

- Top bar shows time 09:51 (left), battery icon etc.
  
The main content area is divided into two parts? Wait no—there's an app window showing what looks like Xiaohongshu feed or post. The top left has "朋友圈" and then a title maybe about some topic.

Then there are navigation buttons at bottom of the screen: back, home (circle), recent apps?

Also in this screenshot we can see text content from posts—like Chinese characters discussing something related to technology perhaps? There's mention like "#技术分享#" etc. So it seems an app showing social media feed with multiple comments/posts.

So summarizing:

The image is a mobile screen capture of what appears (likely Xiaohongshu) displaying the "朋友圈" section, containing posts and comment threads in Chinese text about technology or similar topics; there's also UI elements like time display at top left ("09:51"), battery icon etc., navigation buttons below.

Need to describe exactly visible content.</think>
The image is a screenshot of an Android mobile device showing what appears (likely the social media app *Xiaohongshu* / "Little Red Book") displaying its **"朋友圈"* section** (“Moments”/feed).  

### Key Elements:  
- At top-left, there’s text “朋友圏”（“Friends’ Circle”). Below that is a post area with Chinese content. The first visible title in the feed seems to be about technology or sharing (e.g., mentions like `#技术分享` / "#Tech Share"). There are comment threads below posts showing multiple replies/likes, typical of social media interaction UIs.

- **Top status bar**: Shows time “09:51”, battery icon with charge level (~70%), and signal indicators.  

 - At the bottom navigation area (Android standard), there’s a back arrow button on left; center is home circle (`○`); right side shows recent apps square, plus other icons for notifications etc.

Overall this screenshot captures an active social media feed interface in Chinese language with typical mobile UI components like status bar and system nav buttons.

eelbaz · 2025-12-13T18:34:22Z

Hmmmm, Interesting..

eelbaz requested review from CISC, ggerganov and ngxson as code owners December 13, 2025 15:16

ngxson requested changes Dec 13, 2025

View reviewed changes

eelbaz mentioned this pull request Dec 13, 2025

support GLM-4.5V and GLM-4.1V vision models #16600

Draft

github-actions bot added model Model specific examples python python script changes labels Dec 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: add GLM4V multimodal model with conversion support #17998

mtmd: add GLM4V multimodal model with conversion support #17998

eelbaz commented Dec 13, 2025 •

edited

Loading

Uh oh!

ngxson left a comment •

edited

Loading

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025

Uh oh!

eelbaz commented Dec 13, 2025 •

edited

Loading

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025 •

edited

Loading

Uh oh!

eelbaz commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mtmd: add GLM4V multimodal model with conversion support #17998

Are you sure you want to change the base?

mtmd: add GLM4V multimodal model with conversion support #17998

Conversation

eelbaz commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025

Uh oh!

eelbaz commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eelbaz commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eelbaz commented Dec 13, 2025 •

edited

Loading

ngxson left a comment •

edited

Loading

eelbaz commented Dec 13, 2025 •

edited

Loading

IIIIIllllIIIIIlllll commented Dec 13, 2025 •

edited

Loading