Skip to content

Conversation

@eelbaz
Copy link

@eelbaz eelbaz commented Dec 13, 2025

Adds complete GLM-4.6V-Flash support including runtime and conversion scripts.
(This pull is a working take-or-leave support for glm4.6V models while official support is provided by the maintainer.)

Usage

Convert from HuggingFace:

# Text model
python convert_hf_to_gguf.py zai-org/GLM-4.6V-Flash \
  --outfile model.gguf --outtype bf16

# Vision encoder
python convert_hf_to_gguf.py zai-org/GLM-4.6V-Flash \
  --outfile mmproj.gguf --outtype f16 --mmproj

Run inference:

llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf \
  --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 \
  --repeat_penalty 1.9 -fa on --jinja \
  -p "In English Only: Describe what you see in the image." \
  --image image.jpg

Note: peer-coded with Claude for Debugging

Adds complete support for GLM-4.6V-Flash and related models including
runtime inference and HuggingFace-to-GGUF conversion scripts.

Architecture:
- Vision encoder with dual Conv2D patch embedding and M-RoPE
- GLM4-based LLM with M-RoPE position encoding
- 2x2 patch merger with SwiGLU FFN
- Reuses existing ggml_rope_multi() infrastructure

Conversion support:
- GLM4VisionModel class for vision encoder conversion
- Handles Conv3D to Conv2D split for patch embeddings
- Lazy tensor evaluation and all GLM4V-specific tensors

Testing (zai-org/GLM-4.6V-Flash):
- Text model: 18.8GB, 523 tensors (bf16)
- Vision encoder: 1.7GB, 182 tensors (f16)
- Inference: Correct image descriptions

Peer-coded with claude for debugging
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop spamming our project with low-quality AI-generated code.

Original work is from @ddh0 . We won't merge the current PR.

@IIIIIllllIIIIIlllll
Copy link

Tested, GLM-4.6V-Flash cannot correctly recognize image content (text content). The model on the official website can correctly recognize the same image content.

@eelbaz
Copy link
Author

eelbaz commented Dec 13, 2025

@IIIIIllllIIIIIlllll - this is the command I used to test:

lama-mtmd-cli -m llama.cpp/model.gguf --mmproj mmproj.gguf --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 --repeat_penalty 1.9 -fa on --jinja -p "What do you see in this image?" --image mona_lisa.jpg

`Got it, let's analyze the image. The user is asking what we see in this famous painting.

Firstly recognize that "Mona Lisa" by Leonardo da Vinci—so key elements: a woman (the MonaLisa) with long brown hair styled down over her shoulders; she has hands crossed at waist level or lower? Wait, looking closely—the pose of the figure. The background is an outdoor landscape scene in soft blues and greens.

So describe it:

This image depicts Leonardo da Vinci's famous painting "Mona Lisa" (also known as La Gioconda). In this artwork:....

With Text: "here's output of text image (https://i.sstatic.net/IvV2y.png)
"What is the exact text in this image?" response: ... "The actual OCR (the image text) is: It was the best of
Times it wase worst times, It wa age o f wisdom i twa s e ag eof foolishness...""

Can you share the command you're testing with, I want to see if I can replicate.

@IIIIIllllIIIIIlllll
Copy link

IIIIIllllIIIIIlllll commented Dec 13, 2025

I copied your command and tested it again, but the result was the same.
It is completely unable to correctly recognize the text content in the image. Sadly.

The command:
llama-mtmd-cli -m /home/mark/Models/Q8/GLM-4.6V-Flash/GLM-4.6-Flash.gguf --mmproj /home/mark/Models/Q8/GLM-4.6V-Flash/mmproj.gguf --ctx-size 8192 --temp 0.8 --top-p 0.6 --top-k 1.1 --repeat_penalty 1.9 -fa on --jinja -p "What do you see in this image?" --image /home/mark/9395bd24-9b01-49f0-8573-09c6ffbb2cff.jpeg

Here is the image:
9395bd24-9b01-49f0-8573-09c6ffbb2cff.zip

Edited:
The model uses bf16 quantization, and the mmproj file uses f32 quantization.

Output:

<think>Got it, let's analyze the image. The screenshot shows a mobile app interface on an Android device (since we can see navigation buttons at bottom). 

Firstly there are two tabs or sections: one is "朋友圈" which means Moments/Feeds in WeChat-like apps? Wait no—this looks like Xiaohongshu ("Little Red Book") maybe. Because the top left has a logo with red and yellow, typical of some Chinese social media.

Looking at elements:

- Top bar shows time 09:51 (left), battery icon etc.
  
The main content area is divided into two parts? Wait no—there's an app window showing what looks like Xiaohongshu feed or post. The top left has "朋友圈" and then a title maybe about some topic.

Then there are navigation buttons at bottom of the screen: back, home (circle), recent apps?

Also in this screenshot we can see text content from posts—like Chinese characters discussing something related to technology perhaps? There's mention like "#技术分享#" etc. So it seems an app showing social media feed with multiple comments/posts.

So summarizing:

The image is a mobile screen capture of what appears (likely Xiaohongshu) displaying the "朋友圈" section, containing posts and comment threads in Chinese text about technology or similar topics; there's also UI elements like time display at top left ("09:51"), battery icon etc., navigation buttons below.

Need to describe exactly visible content.</think>
The image is a screenshot of an Android mobile device showing what appears (likely the social media app *Xiaohongshu* / "Little Red Book") displaying its **"朋友圈"* section** (“Moments”/feed).  

### Key Elements:  
- At top-left, there’s text “朋友圏”(“Friends’ Circle”). Below that is a post area with Chinese content. The first visible title in the feed seems to be about technology or sharing (e.g., mentions like `#技术分享` / "#Tech Share"). There are comment threads below posts showing multiple replies/likes, typical of social media interaction UIs.

- **Top status bar**: Shows time “09:51”, battery icon with charge level (~70%), and signal indicators.  

 - At the bottom navigation area (Android standard), there’s a back arrow button on left; center is home circle (`○`); right side shows recent apps square, plus other icons for notifications etc.

Overall this screenshot captures an active social media feed interface in Chinese language with typical mobile UI components like status bar and system nav buttons.

@eelbaz
Copy link
Author

eelbaz commented Dec 13, 2025

Hmmmm, Interesting..
image

@github-actions github-actions bot added model Model specific examples python python script changes labels Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants