-
Notifications
You must be signed in to change notification settings - Fork 14.1k
mtmd: add GLM4V multimodal model with conversion support #17998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adds complete support for GLM-4.6V-Flash and related models including runtime inference and HuggingFace-to-GGUF conversion scripts. Architecture: - Vision encoder with dual Conv2D patch embedding and M-RoPE - GLM4-based LLM with M-RoPE position encoding - 2x2 patch merger with SwiGLU FFN - Reuses existing ggml_rope_multi() infrastructure Conversion support: - GLM4VisionModel class for vision encoder conversion - Handles Conv3D to Conv2D split for patch embeddings - Lazy tensor evaluation and all GLM4V-specific tensors Testing (zai-org/GLM-4.6V-Flash): - Text model: 18.8GB, 523 tensors (bf16) - Vision encoder: 1.7GB, 182 tensors (f16) - Inference: Correct image descriptions Peer-coded with claude for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop spamming our project with low-quality AI-generated code.
Original work is from @ddh0 . We won't merge the current PR.
|
Tested, GLM-4.6V-Flash cannot correctly recognize image content (text content). The model on the official website can correctly recognize the same image content. |
|
@IIIIIllllIIIIIlllll - this is the command I used to test:
`Got it, let's analyze the image. The user is asking what we see in this famous painting. Firstly recognize that "Mona Lisa" by Leonardo da Vinci—so key elements: a woman (the MonaLisa) with long brown hair styled down over her shoulders; she has hands crossed at waist level or lower? Wait, looking closely—the pose of the figure. The background is an outdoor landscape scene in soft blues and greens. So describe it: This image depicts Leonardo da Vinci's famous painting "Mona Lisa" (also known as La Gioconda). In this artwork:.... With Text: "here's output of text image (https://i.sstatic.net/IvV2y.png) Can you share the command you're testing with, I want to see if I can replicate. |
|
I copied your command and tested it again, but the result was the same. The command: Here is the image: Edited: Output: |

Adds complete GLM-4.6V-Flash support including runtime and conversion scripts.
(This pull is a working take-or-leave support for glm4.6V models while official support is provided by the maintainer.)
Usage
Convert from HuggingFace:
Run inference:
llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf \ --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 \ --repeat_penalty 1.9 -fa on --jinja \ -p "In English Only: Describe what you see in the image." \ --image image.jpgNote: peer-coded with Claude for Debugging