mtmd: Add DeepSeekOCR Support #17400

sfallah · 2025-11-20T09:11:15Z

Feature Request: #16676

Make sure to read the contributing guidelines before submitting a PR

GGUF Models

sabafallah/DeepSeek-OCR-GGUF

deepseek-ocr-f32.gguf

mmproj-deepseek-ocr-f32.gguf

Running the Model

Build llama.cpp (Mac)

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --config Release

Running llama-mtmd-cli

DeepSeekOCR Paper (First page)

build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0

Hard Test (Old Newspaper Image)

build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tools/mtmd/test-1.jpeg \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0

init commit

mtmd: fix vision model processing

…f/deepseek-ocr

testing Vision model loading

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

…ut in deepseek2 model

…f/deepseek-ocr

ngxson · 2025-12-09T10:29:46Z

tools/mtmd/mtmd-cli.cpp

        }

    } else {
+        if (mtmd_is_deepseekocr(ctx.ctx_vision.get())) {


many models does not support chat mode - it's not our responsibility to tell user what to do

beside, model-specific API like is_model_abc is not allowed. it's an anti-pattern when designing public API

Got it. I'll clean it up very soon.

ngxson · 2025-12-09T10:47:01Z

@sfallah @bluebread Hmm, are you sure that this PR is working?

Using the existing test file in the repo ./tools/mtmd/test-1.jpeg

llama-mtmd-cli -m ../models/DeepSeek-OCR/model.gguf --mmproj ../models/DeepSeek-OCR/mmproj-model.gguf \
  -p "<|grounding|>Convert the document to markdown." \
  --image ./tools/mtmd/test-1.jpeg --chat-template deepseek

The beginning of output looks coherent

<|ref|>title<|/ref|><|det|>[[185, 109, 717, 220]]<|/det|>
# Che New Jork Cimes 

<|ref|>text<|/ref|><|det|>[[68, 142, 174, 180]]<|/det|>
"All the News That's Fit to Print" 

<|ref|>text<|/ref|><|det|>[[67, 228, 176, 246]]<|/det|>
VOL.CXVIII. No.40,721 

<|ref|>text<|/ref|><|det|>[[292, 228, 374, 246]]<|/det|>
1 an the New York Times 

<|ref|>text<|/ref|><|det|>[[410, 228, 584, 246]]<|/det|>
NEW YORK, MONDAY, JULY 21, 1891 

<|ref|>text<|/ref|><|det|>[[799, 138, 908, 204]]<|/det|>
LATE CITY EDITION
Wednesday, June, seven miles nine
tenths, Brooklyn, Jamaica Plain,
Three miles, Brooklyn, Brooklyn,
One hundred and fifty-sixth Street,
N. C. Congress St. uptown at 9, 12. 

<|ref|>text<|/ref|><|det|>[[839, 220, 903, 240]]<|/det|>
13 CENTES 

<|ref|>title<|/ref|><|det|>[[65, 269, 896, 399]]<|/det|>
# MEN WALK ON MOON 

<|ref|>title<|/ref|><|det|>[[65, 408, 894, 528]]<|/det|>
# ASTRONAUTS LAND ON PLAIN; COLLECT ROCKS, PLANT FLAG 

<|ref|>text<|/ref|><|det|>[[65, 552, 230, 587]]<|/det|>
Voice From Moon: 

<|ref|>text<|/ref|><|det|>[[73, 594, 269, 624]]<|/det|>
‘Eagle Has Lander’

But then it went repeatedly:

<|ref|>text<|/ref|><|det|>[[710, 834, 904, 876]]<|/det|>
The second item is that the sun is moving. Assuming that its orbit is circular, the distance between the earth and the sun will increase at the rate of about 1.3 miles per day. 

<|ref|>text<|/ref|><|det|>[[710, 876, 904, 908]]<|/det|>
The third item is that the moon is moving. Assuming that its orbit is circular, the distance between the earth and the moon will increase at the rate of about 1.3 miles per day. 

<|ref|>text<|/ref|><|det|>[[710, 908, 904, 940]]<|/det|>
The fourth item is that the sun is moving. Assuming that its orbit is circular, the distance between the earth and the sun will increase at the rate of about 1.3 miles per day. 

<|ref|>text<|/ref|><|det|>[[710, 940, 904, 972]]<|/det|>
The fifth item is that the moon is moving. Assuming that its orbit is circular, the distance between the earth and the moon will increase at the rate of about 1.3 miles per day. 

<|ref|>text<|/ref|><|det|>[[710, 972, 904, 984]]<|/det|>
The sixth item is that the sun is moving. Assuming that its orbit is circular, the distance between the earth and the sun will increase at the rate of about 1.3 miles per day.

ngxson · 2025-12-09T10:50:09Z

Another run, it repeats:

<|ref|>text<|/ref|><|det|>[[763, 151, 911, 207]]<|/det|> Wednesday, June, seven miles nine tenths, Brooklyn, Jamaica Plain, 7th Avenue, 8th Street, Brooklyn, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8

bluebread · 2025-12-09T11:09:21Z

Another run, it repeats:

<|ref|>text<|/ref|><|det|>[[763, 151, 911, 207]]<|/det|> Wednesday, June, seven miles nine tenths, Brooklyn, Jamaica Plain, 7th Avenue, 8th Street, Brooklyn, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8th Avenue, 8th Street, 8

Here are the script of the reference model and the output, which is also not good:

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '1'
model_name = '/root/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='eager', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = '/root/llama.cpp/tools/mtmd/test-1.jpeg'
output_path = './outputs'
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 640, image_size = 640, crop_mode=False, save_results = True, test_compress = True)

(deepseek-ocr) root@13ca65024005:~# python3 /root/DeepSeek-OCR-vLLM/DeepSeek-OCR-master/DeepSeek-OCR-hf/run_dpsk_ocr.py
You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at /root/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
directly resize
Saving tensor global_view (shape: (1, 3, 640, 640), dtype: torch.float32, sum: 354309.21875) to global_view_py.txt
Tensor saved successfully
/root/miniconda3/envs/deepseek-ocr/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
Saving tensor patch_w (shape: (768, 3, 16, 16), dtype: torch.float32, sum: 77.85554504394531) to patch_w_py.txt
Tensor saved successfully
Saving tensor patch_b (shape: (1, 1, 1, 768), dtype: torch.float32, sum: 11.546276092529297) to patch_b_py.txt
Tensor saved successfully
Saving tensor inp_raw (shape: (1, 3, 640, 640), dtype: torch.float32, sum: 354792.75) to inp_raw_py.txt
Tensor saved successfully
Saving tensor inpL (shape: (1, 40, 40, 768), dtype: torch.float32, sum: 52674.0703125) to inpL_py.txt
Tensor saved successfully
=====================
BASE:  torch.Size([1, 100, 1280])
NO PATCHES
=====================
The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory.
<|ref|>title<|/ref|><|det|>[[60, 255, 936, 520]]<|/det|>
# MEN WALK ON MOON  MEN WALK ON MOON  MEN WALK ON MOON  MEN WALK ON MOAN  MEN WALK ON MOON  MEN WALK ON MOON  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PLANT  MEN WALK ON PLANT  MEN WALK ON PLANT  MEN WALK ON PLANET  MEN WALK ON PLANT  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PLAIN  MEN WALK ON PLAIN  MEN WALK ON PLAIN  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PLEAN  MEN WALK ON PLEAN  MEN WALK ON PLEAN  MEN WALON PLANET  MEN WALON PLANET  MEN WALON PLANET  MEN WALON PLANT  MEN WALON PLANT  MEN WALON PLANT  MEN WALON PLANET  MEN WALON PLANET  MEN WALON PLEAN  MEN WALON PLEAN  MEN WALON PLEAN  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PANEL  MEN WALK ON PANEL  MEN WALK ON PANEL  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALK ON PALAN  MEN WALK ON PALAN  MEN WALK ON PALAN  MEN WALK ON PLANET  MEN WALK ON PLANET  MEN WALKON PLANET  MEN WALKON PLANET  MEN WALKON PLANET  MEN WALON PLANET  MEN WALON PLANET  MENWALKON PLANET  MENWALKON PLANET  MENWALKON PLANET  MEN WALKON PLANET  MEN WALKON PLANET  MENE WALKON PLANET  MEN WALKON PLANET  MEN WALKON PLANET
==================================================
image size:  (640, 488)
valid image tokens:  100
output texts tokens (valid):  446
compression ratio:  4.46
==================================================
===============save results:===============
image: 0it [00:00, ?it/s]

Based on my experiments, while DeepSeek-OCR is able to handle clean well-formatted documents (e.g. academic papers), it struggles with arbitrary image inputs (e.g. photograph of newspaper) due to its limited training dataset. I suspect it's an experimental project as some kind of POC rather than a product.

ngxson · 2025-12-09T11:50:03Z

I suspect it's an experimental project as some kind of POC rather than a product.

If that actually the case, I'm quite doubt about merging this PR as it can easily flood the project with issues regarding the model quality, which we have no control at all.

Not ignoring your efforts here - it's amazing to see such a complicated architecture implemented in GGML. But in the past, I myself also had many PRs that are not merge-able due to model quality - which is even not my faults. Most recent example was the PaddleOCR model

What I think can be better is to keep the PR as an experiment until when more users confirm that it works (or maybe deepseek team will have another better-trained OCR model in the future)

sfallah · 2025-12-09T12:10:23Z

@ngxson @bluebread
I don't think there is an issue with the model itself!

Forcing the base image size (by hardcoding it) I get a better result.


<|ref|>text<|/ref|><|det|>[[63, 118, 177, 178]]<|/det|>
"All the News That's Fit to Print" 

<|ref|>text<|/ref|><|det|>[[63, 222, 176, 238]]<|/det|>
VOL. CXVIII. No. 40,721 

<|ref|>text<|/ref|><|det|>[[396, 221, 579, 238]]<|/det|>
NEW YORK, MONDAY, JULY 21, 1969 

<|ref|>text<|/ref|><|det|>[[789, 120, 908, 138]]<|/det|>
LATE CITY EDITION 

<|ref|>text<|/ref|><|det|>[[783, 140, 911, 211]]<|/det|>
Wednesday, July 21, 1969, 10 CENTS
Water: Salt, warm today; clear
tonight, Sunny, pleasant sunshine.
Time, range: today 10:45; Monday
11:45. High: 10:00, Friday 10:00. 
In: Complete U.S. report as P. 51. 

<|ref|>title<|/ref|><|det|>[[60, 263, 907, 540]]<|/det|>
# MEN WALK ON MOON
## ASTRONAUTS LAND ON PLAIN; COLLECT ROCKS, PLANT FLAG 

<|ref|>text<|/ref|><|det|>[[56, 561, 256, 632]]<|/det|>
Voice From Moon:
'Eagle Has Landed' 

<|ref|>text<|/ref|><|det|>[[66, 653, 262, 723]]<|/det|>
EAGLE (the lunar module) Houston, Tranquility
Base here. The Eagle has landed.
HOCHSTON: Major, Tranquility, we envy you on the
ground. You've put a bunch of guys about to turn blue.
We're breaking again. Thanks a lot. 

<|ref|>text<|/ref|><|det|>[[66, 724, 262, 793]]<|/det|>
TRANQUILITY BASE: Thank you.
HOCHSTON: You're looking good here.
TRANQUILITY BASE: A very smooth touchdown.
HOCHSTON: Eagle, you are my life. (The first
step in the lunar operation.) Over. 

<|ref|>text<|/ref|><|det|>[[66, 793, 262, 843]]<|/det|>
TRANQUILITY BASE: Major, stay for T1.
HOCHSTON: Major and I are on my way. (The second step.)
TRANQUILITY BASE: Major, 

<|ref|>text<|/ref|><|det|>[[66, 843, 262, 872]]<|/det|>
HOCHSTON: The command and service module
How do you read me? 

<|ref|>text<|/ref|><|det|>[[66, 872, 262, 902]]<|/det|>
HOCHSTON: Columbia, he has landed Tranquility
Base. Eagle is at Tranquility. I read you first by.
Over. 

<|ref|>text<|/ref|><|det|>[[66, 902, 225, 921]]<|/det|>
COLUMBIA: Yes, I heard the whole thing. 

<|ref|>text<|/ref|><|det|>[[66, 921, 210, 940]]<|/det|>
HOCHSTON: Well, it's a good show. 

<|ref|>text<|/ref|><|det|>[[66, 940, 175, 959]]<|/det|>
TRANQUILITY BASE: Yes. 

<|ref|>text<|/ref|><|det|>[[66, 959, 262, 979]]<|/det|>
COLUMBIA: The most lunar module step you
will be for the 72 event. That is at 21 minutes 26 sec- 

<|ref|>text<|/ref|><|det|>[[66, 979, 190, 988]]<|/det|>
and 21.7 seconds. 

<|ref|>image<|/ref|><|det|>[[264, 543, 693, 968]]<|/det|>
 

<|ref|>text<|/ref|><|det|>[[502, 969, 692, 988]]<|/det|>
The lunar module's main attitude relative to the first step on the surface of the moon. 

<|ref|>text<|/ref|><|det|>[[696, 565, 909, 628]]<|/det|>
A Powdery Surface
Is Closely Explored 

<|ref|>text<|/ref|><|det|>[[754, 647, 852, 666]]<|/det|>
By JOHN NOBLE WILFORD 

<|ref|>text<|/ref|><|det|>[[696, 668, 909, 700]]<|/det|>
HOCHSTON, Monday, July 21—Men have landed and
walked on the moon. 

<|ref|>text<|/ref|><|det|>[[696, 700, 909, 748]]<|/det|>
Two of the astronauts, astronauts of Apollo 11, arrived their
fragile four-legged lunar module safely and smoothly to
the lunar landing yesterday at 4:17:40 P.M., Eastern day-
light time. 

<|ref|>text<|/ref|><|det|>[[696, 748, 909, 787]]<|/det|>
Neil A. Armstrong, the 38-year-old civilian commander,
landed on earth and the astronauts were there. 

<|ref|>text<|/ref|><|det|>[[696, 787, 909, 836]]<|/det|>
"Obviously, Tranquility Base here. The Eagle has landed."
The first step to reach the moon—the Armstrong and
his engineer, Col. Edwin E. Aldrin Jr. of the Aeronau- 
tics Department—was a long one. 

<|ref|>text<|/ref|><|det|>[[696, 836, 909, 884]]<|/det|>
Aldrin and a half hours later, Mr. Armstrong opened
the landing craft's hatch, stepped slowly down the ladder
and descended as he pushed out first American flag on the
lunar crest. 

<|ref|>text<|/ref|><|det|>[[696, 884, 909, 923]]<|/det|>
"I don't see much step for my nose, one giant leap for
the whole world."
He first step on the moon came at 10:56:29 P.M., as
a television camera outside the craft transmitted his every
move to an aerial and visual audience of hundreds of
millions of people on earth. 

<|ref|>text<|/ref|><|det|>[[756, 923, 861, 940]]<|/det|>
TELEVISION: "Slippin' Test Soil"

ngxson · 2025-12-09T13:32:59Z

I think we must make sure it works with different input image qualities (e.g. lighting conditions, colors) and different sizes. A test script would be nice-to-have - ~~it can be push to a https://gist.github.com/ and shared here in the comment.~~ such test script can be placed into tools/mtmd/test-deepseek-ocr.py for example

Any bugs related to output quality should be addressed before I can do any refactoring on the PR, otherwise it's very difficult to trace back the commit containing the bug.

bluebread · 2025-12-09T16:38:25Z

@sfallah Good job! We need more delicate logic for auto mode selection and comprehensive testing. Could you please take care of it? I'll probably be busy for the next couple of days.

sfallah · 2025-12-09T18:03:56Z

@sfallah Good job! We need more delicate logic for auto mode selection and comprehensive testing. Could you please take care of it? I'll probably be busy for the next couple of days.

I will take care of tests and the rest no problem.

setting min-resolution base (1024) max large (1280) for dynamic-resolution

ngxson · 2025-12-10T21:22:55Z

FYI, some changes are added via this PR: #17909

build_vit() with fused qkv support
in single-turn mode, image placement in prompt changed (now follow the order: image first, then text)

# Conflicts: # tools/mtmd/clip.cpp # tools/mtmd/mtmd-cli.cpp

added new opt to tests.sh to disable flash-attn

quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

ngxson · 2025-12-12T15:06:09Z

heads up, sorry for the breaking change but there will be a refactoring (just moving stuff around) in #17965

after finishing with this refactoring (and after you done testing on your side), I'll go back to deepseek-ocr

…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp

Merged with PR ggml-org#17965

sfallah · 2025-12-13T16:49:49Z

@ngxson

heads up, sorry for the breaking change but there will be a refactoring (just moving stuff around) in #17965

after finishing with this refactoring (and after you done testing on your side), I'll go back to deepseek-ocr

Merge with #17965 is done.
I have also added deepseek-ocr to tests.sh.
As far my tests goes, it works, but the python test script is not done yet.
I will finish the test script tomorrow.

python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown

sfallah and others added 22 commits November 14, 2025 12:40

mtmd: llama.cpp DeepSeekOCR support

43a130b

init commit

loading sam tensors

b6b9f02

mtmd: fix vision model processing

85c7cda

Merge pull request #1 from bluebread/sf/deepseek-ocr

578c8d7

mtmd: fix vision model processing

deepseek-ocr clip-vit model impl

2aab52e

mtmd: add DeepSeek-OCR LM support with standard attention

eab28ed

mtmd: successfully runs DeepSeek-OCR LM in llama-cli

7630587

mtmd: Fix RoPE type for DeepSeek-OCR LM.

2de3436

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

e8b2610

…f/deepseek-ocr

loading LM

97e0907

testing Vision model loading

Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr

13dc6fb

Merge pull request #2 from bluebread/sf/deepseek-ocr

b32bb5e

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

sam warmup working

790bbb9

sam erroneous return corrected

cec9a5c

clip-vit: corrected cls_embd concat

8b3d319

clip-vit: model convert qkv_proj split

1e08157

corrected combining of image encoders' results

331cea8

fix: update callback for ffn_moe_weighted and add callback for attn_o…

6c0715b

…ut in deepseek2 model

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

a65ddf5

…f/deepseek-ocr

concat image_newline and image_seperator tokens

63a042f

visual_model warmup (technically) works

89afda8

window partitioning using standard ggml ops

88032f4

sfallah requested review from CISC, ggerganov and ngxson as code owners November 20, 2025 09:11

github-actions bot added model Model specific examples python python script changes labels Nov 20, 2025

sfallah marked this pull request as draft November 20, 2025 09:12

sfallah mentioned this pull request Nov 20, 2025

ggml : enhance rel-pos and window ops with CUDA support #17383

Open

ngxson reviewed Dec 9, 2025

View reviewed changes

mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template

0161406

fix: test-1.jpg ORC issue with small (640) resolution

ed944cd

setting min-resolution base (1024) max large (1280) for dynamic-resolution

sfallah added 5 commits December 11, 2025 07:31

minor: editconfig-check fix

aaf2fd1

Merge branch 'master' into sf/deepseek-ocr-merge-test

33fabf0

# Conflicts: # tools/mtmd/clip.cpp # tools/mtmd/mtmd-cli.cpp

merge with changes from ggml-org#17909

d70f171

added new opt to tests.sh to disable flash-attn

minor: editconfig-check fix

4cbbe8a

testing deepseek-ocr

47f0fee

quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

ngxson mentioned this pull request Dec 12, 2025

clip: move model cgraphs into their own files #17965

Merged

sfallah added 5 commits December 13, 2025 10:59

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr-me…

e0e69fd

…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp

quick and (potential) dirty merge with ggml-org#17909

f95a6fe

refactoring, one single builder function and static helpers

f7736f2

added deepseek-ocr test to tests.sh

fb3bb6a

Merge pull request #11 from sfallah/sf/deepseek-ocr-merge_#17965

1b38ccf

Merged with PR ggml-org#17965

bluebread mentioned this pull request Dec 14, 2025

mtmd: generalize image resizing in llava_uhd #18014

Merged

sfallah added 4 commits December 14, 2025 15:14

minor formatting fixes

6c36c03

check with fixed expected resutls

dc2066e

Merge pull request #10 from sfallah/sf/deepseek-ocr-test-script

3fc61d4

python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown

minor formatting

7f8621c

mtmd: Add DeepSeekOCR Support #17400

Are you sure you want to change the base?

mtmd: Add DeepSeekOCR Support #17400

Uh oh!

Conversation

sfallah commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GGUF Models

Running the Model

Build llama.cpp (Mac)

Running llama-mtmd-cli

DeepSeekOCR Paper (First page)

Hard Test (Old Newspaper Image)

Uh oh!

ngxson Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluebread Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluebread Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 9, 2025

Uh oh!

bluebread commented Dec 9, 2025

Uh oh!

ngxson commented Dec 9, 2025

Uh oh!

sfallah commented Dec 9, 2025

Uh oh!

ngxson commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluebread commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 10, 2025

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

sfallah commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sfallah commented Nov 20, 2025 •

edited

Loading

ngxson Dec 9, 2025 •

edited

Loading

bluebread Dec 9, 2025 •

edited

Loading

ngxson commented Dec 9, 2025 •

edited

Loading

ngxson commented Dec 9, 2025 •

edited

Loading

bluebread commented Dec 9, 2025 •

edited

Loading

sfallah commented Dec 9, 2025 •

edited

Loading