-
Notifications
You must be signed in to change notification settings - Fork 14.1k
mtmd: Add DeepSeekOCR Support #17400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
init commit
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
…ut in deepseek2 model
tools/mtmd/mtmd-cli.cpp
Outdated
| } | ||
|
|
||
| } else { | ||
| if (mtmd_is_deepseekocr(ctx.ctx_vision.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many models does not support chat mode - it's not our responsibility to tell user what to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beside, model-specific API like is_model_abc is not allowed. it's an anti-pattern when designing public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll clean it up very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
|
@sfallah @bluebread Hmm, are you sure that this PR is working? Using the existing test file in the repo The beginning of output looks coherent But then it went repeatedly: |
|
Another run, it repeats:
|
Here are the script of the reference model and the output, which is also not good: from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
model_name = '/root/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, _attn_implementation='eager', trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = '/root/llama.cpp/tools/mtmd/test-1.jpeg'
output_path = './outputs'
res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 640, image_size = 640, crop_mode=False, save_results = True, test_compress = True)Based on my experiments, while DeepSeek-OCR is able to handle clean well-formatted documents (e.g. academic papers), it struggles with arbitrary image inputs (e.g. photograph of newspaper) due to its limited training dataset. I suspect it's an experimental project as some kind of POC rather than a product. |
If that actually the case, I'm quite doubt about merging this PR as it can easily flood the project with issues regarding the model quality, which we have no control at all. Not ignoring your efforts here - it's amazing to see such a complicated architecture implemented in GGML. But in the past, I myself also had many PRs that are not merge-able due to model quality - which is even not my faults. Most recent example was the PaddleOCR model What I think can be better is to keep the PR as an experiment until when more users confirm that it works (or maybe deepseek team will have another better-trained OCR model in the future) |
|
@ngxson @bluebread Forcing the base image size (by hardcoding it) I get a better result. |
|
I think we must make sure it works with different input image qualities (e.g. lighting conditions, colors) and different sizes. A test script would be nice-to-have - Any bugs related to output quality should be addressed before I can do any refactoring on the PR, otherwise it's very difficult to trace back the commit containing the bug. |
|
@sfallah Good job! We need more delicate logic for auto mode selection and comprehensive testing. Could you please take care of it? I'll probably be busy for the next couple of days. |
I will take care of tests and the rest no problem. |
setting min-resolution base (1024) max large (1280) for dynamic-resolution
|
FYI, some changes are added via this PR: #17909
|
# Conflicts: # tools/mtmd/clip.cpp # tools/mtmd/mtmd-cli.cpp
added new opt to tests.sh to disable flash-attn
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR
|
heads up, sorry for the breaking change but there will be a refactoring (just moving stuff around) in #17965 after finishing with this refactoring (and after you done testing on your side), I'll go back to deepseek-ocr |
…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp
Merge with #17965 is done. |
python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown
Feature Request: #16676
Make sure to read the contributing guidelines before submitting a PR
GGUF Models
sabafallah/DeepSeek-OCR-GGUF
deepseek-ocr-f32.gguf
mmproj-deepseek-ocr-f32.gguf
Running the Model
Build llama.cpp (Mac)
Running llama-mtmd-cli
DeepSeekOCR Paper (First page)
Hard Test (Old Newspaper Image)