You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LlamaHub is a local-first UI built around llama-server with a focus on speed, tooling, and workflow (not just “a chat box”). I’ve been using it as my daily driver and I’m getting excellent inference speed with OSS 20B and other models.
Highlights
Fast local inference (llama.cpp first-class; OpenAI-compatible endpoints optional)
Local embeddings + RAG (you can keep retrieval local even if you use cloud models for reasoning)
MCP tools stack built-in (and it’s actually usable from the UI)
Multi-agent workflows (I tested a setup where a local model orchestrated multiple cloud Gemini instances while embeddings stayed local)
Lots of UI polish (themes, layout controls, etc. — I wanted it to feel “daily-usable”)
What I’m looking for
I’d really love help from the llama.cpp community to:
fix logic in the smart canvas
working on model launching
polish rough edges / improve UX
sanity-check architecture choices
tighten setup docs and defaults
test on more hardware + OS combos
I’m specifically hoping to keep this a community-driven collab (I’m not interested in someone repackaging it and selling it as a “premium AI app”).
Best 3 Launch commands (what I’m using currently)
This one kicks ass in the openai codex cli !!
jeff@jeff-STGAUBRON:~$ /home/jeff/llama-b6962-bin-ubuntu-vulkan-x64/build/bin/llama-server -m "/home/jeff/Desktop/models/gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -c 131072 --parallel 1 --host 0.0.0.0 --port 8082 -b 2056 -ub 256 -fa auto --temp 1.0 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 --repeat-last-n 200 --cache-type-k q8_0 --cache-type-v q8_0 --mlock --threads 8 --threads-batch 8 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja
Not great in the codex but KICKASS in llamahub. (custom chat template i made in the repo) insane function/tool calling!!
jeff@jeff-STGAUBRON:~$ /home/jeff/llama-b6962-bin-ubuntu-vulkan-x64/build/bin/llama-server -m "/home/jeff/Desktop/models/gpt-oss-20b-gpt-5-codex-distill.F16.gguf" -ngl 99 -c 131072 --parallel 1 --host 0.0.0.0 --port 8082 -b 2056 -ub 256 -fa auto --temp 1.0 --top-p 1.0 --top-k 40 --repeat-penalty 1.0 --repeat-last-n 200 --cache-type-k q8_0 --cache-type-v q8_0 --mlock --threads 24 --threads-batch 12 --chat-template-file "/home/jeff/Desktop/models/francine_oss.jinja.txt" --jinja
The best embedding model i've used to date!
jeff@jeff-STGAUBRON:~/build-cpu/bin$ /home/jeff/build-cpu/bin/llama-server --embedding -m "/home/jeff/Desktop/models/qwen3-embedding-0.6b-q4_k_m.gguf" -c 8192 -b 512 --parallel 1 --host 0.0.0.0
If you try it and hit issues, open a GitHub issue with your OS + GPU + llama.cpp build + model + command line flags and I’ll do my best to reproduce.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’ve been hacking on an upgraded GUI / command-center for llama.cpp and I think it’s finally in a good place to share.
Repo: https://github.com/jbulger82/LLAMA_Hub
What it is
LlamaHub is a local-first UI built around llama-server with a focus on speed, tooling, and workflow (not just “a chat box”). I’ve been using it as my daily driver and I’m getting excellent inference speed with OSS 20B and other models.
Highlights
Fast local inference (llama.cpp first-class; OpenAI-compatible endpoints optional)
Local embeddings + RAG (you can keep retrieval local even if you use cloud models for reasoning)
MCP tools stack built-in (and it’s actually usable from the UI)
Multi-agent workflows (I tested a setup where a local model orchestrated multiple cloud Gemini instances while embeddings stayed local)
Lots of UI polish (themes, layout controls, etc. — I wanted it to feel “daily-usable”)
What I’m looking for
I’d really love help from the llama.cpp community to:
fix logic in the smart canvas
working on model launching
polish rough edges / improve UX
sanity-check architecture choices
tighten setup docs and defaults
test on more hardware + OS combos
I’m specifically hoping to keep this a community-driven collab (I’m not interested in someone repackaging it and selling it as a “premium AI app”).
Best 3 Launch commands (what I’m using currently)
This one kicks ass in the openai codex cli !!
jeff@jeff-STGAUBRON:~$ /home/jeff/llama-b6962-bin-ubuntu-vulkan-x64/build/bin/llama-server -m "/home/jeff/Desktop/models/gpt-oss-20b-Q4_K_M.gguf" -ngl 99 -c 131072 --parallel 1 --host 0.0.0.0 --port 8082 -b 2056 -ub 256 -fa auto --temp 1.0 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 --repeat-last-n 200 --cache-type-k q8_0 --cache-type-v q8_0 --mlock --threads 8 --threads-batch 8 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja
Not great in the codex but KICKASS in llamahub. (custom chat template i made in the repo) insane function/tool calling!!
jeff@jeff-STGAUBRON:~$ /home/jeff/llama-b6962-bin-ubuntu-vulkan-x64/build/bin/llama-server -m "/home/jeff/Desktop/models/gpt-oss-20b-gpt-5-codex-distill.F16.gguf" -ngl 99 -c 131072 --parallel 1 --host 0.0.0.0 --port 8082 -b 2056 -ub 256 -fa auto --temp 1.0 --top-p 1.0 --top-k 40 --repeat-penalty 1.0 --repeat-last-n 200 --cache-type-k q8_0 --cache-type-v q8_0 --mlock --threads 24 --threads-batch 12 --chat-template-file "/home/jeff/Desktop/models/francine_oss.jinja.txt" --jinja
The best embedding model i've used to date!
jeff@jeff-STGAUBRON:~/build-cpu/bin$ /home/jeff/build-cpu/bin/llama-server --embedding -m "/home/jeff/Desktop/models/qwen3-embedding-0.6b-q4_k_m.gguf" -c 8192 -b 512 --parallel 1 --host 0.0.0.0
If you try it and hit issues, open a GitHub issue with your OS + GPU + llama.cpp build + model + command line flags and I’ll do my best to reproduce.
— Jeff
Beta Was this translation helpful? Give feedback.
All reactions