718 points · danielhanchen · 1 day ago
qwen.aisimonw
danielhanchenOP
simonw
brew upgrade llama.cpp # or brew install if you don't have it yet
Then: llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this: llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.skhameneh
tommyjepsen
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
vessenes
0cf8612b2e1e
Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.
predkambrij
cat docker-compose.yml
services:
llamacpp:
volumes:
- llamacpp:/root
container_name: llamacpp
restart: unless-stopped
image: ghcr.io/ggml-org/llama.cpp:server-cuda
network_mode: host
command: |
-hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_XL --jinja --cpu-moe --n-gpu-layers 999 --ctx-size 102400 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on
# unsloth/gpt-oss-120b-GGUF:Q2_K
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
llamacpp:Tepix
From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
Alifatisk
Hope they update the model page soon https://chat.qwen.ai/settings/model
adefa
FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8
Sequential (single request)
Prompt Gen Prompt Processing Token Gen
Tokens Tokens (tokens/sec) (tokens/sec)
------ ------ ----------------- -----------
521 49 3,157 44.2
1,033 83 3,917 43.7
2,057 77 3,937 43.6
4,105 77 4,453 43.2
8,201 77 4,710 42.2
Parallel (concurrent requests)
pp4096+tg128 (4K context, 128 gen):
n t/s
-- ----
1 28.5
2 39.0
4 50.4
8 57.5
16 61.4
32 62.0
pp8192+tg128 (8K context, 128 gen):
n t/s
-- ----
1 21.6
2 27.1
4 31.9
8 32.7
16 33.7
32 31.7cedws
mark_l_watson
So much expensive inference is provided free or at large discounts - that craziness should end.
featherless
Robdel12
Does anyone any experience with these and is this release actually workable in practice?
aktenlage
gitpusher
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
mmaunder
macmac_mac
Does anyone see a reason to still use elevenlabs etc. ?
SamDc73
In terms of intelligence per compute, it’s probably the best model I can realistically run locally on my laptop for coding. It’s solid for scripting and small projects.
I tried it on a mid-size codebase (~50k LOC), and the context window filled up almost immediately, making it basically unusable unless you’re extremely explicit about which files to touch. I tested it with a 8k context window but will try again with 32k and see if it becomes more practical.
I think the main blocker for using local coding models more is the context window. A lot of work is going into making small models “smarter,” but for agentic coding that only gets you so far. No matter how smart the model is, an agent will blow through the context as soon as it reads a handful of files.
zamadatix
zokier
alexellisuk
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
ionwake
Im currently using qwen 2.5 16b , and it works really well
codedokode
orliesaurus
storus
StevenNunez
valcron1000
blurbleblurble
dk8996
endymion-light
On a misc note: What's being used to create the screen recordings? It looks so smooth!
ossicones
throwaw12
StevenNunez
fudged71
syntaxing
dzonga
jtbaker
ltbarcly3
kylehotchkiss
moron4hire
cpill
lysace
Soerensen
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next