Qwen3-Coder-Next

718 points · danielhanchen · 1 day ago

simonw1 day ago
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.
I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
danielhanchenOP1 day ago
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

skhameneh1 day ago
It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.
tommyjepsen1 day ago
I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTU
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
vessenes1 day ago
3B active parameters, and slightly worse than GLM 4.7. On benchmarks. That's pretty amazing! With better orchestration tools being deployed, I've been wondering if faster, dumber coding agents paired with wise orchestrators might be overall faster than using the say opus 4.5 on the bottom for coding. At least we might want to deploy to these guys for simple tasks.
0cf8612b2e1e1 day ago
What is the best place to see local model rankings? The benchmarks seem so heavily gamed that I am willing to believe the “objective” rankings are a lie and personal reviews are more meaningful.
Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.

predkambrij1 day ago

17t/s on a laptop with 6GB VRAM and DDR5 system memory. Maximum of 100k context window (then it saturates VRAM). Quite amazing, but tbh I'll still use inference providers, because it's too slow and it's my only machine with "good" specs :)

    cat docker-compose.yml
    services:
      llamacpp:
        volumes:
          - llamacpp:/root
        container_name: llamacpp
        restart: unless-stopped
        image: ghcr.io/ggml-org/llama.cpp:server-cuda
        network_mode: host
        command: |
          -hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_XL --jinja --cpu-moe --n-gpu-layers 999 --ctx-size 102400 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on
    # unsloth/gpt-oss-120b-GGUF:Q2_K
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]

    volumes:
       llamacpp:

Tepix1 day ago
Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage).
From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
Alifatisk1 day ago
As always, the Qwen team is pushing out fantastic content
Hope they update the model page soon https://chat.qwen.ai/settings/model

adefa1 day ago

Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344

  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7

cedws1 day ago
I kind of lost interest in local models. Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools and it reminded me why we need to support open tools and models. I’ve cancelled my CC subscription, I’m not paying to support anticompetitive behaviour.
mark_l_watson13 hours ago
A 3B resident parameter MOE allows absolutely huge savings on inference costs. I use a cloud provider for models to large to run locally, can’t wit for them to support qwen3-coder-next hopefully in a few days.
So much expensive inference is provided free or at large discounts - that craziness should end.
featherless1 day ago
I got openclaw to compete Qwen3-Coder-Next vs Minimax M2.1 simultaneously on my Mac Studio 512GB: https://clutch-assistant.github.io/model-comparison-report/
Robdel121 day ago
I really really want local or self hosted models to work. But my experience is they’re not really even close to the closed paid models.
Does anyone any experience with these and is this release actually workable in practice?
aktenlage15 hours ago
I tried it to review some C++ code. It actually found minor bugs, but the signal to noise ratio is too high (maybe 10% of the found issues were real issues)
gitpusher1 day ago
Pretty cool that they are advertising OpenClaw compatibility. I've tried a few locally-hosted models with OpenClaw and did not get good results – (that tool is a context-monster... the models would get completely overwhelmed them with erroneous / old instructions.)
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
mmaunder1 day ago
These guys are setting up to absolutely own the global south market for AI. Which is in line with the belt and road initiative.
macmac_mac1 day ago
I just tried qwen 3 tts and it was mind blowingly good, you can even provide directions for the overall tone etc. Which wasn't the case when I used commercial super expensive products like the (now closed after being bought by meta) play.ht .
Does anyone see a reason to still use elevenlabs etc. ?
SamDc731 day ago
This is model 12188, which claims to rival SOTA models while not even being in the same league.
In terms of intelligence per compute, it’s probably the best model I can realistically run locally on my laptop for coding. It’s solid for scripting and small projects.
I tried it on a mid-size codebase (~50k LOC), and the context window filled up almost immediately, making it basically unusable unless you’re extremely explicit about which files to touch. I tested it with a 8k context window but will try again with 32k and see if it becomes more practical.
I think the main blocker for using local coding models more is the context window. A lot of work is going into making small models “smarter,” but for agentic coding that only gets you so far. No matter how smart the model is, an agent will blow through the context as soon as it reads a handful of files.
zamadatix1 day ago
Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?
zokier1 day ago
For someone who is very out of the loop with these AI models, can someone explain what I can actually run on my 3080ti (12G)? Is this something like that or is this still too big; is there anything remotely useful runnable with my GPU? I have 64G RAM if that helps (?).
alexellisuk1 day ago
Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
ionwake1 day ago
will this run on an apple m4 air with 32gb ram?
Im currently using qwen 2.5 16b , and it works really well
codedokode17 hours ago
It's sad they only have 80B version, given current RAM prices.
orliesaurus1 day ago
how can anyone keep up with all these releases... what's next? Sonnet 5?
storus1 day ago
Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?
StevenNunez1 day ago
Not crazy about it. It keeps getting stuck in a loop and filling up the context window (131k, run locally). Kimi's been nice, even if a bit slow.
valcron10001 day ago
Still nothing to compete with GPT-OSS-20B for local image with 16 VRAM.
blurbleblurble1 day ago
So dang exciting! There are a bunch of new interesting small models out lately, by the way, this is just one of them...
dk89961 day ago
Is there a good way to enable this model within VSCode, looking for something like Copilot?
endymion-light1 day ago
Looks great - i'll try to check it out on my gaming PC.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
ossicones1 day ago
What browser use agent are they using here?
throwaw121 day ago
We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks
StevenNunez1 day ago
Going to try this over Kimi k2.5 locally. It was nice but just a bit too slow and a resource hog.
fudged711 day ago
I'm thrilled. Picked up a used M4 Pro 64GB this morning. Excited to test this out
syntaxing1 day ago
Is Qwen next architecture ironed out in llama cpp?
dzonga1 day ago
the qwen website doesn't work for me in safari :(. had to read the announcement in chrome
jtbaker1 day ago
any way to run these via ollama yet?
ltbarcly31 day ago
Here's a tip: Never name anything new, next, neo, etc. You will have a problem when you try to name the thing after that!
kylehotchkiss1 day ago
Is there any online resource tracking local model capability on say... a $2000 64gb memory Mac Mini? I'm getting increasingly excited about the local model space because it offers us a future where we can benefit from LLMs without having to listen to tech CEOs saber rattle about removing America of its jobs so they can get the next fundraising round sorted
moron4hire1 day ago
My IT department is convinced these "ChInEsE cCcP mOdElS" are going to exfiltrate our entire corporate network of its essential fluids and vita.. erh, I mean data. I've tried explaining to them that it's physically impossible for model weights to make network requests on their own. Also, what happened to their MitM-style, extremely intrusive network monitoring that they insisted we absolutely needed?
cpill1 day ago
I wonder if we could have much smaller models if they train on less languages? i.e. python + yaml + json only or even an single languages with an cluster of models loaded into memory dynamically...?
lysace1 day ago
Is it censored according to the wishes of the CCP?
Soerensen1 day ago
The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.

Loading comments...

news.ycombinator.com/item?id=46872706

simonw1 day ago
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.
I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
danielhanchenOP1 day ago
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

simonw1 day ago