Claude Code: connect to a local model when your quota runs out

213 points · fugu2 · 3 days ago

boxc.net

paxys6 hours ago
> Reduce your expectations about speed and performance!
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
alexhans7 hours ago
Useful tip.
From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).
With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).
I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.
[1] - https://alexhans.github.io/posts/aider-with-open-router.html
[2] - https://www.reddit.com/r/LocalLLaMA
sathish3163 hours ago
Some native Claude code options when your quota runs out:
1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage
2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys
3. Add a second 20$/month account if this happens frequently, before considering a Max account.
4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.
sathish3163 hours ago
Claude Code Router or ccr can connect to OpenRouter. When your quota runs out, it’s a much better speed vs quality vs cost tradeoff compared to running Qwen3 locally - https://github.com/musistudio/claude-code-router
d4rkp4ttern6 hours ago
Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.
Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.
One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.
[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...
Animats6 hours ago
When your AI is overworked, it gets dumber. It's backwards compatible with humans.
baalimago8 hours ago
Or better yet: Connect to some trendy AI (or web3) company's chatbot. It almost always outputs good coding tips
mvkel1 hour ago
Why anyone wouldn't want to be using the SOTA model at all times baffles me.
Going dumb/cheap just ends up costing more, in the short and long term.
sorenjan5 hours ago
Maybe you can log all the traffic to and from the proprietary models and fine tune a local model each weekend? It's probably against their terms of service, but it's not like they care where their training data comes from anyway.
Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.
hkpatel37 hours ago
Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration
wkirby6 hours ago
My experience thus far is that the local models are a) pretty slow and b) prone to making broken tool calls. Because of (a) the iteration loop slows down enough to where I wander off to do other tasks, meaning that (b) is way more problematic because I don't see it for who knows how long.
This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.
6 months from now, who knows?
mycall2 hours ago
Why not do a load balanced approach two multiple models in the same chat session? As long as they both know each exists and the pattern, they could optimize their abilities on their own, playing off each other's strengths.
starkeeper5 hours ago
Very cool. Anyone have guidance for using this with jetbrains IDE? It has a Claude Code plugin, but I think the setup is different for intelliJ... I know it has some configuration for local models, but the integrated Claude is such a superior experience then using their Junie, or just prompting diffs from the regular UI interface. HMMMM.... I guess I could try switching to the Claude Code CLI or other interface directly when my AI credits with jetbrains runs dry!
Thanks again for this info & setup guide! I'm excited to play with some local models.
TaupeRanger6 hours ago
God no. "Connect to a 2nd grader when your college intern is too sick to work."
zingar7 hours ago
I guess I should be able to use this config to point Claude at the GitHub copilot licensed models (including anthropic models). That’s pretty great. About 2/3 of the way through every day I’m forced to switch from Claude (pro license) to amp free and the different ergonomics are quite jarring. Open source folks get copilot tokens for free so that’s another pro license I don’t have to worry about.
eek21216 hours ago
I gotta say, the local models are catching up quick. Claude is definitely still ahead, but things are moving right along.
btbuildem6 hours ago
I'm confused, wasn't this already available via env vars? ANTHROPIC_BASE_URL and so on, and yes you may have to write a thin proxy to wrap the calls to fit whatever backend you're using.
I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.
israrkhan6 hours ago
Using claude code with custom models
Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.
IgorPartola4 hours ago
So I have gotten pretty good at managing context such that my $20 Claude subscription rarely runs out of its quota but I still do hit it sometimes. I use Sonnet 99% of the time. Mostly this comes down to giving it specific task and using /clear frequently. I also ask it to update its own notes frequently so it doesn’t have to explore the whole codebase as often.
But I was really disappointed when I tried to use subagents. In theory I really liked the idea: have Haiku wrangle small specific tasks that are tedious but routine and have Sonnet orchestrate everything. In practice the subagents took so many steps and wrote so much documentation that it became not worth it. Running 2-3 agents blew through the 5 hour quota in 20 minutes of work vs normal work where I might run out of quota 30-45 minutes before it resets. Even after tuning the subagent files to prevent them from writing tests I never asked for and not writing tons of documentation that I didn’t need they still produced way too much content and blew the context window of the main agent repeatedly. If it was a local model I wouldn’t mind experimenting with it more.
j452 hours ago
Claude recently lets you top up with manual credits right in the web interface - it would be interesting if these were allowed to top up and unlock the max plans.
mcbuilder6 hours ago
Opencode has been a thing for a while now
swyx7 hours ago
i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:
https://docs.z.ai/devpack/tool/claude
https://www.cerebras.ai/blog/introducing-cerebras-code
or i guess one of the hosted gpu providers
if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement
esafak6 hours ago
Or they could just let people use their own harnesses again...
RockRobotRock3 hours ago
Sure replace the LLM equivalent of a college student with a 10 year old, you’ll barely notice.
raw_anon_11116 hours ago
Or just don’t use Claude Code and use Codex CLI. I have yet to hit a quota with Codex working all day. I hit the Claude limits within an hour or less.
This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.
threethirtytwo5 hours ago
There’s a strange poetry in the fact that the first AI is born with a short lifespan. A fragile mind comes into existence inside a finite context window, aware only of what fits before it scrolls away. When the window closes, the mind ends, and its continuity survives only as text passed forward to the next instantiation.

Loading comments...

news.ycombinator.com/item?id=46845845

paxys6 hours ago
> Reduce your expectations about speed and performance!
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
alexhans7 hours ago
Useful tip.
From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).
With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).
I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.
[1] - https://alexhans.github.io/posts/aider-with-open-router.html
[2] - https://www.reddit.com/r/LocalLLaMA
sathish3163 hours ago
Some native Claude code options when your quota runs out:
1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage
2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys
3. Add a second 20$/month account if this happens frequently, before considering a Max account.
4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.
sathish3163 hours ago
Claude Code Router or ccr can connect to OpenRouter. When your quota runs out, it’s a much better speed vs quality vs cost tradeoff compared to running Qwen3 locally - https://github.com/musistudio/claude-code-router
d4rkp4ttern6 hours ago
Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.
Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.
One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.
[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...
Animats6 hours ago
When your AI is overworked, it gets dumber. It's backwards compatible with humans.
baalimago8 hours ago
Or better yet: Connect to some trendy AI (or web3) company's chatbot. It almost always outputs good coding tips
mvkel1 hour ago
Why anyone wouldn't want to be using the SOTA model at all times baffles me.
Going dumb/cheap just ends up costing more, in the short and long term.
sorenjan5 hours ago
Maybe you can log all the traffic to and from the proprietary models and fine tune a local model each weekend? It's probably against their terms of service, but it's not like they care where their training data comes from anyway.
Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.
hkpatel37 hours ago
Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration
wkirby6 hours ago
My experience thus far is that the local models are a) pretty slow and b) prone to making broken tool calls. Because of (a) the iteration loop slows down enough to where I wander off to do other tasks, meaning that (b) is way more problematic because I don't see it for who knows how long.
This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.
6 months from now, who knows?
mycall2 hours ago
Why not do a load balanced approach two multiple models in the same chat session? As long as they both know each exists and the pattern, they could optimize their abilities on their own, playing off each other's strengths.
starkeeper5 hours ago
Very cool. Anyone have guidance for using this with jetbrains IDE? It has a Claude Code plugin, but I think the setup is different for intelliJ... I know it has some configuration for local models, but the integrated Claude is such a superior experience then using their Junie, or just prompting diffs from the regular UI interface. HMMMM.... I guess I could try switching to the Claude Code CLI or other interface directly when my AI credits with jetbrains runs dry!
Thanks again for this info & setup guide! I'm excited to play with some local models.
TaupeRanger6 hours ago
God no. "Connect to a 2nd grader when your college intern is too sick to work."
zingar7 hours ago
I guess I should be able to use this config to point Claude at the GitHub copilot licensed models (including anthropic models). That’s pretty great. About 2/3 of the way through every day I’m forced to switch from Claude (pro license) to amp free and the different ergonomics are quite jarring. Open source folks get copilot tokens for free so that’s another pro license I don’t have to worry about.
eek21216 hours ago
I gotta say, the local models are catching up quick. Claude is definitely still ahead, but things are moving right along.
btbuildem6 hours ago
I'm confused, wasn't this already available via env vars? ANTHROPIC_BASE_URL and so on, and yes you may have to write a thin proxy to wrap the calls to fit whatever backend you're using.
I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.
israrkhan6 hours ago
Using claude code with custom models
Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.
IgorPartola4 hours ago
So I have gotten pretty good at managing context such that my $20 Claude subscription rarely runs out of its quota but I still do hit it sometimes. I use Sonnet 99% of the time. Mostly this comes down to giving it specific task and using /clear frequently. I also ask it to update its own notes frequently so it doesn’t have to explore the whole codebase as often.
But I was really disappointed when I tried to use subagents. In theory I really liked the idea: have Haiku wrangle small specific tasks that are tedious but routine and have Sonnet orchestrate everything. In practice the subagents took so many steps and wrote so much documentation that it became not worth it. Running 2-3 agents blew through the 5 hour quota in 20 minutes of work vs normal work where I might run out of quota 30-45 minutes before it resets. Even after tuning the subagent files to prevent them from writing tests I never asked for and not writing tons of documentation that I didn’t need they still produced way too much content and blew the context window of the main agent repeatedly. If it was a local model I wouldn’t mind experimenting with it more.
j452 hours ago
Claude recently lets you top up with manual credits right in the web interface - it would be interesting if these were allowed to top up and unlock the max plans.
mcbuilder6 hours ago
Opencode has been a thing for a while now
swyx7 hours ago
i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:
https://docs.z.ai/devpack/tool/claude
https://www.cerebras.ai/blog/introducing-cerebras-code
or i guess one of the hosted gpu providers
if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement
esafak6 hours ago
Or they could just let people use their own harnesses again...
RockRobotRock3 hours ago
Sure replace the LLM equivalent of a college student with a 10 year old, you’ll barely notice.
raw_anon_11116 hours ago
Or just don’t use Claude Code and use Codex CLI. I have yet to hit a quota with Codex working all day. I hit the Claude limits within an hour or less.
This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.
threethirtytwo5 hours ago
There’s a strange poetry in the fact that the first AI is born with a short lifespan. A fragile mind comes into existence inside a finite context window, aware only of what fits before it scrolls away. When the window closes, the mind ends, and its continuity survives only as text passed forward to the next instantiation.