SmackerNews

GLM-5.2 is the new leading open weights model on Artificial Analysis

894 points · 442 comments · 3 days ago · himata4113

Tiberium2 days ago
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

unrvl222 days ago
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
simonw2 days ago
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
mrngld2 days ago
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
CuriouslyC2 days ago
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
CubsFan10602 days ago
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
tensegrist2 days ago
On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
wongarsu2 days ago
It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
gertlabs2 days ago
GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).
Data at https://gertlabs.com/rankings
SwellJoe2 days ago
I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
kingstnap2 days ago
According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
XCSme2 days ago
In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
xiaoyu20062 days ago
This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.
Pragmata2 days ago
So this basically means we will have a near opus level model able to be run locally in the next couple of months right?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
ponyous2 days ago
Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
```
    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%
```
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.
Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
rahidz2 days ago
Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
osti2 days ago
Fun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.
davidwritesbugs2 days ago
I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
_pdp_2 days ago
I am helpful.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
leemoore2 days ago
GLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6
ramon1562 days ago
I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
m-dot-reviews2 days ago
For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
Imustaskforhelp2 days ago
I have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
tomerbd2 days ago
I code daily with AI - real programming tasks, professional, real work, read customers, I use below 3:
- codex 5.5 medium - best results less hand holding medium speed
- opus 4.8 max - mediocre with hand holding medium speed
- glm 5.2 max - mediocre with hand holding and super slow
- composer 2.5 - mediocre with hand holding and super fast
I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.
redbell2 days ago
Launch announcement from four days ago: https://smackernews.com/item/48518684 HN
The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...
bizer2 days ago
Z-ai/GLM’s KV caching technology is truly impressive; the implicit cache hit rate of its official API exceeds 95%, far surpassing other APIs that support implicit caching, such as Gemini and Qwen. I’ve been pondering the architectural design behind this, though I haven't yet formed a fully coherent theory.
dizhn2 days ago
FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en
mesmertech2 days ago
Seems really good at frontend work, and as a result on remotion programmatic videos. Not the best yet, thats still Gemini 3.1 pro(trained on actual videos) or Fable, but often better than what Opus can come up with
https://mesmer.tools/benchmarks/ai-video-generation
JustSkyfall2 days ago
The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
jauntywundrkind2 days ago
Also so wild that it's relatively compact. 753B-40A is so reasonable, shows incredible scaling in what the model can do, without just throwing heaps of new parameters in.
This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.
alansaber2 days ago
These open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.
gauravvij1372 days ago
They've come along pretty far now.
I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.
guybedo2 days ago
It's probably a good model but they used GLM 5.1 to code their infra.
I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.
Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.
aunty_helen2 days ago
Before you go and sign up to the max plan like I did, they are obviously struggling for capacity. I'm getting API rate limited and 429'd on a simple "hello"
robertwt72 days ago
what is that moodboard and chart of hypertension in the middle of the article that isn't explained?
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
RDTvlokip2 days ago
I have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?
daniban2 days ago
I'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"
hereme8882 days ago
Hmmm... GLM insists it's Gemini.
https://github.com/zai-org/GLM-5/issues/79
creamyhorror2 days ago
It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
jayess2 days ago
I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."
KaoruAoiShiho2 days ago
This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.
hit8run2 days ago
Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
Computer02 days ago
Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.
PetrBrzyBrzek2 days ago
I'm a bit shocked that GLM 5.2 is not multimodal. Like, how should I use it? I use images all the time.
piterrro2 days ago
DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.
Havoc2 days ago
It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4
Their servers are melting though - getting more timeouts etc
eckelhesten2 days ago
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
zftnb6662 days ago
Open-weight models are winning. The gap with closed models is now measured in months, not years.
kissgyorgy2 days ago
I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
nh43215rgb2 days ago
GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
blt2 days ago
There's only one GLM in my heart: the one that includes vec3.hpp
lousken2 days ago
Cerebras really needs to have this on their API list (if they even still exist).
[deleted]
sourcecodeplz2 days ago
1m context btw.
casey21 day ago
Mark my words, by the end of 2027, there will be an open weights model that is better than anything OpenAI and Anthropic are capable of making. They will lose at inference scaling too.
adithyaharish2 days ago
why do not all open source LLM's have open weights like this model?
[deleted]
hyqzz82 days ago
It is a very useful model
catigula1 day ago
Which American model did they distill this one from?
dsrtslnd232 days ago
looks like I need a GB300 workstation

news.ycombinator.com/item?id=48567759