894 points · 442 comments · 3 days ago · himata4113
artificialanalysis.aiTiberium
kristopolous
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
score age size name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
To see everything, run it like so $ curl day50.dev/art-analysis.sh | bash
The repo: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
unrvl22
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
simonw
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
mrngld
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
CuriouslyC
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
CubsFan1060
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
tensegrist
On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
wongarsu
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
gertlabs
Data at https://gertlabs.com/rankings
SwellJoe
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
kingstnap
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
XCSme
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
xiaoyu2006
Pragmata
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
ponyous
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
rahidz
osti
davidwritesbugs
_pdp_
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
leemoore
ramon156
I haven't extensively used 5.2 yet, but it seems a lot better.
m-dot-reviews
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
Imustaskforhelp
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
tomerbd
- codex 5.5 medium - best results less hand holding medium speed
- opus 4.8 max - mediocre with hand holding medium speed
- glm 5.2 max - mediocre with hand holding and super slow
- composer 2.5 - mediocre with hand holding and super fast
I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.
redbell
The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...
bizer
dizhn
mesmertech
JustSkyfall
jauntywundrkind
This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.
alansaber
gauravvij137
I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.
guybedo
I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.
Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.
aunty_helen
robertwt7
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
RDTvlokip
daniban
hereme888
creamyhorror
jayess
KaoruAoiShiho
hit8run
Computer0
PetrBrzyBrzek
piterrro
Havoc
Their servers are melting though - getting more timeouts etc
eckelhesten
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
zftnb666
kissgyorgy
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
nh43215rgb
GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
blt
lousken
[deleted]
sourcecodeplz
casey2
adithyaharish
[deleted]
hyqzz8
catigula
dsrtslnd23
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.