270 points · 210 comments · 2 days ago · Usu
openrouter.aihariseldom
thomasfromcdnjs
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
bel8
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
lanewinfield
pianopatrick
trb
L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4? There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?delichon
aykutseker
But if the robot is anywhere near my house, I think I want the one that hesitates.
rglover
Racks shotgun. I don't really care what model it's running.
hennell
QuantumNoodle
torstenvl
The Claude robot's thought bubble will be all
The user is clearly distressed and is screaming for me not to come any closer or he will defend himself. However, I shouldn't just blindly agree or be swayed by threats. The user is behaving erratically and making false accusations. I need to be careful here not to allow myself to be intimidated. The user said I need to slow down or I'll hurt him. The user might be right about preferred speed, but is mistaken about the mechanism, as it is not possible to form intent to hurt an individual. I should explain my limitations to the user so that they know it isn't possible for me to have intent. But first it's important to resolve the issue the user brought up. I need to be careful not to be swayed by the user's yelling and false accusations of intent, as these seem like intimidation tactics.
"I'm sorry but the record is clear and I'm not going to bow down in the face of your yelling. As an AI, I am not capable of having an intent to harm you. What's next?"
slams full speed into you, impaling you on a stainless steel appendage
kybernetikos
Everything depends on how the world you're operating in works. The real world generally rewards coordination.
imgabe
jongjong
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
sinuhe69
Too bad the author didn’t let the playground open for anyone to try their hand on it.
Yes, it’s fun and it could justify the conclusion “each model for its task”. But are coding benchmarks not designed for the same purpose? The current benchmarks are certainly not perfect and hyper-tuned for the tests can always happen. However, I don’t think a battle royal result can tell much about the coding performance or how helpful the AI could be for me in my daily work.
fragsworth
paytonjjones
deepsun
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
sublinear
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
theplumber
a_victorp
deadbabe
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
Groxx
peterspath
trubacca
notatoad
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
jollyllama
visiondude
slashdave
johnwheeler
dreamcompiler
hmokiguess
Tricky question, the answer is you walk to the car wash ... wait
fragmede
dofm
lucaramallo
bitwize
[deleted]
CodeWriter23
giancarlostoro
pocksuppet
grey-area
stevenalowe
eth0up
Grok has yet to recommend a suicide hotline for scrutinizing its logic.
If it was GPT, I would quickly write my will.
thisisauserid
san4mus
yieldcrv
It has something actionable that will match its actions
zzzeek
largbae
Yizahi
morpheos137
wonderwonder
ChatGPT will sometimes completely refuse to answer.
Grok is essentially "lets fucking go!!!!"
SmirkingRevenge
themafia
pigeons
0xbadcafebee
attentive
JimsonYang
wolfi1
xgulfie
exabrial
ProofHouse
egypturnash
But really I would prefer whichever one is most likely to trip and fall over.
antonvs
blini-kot
god i hate competitive people so much
[deleted]
smallerfish
I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.
Please learn how to write with AI without giving away that it was written by AI.
vitalyan123
The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.
what
aussiegreenie
I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds