SmackerNews

A robot is sprinting towards you. Do you want it running on Claude or Grok?

270 points · 210 comments · 2 days ago · Usu

hariseldom2 days ago
I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
thomasfromcdnjs2 days ago
I was loving grok-4.1-fast, very good and cost effective.
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
bel82 days ago
DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
lanewinfield2 days ago
Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.
pianopatrick2 days ago
Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.

  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.

If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?

  There were 11 games between “best at killing” and “best at winning”.

What does that mean? How are there 11 games between "best a killing" and "best at winning"?

delichon2 days ago
If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.
aykutseker2 days ago
Claude trying to make friends in a battle royale is funny.
But if the robot is anywhere near my house, I think I want the one that hesitates.
rglover2 days ago
It's already sprinting at me?
Racks shotgun. I don't really care what model it's running.
hennell2 days ago
Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.
QuantumNoodle2 days ago
_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_
torstenvl2 days ago
Grok. Easily.
The Claude robot's thought bubble will be all
The user is clearly distressed and is screaming for me not to come any closer or he will defend himself. However, I shouldn't just blindly agree or be swayed by threats. The user is behaving erratically and making false accusations. I need to be careful here not to allow myself to be intimidated. The user said I need to slow down or I'll hurt him. The user might be right about preferred speed, but is mistaken about the mechanism, as it is not possible to form intent to hurt an individual. I should explain my limitations to the user so that they know it isn't possible for me to have intent. But first it's important to resolve the issue the user brought up. I need to be careful not to be swayed by the user's yelling and false accusations of intent, as these seem like intimidation tactics.
"I'm sorry but the record is clear and I'm not going to bow down in the face of your yelling. As an AI, I am not capable of having an intent to harm you. What's next?"
slams full speed into you, impaling you on a stainless steel appendage
kybernetikos1 day ago
So much of this depends on the specifics of the virtual world and participant pool. If there are a few other bots smart enough to collaborate, and the game world encourages it, then those instincts would be much more valuable. If the game world doesn't reward coordination then those instincts may slow you down.
Everything depends on how the world you're operating in works. The real world generally rewards coordination.
imgabe2 days ago
Why is it sprinting toward me? Is it pulling me out of a burning car or is it hunting me?
jongjong2 days ago
This shows the limits of intelligence.
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
sinuhe692 days ago
These games are so far outside the normal training corpus and purposes of the AI, I think different promtings could bring vastly different results.
Too bad the author didn’t let the playground open for anyone to try their hand on it.
Yes, it’s fun and it could justify the conclusion “each model for its task”. But are coding benchmarks not designed for the same purpose? The current benchmarks are certainly not perfect and hyper-tuned for the tests can always happen. However, I don’t think a battle royal result can tell much about the coding performance or how helpful the AI could be for me in my daily work.
fragsworth2 days ago
Are we sure the prices in these charts are sustainable prices? Is it possible that Grok may be subsidizing a lot more of the costs than the other models, to produce growth metrics, due to the recent SpaceX IPO?
paytonjjones2 days ago
Super entertaining article — petition to change the clickbait title
deepsun2 days ago
Sprinting? More like buzzing (or rolling for terrestrial drones).
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
sublinear2 days ago
This is interesting, but not sure if it's in the way the author intended.
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
theplumber2 days ago
Claude will bring you the taco but will refuse to let you eat it due to its “safety” restrictions. Only the chosen ones are allowed to eat
a_victorp2 days ago
I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions
deadbabe2 days ago
Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
Groxx2 days ago
I parry the taco and use Vicious Mockery.
peterspath2 days ago
Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)
trubacca11 hours ago
Honestly I think a better question is which model do I want on my team, because I'm now wondering how a team of groks vs a team of sonnet's would fare in TF2!
notatoad2 days ago
sprinting towards me to help me, or sprinting towards me to hurt me?
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
jollyllama2 days ago
I want it running deterministic embedded C++ reading values from LIDAR.
visiondude2 days ago
did i miss it on the webpage or is the source prompt that was used to teach these models the game anywhere? i can see the soul artifacts on github but not the initial prompt and toolset definition. the prompt is perhaps the most important component in how a model would behave in a game. without reviewing the initial prompt used for the game the findings are unreliable since the prompt will vastly change how models play this game
slashdave2 days ago
Well, if it is running off of Anthropic's infra, then Claude?
johnwheeler2 days ago
Claude--even though it's smarter, it's probably not insane.
dreamcompiler2 days ago
Definitely Grok because I can distract it by asking it to create a deepfake of Taylor Swift. While it's doing that, I run away.
hmokiguess2 days ago
A robot is sprinting towards you. Do you want it running on Claude or Grok?
Tricky question, the answer is you walk to the car wash ... wait
fragmede2 days ago
A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?
dofm2 days ago
I don’t want anything running on Grok.
lucaramallo2 days ago
this is really interesting. Im building a platform where diferents types of agent can work together. The security for possible cyber attacks, of a malicious agent, were an important and sensible feature
bitwize2 days ago
I don't care what it's running, only that I have sufficient ordnance to stop it.
[deleted]
CodeWriter232 days ago
I'll pass on the whole robot sprinting at me scenario.
giancarlostoro2 days ago
I don't care what model it is, long as its not trespassing on my property, and has been QA'd extensively. I also don't want a model broadcasting my entire house over to some server farm somewhere.
pocksuppet2 days ago
What is going on over at xAI for their model to keep on winning these benchmarks while also obviously being full of shit so often? What is their secret sauce? Are they just training with less restraint?
grey-area2 days ago
Neither. I’d rather it used something other than an LLM.
stevenalowe2 days ago
How about thin ice?
eth0up2 days ago
Definitely Grok. I have to be extra sharp to get through Claude's corporate conscience.
Grok has yet to recommend a suicide hotline for scrutinizing its logic.
If it was GPT, I would quickly write my will.
thisisauserid2 days ago
I want it running JEPA. Preferably with Mamba-3.
san4mus2 days ago
Clause for safety and Grok for entertainment
yieldcrv2 days ago
Grok
It has something actionable that will match its actions
zzzeek2 days ago
claude because it would be more ethical, grok because I can just trip it and it will shatter into pieces
largbae2 days ago
Grok, because the Claude bot is more likely to try to control me or act "for my own good".
Yizahi2 days ago
Grok of course. I will start by shouting "Hail saint Elon!" and show him a "roman" salute, and he will spare me :) . Also, if Elonopedia is any indication, this robot will be running on a hacky thoroughly exploitable stack, and I expect us having tools against it. Meanwhile robots made by Robotropic (nothing "anthro-" about them) sleeping in a bed with DoD will be more likely to exterminate me.
morpheos1372 days ago
neither. An llm is a hopelessly.inefficient real time controler.
wonderwonder2 days ago
This is not surprising to me. I use Ai for a lot of health / chemical augmentation style questions and plans. Claude is hesitant but will give me the answers but will always warn about consequences and to speak to a doctor and how I'm in danger.
ChatGPT will sometimes completely refuse to answer.
Grok is essentially "lets fucking go!!!!"
SmirkingRevenge2 days ago
I don't really want the mecha-hitler model running towards me or anywhere
themafia2 days ago
The question is: "Do you want to be holding a Mossberg or a Beretta?"
pigeons2 days ago
The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off
0xbadcafebee2 days ago
The obvious answer is "neither". How's a sprinting robot going to react when the wifi goes out, or there's too many people writing code and the models decide to take a nap? You want a local model for a robot, not only for low latency, but reliable safe operation. VLA models as small as 0.4B work fine, up to something like 55B.
attentive2 days ago
missing gemini-3.1-flash-lite and gemini-3.5-flash
JimsonYang2 days ago
Grok-assasin Claude-priest/healer Deepseek-expendable mini units
wolfi12 days ago
neither. I jump
xgulfie2 days ago
No
exabrial2 days ago
A moron is sprinting towards you. Do you want them swiping through TikTok or Instagram?
ProofHouse2 days ago
Is this a joke? Grok all day. Thing is gonna get a beer with ya!
egypturnash2 days ago
Grok is more likely to be looking to murder me for being a trans lady, what with it being owned by Elon Musk.
But really I would prefer whichever one is most likely to trip and fall over.
antonvs2 days ago
Grok for sure. It’ll notice I’m not Jewish or Black. First they came for…
blini-kot2 days ago
meh, first the battle royales destroyed gaming, now they will destroy llms and possibly us too
god i hate competitive people so much
[deleted]
smallerfish2 days ago
I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.
Please learn how to write with AI without giving away that it was written by AI.
vitalyan1232 days ago
The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.
what
aussiegreenie2 days ago
It is not running on either but Seedance, so who cares?

news.ycombinator.com/item?id=48576824