963 points · 914 comments · 1 month ago · MallocVoidstar
blog.googlespankalee
sdeiley
Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight
____
Update:
3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.
https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...
sheepscreek
So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.
While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.
UPDATE: I may have spoken too soon.
> Fixing Truncated Array Syncing Bug
> I traced the missing array items to a typo I made earlier!
> When fixing the GC cast crash, I accidentally deleted the assignment..
> ..effectively truncating the entire array behind it.
These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!On a lighter note, every time it happens, I think about this Family Guy: https://youtu.be/HtT2xdANBAY?si=QicynJdQR56S54VL&t=184
minimaxir
Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3
Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.
xrd
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
mijoharas
Are Google planning to put any of their models into production any time soon?
Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?
1024core
You are definitely going to have to drive it there—unless you want to put it in neutral and push!
While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.
Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.
Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
nickandbro
"create a svg of a unicorn playing xbox"
https://www.svgviewer.dev/s/NeKACuHj
Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.
Robdel12
I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.
With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?
simonw
WarmWash
However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.
So maybe I'll give it a 90%?
This is without tools as well.
sigmar
edit: biggest benchmark changes from 3 pro:
arc-agi-2 score went from 31.1% -> 77.1%
apex-agents score went from 18.4% -> 33.5%
esafak
zhyder
Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.
I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.
davidguetta
So google doesn't use NVIDIA GPUs at all ?
maxloh
Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.
I hope the 3.1 release includes a much larger output limit.
the_duke
BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.
I hope they manage to improve things on that front, because then Flash would be great for many tasks.
faebi
Similar in antigravity. Privately it's my absolute favorite.
So I'm actually rooting for this.
ttul
This tech is not going to replace us. If anything, I am becoming even more of a workaholic. But the output volume is going to pay off for those who are privileged enough to use these tools.
tenpoundhammer
exabrial
Not another piece of Electron bloatware, a regular, efficient, fast, snappy, native, app. One that connects to my MCP severs and has local filesystem tools.
Anthropic might fall behind Google/OpenAI eventually, but their Desktop App + MCP/Connectors is unbelievably useful to get real work done.
mbh159
zapnuk
1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months
2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.
I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.
XCSme
qingcharles
It's only February...
ArmandoAP
infinitewars
veselin
Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?
vnglst
vnglst
janalsncm
This kind of test is good because it requires stitching together info from the whole video.
sergiotapia
opencode models --refresh
Then /models and choose Gemini 3.1 ProYou can use the model through OpenCode Zen right away and avoid that Google UI craziness.
---
It is quite pricey! Good speed and nailed all my tasks so far. For example:
@app-api/app/controllers/api/availability_controller.rb
@.claude/skills/healthie/SKILL.md
Find Alex's id, and add him to the block list, leave a comment
that he has churned and left the company. we can't disable him
properly on the Healthie EMR for now so
this dumb block will be added as a quick fix.
Result was: 29,392 tokens
$0.27 spent
So relatively small task, hitting an API, using one of my skills, but a quarter. Pricey!agentifysh
More importantly feels like Google is stretched thin across different Gemini products and pricing reflects this, I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.
I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)
dxbednarczyk
For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?
timabdulla
I would love for them to eliminate these issues because just touting benchmark scores isn't enough.
upmind
thallavajhula
Gemini is almost great. Claude Opus is great. I keep switching among these subscriptions every month to not miss out on any of the offerings for too long; ChatGPT Plus <-> Gemini Pro <-> Claude.
WarmWash
Either way early user tests look promising.
carpe__diem
In production, the costly failures are usually "almost right" edits that quietly shift semantics across large diffs.
We now gate model upgrades behind a fixed eval set of our own repos + prompts and compare pass rates by task category (refactor, test repair, API migration). Raw benchmark gains matter less to us than variance and rollback safety. If 3.1 improves consistency on long multi-file edits, that’s a bigger win than a small jump on one-shot tasks.
XCSme
EDIT: while also being 3x cheaper
pawelduda
dudeinhawaii
The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.
Once agentic coding was a bust, I gave it a run as a daily driver for AI assistant. It performed fairly well but then began behaving strangely. It would lose context mid conversation. For instance, I said "In san francisco I'm looking for XYZ". Two turns later I'm asking about food and it gives me suggestions all over the world.
Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).
Both GPT-5.2 and Claude crunched and came back with high quality estimations and rationale. Gemini 3.1 Pro... 5 times, returned a weather forecast widget for either the layover or final destination. This was on "Pro" reasoning, the highest exposed on the Gemini App/WebApp. I've always suspected Google swaps out models randomly so this.. wasn't surprising.
I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.
This tells me that a Google AI Ultra subscription gives me a sub-par coding agent which often swaps in Flash models, a sub-par web/app AI experience that also isn't using the advertised SOTA models, and a bunch of preview apps for video gen, audio gen (crashed every time I attempted), and world gen (Genie was interesting but a toy).
This will be a quick cancel as soon as the intro rate is done.
It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.
Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.
hackrmn
nobrains
There is not enough time to read the text, see old animation, and see new animation. Better would have been to keep the same animation on repeat, so that people have unlimited time to read the text and observer the animations.
Also, it jumps from example to example in the same video. Better would have been to show each separately, so that once user is done observing one example at their own pace, they can proceed to the next.
As a workaround, I had to open the video (just the video) in a new tab, pause once an example came up, read the text, then rewind to the start of the animation to see the old animation example, then rewind again, then see the new animation example, and then sometimes rewind again if I wanted to see the animation again. Then, once done with the example, I had to forward to the next example and repeat the above process again.
Somewhere along that process, they lost me.
saberience
I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.
I.e. they are benchmaxing
Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.
PunchTornado
jeffbee
ETA: They apparently wiped out everyone's chats (including mine). "Our engineering team has identified a background process that was causing the missing user conversation metadata and has successfully stopped the process to prevent further impact." El Mao.
ponyous
Unsurprisingly 3.1 performs a bit better. But surprisingly it costs 2.6x as much ($0.14 vs. $0.37 per 3D Model Generation) and is 2.5x slower (1m 24s vs. 3m 28s).
To me it feels like "lets increase our thinking budget and call it an improved model!"
josalhor
rahulroy
I tried telling this to agent, and it keeps repeating the same phrase "Gemini 3.1 Pro is not available on this version. Please upgrade to the latest version."
Congratulations on beating the benchmarks, but I wonder how much effort is devoted on improving DX?
Edit: It's updated now, I can confirm with "There are currently no updates available.". It still doesn't let me continue with the conversation. I'm able to create new session though.
markerbrod
vinhnx
dude250711
brap
What’s most surprising is that I had it follow a strict loop/workflow and it did that perfectly. Normally these things go off the rails after a while with complex workflows. It’s something I have to usually enforce with some orchestration script and multiple agents, but this time it was just one session meticulously following orders.
Impressive, and saves a lot of time on building the orchestration glue.
impulser_
Murfalo
[deleted]
conception
OpenAI and Google's Deep Research produce a very long, 100% made up report. If I question the AI on the report, they both admit they just made it up.
Claude just returns, "I couldn't find anything on the BBS or the game."
cmrdporcupine
[deleted]
metavolvelabs
ChrisArchitect
0xcb0
onlyrealcuzzo
If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.
mark_l_watson
Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.
[deleted]
pRusya
Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.
use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.
panarchy
zokier
Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.
So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.
datakazkn
mixel
rishabhaiover
azuanrb
But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.
tskulbru
MASNeo
The latest update? I simply don’t care. I am not paid to evaluate models, I am paid to build. Not sure 4 benchmark points are making the difference.
6d6b73
barfingclouds
clhodapp
[deleted]
hsaliak
d4rkp4ttern
ChrisArchitect
makeavish
n4pw01f
In contrast, the vs code plugin was pretty bad, and did crazy things like mix languages
attentive
I'd rate it between haiku 4.5 (also pretty good for a price) and sonnet. Closer to sonnet.
Sure, if I am not cost-sensitive I'd run everything in opus 4.6 but alas.
quacky_batak
Anthropic is clearly targeted to developers and OpenAI is general go to AI model. Who are the target demographic for Gemini models? ik that they are good and Flash is super impressive. but i’m curious
robviren
mrcwinn
syspec
On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.
Does this mean that 3.0 is now GA at least?
denysvitali
0x110111101
Drblessing
siliconc0w
Grisu_FTP
Am I the issue? Am i just misremembering the early times because it was a new thing?
holografix
Is Gemini meant to be be a revenue making product or strictly a cost centre to defend against Search and Ads erosion by OpenAI?
Why does the Gemini web app not support MCP Servers?
__jl__
jeffybefffy519
Jirach05
alwinaugustin
SrFil
seizethecheese
ismailmaj
johnwheeler
eric15342335
[deleted]
nautilus12
matrix2596
getcrunk
atleastoptimal
1024core
[deleted]
yuvalmer
msavara
[deleted]
Topfi
andrewstuart
Useless.
[deleted]
[deleted]
naiv
LZ_Khan
mustaphah
As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).
It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.
---
[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...
trilogic
Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?
kuprel
jdthedisciple
BMFXX
hn_throw2025
https://www.google.com/appsstatus/dashboard/incidents/nK23Zs...
makeavish
himata4113
lysecret
leecommamichael
throwaw12
Benchmarks are saying: just try
But real world could be different
pickle-pixel
taytus
solarisos
techgnosis
jcims
(FWIW I'm finding a lot of utility in LLMs doing diagrams in tools like drawio)
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.