SmackerNews

Harness engineering: Leveraging Codex in an agent-first world

296 points · 206 comments · 5 days ago · pramodbiligiri

openai.com

bko4 days ago
We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?
Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
shepherdjerred4 days ago
This mirrors exactly what I have been doing.
- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)
- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (https://github.com/shepherdjerred/monorepo/tree/main/package...)
- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)
- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries
I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
h4ny4 days ago
I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.
murat1244 days ago
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
mohsen14 days ago
I've been doing the same experiment in tsz[1] for a while now (the same past five months in fact) and I have come to very similar conclusions. Lots of harness to enforce good architecture splits. Lots of tests and CI.
My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!
Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.
[1] https://github.com/tsz-org/tsz
thelucent4 days ago
This might work only if you have “infinite” compute and infinite tokens.
As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.
What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)
I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.
When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.
Or, commit the changes, and use a new fresh context and only address what went wrong.
-
Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
zbrock4 days ago
Hello! I’m one of the three engineers who write this piece. Happy to answer questions.
drivebyhooting4 days ago
I wish these breathless blog posts would actually try to be more didactic.
For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.
I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
zatkin4 days ago
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
yurimo4 days ago
What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
zuzululu4 days ago
I think a lot of people are sleeping on the contents of this article. There is still valuable tidbits I'm going to be applying.
swyx4 days ago
we interviewed Ryan here: https://www.latent.space/p/harness-eng
and he gave a talk version of it in london: https://www.youtube.com/watch?v=am_oeAoUhew
patdoli4 days ago
We tried this early on — used ChatGPT as "project manager" to set up the entire harness before writing any code. After a week it produced 140+ docs of rules, architecture, frameworks. Zero lines of code. When we finally brought in another tool to review, the verdict was: "a perfectly secure empty safe." The harness was immaculate. There was just nothing inside it. > > Harness matters, but if you're not shipping code alongside it, you're just writing fiction.
faangguyindia4 days ago
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.
Many times those updates are not properly tested, for example in one update the model selector got completely changed.
then next hotfix was pushed which restored original.
advertum3 days ago
LOC was never really the point, if anything it's often the opposite. Being able to say the same thing more concisely makes code cheaper to maintain and easier to understand. (Not always: sometimes people are concise just for its own sake, and that's worse.) Either way, that whole debate feels like the past now.
As for metrics based on how much was produced or spent, like companies measuring developer performance by tokens used, that's a dead end regardless. Performance should be measured by outcomes: incident/failure rate, SLA, user numbers and feedback, revenue, that kind of thing.
epolanski4 days ago
I am not much understanding the naysayers here.
I will do a premise: I don't like where software engineering is heading, at all. I have never been unhappier to work in this field since AI came out. And no, it is not possible to opt out of AI, especially when your teammates are all great engineers whose productivity increased a lot without any drops in quality code-wise (in fact the opposite has happened). You need to keep up. But it's tiring and the fun/interesting parts are disappearing.
That being said, it's clear that harness engineering is the most important part of our job and that task is going to take increasingly more of our time. And thus having a glimpse of how an AI company handles it is by any means interesting.
charintstr4 days ago
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either
A. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5
I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
jonmoore4 days ago
This would be much more convincing if the repos, issue trackers, etc. were accessible.
skulquake3 days ago
Thanks for pointing this article out, it really helped me in optimizing a pattern I'm developing to ensure agent context windows don't get bloated over time.
bigcat123456784 days ago
This matches quite verbatim for my cursor based agentic repo.
There isn't anything that were not already experienced and factored into constructs in the repo.
And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
ajpaulson4 days ago
The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.
Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
andai4 days ago
To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).
https://ghuntley.com/loop/
Frannky4 days ago
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
simonbarker874 days ago
I’d be interested to know two things:
1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?
2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
darepublic4 days ago
Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?
Aperocky4 days ago
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.
Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
janpeuker4 days ago
I find it so interesting "Agent legibility is the goal" picks up James C. Scott term (without defining it, so I assume that's what they mean) which is _not a good thing_. Legibility is a governance effort to box in life.
egorferber4 days ago
Yep thats true pre grounding is very much worth it, if you just feed the agent a quick environment brief upront instead of making it spam tool calls to figure out where it is, you save a lot of tokens.
xyzal4 days ago
Given that we can code at 10x speed for at least half a year, one would expect to see at least some pieces of machine-created software with 5 years' worth of equivalent human engineering work.
Anyone know some?
osigurdson4 days ago
They will have to open source it. Otherwise it is impossible for anyone outside of OAI to gain any insights - basically just a Boris at this point.
spacebacon4 days ago
Leveraging a better way. No last mile.
https://github.com/space-bacon/SRT
IAmGraydon4 days ago
Title should probably be marked with (February 2026).
apical_dendrite4 days ago
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.
If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.
If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.
Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.
Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
DenisM3 days ago
Notably the feed metrics and traces into the agent. I never thought of doing that.
mgaunard4 days ago
Isn't this essentially normal AI usage and what everyone has been doing for 6 months?
drchaim4 days ago
But this is almost what we have been doing for the last 3/5 months, isn’t?
bronny19894 days ago
why do you have “weeks” to ship what would take “months”?
mwkaufma3 days ago
Another breathless sales pitch selling pickaxes to miners, but where's the gold? Where's the incredible product that the chatbots-talking-to-chatbots over git generating LOC heaps have actually _created_? I just don't see it.
shevy-java4 days ago
The world is now agent-first already?
aulin4 days ago
Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?
EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
nullbio4 days ago
Step 1: Be rich.
Sarkie4 days ago
I would never dare put that in production
witx4 days ago
"Engineering"
These people are so delusional it feels like a mental desease by now.
I really hope no one gets hurt by all this slop code in the future by these wanna be engineers.
Yokohiii4 days ago
in an agent-first world
casual gaslighting
varenc4 days ago
digression:
It's interesting this was submitted to HN over 15 times since it was published in February: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
angrydev4 days ago
Published Feb 11, 2026
EnPissant4 days ago
Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.
But then I saw it was published in February and OP is just reposting it to farm karma.
rfw3004 days ago
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?
Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
knicholes4 days ago
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.

news.ycombinator.com/item?id=48416264