The Untrainable

Jun 10, 2026

The mid-2026 investor’s version of AI psychosis is a despair that nothing is investable, that we should put all our money into Anthropic and Nvidia and go home. I have never felt it. I have been sure the models are smarter than me for several sub-versions now, I’d be a happy buyer of Anthropic and Nvidia at the market price, and all my smartest friends are quite convinced that self-improvement is soon to work – and I still don’t feel it. The despair isn’t stupid. The logic runs: if the model keeps getting better at everything, then every company built on top of one is a thin wrapper waiting to be absorbed, and the only value that survives is the compute and the frontier weights.

Take software, the case the despair leans on hardest. Devin shipped in 2024 solving thirteen percent of the tasks on the standard software benchmark, and was largely dismissed. A year and a half later the best agents hit the high eighties, and they’re doing real work inside Goldman Sachs and the U.S. Army. Nearly everyone drew the same wrong lesson: the model ate software engineering. But as the model swallowed the part of software engineering you can best measure, we’re relearning what many teams knew – engineering has always resisted measurement, and the most measurable parts may not be the only important ones.

Mert Demirer and coauthors at MIT finally put numbers on it: across more than 100,000 developers, the latest coding agents lifted how much code got written by roughly 180%, and how much actually shipped by about 30%. Writing got cheap. The rest still runs through a person, and it matters. The net impact is, of course, still amazing.

A benchmark is a thing you can measure, and a thing you can measure is a thing you can train against. Thus, coding agents matured first: a compiler is a free verifier, a test suite is a free verifier, and when the answer checks itself for nothing you can grind against the check until you beat it. But passing the test never told you the change was the right one for a decade-old codebase with three undocumented reasons that module exists and a deploy pipeline held together by a cron job no one will admit to writing.

That kind of correctness can’t be read off a leaderboard, and it can’t really be read off anything. You find out whether a system that complex works by running it in the world long enough to learn, and a smarter model doesn’t make the world run faster. Nobody unit-tests something the size of Google and trusts the green check; you trust it because it survived years of real load. Correctness like that isn’t only private, it’s the slow kind of moat capital can’t collapse. Even the optimists grant the clock can’t be skipped: Noam Brown, who has pioneered OpenAI’s reasoning models, wrote recently that the only sure way to evaluate an agent over a one-year horizon may be to run it…for a year.

As Gabe Pereyra says, real automation isn’t only the model getting better. It’s the product, the model, the workflow, and the firm moving together, and three of those four move at the speed of an organization. Moving people is the part no benchmark touches: getting a skeptical partner to change how she runs her matters, holding a team together through a rebuild. It’s why, when we hire a CEO, the ability to deal with people weighs at least as much as the analytical horsepower, and a smarter model doesn’t change that weighting. The feedback is ambiguous, the horizon is years, and the trust belongs to a person. Every company I know has every engineer on frontier coding models, and not one has changed its eng org at anything close to that speed. Adoption took a quarter, and what a magical quarter of token growth it was! But the rebuild is taking years.

What’s legible is what’s leaving. The valuable work is illegible by construction: anything you can put on a leaderboard, you can train against, so anything measurable is already on its way to commodity. The process takes time and is never total, but the direction never reverses. Put it in money terms, the way my friend Matt MacInnis at Rippling does: a token spent answering a generic question is worth almost nothing, since anyone’s model can answer it, while a token spent reasoning over your company’s data is worth much more, because it does the thing you actually want, not just the plausible thing.

The legible work gets eaten from two directions. From below, tasks saturate: once a job can be checked cheaply, the buyer stops asking which model did it and starts asking what it costs, and the work falls to whatever open or distilled model is cheapest that week. Everywhere they can matter, margins eventually matter. From above, the labs are trying to get the models to swallow their own scaffolding. The retrieval, the routing between cheap and expensive calls, the tool use, even the reasoning policy, all the apparatus that used to wrap a model is being pulled into the weights, until the wrapper is the model. This is the absorption frontier. Margin pressure cuts the other way too: a general agent has to be ready for anything, which is expensive, while a focused application can tune one workflow until it runs on a fraction of the token spend, and unlike the lab selling those tokens, it keeps the difference.

So, we may ask two things of any kind of work. Is its correctness private and expensive to establish, the kind of truth that exists only inside someone’s data? And is it walled off, locked inside a system you can’t get into? Set those against how saturated the task is, and you get a 2x2. Saturated work with public answers is commodity tokens, and open models own it. Frontier work with public answers, where coding benchmarks live, is where the labs win, because when the eval is free, owning it counts for nothing. The prize is the last corner, the untrainable one: frontier work whose correctness exists only in private. You can see it in the inference clouds hosting the AI-native pioneers, where the vast majority of tokens are generated by custom models, not generic open ones.

The walls into that last corner vary in height. A single developer’s toy codebase is portable and standardized, so the climb is short. A bank’s production systems are neither, and you don’t get root by being 2% more clever on SWE-Bench Verified.

Capability eats many things, but a better model does not make private ground truth public. It does not hold the license, sign off on the liability, or own the firm’s files, and it cannot be the party that gets sued when the answer is wrong. Intelligence is not the bottleneck here. Permission is, and so is accountability. You can imagine a model far smarter than any person, and it still has to be let in the door, and someone still has to put their name on what it does.

That door has a lock and a deadbolt. The lock is the environment: you only get to verify whether AI did something useful inside a system once you’re trusted inside it, after the security review, the integration, the contract with your name on the outcome. The deadbolt is the user. A majority of American doctors now open OpenEvidence every day, and no amount of compute buys that. A lab can train a flawless medical model tomorrow and still have no way into the physician’s habit, or into the decision flow of UCSF, because trust is built slowly, on relationships, with user’s acquiescence, not gradient descent that erases them.

This is also the job. An application earns its place in the untrainable corner by doing unglamorous work: arranging a company’s private reality so a model can act on it, handing the model the tools to act, working with the customer to change the reality of its workforce. A company that brings the translation is tough to copy – and the translation never ends. Integration and maintenance run as long as the relationship does, won by teams that put domain-specialized engineers and tools next to the customer.

As one example, at a top white-shoe law firm, the M&A practice alone runs close to a thousand deals a year. You can’t have hundreds of associates each downloading client files to a desktop and asking a general agent to rip through them, for confidentiality reasons and a dozen others, and even if you could, what you’d learn would be fragments, one associate’s corrections at a time, with no view of how a whole deal flows. The signal that matters lives at the level of the deal, and a deal has a shape: for M&A the NDA, the term sheet, diligence, the purchase agreement, the ancillaries, the closing checklist; for IP litigation, motions, discovery, prior art, more motions. Each practice area has its own, and neither the lawyers nor the tools interchange across them. And the problem the firm is actually solving sits a level above all of it: running every practice area in parallel, the way a top partner runs hundreds of matters at once while bringing in new ones and training associates. Transforming a firm like that isn’t a single task you can write an eval for. It takes an operator to moneyball it, with extremely ambiguous intermediate goals and incomplete feedback, over very long horizons, in an environment that won’t hold still.

Illegible value is unfortunately also complicated to sell, for the same reason it’s hard to commoditize: a company can’t tell from the outside whether AI will transform its operations any better than the benchmark can. So the strongest businesses stop trying to prove it externally, get in, and price the outcome instead. Sierra charges when its agent resolves a customer’s issue and nothing when it kicks the problem to a human, so the price becomes the evaluation, and it works only because Sierra owns the definition of resolved. Cognition’s Devin makes the same move in software with a “performance guarantee,” which you can only offer for outcomes in a system you’re trusted inside.

Even serving tokens, the layer everyone loved to call a pure commodity, doesn’t behave like one. The best AI-native companies concentrate their serving on one or two providers (Baseten or Fireworks) because cost per token commoditizes on schedule while reliability under real traffic and guaranteed access to scarce compute do not. Where you serve is a different choice from which models you use. Price is the only part of inference that acts like a commodity.

One objection often raised is that the lab is your supplier – why won’t it run its own first-party product below cost to bleed you out, or revoke your API access and take the market itself? That is the real version of the despair, and it only works if the model layer is a single-player game. It clearly isn’t – it looks more like a three-and-a-half-way death match with a crop of international players six months of training behind, and a development league 5X the size it was last year. Customers want competition among their suppliers, and the labs want market share more than they want any one application dead.

You can watch this in the markets where the labs compete head to head. In consumer chat, the best model has never simply won. ChatGPT held its lead through years of real competition, and the share it is losing now is going to Gemini on the strength of Android and Search, not a better model. Anthropic, which the prediction markets (and internet vibes) currently rate as having the best model, is barely a factor in consumer chat and built its business in enterprise and coding instead. If a better model can’t take a rival’s users in the most central application there is, it isn’t going to integrate its way through a hospital’s records or a bank’s liability. The public chooses on more than coding today. If the frontier remains crowded, the layer above will be valuable.

If the work can’t be scored from outside, someone on the inside has to decide what a good answer even is, and that decision is the whole game. Enough of those decisions, written down, become a benchmark. Harvey publishes one for law, and Sierra publishes one for voice agents. You earn the right to define what good means for a field by being the one it already uses, and these companies earned that through the struggle of real adoption.

The evaluation that decides real money is private and per-firm: what this firm, on this kind of matter, will accept as good work, and it is nowhere near finished, because the depth of the law dwarfs any public test. OpenEvidence is settling what a safe clinical answer looks like. None of this is really measurement, it is judgment about what is true and what is good, written down until it becomes the standard everyone else is measured against, and a foundation lab can’t author it however smart it gets, because that standing only exists inside the field. That authority tends to land where it already sat. The senior lawyer writes the legal benchmark. Defining a safe clinical answer falls to a physician. And resolved means whatever the company that already owns the customer says it does.

The absorption frontier keeps rising, because we keep learning to measure more of the work, and the measurable gets eaten. The untrainable ground shrinks under whoever’s standing on it, so you can’t find a defensible spot and rest. You keep stepping toward whatever can’t yet be scored, and you re-underwrite constantly. On a narrow task, with your private data and your own evals, you can train to the frontier and beat the general models where it counts, and that specialized model becomes part of the moat. On the other hand, competing on the general model is a capital war you lose to whoever owns the most compute, and the trap for a company with shallow access and a legible task. The day it commits to out-training the frontier in a general swath of tasks to survive, the winner seems most decided by datacenter size, and the ending is usually not an independent champion but a sale to someone compute-rich.

All of that is defense. Even harder is offense, choosing what to build in the first place. That’s what I spend the year looking for, and I find it maybe three times. The model is no help there. It will do whatever you point it at and can’t tell you what’s worth pointing it at, and you can’t benchmark that, so you can’t train it. It’s also the reason the incumbents don’t take everything: they keep the ground they have, and the next thing comes from someone who finds a use before the rest of us. Maybe intent is an even scarcer input than compute.

The despair is half right. The thin wrapper layer really is being absorbed, and a lot of what looks like a company today is a thin wrapper. It is wrong about what that leaves. The mechanism is clear; the destination isn’t. What I’d bet on is the direction: intelligence keeps getting cheaper, and value keeps sliding toward the few places a model can’t reach. The untrainable is value with history. So get inside one, do the unglamorous translation, and start writing down what good means there, because someone is going to. The most cited benchmark score of the year is a map of territory about to be worthless, and a notice of who is about to lose the right to say what counts as good.

Lisa Su

Jun 15

This reinforces something I've been thinking about lately: we're entering an era where AI commoditizes execution, while humans become curators of context, trust, and accountability. Product teams may soon spend more time designing feedback loops and evaluation systems than building features themselves. The future moat is not owning intelligence—it's owning the environment where intelligence can safely operate.

1 reply

Doug Standley

Jun 13

Noted! Pay attention to this wisdom from the frontline. Oh, and take a look at Onyx to see Sarah’s lens in action.

14 more comments...

Sarah Guo

Discussion about this post

Ready for more?