When AI Writes Almost All the Code — Five Months In – Good Machine

TL;DR

Orosz called the model inflection correctly. Opus 4.5, GPT-5.2 and Gemini 3, all shipping in a six-week window from mid-November 2025, did clear a capability line that earlier models had not.
One developer can now ship 100% AI-written code. Boris Cherny did it on Claude Code. Malte Ubl shipped two open-source projects over a winter break. These are not staged demos. They are the new floor for a certain kind of engineer doing a certain kind of work.
The team-level dividend has not materialised on the same curve. DX's 4.2M-developer dataset puts AI-generated production code at ~27%, not 90%. DORA 2025 finds delivery stability falling as AI adoption rises. Organisational throughput has not moved past ~10% on the most generous reading.
The bottleneck moved downstream. Q1 2026 surveys show developers now spending more time reviewing AI code (11.4 hrs/week) than writing it (9.8 hrs/week). PR sizes are up 154%. Review times up 91%. Bugs up 9%. Code churn doubled.
What this means depends on what you actually do. For a solo builder on greenfield work, Orosz's January call is closer to true every week. For an engineering organisation maintaining a codebase someone will still own in 2029, the picture is more sobering and the upgrade path is not the IDE.

What Orosz called — and what has aged well

The January piece made one big architectural claim and one operational one.

The architectural claim: November–December 2025 was a model capability inflection — not a marketing event, an actual line crossed. Opus 4.5 (24 Nov), GPT-5.2 (11 Dec), and arguably Gemini 3 (17 Nov) reached a level where a meaningful fraction of working engineers — including ones who had publicly dismissed AI coding as slop two months earlier — flipped. Karpathy is the load-bearing example. In October he called it intermediate-stage and unworthy of the hype. By 26 December he was writing "I've never felt this much behind as a programmer."

That call has aged well. METR's Horizon benchmark — which measures the task-length at which a model completes work at a 50% success rate — has been roughly doubling every 128 days since 2023. Claude 3.7 Sonnet sat at 60 minutes. Claude Opus 4.6, released February 2026, reached 719 minutes. That is twelve hours of coherent autonomous work on the bench. Whether the bench reflects production work is a separate question — but the underlying capability curve is not in doubt.

The operational claim — that this would translate into ~90%+ AI-generated code for many developers and teams in 2026 — is the one the data has complicated.

Where the data stopped agreeing

Three numbers do most of the work here, and they do not point the same direction.

The first number is 27%. DX measured AI-generated production code across 4.2 million developers and 450+ companies in late 2025 — not "AI suggestions accepted," not "AI-touched," but code that made it to production attributable to AI generation. The figure was 26.9%. A study published in Science using a different method on Python contributions from US GitHub developers landed around 30%. Sundar Pichai's public number for Google is "more than a quarter." These cluster. They are real measurements, not vendor claims. They are nowhere near 90%.

The second number is 19%. A METR randomised controlled trial published in July 2025 found experienced open-source developers using AI tooling completed real tasks 19% slower than developers without it. The same developers believed they had been 20% faster. That 39-point perception gap is the most uncomfortable finding in the field. METR has since walked the headline number down — a newer cohort of 800+ tasks with 57 developers showed -4% with a confidence interval that crosses zero — but the perception gap is the part that survives every methodological challenge. People who are slower with AI consistently believe they are faster.

The third number is 10%. Six independent studies — DORA, Faros AI, NBER's executive survey, Bain's lifecycle analysis, Stack Overflow's longitudinal data, and DX's organisational metrics — converge on a ceiling of roughly 10% organisational throughput improvement at high adoption. Apollo's Torsten Slok put it cleanly in February: "AI is everywhere except in the incoming macroeconomic data." An NBER paper that month surveyed nearly 6,000 executives. Over 80% reported AI had no measurable impact on productivity over the preceding three years.

Hold these three numbers next to Boris Cherny's 200 AI-written PRs and you have the actual shape of the change. Individual capability has gone vertical. Organisational outcome has not. Both are true. Both are the story.

The bottleneck moved. It did not disappear.

Writing code was never the bottleneck in shipping software. Bain's lifecycle analysis puts coding-and-testing at roughly 25–35% of total software development time. The rest is requirements, review, debugging, meetings, documentation, deployment, on-call. A 100% speedup on the smaller fraction produces a 15–25% topline improvement if everything else holds constant — and the data is increasingly clear that everything else does not hold constant.

Faros AI measured 10,000+ developers across 1,255 teams in mid-2025. Teams with high AI adoption merged 98% more pull requests. PR size grew 154%. Review time went up 91%. Bugs up 9%. Organisational DORA metrics — the actual throughput numbers — were flat. The coding step accelerated. The review step, which was already the constraint, got worse. This is what bottleneck migration looks like in the wild.

Q1 2026 survey data sharpens this. Developers now report spending 11.4 hours per week reviewing AI-generated code versus 9.8 hours writing new code. The job changed shape. The people who feel best about AI tooling are the people whose workflow now revolves around producing code; the people who feel worst are the people who own the codebase six months later, and who are reading more than they are writing.

This is not the failure mode Orosz was warning against. He flagged the ugly outcome — "more code generated will lead to more problems, weak software engineering practices start to hurt sooner" — but the specific mechanism that has emerged is review collapse under PR volume. It is a queueing problem more than a quality problem. And queueing problems do not yield to better models.

The security number nobody puts on the keynote slide

Black Duck's 2026 Open Source Security and Risk Analysis report — the eighteenth annual, large sample, longitudinal — found vulnerabilities per codebase up 107% year-over-year. The mean codebase moved from 280 to 581 known vulnerabilities. The report does not attribute the doubling entirely to AI, but Veracode's parallel finding does some of that work: testing 100+ LLMs across 80 coding tasks, 45% of AI-generated code introduced OWASP Top 10 vulnerabilities. CodeRabbit's analysis put AI-generated code at 2.74× the security vulnerability density of human-written code.

If your mental model of "AI writes the code" includes "AI also catches its own security issues at human-equivalent rates," the data does not support that yet. The natural pairing of agentic coding tools with agentic security review is shipping — GitHub's Copilot Autofix, Snyk DeepCode AI, Semgrep's AI rules — but the gap between code generation rate and security review rate is the next eighteen months of the story.

What this actually means for you

The honest answer depends on which version of "you" we are talking to. We are addressing engineers and the people who lead them — the natural audience of the originating piece — not any organisation or commercial context.

If you are an individual engineer building greenfield software on mainstream stacks (TypeScript/React, Python, Go, Rust on well-trodden frameworks): Orosz's January call holds. The fastest path forward is to commit to one anchor tool — Claude Code, Cursor, or Codex — for ninety days, learn its failure modes deeply, and accept the productivity premium. Q1 2026 data has Claude Code at 28% primary-tool share and Cursor at 24%; these are the safe bets on workflow fit, not benchmark wins. The plateau hits at ~180 days, so plan to re-evaluate before then.

If you maintain a codebase older than two years: the dividend is smaller, the risks are larger, and the bottleneck is your review process before it is your editor. The single highest-leverage change is not adopting another agent — it is rebuilding review capacity. That means tightening PR size limits (the 154% growth number is your enemy), funding async reviewer rotation, and instrumenting churn — Larridin's benchmark has it at 7.1% in 2026 versus 3.3% pre-AI. Churn over 7% is a signal that your review queue is shipping code it shouldn't.

If you lead an engineering organisation: the question to put to yourself this quarter is not what is our AI adoption rate? — that race is over and the number is somewhere between 84% and 91% depending on how you ask. The question is which DORA metric is moving in the wrong direction? DORA 2025 found delivery stability dropping 7.2% per 25-point increase in AI adoption. If your stability is degrading and your throughput is flat, you are inside the modal pattern, and the fix is investment in pre-merge automated testing, security scanning, and review tooling — not more seats on more agents.

If you are not an engineer but are reading this because the topic is everywhere: the useful update is narrower than the headlines suggest. Software is being produced faster by individual developers in greenfield contexts on common stacks. That is real. Software is not getting cheaper, faster, or safer at the organisation level on any timeframe that has yet shown up in published data. The macro productivity dividend that AI optimists were forecasting for 2026 has not arrived, and the people closest to the work — engineering leaders running real DORA metrics on real teams — are the most measured group in the conversation. That should update your model of every other "AI is transforming X" story you read this year.

Uncertainty ledger

The Horizon benchmark might be the leading indicator. Claude Opus 4.6 reaching 719 minutes of coherent task-completion in February is a real number on a real bench. If the doubling cadence holds and that capability translates from bench to production codebases, the organisational ceiling may move in 2026 H2 rather than 2027. We do not yet know whether it will.
The 27% production-code figure is from late 2025 data. Six months is a long time at current rates. The Q3 2026 measurement could be materially higher. Watch DX's quarterly impact report — that is the cleanest data series in the field.
The METR slowdown finding has been contested. The newer cohort is closer to neutral than to the original 19%. The perception gap, however, has not been challenged successfully by any study we have read.
None of the productivity research yet covers the agentic, long-horizon, multi-PR workflows enabled by Opus 4.6-class models. Most published data was collected on pre-tipping-point tooling. The next twelve months of measurement will be the first that reflects what Orosz was describing.

Bottom Line

The model capability inflection Orosz called in January was real and has compounded. The individual-developer productivity dividend is real for the right work on the right stack. The team and organisation-level dividend is not yet in the data — and the structural reason is that AI coding tools accelerated the part of the software lifecycle that was never the bottleneck while making the actual bottleneck, review, materially worse. AI writes almost all the code is becoming literally true for a growing cohort of engineers. AI delivers almost all the software is a different claim, on a different curve, and the next eighteen months is when we will find out whether it is on the same curve at all.

Sources

Orosz, G. When AI writes almost all code, what happens to software engineering? The Pragmatic Engineer, 6 Jan 2026. Tier 2.
DX. Q4 2025 AI-Assisted Engineering Impact Report — 91% adoption, 26.9% AI-generated production code across 4.2M developers, 450+ companies. Tier 2.
DORA. 2025 State of AI-Assisted Software Development, Oct 2025 — 90% adoption, stability declining with adoption. Tier 2.
METR. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, Jul 2025; follow-up cohort 2026. Tier 1.
Faros AI. Engineering Metrics Across 10,000+ Developers, Jun 2025 — +98% PRs, +154% PR size, +91% review time, +9% bugs. Tier 2.
Black Duck. 2026 Open Source Security and Risk Analysis (OSSRA) Report — vulnerabilities per codebase +107% YoY. Tier 2.
Veracode. Security Testing of 100+ LLMs on 80 Coding Tasks — 45% introduce OWASP Top 10 vulnerabilities. Tier 2.
METR. Horizon Benchmark — Task-Length Doubling Time, Kwa et al. 2025–2026. Tier 1.
NBER Working Paper. AI Adoption and Firm Productivity: Survey of ~6,000 Executives, Feb 2026. Tier 1.
Apollo Global Management. T. Slok commentary, Feb 2026. Tier 2.
Stack Overflow. 2025 Developer Survey (Aug 2025; extended analysis Dec 2025). Tier 2.
Bain & Company. Software Development Lifecycle Analysis. Tier 2.
Larridin. AI Coding Benchmarks 2026 — churn 3.3% → 7.1%. Tier 3.
Digital Applied. AI Coding Tool Adoption 2026 — Q1 Developer Survey. Tier 3.