A senior engineer's desk with one elegant handwritten plan in the foreground, while a glass office wall behind is completely covered with chaotic flowcharts and process diagrams. The contrast: one calm page in focus vs a wall of process noise.

Hundreds of tests. The app was still broken.

Let that sink in for a second. That one sentence is the punchline of six weeks of work, eighty-something experiments, and a slow, quiet realization that most of what you've been told about AI agent workflows is wrong in a very specific way.

I've been building software for over twenty years. I've shipped real systems with real users, and I've spent the last year trying to figure out where AI agents actually pull their weight in that process. Every other day someone publishes a new orchestration framework, a new graph-based agent topology, a pretty diagram showing how their methodical seven-stage pipeline crushes the "slop" of a oneshot.

I wanted to see the results, not the hype. So I ran the experiment.

I picked a real, non-trivial problem, wrote one rigorous plan with acceptance criteria, and pointed three different workflows at it in parallel: a oneshot using the Claude Agent SDK, a lighter agentic pipeline with code review, and a strict TDD red/green pipeline with three personas. I let them all build the same app from the same plan. Then I scored the outputs.

What came out the other side wasn't what I expected, and probably isn't what you expect either.

Here's what 80+ runs taught me, and why I think most teams chasing "more orchestration" are debugging the wrong thing.

The Experiment: Same Plan, Three Workflows

The thing I wanted to build wasn't a toy. I needed something real enough that the workflow choice would actually matter — so I built a Claude.md / AGENTS.md scorer. It takes an agent config file, grades it across multiple dimensions, exports a PDF, and runs a live evaluation that actually executes code to see whether the config produced better behavior in a real model run. Frontend, backend, scoring, code execution. Not trivial.

One honest note up front: there is no agreed-upon ground truth for what makes a "good" AGENTS.md or CLAUDE.md file. I asked engineers and AI researchers I trust what they look for. They had overlapping intuitions — clarity, scope, discoverability, examples — but no two answers were decisive. Pretty much like asking ten senior engineers what makes "good code." Some agreement. Lots of opinion. No oracle. Just like humans. 😉 That ambiguity made this experiment more interesting, not less, but it's something to hold in mind as you read the scores.

Before I let any agent touch it, I wrote a plan. Not "build me a scorer." A real plan, built with the /plan skill: a swarm of exploration agents crawled the codebase, surfaced constraints, and handed back a step-by-step breakdown — every step naming files, components, and what "done" looked like.

That last part is the part nobody talks about. Every step had acceptance criteria: specific, testable, about behavior the user could see — not metrics an agent could game. The plan wasn't perfect. There's no such thing. But it was a plan I could hand, unchanged, to three different workflows and know they were all aiming at the same target.

That's what I did. Same plan. Three workflows.

Oneshot. A single agent. Takes the plan, takes off, builds the whole thing in one continuous run. No phase boundaries, no review gate, no orchestration. Using the Claude Agent SDK so I can control the context window precisely.

Light agentic with review. A dev cycle where each step gets to figure out its own approach, but every step passes through a review gate that can send work back for fixes before it commits. Lighter touch than full TDD, but with quality gates baked in.

TDD red/green pipeline. The full ceremony. Three personas: a system architect, an SDE, and a test engineer. Each step starts red -- the test engineer writes failing tests first. Then the SDE writes the minimum code to go green. Then a review pass. Rigid, methodical, focused heavily on tests. The kind of workflow that looks like real engineering on a slide.

I kicked off all three in parallel against the same plan. Then I went and did other things while they built.

Three identical bonsai trees on a shelf, each surrounded by a wildly different amount of gardening apparatus — from one pair of shears to a workbench buried in tools. Same tree, very different toolbox.

Why I Used the SDK (And You Probably Should Too)

Quick aside, because this is one of the things I actually want you to take away.

Using the Claude Agent SDK changed how I think about orchestration. The SDK exposes a query function that creates a new, fresh context window with an exact prompt every time you call it. That sounds boring. It's not.

It means I get repeatability. I get to control exactly what an agent sees and what it doesn't. I can kick off N tasks where each one starts with a clean slate and a precisely scoped prompt, instead of the all-too-common pattern of "one giant agent slowly drowning in its own context until it forgets what it's doing."

Even if you're doing nothing fancy -- even if you're running a oneshot -- the SDK is worth it. You decide when context resets. You decide what the agent knows. You stop guessing why the same prompt produced different code yesterday than today.

The downside, which I learned the hard way: the moment you start orchestrating across multiple query calls, you also start losing context on purpose. Each call is a fresh mind. So now you have a new problem -- how do steps share what they learned? How does the workflow remember what went wrong last time?

I solved that by wiring Obsidian into the loop as a shared knowledge vault. Each step writes notes. Each subsequent run reads them. It's a self-learning loop, primitive but real. And it turned out to matter more than the workflow shape itself. More on that in a minute.

What Actually Happened When the Workflows Finished

I came back, pulled up all three outputs side by side, and started scoring.

The oneshot finished in about seven minutes. It generated 64 tests. When I ran the comprehensive scorer against the resulting Claude.md, it gave it 8.15 out of 10. The simple version actually scored slightly higher in some dimensions. The frontend looked clean. It worked.

Did it nail every detail? No. The placement of the API key field was a little odd -- it ended up tucked up at the top of the page rather than next to the live evaluation widget, where it logically belongs. Cosmetic stuff. The kind of thing you'd push back in a code review.

The light agentic workflow with review came out somewhere in between. It scored a 77. It was more on-brand visually. It correctly differentiated comprehensive from basic configs. It ran the live evaluation properly. The review gate caught real polish issues -- things the oneshot missed -- and the result felt more deliberate.

Then there was the TDD pipeline.

Earlier in the experiment series, the TDD workflow produced 282 tests and a broken app. I tightened the plan and ran it again. This time it produced 610-plus tests. The output was more dispersed. The API key still landed in a weird place. Some scorers gave it a 95 on certain dimensions, but the deeper I looked at the actual behavior of the app, the more I noticed something: the test-first agent had written tests that mocked everything. The implement agent had then made those mocks pass. Neither one cared whether the app actually worked end to end. They cared about their phase.

This is the part that should make you uncomfortable if you're sold on TDD-for-agents as a default. The tests passed. The phases completed. The review gate signed off. And the user-visible behavior was worse than the seven-minute oneshot.

I want to be very precise here, because the easy version of this story is "TDD bad, oneshot good" and that's not what I'm saying. The TDD output had real strengths -- when you focus an agent down on a specific task with a specific persona, it does think harder about that task. It writes more tests. It catches more edge cases on the dimensions it's looking at. But it also tends to generate way more code and way more tests than you need, and those tests don't always verify the thing that actually matters: does the app work for the user?

I tested this hypothesis with one more variant -- a leaner TDD workflow that explicitly limited the number of tests and focused them on the acceptance criteria from the plan. I figured if the bloat was the problem, constraining it would help.

It didn't, conclusively. The lean version generated less stuff but didn't produce noticeably better code. The signal just isn't there to justify the orchestration cost yet.

The Numbers That Hit Hardest

If I strip away all the workflow philosophy and just look at the data from the runs, here's what the spread looks like:

Dimension Oneshot Light Agentic TDD Red/Green
Time to complete~7 minutesLonger, multi-phaseMuch longer
Tests generated64Moderate, plus review-driven adds610+
Quality gatesNoneReview gate per stepTest-first + persona reviews
Score (representative run)8.15 / 1077 / 100Mixed (95 on some dims, broken behavior on others)
Architecture polishGoodBetterVariable
User-visible "does it work"YesYesOften no until heavy fixes

The honest read on this table: the gap between the workflows isn't nearly as big as the orchestration marketing suggests. And when there was a gap, you could usually close it by giving the oneshot a small follow-up nudge -- "fix the API key placement, tighten this section" -- in seconds. The TDD pipeline took an order of magnitude longer to get to the same place, and sometimes never got there at all.

There's a place for review gates. There's a place for orchestration. But "hundreds of tests" is not a quality metric. It's a cost.

Why More Orchestration Doesn't Mean More Quality

Let me be fair to orchestration before I criticize it.

Forcing an agent into a single phase — "write tests for this, and only this" — does focus it. It controls context. It limits scope creep. Those are real, valuable things, especially when you have a precise quality bar to hit, a compliance gate to clear, or a step that has to behave the same way every single time. If you need an agent to do exactly one thing very carefully, orchestration is a tool that helps. That's not nothing, and I want to keep using it where it earns its keep.

The trade-off is what nobody puts on the slide. Here's my best explanation for what I saw, and I'd love to hear yours.

When you force agents into rigid phases, they optimize for the phase, not for the outcome. The test-first agent gets a single instruction: write tests for this. Its job is done when there are tests. Whether those tests verify real behavior or mocked behavior is a question nobody asked it.

The implementation agent gets the next instruction: make these tests pass. Its job is done when they're green. Whether the app works end-to-end isn't its concern.

The review agent gets a diff and a checklist. It catches the things on its checklist -- naming conventions, docstring formatting, missing types. It doesn't catch the things that aren't on its checklist, which is usually the actual bugs.

None of these agents is being lazy. They're doing exactly what you told them to do. The problem is that the orchestration removed their strongest capability: looking at the whole picture and noticing that the thing as a system doesn't actually work.

A oneshot agent with a good plan keeps the whole thing in mind. It writes a feature, runs the app, sees something off, fixes it, moves on. It behaves more like a senior engineer thinking through a problem than a junior dev following a Jira board.

That's the irony. We built rigid pipelines to make AI agents behave more like serious engineers, and we ended up making them behave more like the worst version of process-bound contractors.

The Six Things That Actually Mattered

Six distinct keys hung on six iron nails on a weathered workshop wall — each different in size, age, and metal, each given its own place. The six things that earned their keep after eighty experiments.

After all the runs, all the variants, all the rebuilds, here's what I'm actually walking away believing. None of these are revolutionary. All of them are the things I would have undervalued if I hadn't burned six weeks proving them to myself.

1. The plan beats the pipeline

This is the thing I'd put on a slide if I only had one to show.

A strong plan with real acceptance criteria yielded better output across every workflow I tried. No exceptions. The plans I wrote with the /plan skill — the ones with exploration agents, file-level component breakdowns, and explicit done-criteria — consistently produced better results than vague "build me a scorer" prompts, regardless of what pipeline I ran them through.

The orchestration shape was secondary. The plan was primary. The same oneshot that produced mediocre output from a thin prompt produced solid output from a strong plan. The same TDD pipeline that struggled with a vague target produced cleaner work when the acceptance criteria were specific.

If you remember one thing from this post: invest in the plan, not the pipeline. Most teams have it backwards. They spend weeks tuning agent topologies and minutes writing prompts. The leverage is on the other side.

2. Ground truth and acceptance criteria are non-negotiable

Every score I got from the scorer varied -- a lot -- because there is no agreed-upon ground truth for what makes a "good" AGENTS.md file. That same problem exists in your codebase right now, in every workflow you're orchestrating. If you can't define done crisply, no amount of agent ceremony will hide that. You'll just get high-quality output for the wrong target.

Acceptance criteria are the contract between you and the agent. Make them specific. Make them testable. Make them about behavior the user can observe, not metrics the agent can game.

3. Models change. Your workflow has to absorb that.

The models are moving fast, and "new model" doesn't always mean "better outcome for your pipeline."

I ran the first big round of experiments on one model. A new one dropped before I was done. I re-ran the same plan through the same workflows on the new model expecting an across-the-board lift. Some things got better. Some things got worse. The same TDD pipeline that produced a workable app on the previous model produced more tests, more code, and more breakage on the new one. Same plan, same orchestration, smarter model — and the output regressed in a measurable way.

I want to be careful here. This isn't a "newer model is worse" story. It's a more useful one: a more capable model amplifies whatever your workflow tells it to do. If your pipeline rewards more tests and tighter focus on the phase, a smarter model will give you more of that, including the parts you didn't want. Capability without verification is a louder version of your existing problem.

So factor it in. The model under your workflow will keep changing. The workflow has to assume that and verify outcomes against user-visible behavior, not against the artifacts the agent produces. Every time I tried to cleverly constrain the model into doing what I thought was the right thing, I ended up worse off than when I gave it the problem and verified the result.

Spend your engineering time on verification — tests that test real behavior, evals that exercise real flows, observability that shows you what actually happened. Don't spend it on cages, and don't expect a model upgrade to fix a workflow that's pointed at the wrong target.

4. Use the SDK for control

Even if you're not building elaborate pipelines, the Agent SDK is worth it. Repeatability matters. Fresh context per query matters. Knowing exactly what your agent saw, every single time, matters. You stop running magic-show experiments and start running real ones.

The SDK is the thing that turned my "this run was great, the next one was awful" frustration into actual measurable comparisons. That alone was worth the cost of switching.

5. Orchestration is expensive. Plan for that.

If you're going to invest in orchestration -- and there are real reasons to do that, especially when you have hard quality gates or compliance constraints -- expect to invest a lot of time. These things are not lightweight. The 80+ experiments weren't because I was being thorough for fun. Each iteration surfaced something I hadn't seen, and I had to chase it down. Pipelines have their own bugs. Personas drift. Context gets lost between steps. You're not just shipping an app anymore, you're shipping the factory that ships the app.

Be honest about whether your team has the time and stomach for that. Most teams don't, and most teams shouldn't pretend they do.

6. Don't believe the hype. Measure.

The single most useful thing I built during this whole experiment wasn't the scorer or the pipeline. It was the discipline of running the same plan through multiple workflows and measuring the outputs. The graphics on Twitter and the demo videos are not your data. Your data is your data.

Pick a real problem you've already solved by hand. Run it through whatever workflow someone is selling you. Compare what comes out to what you would have shipped. If the workflow doesn't beat your baseline, it doesn't matter how pretty the diagram is.

Where I Net Out

I'm not anti-orchestration. I built one. I'm going to keep building them, because there are absolutely cases -- regulated environments, code that has to pass a known compliance bar, multi-day workflows where context needs to survive sleep -- where the right pipeline pays for itself.

But for most teams, most of the time, the right move today is much simpler than the influencer thread suggests:

Stop treating AI agents like junior devs who need a process to keep them from breaking things. Treat them like reasoning engines that need a clear problem, real context, and a way for you to verify the result. Then get out of the way.

You'll be surprised how often "less workflow" beats "more workflow" -- and how much of your engineering time you get back to spend on the things only you can do.

I'm going to keep running these experiments and sharing what I find. I'll also be open-sourcing everything from this round so you can run your own version, against your own problems, with your own data.

Run It Yourself

The whole point of this post is that your data is your data. So here are three live versions of the scorer I built — same job, different generations of the workflow that produced them. They're up. They take an AGENTS.md or CLAUDE.md file. Try them with your own configs. Compare the outputs. See where they agree, where they don't, and which one feels closer to how you would judge a good config.

Run an AGENTS.md you've actually shipped through all three. The disagreements are where the interesting questions are.

That's the only experiment that ends up mattering.

Key Takeaways

YouTubeWatch the full walkthrough on YouTube

GitHubOpen-source repo on GitHub — full pipeline, plans, runs, and Obsidian vault used in this experiment

WebThree Experiment Results — Live Demos

Previous Field Notes

More posts

May 29, 2016

The love story that brought me to Sofia

Read

March 29, 2015

How To Hire Awesome Engineers?

Read

Get the latest updates