Share this post

Hundreds of tests. The app was still broken.
Let that sink in for a second. That one sentence is the punchline of six weeks of work, eighty-something experiments, and a slow, quiet realization that most of what you've been told about AI agent workflows is wrong in a very specific way.
I've been building software for over twenty years. I've shipped real systems with real users, and I've spent the last year trying to figure out where AI agents actually pull their weight in that process. Every other day someone publishes a new orchestration framework, a new graph-based agent topology, a pretty diagram showing how their methodical seven-stage pipeline crushes the "slop" of a oneshot.
I wanted to see the results, not the hype. So I ran the experiment.
I picked a real, non-trivial problem, wrote one rigorous plan with acceptance criteria, and pointed three different workflows at it in parallel: a oneshot using the Claude Agent SDK, a lighter agentic pipeline with code review, and a strict TDD red/green pipeline with three personas. I let them all build the same app from the same plan. Then I scored the outputs.
What came out the other side wasn't what I expected, and probably isn't what you expect either.
Here's what 80+ runs taught me, and why I think most teams chasing "more orchestration" are debugging the wrong thing.
The Experiment: Same Plan, Three Workflows
The thing I wanted to build wasn't a toy. I needed something real enough that the workflow choice would actually matter — so I built a Claude.md / AGENTS.md scorer. It takes an agent config file, grades it across multiple dimensions, exports a PDF, and runs a live evaluation that actually executes code to see whether the config produced better behavior in a real model run. Frontend, backend, scoring, code execution. Not trivial.
One honest note up front: there is no agreed-upon ground truth for what makes a "good" AGENTS.md or CLAUDE.md file. I asked engineers and AI researchers I trust what they look for. They had overlapping intuitions — clarity, scope, discoverability, examples — but no two answers were decisive. Pretty much like asking ten senior engineers what makes "good code." Some agreement. Lots of opinion. No oracle. Just like humans. 😉 That ambiguity made this experiment more interesting, not less, but it's something to hold in mind as you read the scores.
Before I let any agent touch it, I wrote a plan. Not "build me a scorer." A real plan, built with the /plan skill: a swarm of exploration agents crawled the codebase, surfaced constraints, and handed back a step-by-step breakdown — every step naming files, components, and what "done" looked like.
That last part is the part nobody talks about. Every step had acceptance criteria: specific, testable, about behavior the user could see — not metrics an agent could game. The plan wasn't perfect. There's no such thing. But it was a plan I could hand, unchanged, to three different workflows and know they were all aiming at the same target.
That's what I did. Same plan. Three workflows.
Oneshot. A single agent. Takes the plan, takes off, builds the whole thing in one continuous run. No phase boundaries, no review gate, no orchestration. Using the Claude Agent SDK so I can control the context window precisely.
Light agentic with review. A dev cycle where each step gets to figure out its own approach, but every step passes through a review gate that can send work back for fixes before it commits. Lighter touch than full TDD, but with quality gates baked in.
TDD red/green pipeline. The full ceremony. Three personas: a system architect, an SDE, and a test engineer. Each step starts red -- the test engineer writes failing tests first. Then the SDE writes the minimum code to go green. Then a review pass. Rigid, methodical, focused heavily on tests. The kind of workflow that looks like real engineering on a slide.
I kicked off all three in parallel against the same plan. Then I went and did other things while they built.

Why I Used the SDK (And You Probably Should Too)
Quick aside, because this is one of the things I actually want you to take away.
Using the Claude Agent SDK changed how I think about orchestration. The SDK exposes a query function that creates a new, fresh context window with an exact prompt every time you call it. That sounds boring. It's not.
It means I get repeatability. I get to control exactly what an agent sees and what it doesn't. I can kick off N tasks where each one starts with a clean slate and a precisely scoped prompt, instead of the all-too-common pattern of "one giant agent slowly drowning in its own context until it forgets what it's doing."
Even if you're doing nothing fancy -- even if you're running a oneshot -- the SDK is worth it. You decide when context resets. You decide what the agent knows. You stop guessing why the same prompt produced different code yesterday than today.
The downside, which I learned the hard way: the moment you start orchestrating across multiple query calls, you also start losing context on purpose. Each call is a fresh mind. So now you have a new problem -- how do steps share what they learned? How does the workflow remember what went wrong last time?
I solved that by wiring Obsidian into the loop as a shared knowledge vault. Each step writes notes. Each subsequent run reads them. It's a self-learning loop, primitive but real. And it turned out to matter more than the workflow shape itself. More on that in a minute.
What Actually Happened When the Workflows Finished
I came back, pulled up all three outputs side by side, and started scoring.
The oneshot finished in about seven minutes. It generated 64 tests. When I ran the comprehensive scorer against the resulting Claude.md, it gave it 8.15 out of 10. The simple version actually scored slightly higher in some dimensions. The frontend looked clean. It worked.
Did it nail every detail? No. The placement of the API key field was a little odd -- it ended up tucked up at the top of the page rather than next to the live evaluation widget, where it logically belongs. Cosmetic stuff. The kind of thing you'd push back in a code review.
The light agentic workflow with review came out somewhere in between. It scored a 77. It was more on-brand visually. It correctly differentiated comprehensive from basic configs. It ran the live evaluation properly. The review gate caught real polish issues -- things the oneshot missed -- and the result felt more deliberate.
Then there was the TDD pipeline.
Earlier in the experiment series, the TDD workflow produced 282 tests and a broken app. I tightened the plan and ran it again. This time it produced 610-plus tests. The output was more dispersed. The API key still landed in a weird place. Some scorers gave it a 95 on certain dimensions, but the deeper I looked at the actual behavior of the app, the more I noticed something: the test-first agent had written tests that mocked everything. The implement agent had then made those mocks pass. Neither one cared whether the app actually worked end to end. They cared about their phase.
This is the part that should make you uncomfortable if you're sold on TDD-for-agents as a default. The tests passed. The phases completed. The review gate signed off. And the user-visible behavior was worse than the seven-minute oneshot.
I want to be very precise here, because the easy version of this story is "TDD bad, oneshot good" and that's not what I'm saying. The TDD output had real strengths -- when you focus an agent down on a specific task with a specific persona, it does think harder about that task. It writes more tests. It catches more edge cases on the dimensions it's looking at. But it also tends to generate way more code and way more tests than you need, and those tests don't always verify the thing that actually matters: does the app work for the user?
I tested this hypothesis with one more variant -- a leaner TDD workflow that explicitly limited the number of tests and focused them on the acceptance criteria from the plan. I figured if the bloat was the problem, constraining it would help.
It didn't, conclusively. The lean version generated less stuff but didn't produce noticeably better code. The signal just isn't there to justify the orchestration cost yet.
The Numbers That Hit Hardest
If I strip away all the workflow philosophy and just look at the data from the runs, here's what the spread looks like:
| Dimension | Oneshot | Light Agentic | TDD Red/Green |
|---|---|---|---|
| Time to complete | ~7 minutes | Longer, multi-phase | Much longer |
| Tests generated | 64 | Moderate, plus review-driven adds | 610+ |
| Quality gates | None | Review gate per step | Test-first + persona reviews |
| Score (representative run) | 8.15 / 10 | 77 / 100 | Mixed (95 on some dims, broken behavior on others) |
| Architecture polish | Good | Better | Variable |
| User-visible "does it work" | Yes | Yes | Often no until heavy fixes |
The honest read on this table: the gap between the workflows isn't nearly as big as the orchestration marketing suggests. And when there was a gap, you could usually close it by giving the oneshot a small follow-up nudge -- "fix the API key placement, tighten this section" -- in seconds. The TDD pipeline took an order of magnitude longer to get to the same place, and sometimes never got there at all.
There's a place for review gates. There's a place for orchestration. But "hundreds of tests" is not a quality metric. It's a cost.
Why More Orchestration Doesn't Mean More Quality
Let me be fair to orchestration before I criticize it.
Forcing an agent into a single phase — "write tests for this, and only this" — does focus it. It controls context. It limits scope creep. Those are real, valuable things, especially when you have a precise quality bar to hit, a compliance gate to clear, or a step that has to behave the same way every single time. If you need an agent to do exactly one thing very carefully, orchestration is a tool that helps. That's not nothing, and I want to keep using it where it earns its keep.
The trade-off is what nobody puts on the slide. Here's my best explanation for what I saw, and I'd love to hear yours.
When you force agents into rigid phases, they optimize for the phase, not for the outcome. The test-first agent gets a single instruction: write tests for this. Its job is done when there are tests. Whether those tests verify real behavior or mocked behavior is a question nobody asked it.
The implementation agent gets the next instruction: make these tests pass. Its job is done when they're green. Whether the app works end-to-end isn't its concern.
The review agent gets a diff and a checklist. It catches the things on its checklist -- naming conventions, docstring formatting, missing types. It doesn't catch the things that aren't on its checklist, which is usually the actual bugs.
None of these agents is being lazy. They're doing exactly what you told them to do. The problem is that the orchestration removed their strongest capability: looking at the whole picture and noticing that the thing as a system doesn't actually work.
A oneshot agent with a good plan keeps the whole thing in mind. It writes a feature, runs the app, sees something off, fixes it, moves on. It behaves more like a senior engineer thinking through a problem than a junior dev following a Jira board.
That's the irony. We built rigid pipelines to make AI agents behave more like serious engineers, and we ended up making them behave more like the worst version of process-bound contractors.
The Six Things That Actually Mattered

After all the runs, all the variants, all the rebuilds, here's what I'm actually walking away believing. None of these are revolutionary. All of them are the things I would have undervalued if I hadn't burned six weeks proving them to myself.
1. The plan beats the pipeline
This is the thing I'd put on a slide if I only had one to show.
A strong plan with real acceptance criteria yielded better output across every workflow I tried. No exceptions. The plans I wrote with the /plan skill — the ones with exploration agents, file-level component breakdowns, and explicit done-criteria — consistently produced better results than vague "build me a scorer" prompts, regardless of what pipeline I ran them through.
The orchestration shape was secondary. The plan was primary. The same oneshot that produced mediocre output from a thin prompt produced solid output from a strong plan. The same TDD pipeline that struggled with a vague target produced cleaner work when the acceptance criteria were specific.
If you remember one thing from this post: invest in the plan, not the pipeline. Most teams have it backwards. They spend weeks tuning agent topologies and minutes writing prompts. The leverage is on the other side.
2. Ground truth and acceptance criteria are non-negotiable
Every score I got from the scorer varied -- a lot -- because there is no agreed-upon ground truth for what makes a "good" AGENTS.md file. That same problem exists in your codebase right now, in every workflow you're orchestrating. If you can't define done crisply, no amount of agent ceremony will hide that. You'll just get high-quality output for the wrong target.
Acceptance criteria are the contract between you and the agent. Make them specific. Make them testable. Make them about behavior the user can observe, not metrics the agent can game.
3. Models change. Your workflow has to absorb that.
The models are moving fast, and "new model" doesn't always mean "better outcome for your pipeline."
I ran the first big round of experiments on one model. A new one dropped before I was done. I re-ran the same plan through the same workflows on the new model expecting an across-the-board lift. Some things got better. Some things got worse. The same TDD pipeline that produced a workable app on the previous model produced more tests, more code, and more breakage on the new one. Same plan, same orchestration, smarter model — and the output regressed in a measurable way.
I want to be careful here. This isn't a "newer model is worse" story. It's a more useful one: a more capable model amplifies whatever your workflow tells it to do. If your pipeline rewards more tests and tighter focus on the phase, a smarter model will give you more of that, including the parts you didn't want. Capability without verification is a louder version of your existing problem.
So factor it in. The model under your workflow will keep changing. The workflow has to assume that and verify outcomes against user-visible behavior, not against the artifacts the agent produces. Every time I tried to cleverly constrain the model into doing what I thought was the right thing, I ended up worse off than when I gave it the problem and verified the result.
Spend your engineering time on verification — tests that test real behavior, evals that exercise real flows, observability that shows you what actually happened. Don't spend it on cages, and don't expect a model upgrade to fix a workflow that's pointed at the wrong target.
4. Use the SDK for control
Even if you're not building elaborate pipelines, the Agent SDK is worth it. Repeatability matters. Fresh context per query matters. Knowing exactly what your agent saw, every single time, matters. You stop running magic-show experiments and start running real ones.
The SDK is the thing that turned my "this run was great, the next one was awful" frustration into actual measurable comparisons. That alone was worth the cost of switching.
5. Orchestration is expensive. Plan for that.
If you're going to invest in orchestration -- and there are real reasons to do that, especially when you have hard quality gates or compliance constraints -- expect to invest a lot of time. These things are not lightweight. The 80+ experiments weren't because I was being thorough for fun. Each iteration surfaced something I hadn't seen, and I had to chase it down. Pipelines have their own bugs. Personas drift. Context gets lost between steps. You're not just shipping an app anymore, you're shipping the factory that ships the app.
Be honest about whether your team has the time and stomach for that. Most teams don't, and most teams shouldn't pretend they do.
6. Don't believe the hype. Measure.
The single most useful thing I built during this whole experiment wasn't the scorer or the pipeline. It was the discipline of running the same plan through multiple workflows and measuring the outputs. The graphics on Twitter and the demo videos are not your data. Your data is your data.
Pick a real problem you've already solved by hand. Run it through whatever workflow someone is selling you. Compare what comes out to what you would have shipped. If the workflow doesn't beat your baseline, it doesn't matter how pretty the diagram is.
Where I Net Out
I'm not anti-orchestration. I built one. I'm going to keep building them, because there are absolutely cases -- regulated environments, code that has to pass a known compliance bar, multi-day workflows where context needs to survive sleep -- where the right pipeline pays for itself.
But for most teams, most of the time, the right move today is much simpler than the influencer thread suggests:
- Write a real plan with real acceptance criteria.
- Use the SDK so you can control context and reproduce results.
- Let the agent work. Don't carve it up into ceremonial phases.
- Verify the output against the user-visible behavior, not the test count.
- Add quality gates only when you have evidence you need them.
Stop treating AI agents like junior devs who need a process to keep them from breaking things. Treat them like reasoning engines that need a clear problem, real context, and a way for you to verify the result. Then get out of the way.
You'll be surprised how often "less workflow" beats "more workflow" -- and how much of your engineering time you get back to spend on the things only you can do.
I'm going to keep running these experiments and sharing what I find. I'll also be open-sourcing everything from this round so you can run your own version, against your own problems, with your own data.
Run It Yourself
The whole point of this post is that your data is your data. So here are three live versions of the scorer I built — same job, different generations of the workflow that produced them. They're up. They take an AGENTS.md or CLAUDE.md file. Try them with your own configs. Compare the outputs. See where they agree, where they don't, and which one feels closer to how you would judge a good config.
- Oneshot build — claude-scorer-oneshot-v7-latest.fly.dev — single Claude agent, ~8 minutes, plan in, app out.
- Light agentic workflow with review — claude-scorer-light-v7-latest.fly.dev — the leaner pipeline. Agents implement freely, with a review gate that can send work back before commit.
- TDD-style pipeline — claude-scorer-tdd-v7-latest.fly.dev — the cautionary one. More tests than code, more code than software. Live, as it shipped, scars and all.
Run an AGENTS.md you've actually shipped through all three. The disagreements are where the interesting questions are.
That's the only experiment that ends up mattering.
Key Takeaways
- A strong plan with crisp acceptance criteria beat workflow shape, every time.
- TDD red/green pipelines for agents produced more tests and more code, not better software.
- Review gates caught style issues but missed runtime bugs the oneshot would have noticed.
- The Claude Agent SDK gives you the one thing orchestration desperately needs: repeatability.
- Heavy orchestration is a real engineering investment -- treat it like building a platform, not turning on a feature.
- Verification beats control. Measure outcomes, not ceremony.
Watch the full walkthrough on YouTube
Open-source repo on GitHub — full pipeline, plans, runs, and Obsidian vault used in this experiment
Three Experiment Results — Live Demos
- Oneshot scorer — single Claude agent, ~8 min, plan in / app out
- Light agentic scorer — leaner pipeline with one review gate
- TDD-pipeline scorer — three personas, red/green, scars and all
Previous Field Notes