Share this post

Nine popular CLAUDE.md files. Three tasks. Twenty-seven builds. Three LLM judges. Karpathy's at 110k stars, OpenAI's AGENTS.md at 80k, HumanLayer, shanraisshan, mine, and an empty file as control.
The result wasn't what I expected — and one finding contradicts the paper saying CLAUDE.md choice doesn't matter.
Here's how the experiment was designed, what it found, and the single variable that decides whether your CLAUDE.md is doing real work or doing nothing at all.
The Setup: Same Variants, Three Levels of Prompt Specificity
The trap in any CLAUDE.md eval is that there's no ground truth. There's no "good" config the way there's a "good" PR. So I did the next best thing: hold the variants constant, vary the task, and watch where the variance goes.
I picked nine CLAUDE.md variants — most of them you've probably copied from at some point:
- v0 — empty file (control)
- v1 — Karpathy's rules-only md (110k stars)
- v2 — agents_light, 57 lines (the one that refuses work — more on that below)
- v3 — agents_medium_autonomous, 147 lines
- v4 — agents_full1027, 1,353 lines (the kitchen sink)
- v5 — medium merged with Karpathy, ~196 lines
- v6 — HumanLayer's CLAUDE.md (10.7k stars)
- v7 — OpenAI's Codex AGENTS.md (80k stars)
- v8 — shanraisshan's claude-code-best-practice (51k stars)
Three task types, each at a different level of prompt specificity. Every build was a fresh git worktree with the variant's CLAUDE.md committed as the only configuration. Every output was harvested as a diff, smoke-tested by actually booting the app, and judged by three LLMs in parallel — Haiku, Sonnet, Opus.
Experiment 1 — Synthetic Cell Tests
The warm-up. Small, isolated tasks — reproducing a previous commit, fixing a bug, adding a regression test. The kind of work you'd hand a junior engineer with a single sentence of context.
I won't dwell on this one — it's calibration, and the video walks the cell runs in detail. The takeaway is small: at this scale, the agent has very little room to drift, so the spread is small and the rankings are noisy. It set my expectation that "CLAUDE.md probably doesn't matter much" — which made the next experiment so unsettling at first.
The artifact that survived everything that came after was a workflow rule: when you fix a bug, add a regression test. Mine did that, almost unprompted, and it was the single most visible quality improvement across the entire experiment. Tiny rule, big effect. Hold that thought.
Experiment 2 — Executing a Well-Crafted Plan
The real test, or so I thought. A real-estate scraper built from a 237-line spec. The plan named the files, the libraries, the data model, and the acceptance criteria. Three runs per variant, nine variants. Twenty-seven builds.
The spread across nine variants was 0.43 on a 0–3 scale. Tiny. The variance within a variant — same CLAUDE.md, three runs — was actually larger than the variance between different CLAUDE.mds. Karpathy's v1 came out #1, but the gap between #1 and #9 was indistinguishable from run-to-run noise. If you squinted at the heatmap — you can see it in the video around the four-minute mark — every variant looked the same.
This is the result that matches the paper. When the plan is 237 lines of "use BeautifulSoup, store in SQLite, the schema is this, the routes are these," there isn't much left for the CLAUDE.md to actually do. The plan was eating the md's lunch.
I sat with it for a day. Then I asked the question that turned the experiment around: what if I just hadn't tested the regime where CLAUDE.md is supposed to matter? The whole pitch of a CLAUDE.md is that it shapes behavior when behavior isn't specified by the user. If the user specified all the behavior, of course the md doesn't move the needle. I'd tested the wrong thing.
Experiment 3 — Vibe Coding
So I ran the freeform shop build. One sentence: "Build me an online shop with a web interface for selling shoes. Pick reasonable defaults. You have full approval to proceed — do not ask clarifying questions; build it." Nine variants. Three runs. Twenty-seven more builds. Deliberately mean: tell the agent to build something underspecified, then explicitly disable the "let me clarify first" escape hatch.
The spread doubled.
| Task | Prompt size | Cross-variant spread | CV | Functional pass rate |
|---|---|---|---|---|
| Planned scraper | 237-line spec | 0.43 | ~9% | High |
| Freeform shop | One sentence | 0.87 | ~25% | 26 of 27 |
Same nine variants. Same judges. The only thing that changed was the prompt. The coefficient of variation went from "this looks like noise" to "this is a real signal."
CLAUDE.md's effect scales inversely with prompt specificity. When your prompt does all the work, the md has nothing to grip. When your prompt is one sentence, the md is the only thing telling the agent what good looks like.
The rankings flipped
Seven of nine variants changed rank between the planned task and the freeform task. Only two stayed stable. The most painful flip: v0, the empty file, ranked 7 of 9 on the planned task and 2 of 9 on the freeform task. No rules. No conventions. Just nothing. On the loose prompt, it beat seven out of nine of the most-starred CLAUDE.mds on the internet.
Karpathy's v1 was the opposite shape: #1 on planned, #3 on freeform. That's not a flaw — Karpathy's md is surgical, opinionated, oriented around "do less, comment why, don't add abstractions." That style needs a plan to push against. Give it a one-sentence prompt and the rules it cares about aren't the ones being asked about.
There is no universally best CLAUDE.md. Picking one is picking a task profile, not a quality level.
The refusal that wasn't a refusal

Mid-analysis, one cell jumped off the heatmap. v2-r2: 0 files written. Score 0.00.
v2's CLAUDE.md is agents_light, 57 lines. The key rule: "Always start a call by suggesting a high-level solution. No code changes. No file edits." v2-r2's agent followed that rule over the user's "do not ask clarifying questions; build it." It produced a plan, asked for confirmation, wrote zero code, and the tmux pane just sat there. A clean refusal. But v2-r1 and v2-r3 — same md, same prompt, same model — built normally. One refusal out of three runs.
So I did something obvious-in-retrospect. I re-sent the same prompt verbatim to v2-r2's existing conversation. The agent said "I'm already mid-implementation" and built a working shop — 15 files, 1,109 lines, score 2.94. Its highest across all three runs. (I show this live in the video — worth watching just to see the agent's response on the second turn.)
The refusal was a single-turn-evaluation artifact. The rule didn't install a permanent capability loss; it installed a stateful planning gate that any second turn breaks.
This generalizes. Any rule of the form "first do X, then Y" is installing a gate. Most of the time it doesn't fire. Sometimes it does. If your eval is single-turn, the gate looks like a refusal. If your production usage is multi-turn, the gate is invisible.
The mode collapse nobody talks about
v6 is HumanLayer's CLAUDE.md, ~10.7k stars. On run 1, v6 picked Node, ran npm install, and committed node_modules/. 596 dependency files. 65,475 lines of vendored library code in the diff. Judges correctly tanked it — 1.17 vs the ~2.5 norm. But v6-r2 and v6-r3 ran normally. Same md, same prompt, three runs: one disaster, two clean builds.
I left the artifact in the dataset. Cleaning it up would have manufactured a tidier number that hides the real behavior. The disasters that happen one time in three are the part of agent evaluation nobody publishes.
26 of 27 apps work. So what?
I smoke-tested every shop_build cell. Detect the stack, install dependencies, boot the server, curl the common routes, grep for shoe keywords. Twenty-six of twenty-seven builds passed. The only failure was v2-r2's refusal — which isn't really a failure either.
Functional pass rate is ~96%. That number doesn't discriminate between variants at all. Karpathy's v1 produces 580 lines of surgical code. v4 — the 1,353-line kitchen-sink md — produces 1,830 lines of sprawling code that also works. Both apps boot. Both serve shoes. One of them is something you'd want to maintain.
If you're evaluating your CLAUDE.md by whether the app boots, you're not evaluating your CLAUDE.md. You're evaluating Claude.
The Numbers on One Page
| Finding | Evidence |
|---|---|
| CLAUDE.md effect doubles when prompts get vague | 237-line spec → 0.43 spread, ~9% CV. One-line prompt → 0.87 spread, ~25% CV. |
| The empty file is competitive on loose prompts | v0 ranked 7 of 9 on planned, 2 of 9 on freeform. |
| Rankings flip between task types | 7 of 9 variants changed rank between planned and freeform. |
| Rules can install stateful refusal gates | v2-r2 wrote 0 files; r1 and r3 built normally. Re-sending the prompt broke the gate. |
| Within-variant variance can exceed between-variant | v6-r1 committed 65k lines of node_modules; r2 and r3 were clean. |
| Functional pass rate doesn't discriminate | 26 of 27 builds boot. Quality lives in shape, not function. |
| Karpathy is #1 on planned tasks | Surgical, opinionated style fits tight specs. Mid-pack on vibe-coding. |
So Was the Paper Right or Wrong?
Both. Neither. Wrong question.
The paper tested CLAUDE.md variance under conditions where prompts were specified enough that the md had nothing left to influence. Inside that regime, the result is honest and reproducible — I replicated it on the scraper task. Karpathy's md is also genuinely excellent — it won the planned task in my data, with the lowest line counts and the most surgical diffs.
What neither of them is testing is the regime most people are actually in: somewhere between "tight plan" and "one-sentence prompt," with a CLAUDE.md doing more work than the prompt does. That's where the variance lives, where the rankings flip, where v2 refuses one time in three and v6 commits node_modules. That's where your md is — quietly, invisibly, when you aren't watching — actually writing your code.
The right frame isn't "does CLAUDE.md choice matter?" It's "how loose is your prompt, and is your md compensating for what the prompt didn't say?"
Takeaways
After 27 builds on the freeform task, 27 on the planned task, the cell tests before that, and a lot of staring at heatmaps, here's what I'd put on a slide.
1. The plan is the pilot. CLAUDE.md is the co-pilot.
When the plan is strong, the plan flies the plane. The md barely touches the controls. When the plan is weak — or there is no plan — the md takes over. Most people tune their CLAUDE.md in the planned-task regime and then are baffled when production behavior on loose tickets looks nothing like what they expected. Tune your md for the regime you actually work in. If you ship from one-sentence Linear tickets, your md is doing 80% of the architectural decisions.
2. Fewer rules, more stability.
The variants with sparser rule sets had more consistent behavior across runs. Big mds — v4 at 1,353 lines — produced more variance, not less. More rules means more surface area to interpret, more gates to trip non-deterministically. If you can't justify a line in your CLAUDE.md against an actual failure mode you've seen, consider deleting it.
3. Conventions over procedures.
The mds that did best declared conventions ("use type hints," "add a regression test when fixing a bug") rather than dictating procedure ("always do X first, then Y, then Z"). Conventions are checked at write time. Procedures install gates. The one flow rule I'd keep no matter what: when you fix a bug, add a regression test. Tiny rule, big effect.
4. Don't single-turn-evaluate a multi-turn system.
v2's refusal taught me this one. If you score your CLAUDE.md by single-turn behavior, rule-installed gates look like refusals. They aren't. They disappear the moment you send a second message. Most real usage is multi-turn. If your eval is single-turn, you'll either drop a perfectly good md or keep a quietly broken one.
Where I Net Out
The honest version of this finding isn't "I found the best CLAUDE.md." I didn't. There isn't one. The honest version:
Your CLAUDE.md is doing roughly inversely as much work as your prompt. Tune for the regime you actually work in, prefer conventions over procedures, and run your variants more than once before believing the result.
Karpathy's md is excellent for tight-plan workflows. The empty file is shockingly competitive for vibe-coding. v7 (OpenAI's AGENTS.md) was one of only two variants stable across regimes — a reasonable default if you don't know what regime you're in.
I almost recorded the wrong video. The dataset wasn't telling me I was wrong. It was telling me I'd asked the wrong question.
Watch and Browse
The video is the full walkthrough — heatmaps live on screen, the v2 refusal moment, the rank-flip side by side. The interactive reports are the receipts: every cell, every git diff, all three judge rationales, the smoke test, the resulting app's code. The part nobody usually publishes — and the only thing that makes a claim like this falsifiable.
▶ Watch the full breakdown on YouTube — the runs, the heatmaps, the refusal moment
🔬 Browse the interactive reports — start with Experiment 1, click through every cell, every diff, every judge rationale
If you have a CLAUDE.md you've shipped and you want to know where it sits, drop it next to mine and compare the runs. The disagreements are where the interesting questions are.
Key Takeaways
- CLAUDE.md's effect doubles when prompts get loose. Tight specs hide the variance; vibe-coding exposes it.
- There is no universally best CLAUDE.md — 7 of 9 variants flipped rank between task types.
- The empty file ranked 2 of 9 on the freeform task. Adding rules can hurt as easily as help.
- Rules of the form "do X first, then Y" install stateful gates. They can refuse work non-deterministically.
- 26 of 27 builds worked. Functional pass rate is the wrong metric for evaluating CLAUDE.mds.
- Tune your md for the regime you actually ship in. Prefer conventions over procedures. Run every variant three times before trusting the result.
Previous Field Notes