Remember when we used to navigate with maps? You'd read signs, estimate distances, re-evaluate your route. Then GPS arrived and navigation became a no-brainer: punch in an address, follow the blue line, done.

LLMs promise the same thing for code. Pull a lever, code comes out. It works. It's magical.

Here's the problem: code is not like navigation.

Code matters in ways routes don't. Maintainability. How it fits your system. The patterns that took your team years to nail down. Pull the lever once, code appears. Pull it again—different code. Pull it again—you're building what I call Casino Code.

I ran an experiment across two weeks to show you exactly what this looks like. I asked an AI agent to build a simple To-Do API multiple times with the same prompt. Then I waited a week and did it again.

The results? Like a code trip to Thailand: "Same Same but Different". Similar but different implementations. Similar but Different architectures. Different assumptions about what I "needed." And when I ran it a week later, more variance and new failure modes I hadn't seen before.

Don't get me wrong, I use LLM daily, to write almost all my code.
All the solutions looked ok at first glance. Most of them didn't work on first run. All of them were wrong for what I was building.

This is the trap you MUST avoid...

Pull the Lever, See What Happens

Here's what Casino Code means: you pull the lever on a slot machine and code comes out. Pull again, something different comes out. Again—Casino Code.

The scary part? The code looks legitimate. It compiles. It runs. It passes basic tests, as most chances you let it write it's own test, which is a huge failure point. If you're tired and clicking through a giant PR, you'll ship it.

I see this all the time. Even senior developers approve giant PRs when we can't quickly reverse engineer what's happening. The LLM-generated solution looks good. You're rushed. Maybe it's not your area of deep expertise. You approve it.

But LLMs have failure modes: hallucinations, non-determinism, context clutter and bias that creates fun rabbit holes. When developers approve without fully understanding + LLM failure modes = snowball effect. Each pull of the lever adds another layer of technical debt.

Let me show you what happened.

The Experiment

I kept it simple:

Prompt: "Create a FastAPI To-Do app with GET, SET, and SEARCH endpoints. Save to database."

Tool: Cursor (I use it daily, so full transparency here)

Process: Run the same prompt multiple times. Clear the cache. Blow away the project each time. Then wait a week and run it again to see how environmental changes affect the output.

This is a 50-line app. If this shows massive variance, what happens to your 200,000-line legacy codebase?

Casino Code: Developer pulling a slot machine lever with code coming out

Part 1: The First Few Attempts

Attempt 1: The Over-Engineered Solution

The agent created a crud.py file with SQLAlchemy, separate schemas, models, Pydantic validation—full enterprise architecture for a simple To-Do app.

Here's what it generated:
- Database session handling with commits and refreshes
- Separate schemas.py for Pydantic models
- Separate models.py for ORM models
- Request/response transformations between layers

It's not a bad design for something complex. But for this? Massive overkill. SQLAlchemy is a beast, and for SQLite (which it chose), you don't need an ORM at all. SQLite has its own simple library.

An inexperienced developer looks at this and thinks: "Wow, this is professional. Ship it."

Attempt 2: The Disappearing Pattern

Same prompt. Blow away the project. Run it again.

CRUD is gone.

Just... gone. Now the database queries are directly in main.py. Still using SQLAlchemy (it's very persistent—this is bias), but the whole architecture changed.

Before we had separation. The CRUD layer made some sense (naming aside). This time, we don't even have that layer. The design philosophy shifted entirely.

Same prompt. Completely different app.

Attempt 3: The Similar One

Ran it again. This time, output looked very similar to Attempt 2. No CRUD layer again. Similar file structure. The model is likely caching some decisions from the previous run.

But here's the thing: we don't have guarantees. These are non-deterministic by nature. Sometimes you get similarity, especially if you're using the same model in quick succession. Sometimes you don't.

Attempt 4: CRUD Returns (With a Surprise)

Cleared all history. Removed the folder. Asked one more time.

CRUD is back. Same over-engineered pattern as Attempt 1.

But wait—now there's a completed flag on the To-Do model. I never asked for that. The agent just invented a feature based on... statistical probability from training data? Common To-Do app patterns? Who knows.

Someone inexperienced would just keep going: "Okay, completed flag, sure." But I didn't request it. The agent decided I needed it. Thank you LLM, the over zealous junior.

Attempt 5: Plan Mode

I switched to Cursor's plan mode—where you get more insight into what the agent is thinking. It asked me: "Do you want A, B, or C?"

Much better. But it still didn't realize you don't necessarily need an ORM. It had that SQLAlchemy bias baked in. The plan included: "Implement persistence with SQLAlchemy, migration, testing, and verification."

If you don't know better, you'd say "approve." This sounds like a really good plan.

Even in planning mode, it didn't ask which database I wanted. It created tests (an improvement), but then the tests failed. The code didn't even work.

So not only did we generate different types of code—none of them are actually production-ready.

A week later: AI hallucinations and confusion with glitchy code

Part 2: A Week Later

I filmed Part 2 a week later to show what happens when the environment changes—model updates, routing differences, all of it.

This is where things got really wild.

Environmental Variance

Week one: separate files, CRUD layer sometimes appears, SQLAlchemy every time.

Week two: sometimes everything in a single main.py file. Sometimes separated.

The variance isn't just the model's non-deterministic nature. The environment gets updated. Models change. Cursor might route to different backend models. You get different results even from leading platforms.

I ran the exact same prompt. The agent generated a README (good), documentation (good), but the core architecture? Different again.

The Phantom Pagination

Early in the week-later run, I asked it to "optimize pagination."

Here's the problem: there was no pagination to optimize.

A reasonable response would be: "You don't have pagination. Would you like me to add it?" Instead, our eager AI assistant went into full optimization mode like it was tuning a high-traffic production system. It was off to the races, adding database indexes with the enthusiasm of someone who just discovered performance tuning. Then it changed the database structure—requiring migrations we don't have, naturally. It implemented pagination from scratch while simultaneously treating it like an existing feature that just needed some tweaks. It fixed a non-existent feature. You have to admire the confidence, if nothing else.

And here's where it gets weirder: in some earlier test runs, I'd cleared all conversation history and cache, but the agent still added pagination without me asking. Context leakage—previous runs contaminating the current session. I don't know if it's the model, the routing, Cursor, or the LLM, but it added features by itself.

Some decisions—like adding indexes—might be overkill. Pagination alone solves a lot by fetching in batches. But these should be discussions, not just decisions.

The new email service by your friendly llm

This is the wild one.

I decided to test hallucinations. I asked the agent to integrate "Gurameil" email notifications.

Gurameil doesn't exist. I made it up. I Googled it before the experiment to confirm—nothing.

Here's what happened:

The agent started researching. "Researching Gurameil integration..." It spent time planning. Designing. Thinking deeply about how to integrate this service.

Then: "Implementing using Gurameil."

It built a complete email service integration. Created an email_service.py file, added Gurameil API calls using the requests library, implemented email templates right inside the functions, and even threw in background task handling so the emails wouldn't block the main thread—which would've been a genuinely smart architectural decision, except it used requests, which is synchronous and blocking. So the background task handling was...you get the point. The agent was trying to be clever while simultaneously undermining its own cleverness. And, of course, the entire service it was integrating with was a figment of my imagination.

Cause Gurameil doesn't exist.

The agent hallucinated an entire API and integrated it with complete confidence. If you're not paying close attention, you ship this. It goes to production. Nothing works. You're debugging why emails aren't sending for a service that was never real.

It's not just that it made a mistake—it was deeply confident in the mistake. That's the nightmare scenario.

The Missing Dependencies

The agent would use email-validator but forget to add it to requirements. Same with the requests library—used it, didn't declare it. Multiple dependencies just... missing. I had to manually pip install packages one by one, watching error after error until the app finally ran. The agent was confidently writing code that assumed these libraries existed, but never bothered to tell the project it needed them.

Fun sidequest, courtesy of your friendly GenAI.

The Rabbit Hole of Debugging with Bias

After getting Gurameil integrated (or so it thought), I asked it to add user authentication with JWT.

It added the auth service, created user models, protected routes. Pretty standard.

Then it hit an error: "password cannot be longer than 72 bytes"

The agent started fixing. It truncated passwords. Added encoding logic. Went deeper: double encoding, then removed double encoding because it realized that was wrong, then truncated to exactly 72 bytes.

It kept insisting truncation was the answer. Each "fix" stacked on the previous one. Absolutely confident this was the right path. Layered cake his way there, our excited LLM was ready to re-write the whole app, just to fix this bug.

I had to manually debug to find the real issue: bcrypt version mismatch. The project used bcrypt v5 (very new), but the libraries expected v3.22 (older). That's it. No truncation needed. No complex encoding. Just install the right version.

But the agent was certain. It went down a rabbit hole and layered complexity on the wrong diagnosis. This is one of my favorite failure modes. The agent becomes convinced it knows the problem and keeps fixing the wrong thing.

This is how you build technical debt you can't untangle.

Context Leakage Everywhere

Throughout the week-later runs, I noticed the agent would sometimes add features I'd requested in previous sessions—even after clearing the conversation cache.

It added a README because I'd asked for one before. It tried to add pagination when I hadn't mentioned it. It seemed to remember patterns from earlier runs despite being told to start fresh.

I don't know if it's caching at the model level, routing to models with shared context, or Cursor holding onto something. But context was leaking between supposedly isolated sessions.

Who's Deciding What You Need?

Here's what most people miss.

The scariest part isn't the non-determinism. It's not even the hallucinations.

It's this question: Who's deciding what your codebase needs?

The agent is. And here's the disconnect: your AI pair programmer is optimizing for statistical probability from training data—what it saw work across thousands of repositories. It's choosing common patterns that showed up frequently in open source code. It's making default architectural assumptions based on what's generically popular. The agent is working from what usually happens in most codebases it learned from.

But your codebase isn't "usual." The agent is not optimizing for your system's specific constraints—the ones you discovered through painful trial and error. It doesn't understand your team's established patterns that evolved over years of iteration. It can't account for the edge cases that killed you in production last year. It has no knowledge of the architectural decisions that took your team years to get right. And it definitely doesn't know your team's style, attitude, folk knowledge, and constraints that live in Slack threads and hallway conversations, not in any documentation or training data.

When you can't see the decisions being made, you can't evaluate them properly. Is SQLAlchemy overkill? Yes, for a simple SQLite app, but the agent doesn't know you want to keep things lightweight. Should we have a CRUD layer? Depends entirely on your patterns and team preferences, but the agent is just guessing based on what's common. Do we need a completed flag? You didn't ask for it, but it seemed statistically likely for a To-Do app. Does pagination even exist yet? No, but the agent confidently started optimizing it anyway because it assumed it must be there. Is Gurameil a real email service? No, it's completely made up, but it looked real enough based on naming patterns to warrant a full integration. Should we truncate passwords to 72 bytes to fix this error? Absolutely not—it's just a version mismatch—but the agent was certain truncation was the solution and kept layering fixes on that wrong assumption.

If you don't understand what the AI is deciding under the hood, you're not engineering. You're gambling.

Engineering vs Gambling

Let me be clear: I don't write code by hand anymore. I use AI for everything, or close to it. This isn't about whether to use AI—it's about how.

Engineering means you understand the decisions. You see the tradeoffs. You can defend the choices. You're intentional.

Gambling means you pull the lever, the code looks good, you ship it and hope it works.

This is a 50-line To-Do app across two weeks. I got:
- Multiple different architectures in week one
- Environmental variance in week two (single file vs separated)
- Features I never requested (completed flag, pagination)
- API integrations for services that don't exist (Gurameil)
- Rabbit hole debugging on the wrong problem (bcrypt truncation)
- Missing dependencies that should've been declared
- Context leakage from previous sessions

Now imagine your legacy codebase. Your tribal knowledge. Your constraints that aren't in any training data. Your patterns that took years to establish.

What's the agent deciding for you right now?

The Guardrails

Here's the good news: protection exists.

Three things turn gambling back into engineering:

1. Precise specifications - This is where it starts. The clearer your requirements, the less room for Casino Code. Don't say "optimize the database"—say "add an index on user_id for the todos table if query time exceeds 100ms." Don't say "add email notifications"—say "integrate with SendGrid using async requests, template emails in separate files, and add a retry mechanism for failures." Precision turns statistical guessing into intentional engineering.

2. agents.md - Your constraints, patterns, and decisions written down where the AI can read them. This is where specificity lives permanently. "Here's what we actually need. Here's how we make decisions. Here's what matters to us." The agent won't remember across sessions, but agents.md will. It's your codified knowledge—the tribal wisdom, the hard-won lessons, the constraints that aren't in any training data.

3. evals - Automated testing that catches hallucinations, variance, and architectural drift before production. Did you just integrate a service that doesn't exist? Did the CRUD layer disappear again? Did you add dependencies without declaring them? Evals are your safety net, the continuous validation that the agent stayed within bounds. They catch the Casino Code before it ships.

4. feedback loops - Either manual or automated, where you continuously monitor what the AI generates, catch patterns in its mistakes, and feed those learnings back into your constraints and tests. The agent hallucinates Gurameil? Document it. Add a check. Update agents.md. The more you work with AI, the better you get at spotting its failure modes early—and the tighter your guardrails become.

I'll break down how to implement these in future posts. But first, you have to see the problem clearly.

Watch Casino Code Mini series. See the full examples. Pause and analyze the variance yourself. Watch the agent research Gurameil with complete confidence. See the bcrypt rabbit hole unfold in real time.

Because in 2026, we don't write code by hand. But we don't stop being engineers either.

Watch the full series:
- Casino Code Part 1: The First Few Attempts
- Casino Code Part 2: A Week Later

Coming next: How to build your guardrails—agents.md, evals, and feedback loops that actually work.

What You Need to Remember

Look, I'm not telling you to stop using AI. I write all my code with LLMs now. This isn't about going back to hand-coding everything like it's 2019.

This is about understanding what's happening under the hood.

Casino Code is real. Same prompt, wildly different code. Every. Single. Time. The agent is making architectural decisions you can't see, based on statistical patterns from training data—not your system's constraints or your team's hard-won lessons.

Hallucinations are confident. Gurameil doesn't exist. The agent built it anyway. With complete certainty. That's the scary part—not that it made a mistake, but that it never questioned itself.

Rabbit holes compound fast. When the AI is convinced it's right (but wrong), it keeps layering fixes on the wrong diagnosis. Bcrypt truncation. Double encoding. Migrations. Each "fix" makes it worse. That's how technical debt you can't untangle gets built.

The environment is changing under you. A week later, different models, different routing, different results. Your "stable" AI workflow isn't stable at all.

But here's the good news: guardrails work.

Precise specifications. agents.md for your constraints. Evals to catch the hallucinations before they ship. Feedback loops so you get better at spotting failure modes early.

They turn gambling back into engineering.

We're in the 1960s of AI. We're all figuring this out together. I'm documenting what I find—the wins, the failures, the Gurameil moments that make you laugh and then immediately check your production logs.

Next up: I'll show you exactly how to build those guardrails. agents.md, evals, feedback loops—the whole system. Real examples. Real code. No theory.

Because in 2026, we don't write code by hand. But we definitely don't stop being engineers either.

This came from running the same experiment five times, watching it fail in new and creative ways, and deciding to film the whole thing. I'm building in public and sharing what I learn—including the ridiculous moments like Gurameil. If you want more, follow along. The 1960s of AI are going to be a wild ride.

More posts

October 4, 2015

What is the secret to successful remote software engineering?

Read

December 19, 2025

We're in the 1960s of AI (And Why That's Actually Exciting)

Read

Get the latest updates