Codin' On A Prayer

Shai Yallin
Jan 11
7 min read

You wake up one morning with an idea for a killer new product fully-formed in your mind. Alas, you have no coding skills or time on your hands. You ask around and get recommendations for a few top-tier independent software developers. You engage one of them, hand them a clear spec, and a few days later they deliver a working solution. It’s great. Users love it. Investors love it.

A while later, you want to add another feature. The original contractor has moved on to other engagements and won’t be available in the near future, so you pick the next name on your list of recommendations. This new contractor wasn’t there for the initial decisions. They don’t know which parts of the system are accidental and which are essential. They don’t know what was tried and discarded, or what assumptions the code quietly relies on. All they have is the code.

They read it. They infer intent. They guess. What they deliver is not exactly what you asked for, and the integration with existing features could be better, but hey - it’s working and it’s good enough, so you ship it. New users keep joining; investors are courting you and suddenly you have more money on your hands than you can use. Feature requests are coming in faster than you can read them. You go back to your original contractors but none are available. You bring in a third contractor.

After a few iterations like this, things start to degrade. Each change is made by someone who has less context than the person before them, while the system itself accumulates more implicit behavior. The gap between “what the system does” and “what anyone explicitly understands” keeps growing. Eventually, making changes becomes risky. The contractor does the work, runs whatever they can, and hopes they didn’t break something they didn’t know existed.

21st Century Schizoid Bot

An AI agent is just another contractor. It has no pre-existing context. It wasn’t there for the early decisions. It doesn’t know which parts of the system are fragile, which are sacred, and which only look important because nobody dared touch them. All it sees is what we put in front of it.

So we try to compensate. We write elaborate prompts. We add rules files. We maintain design docs, decision logs, and architectural overviews. We carefully explain how we expect code to look, how tests should be written, how abstractions should be shaped. And sometimes it actually works.

But now we’re no longer writing software - we appeal to the LLM gods, hoping they’re in a good mood. We’re placing faith in a non-deterministic system to behave as if it understood our standards, our constraints, and our intent, without any hard feedback loop to prove that it actually did.

Of course, there’s nothing mystical about it. An LLM isn’t intelligent in any meaningful sense, and it certainly isn’t reasoning about your system. It’s an algorithm that predicts the next token based on statistical patterns in a large corpus of text. That’s it. No understanding, no intent, no obligation to be consistent. When it appears to “get it right,” what you’re seeing is a plausible continuation that happens to line up with your expectations this time.

On Rules and Siddurs

All of this LLM scripture feels comforting because it looks familiar. It resembles the tools we use with humans. If you explain things clearly enough, write them down carefully enough, and repeat them often enough, people usually comply. We’re wired to believe that clarity plus repetition leads to correctness.

Rules files tap directly into that instinct. They give us something concrete to point at. Something we can refine. Something that feels like control. When the agent behaves well, we credit the rules. When it doesn’t, we assume the rules weren’t detailed enough, or phrased correctly, or placed in the right file. So we iterate. We add another rule. Another paragraph. Another example.

At that point, the rules file starts to resemble a prayer booklet, or a Siddur - A carefully curated collection of texts we return to again and again, hoping that if we say the right words, in the right order, with enough intention, the outcome will improve. Sometimes it even does.

But there’s a fundamental problem: none of this is enforceable. Rules files don’t constrain behavior; they suggest it. Prompts don’t bind the agent; they influence it. And influence is not a contract. The agent can comply, partially comply, reinterpret, or ignore the instruction altogether — and it will do all of those things depending on context, phrasing, temperature, or sheer statistical noise.

That’s why this approach doesn’t scale. As the system grows, the gap between “what we asked for” and “what actually happened” widens. When things go wrong, there’s nothing to point to. No failing signal. No clear boundary that was crossed. Just a sense that the gods were less cooperative this time.

Law, not Faith

There is a way to scale development with contractors. We’ve known it for years. And it’s the same thing we use when onboarding new employees.

If we want someone with no prior context to work on our system safely, we need to give them more than documentation. We need to give them a way to know whether their change is acceptable, without asking us, and without guessing. This way starts with a reliable suite of automated acceptance tests.

Acceptance tests encode behavior, not intent. They don’t explain how the system should work; they assert that it does. A contractor can make a change, run the suite, and immediately see whether they broke anything. Not “does this look right?”, but “did I violate an invariant?”. That single feedback loop replaces hours of explanation and weeks of back-and-forth.

Next, they need the ability to run the system - or at least the relevant subsystem - in isolation. If the only place our system can run is production, we’ve already lost. A contractor who can’t run the code can’t validate assumptions, can’t explore behavior, and can’t safely iterate. Isolation matters. Local environments matter. Ephemeral test environments matter. Without them, every change is speculative.

A type system helps for the same reason. Types don’t make systems correct, but they dramatically narrow the space of possible mistakes. They encode assumptions in a form that can be checked mechanically. When a type check fails, it’s not a suggestion. It’s a hard stop. Something no longer fits.

And then there are linters. Linters are often dismissed as style enforcers, but that undersells their value. A good linter asserts standards consistently and impersonally. It doesn’t rely on taste, memory, or goodwill. It encodes “this is how we do things here” in a way that can be checked automatically, every time.

None of these tools explain why the system is the way it is. They don’t transfer understanding. But they don’t need to. They replace hope with constraints. They let someone with no shared history make changes with confidence - not because they understand everything, but because the system tells them when they’re wrong. They do that by returning a non-zero exit code and an error string when they detect a failure, so the entity making the changes only needs to know how to run these tools.

And that’s what makes scale possible.

Executable Law

Acceptance Test–Driven Development has always been about communication. Not communication between people, but between intent and reality. Acceptance tests describe what the system must do from the outside, in terms that don’t depend on internal structure or implementation details. They turn expectations into something executable.

With human contractors, acceptance tests act as a contract. They define the boundaries within which changes are allowed. A contractor doesn’t need to understand the full system to work safely; they need to understand which behaviors must remain true. If the tests pass, the change is acceptable. If they don’t, it isn’t. No debate required.

In the age of LLMs, this becomes even more important. An AI agent cannot ask clarifying questions in the way a human does. It cannot push back on vague requirements. It cannot sense uncertainty or read between the lines. It will happily produce plausible code that looks right, reads well, and subtly violates assumptions you didn’t know you were relying on.

Acceptance tests close that gap. They give the agent something concrete to aim for. They provide fast, unambiguous feedback. They turn “this feels correct” into “this is correct enough to ship.” Without them, the agent is optimizing for plausibility. With them, it’s constrained by behavior. This also changes how we think about prompts and rules. Prompts explain intent. Acceptance tests enforce it. Rules describe preferences. Acceptance tests define boundaries.

Once acceptance tests exist, prompts become hints rather than safeguards. Helpful, but no longer critical. The tests are the authority. That’s the shift ATDD enables. It doesn’t make LLMs reliable. It makes change reliable - even when the thing making the change doesn’t understand your system, your history, or your culture. In a world where more and more code is written by entities with no shared context, executable contracts beat good intentions every time.

No More Rainmaking

LLMs didn’t break software engineering. They just exposed where we were already relying on hope. When change is driven by prompts, rules files, and good intentions, we’re not engineering - we’re petitioning the AI gods for favorable outcomes and hoping they comply. Sometimes they do. Often they don’t. And when they fail, there’s nothing to point to except disappointment.

The fix isn’t better prompts. It’s better enforcement. Acceptance tests, types, linters, and isolated environments don’t make systems elegant or intelligent. They make them bounded. They turn expectations into executable law. They allow change without shared history, without trust, and without faith.

That’s how you scale with contractors.

That’s how you work with AI agents.

And that’s how you replace hope with confidence.

No miracles required.