Decimal. Not a typed newtype. A plain f64 on a field called amount. It
compiled. The tests passed. The PR looked reasonable at a glance. We caught it in
review — but only because we happened to look at that file.
We added a note to the system prompt: never use f32 or f64 for financial fields, use
Decimal or a typed newtype. The next session, same agent, same rule — it worked. Then
a week later, different task, different file path, same mistake. Floats for money.
That’s when we understood the actual problem.
The Institutional Memory Problem
Senior engineers carry accumulated judgment. Not just syntax — real architectural knowledge. Never use floats for money. Every database query must scope by tenant. Domain layer never imports from infrastructure. These aren’t rules you look up. They’re things you learned from a production incident or a code review from someone more experienced, and they became reflexive. When you hire a junior engineer, you transfer that knowledge gradually — through pairing, through review, through the slow accumulation of corrections. It sticks because the engineer remembers. They build a mental model. They generalize. AI agents don’t do that. Every session starts from scratch. There’s no memory of the conversation you had two weeks ago. There’s no “oh right, we don’t do that here.” The prompt you wrote last week is gone. The correction you made yesterday has no effect on today’s session. We’d been operating as if a better system prompt was the answer. Longer, more detailed, more explicit. It helped — but it didn’t solve it. Long prompts get deprioritized mid-task. The agent reads the prompt, understands it, then gets deep into implementation and the edge case doesn’t pattern-match against something from the top of a 3,000-word context block. The knowledge wasn’t the problem. The location of the knowledge was the problem.If you’ve watched AI agents spiral through cascading errors while trying to fix their own
mistakes, the root cause is often the same: no persistent enforcement layer.
When AI fails in cascading errors covers
what that looks like at scale.
Enforcement, Not Documentation
UserId and TenantId types — it enforces the distinction at
compile time. You’re not relying on the developer’s memory. You’re removing memory from
the equation entirely. The Rust newtype pattern
is a canonical example of this: encode domain constraints into the type system rather
than documentation.
Claude Code’s hook system works
on the same principle: small scripts that run before or after tool use. Before the agent
writes a file, the hook runs. If the hook exits with an error, the write is blocked and
the agent sees the error message.
We stopped adding rules to the prompt. We started encoding them as hooks.
The difference isn’t subtle. A rule in a prompt is advice the agent might follow. A hook
is a constraint the agent cannot violate. That distinction matters more than it sounds —
it’s the difference between a sign that says “don’t touch the hot stove” and a stove
that doesn’t get hot.
What We Encoded
The float problem:WHERE tenant_id = ? condition is a data leak between
customers — the kind of mistake that has legal consequences.
A human reviewer would catch it. But we were moving fast and not every PR got the
scrutiny it deserved. And when the implementation is autonomous, there may not be a human
reviewer at all.
This kind of architectural constraint enforcement connects directly to how we structure
the development workflow. The multi-agent workflow
article covers how Evaluator, Builder, and Verifier agents interact — hooks are the
enforcement layer that runs alongside all of them.
The One That Surprised Us
The three hooks above work the same way: detect a violation, block the write. They enforce things we already knew were important. But there’s a fourth hook that works differently, and it’s the one we keep coming back to. When we correct the agent — catch a mistake, point it out, tell it the right approach — we want that correction to outlast the session. So we built a hook that logs corrections to a file as they happen. At session end, another hook fires and sends a notification with a summary: every correction from the session, timestamped.What’s Still Hard
We have several dozen hooks now. They cover code quality, architecture integrity, security patterns, commit formatting, change management. Each one came from something that went wrong. That’s also the limitation: you can only encode violations you’ve already seen. The hooks don’t protect you from novel mistakes — from the architectural misstep you haven’t made yet, from the edge case in a domain you haven’t built before. For that, you still need human judgment. You need someone who has seen enough systems to recognize a problem that hasn’t manifested yet. There’s an asymmetry worth sitting with. The more experience you have, the better your hooks will be. The better your hooks are, the more you can trust the agent to operate autonomously. The more the agent operates autonomously, the more you can focus on the decisions that require genuine judgment. The hooks are how you apply your experience at the leverage point where it has the most effect — before bad code is written, not after.Understanding where AI consistently falls short is just as important as knowing where it
excels. The Code AI Can’t Write (Yet) catalogs
seven categories where AI agents fail consistently — hooks are one mitigation, but they
don’t close every gap.
Why This Matters Beyond AI
The problem we’re describing isn’t actually new. Engineering teams have always had institutional knowledge that lives in people’s heads and gets lost when they leave, gets inconsistently applied under deadline pressure, and fails to transfer cleanly to new team members. With human engineers, you can rely on relationship and habit to partially compensate. With AI agents, you can’t. There’s no relationship to cultivate, no habit that survives a session boundary. That forced us to be explicit about something most teams leave implicit: what are the rules that are truly non-negotiable, versus the conventions we just prefer? If it’s non-negotiable — if violating it would cause a data leak, a financial bug, an architectural failure — then it should be enforced, not documented. The same question applies to teams at any scale. The organizations that answer it well — that take their judgment about what matters most and encode it where it can’t be bypassed — operate differently from the ones that keep the knowledge in people’s heads and hope review catches the gaps. We’re still learning which of our rules deserve that treatment. Several dozen hooks in, we suspect the answer is: more than we thought. Which raises a question we don’t have a clean answer to yet: if you had to enforce your most important engineering standards programmatically — the ones that, if violated, would cause real harm — what would make the list?All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.