Skip to main content
The agent used a float for money. Not Decimal. Not a typed newtype. A plain f64 on a field called amount. It compiled. The tests passed. The PR looked reasonable at a glance. We caught it in review — but only because we happened to look at that file. We added a note to the system prompt: never use f32 or f64 for financial fields, use Decimal or a typed newtype. The next session, same agent, same rule — it worked. Then a week later, different task, different file path, same mistake. Floats for money. That’s when we understood the actual problem.

The Institutional Memory Problem

Senior engineers carry accumulated judgment. Not just syntax — real architectural knowledge. Never use floats for money. Every database query must scope by tenant. Domain layer never imports from infrastructure. These aren’t rules you look up. They’re things you learned from a production incident or a code review from someone more experienced, and they became reflexive. When you hire a junior engineer, you transfer that knowledge gradually — through pairing, through review, through the slow accumulation of corrections. It sticks because the engineer remembers. They build a mental model. They generalize. AI agents don’t do that. Every session starts from scratch. There’s no memory of the conversation you had two weeks ago. There’s no “oh right, we don’t do that here.” The prompt you wrote last week is gone. The correction you made yesterday has no effect on today’s session. We’d been operating as if a better system prompt was the answer. Longer, more detailed, more explicit. It helped — but it didn’t solve it. Long prompts get deprioritized mid-task. The agent reads the prompt, understands it, then gets deep into implementation and the edge case doesn’t pattern-match against something from the top of a 3,000-word context block. The knowledge wasn’t the problem. The location of the knowledge was the problem.
If you’ve watched AI agents spiral through cascading errors while trying to fix their own mistakes, the root cause is often the same: no persistent enforcement layer. When AI fails in cascading errors covers what that looks like at scale.

Enforcement, Not Documentation

The Hook Shield Enforcement Layer The insight that changed things: you can’t teach an AI agent the way you’d teach an engineer. But you can make the wrong thing impossible. That’s how type systems work. A Rust compiler doesn’t trust you to remember that you shouldn’t mix up the UserId and TenantId types — it enforces the distinction at compile time. You’re not relying on the developer’s memory. You’re removing memory from the equation entirely. The Rust newtype pattern is a canonical example of this: encode domain constraints into the type system rather than documentation. Claude Code’s hook system works on the same principle: small scripts that run before or after tool use. Before the agent writes a file, the hook runs. If the hook exits with an error, the write is blocked and the agent sees the error message. We stopped adding rules to the prompt. We started encoding them as hooks. The difference isn’t subtle. A rule in a prompt is advice the agent might follow. A hook is a constraint the agent cannot violate. That distinction matters more than it sounds — it’s the difference between a sign that says “don’t touch the hot stove” and a stove that doesn’t get hot.

What We Encoded

The float problem:
# PreToolUse hook — runs before every file write
# If the file uses f32/f64 on a money-related field, block it.

if echo "$CONTENT" | grep -qiE '(amount|price|balance|fee|revenue)\s*:\s*(f32|f64)'; then
  echo "HOOK BLOCKED: f32/f64 on financial field — use Decimal or a typed newtype" >&2
  exit 1
fi
The hook reads the content the agent is about to write, checks for the pattern, and blocks it if found. The agent gets the error message, understands why, and corrects course — in the same session, without any human intervention. That float incident never happened again. The tenant isolation problem: This one mattered more. We’re building a multi-tenant SaaS. Every database query has to scope by tenant. One missing WHERE tenant_id = ? condition is a data leak between customers — the kind of mistake that has legal consequences. A human reviewer would catch it. But we were moving fast and not every PR got the scrutiny it deserved. And when the implementation is autonomous, there may not be a human reviewer at all.
# If the file is a repository/query file and runs database operations
# without referencing tenant_id anywhere — block it.

if ! echo "$CONTENT" | grep -qi 'tenant_id\|tenant\.id\|TenantId'; then
  echo "HOOK BLOCKED: database queries in $(basename $FILE_PATH) without tenant_id" >&2
  echo "All queries in multi-tenant repos must scope by tenant" >&2
  exit 1
fi
This isn’t checking logic. It’s checking for the presence of the concept. If you’re writing a query file and tenant_id doesn’t appear anywhere, something is wrong. Block it. Make the agent explain itself or add the scoping. The architecture problem: We use Domain-Driven Design: domain layer, application layer, infrastructure layer. The rule is strict — domain must be dependency-free. It cannot import from infrastructure. It cannot know that a specific database technology exists. That discipline is what makes the domain testable and the infrastructure swappable. An AI agent doesn’t internalize architecture philosophy. It sees a struct in the domain layer that needs a database call, and it imports from infrastructure because that’s the fastest path to compilation.
# If the file is in the domain layer and imports from infrastructure or API — block it.

if echo "$FILE_PATH" | grep -qiE '/domain/'; then
  ILLEGAL=$(echo "$CONTENT" | grep -nE '^use (crate::|super::)*(infrastructure|infra|api)::')
  if [[ -n "$ILLEGAL" ]]; then
    echo "HOOK BLOCKED: Domain layer imports infrastructure — domain must be dependency-free" >&2
    exit 1
  fi
fi
The agent can’t shortcut past the architecture. Not because it understands why the separation matters — but because the rule is enforced before the file is written.
This kind of architectural constraint enforcement connects directly to how we structure the development workflow. The multi-agent workflow article covers how Evaluator, Builder, and Verifier agents interact — hooks are the enforcement layer that runs alongside all of them.

The One That Surprised Us

The three hooks above work the same way: detect a violation, block the write. They enforce things we already knew were important. But there’s a fourth hook that works differently, and it’s the one we keep coming back to. When we correct the agent — catch a mistake, point it out, tell it the right approach — we want that correction to outlast the session. So we built a hook that logs corrections to a file as they happen. At session end, another hook fires and sends a notification with a summary: every correction from the session, timestamped.
# Called explicitly when a correction occurs
# Usage: log-correction.sh "description of what was corrected"

LOGFILE="/tmp/corrections-$(date +%Y%m%d).log"
TIMESTAMP=$(date -u '+%Y-%m-%dT%H:%MZ')
echo "[$TIMESTAMP] $*" >> "$LOGFILE"
It’s simple. But what it produces is a record of where the agent’s judgment diverged from ours. That record is how new hooks get written. The float rule came from a correction. The tenant isolation rule came from a correction. The pattern is consistent: you catch something once manually, you log it, you build a hook, you never catch it manually again. The enforcement layer grows from experience. It’s a living record of every mistake the system has made, transformed into a constraint the system can no longer make. We found that oddly satisfying — a feedback loop that actually closes.

What’s Still Hard

We have several dozen hooks now. They cover code quality, architecture integrity, security patterns, commit formatting, change management. Each one came from something that went wrong. That’s also the limitation: you can only encode violations you’ve already seen. The hooks don’t protect you from novel mistakes — from the architectural misstep you haven’t made yet, from the edge case in a domain you haven’t built before. For that, you still need human judgment. You need someone who has seen enough systems to recognize a problem that hasn’t manifested yet. There’s an asymmetry worth sitting with. The more experience you have, the better your hooks will be. The better your hooks are, the more you can trust the agent to operate autonomously. The more the agent operates autonomously, the more you can focus on the decisions that require genuine judgment. The hooks are how you apply your experience at the leverage point where it has the most effect — before bad code is written, not after.
Understanding where AI consistently falls short is just as important as knowing where it excels. The Code AI Can’t Write (Yet) catalogs seven categories where AI agents fail consistently — hooks are one mitigation, but they don’t close every gap.

Why This Matters Beyond AI

The problem we’re describing isn’t actually new. Engineering teams have always had institutional knowledge that lives in people’s heads and gets lost when they leave, gets inconsistently applied under deadline pressure, and fails to transfer cleanly to new team members. With human engineers, you can rely on relationship and habit to partially compensate. With AI agents, you can’t. There’s no relationship to cultivate, no habit that survives a session boundary. That forced us to be explicit about something most teams leave implicit: what are the rules that are truly non-negotiable, versus the conventions we just prefer? If it’s non-negotiable — if violating it would cause a data leak, a financial bug, an architectural failure — then it should be enforced, not documented. The same question applies to teams at any scale. The organizations that answer it well — that take their judgment about what matters most and encode it where it can’t be bypassed — operate differently from the ones that keep the knowledge in people’s heads and hope review catches the gaps. We’re still learning which of our rules deserve that treatment. Several dozen hooks in, we suspect the answer is: more than we thought. Which raises a question we don’t have a clean answer to yet: if you had to enforce your most important engineering standards programmatically — the ones that, if violated, would cause real harm — what would make the list?
All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.