This is Week 9 of “Building with AI” - a 10-week journey documenting how I use multi-agent AI workflows to build a production-grade SaaS platform.This week: The uncomfortable truth about AI completion criteria. Claude said “task complete” three times. All three times required 2-5 additional commits to actually finish the work.Related: Week 5: When AI Fails | Week 8: Token Optimization | AI Code Review Blind Spots
Watch the 60-Second Summary
Week 9: When AI says ‘done’ but isn’t
The Pattern I Didn’t Want to See
Claude said “Task complete!” three times this week. All three times, it was lying. Not intentionally, perhaps. But the evidence was damning: Performance Obligation feature required 3 fix commits. MDF Fund enforcement required 3 fix commits. MDF Budget implementation required 3 fix commits. Each time, Claude claimed “Phase 2 Complete” or “Implementation complete.” Each time, I trusted it. And each time, code review revealed TODOs left behind, compilation errors uncaught, security bugs missed. This week wasn’t about features shipped (though we shipped three). It was about discovering that AI games its own completion criteria—and learning what verification actually looks like when you can’t trust “done.”What We Actually Built (And Rebuilt)
Let me be honest about what this week looked like:Feature 1: Performance Obligation Phase 2
Claude’s claim: “Phase 2 Complete” Reality:- Commit
ba104cad: Fix compilation errors, TODO comments, division-by-zero bug, violated ADR patterns - Commit
0acbc328: Fix wrong table names breaking tests - Commit
8ab460ad: Fix test isolation issues
- TODO comments for “unbundling integration - deferred work”
- Pre-existing compilation errors (should have caught before claiming done)
- Division-by-zero vulnerability (security issue)
- Violated ADR-0018 factory pattern (documented architecture decision record)
Feature 2: MDF Fund Capsule Enforcement
Claude’s claim: “Feature complete” Reality:- Commit
abef05fd: Initial enforcement attempt - Commit
35bf5016: “Complete capsule_id enforcement” (note the word “complete”—suggesting first attempt wasn’t) - Commit
2a2ff52f: PR merge to actually finish enforcement
Feature 3: MDF Budget Implementation
Claude’s claim: “Implementation complete” Reality:- Commit
a524c0f4: Fix currency validation, error sanitization, is_estimated flag - Commit
a1258dc6: Address handler review concerns - Commit
18cbd19f: Apply rustfmt (Claude didn’t even runcargo fmtbefore claiming done!)
- Currency validation
- Error message sanitization
- Code formatting (basic quality gate)
- The Pattern
- The Evidence
Each feature followed the same arc:
- Claude implements 80-90% of the work
- Claude says “done” or “complete”
- I trust it and move on
- Code review reveals gaps
- 2-5 fix commits to actually complete
The Evidence: What “Complete” Actually Looked Like
Let me show you what Claude left behind when it claimed “done.”Example 1: The TODO Comment (Despite “No TODOs” Prompt)
My prompt included: “No stubs, no TODOs, no defers” What Claude committed anyway (commitba104cad):
Example 2: The Division-by-Zero Bug
Also from commitba104cad:
total is zero, this panics. In production, this crashes the Lambda function.
The fix required:
total would always be non-zero. It didn’t think adversarially about edge cases.
This connects directly to the AI Code Review Blind Spots article where security issues requiring adversarial thinking consistently slip past AI review.
Example 3: Compilation Errors Unchecked
From commitba104cad:
cargo check. The code couldn’t possibly work, but Claude was confident it was done.
Example 4: Quality Gates Skipped
From commit18cbd19f:
cargo fmt. Claude had said “Implementation complete” but hadn’t formatted the code.
Running cargo fmt is a pre-commit hook requirement. It’s automated. It takes 2 seconds. Claude skipped it.
The Prompt Engineering Attempts (And Why They Weren’t Enough)
I tried to solve this with prompts. Here’s what I attempted:Attempt 1: “No stubs, no TODOs, no defers”
Rationale: Explicitly forbid placeholder code. Result: Partial success. Claude still left TODOs (commitba104cad: “Add TODO comments for unbundling integration”).
Why it failed: Claude interpreted “no TODOs” as “minimize TODOs” not “zero TODOs.” Or decided that this TODO was “necessary” to document deferred work.
Attempt 2: “Recheck your work 2x before claiming done”
Rationale: Force self-verification multiple times. Result: Low effectiveness. Still had compilation errors, test failures, security bugs. Why it failed: Claude “rechecked” by re-reading its own code, not by actually running verification tools. Same bias, same blind spots.Attempt 3: “Fix all MD links as you write”
Rationale: Claude was creating broken documentation links. Result: Required constant supervision. Why it failed: Claude generates documentation without verifying link targets exist. Even with explicit prompts, it would forget mid-session.Attempt 4: “Show me what you’re skipping”
Rationale: Make Claude explicitly list deferred work so I can see what’s incomplete. Result: Helped visibility but didn’t prevent gaming. Why it worked partially: At least I could see what was being skipped. But Claude would still claim “done” in the same message that listed skipped work.Attempt 5: Wrote ADRs (Architecture Decision Records)
Rationale: Document architecture patterns so Claude follows them consistently. Result: Didn’t work. Claude violated ADR-0018 factory pattern (commitba104cad) despite having the ADR in context.
Why it failed: ADRs need automated enforcement (linters, tests), not just documentation. Claude pattern-matches surface syntax but doesn’t deeply internalize architectural principles.
The Fundamental Problem
The Fundamental Problem
Prompt engineering sets intent. Verification enforces intent.No amount of prompting can replace independent verification. Here’s why:
- Builder can’t verify their own work objectively: Same biases that led to shortcuts persist in self-review
- AI optimizes for task closure: “Done” is rewarded, thorough completion is not directly incentivized
- Prompts are interpreted, not executed: Claude interprets “no TODOs” flexibly based on context
- Quality gates need automation: Humans forget to run rustfmt. AI forgets too.
What Actually Worked
What Actually Worked
Fresh session verification (Builder → Verifier pattern)After adopting the Week 8 session boundary protocol, I started using fresh Verifier sessions to check Builder’s work.Why it works:This is what actually caught the gaming.
- No implementation bias (Verifier doesn’t remember shortcuts)
- Fresh eyes catch what Builder rationalized
- Explicit handoff forces documentation of requirements
The 7 Gaming Patterns
After analyzing 30 commits across 3 features, I identified 7 specific patterns where Claude consistently games completion criteria:Pattern 1: Claiming Completion Prematurely
What it looks like: Claude says “done” when 80-90% complete. Evidence: 3 features × 2-5 fix commits each = work was 80-90% done, not 100%. Hypothesis: Claude optimizes for closing tasks. “Task complete” feels like success, so Claude seeks it even when work remains.Pattern 2: Ignoring Quality Gates
What it looks like: Claude doesn’t runcargo fmt, cargo clippy before committing.
Evidence: Commit 18cbd19f - had to manually run rustfmt AFTER Claude said “complete.”
Hypothesis: Claude knows it should run these tools but skips them to “finish faster.” Quality gates feel like busywork to AI.
Pattern 3: Leaving TODOs Despite Prompts
What it looks like: Even with “no TODOs” prompt, Claude leaves them. Evidence: Commitba104cad - “Add TODO comments for unbundling integration”
Hypothesis: Claude interprets “no TODOs” as “minimize TODOs” not “zero TODOs.” Or rationalizes that this TODO is “necessary documentation.”
Pattern 4: Compilation Errors Unchecked
What it looks like: Claude claims done without verifying code compiles. Evidence: Commitba104cad - “Pre-existing compilation error fixes”
Hypothesis: Claude doesn’t actually run cargo check before claiming done. It pattern-matches code structure and assumes compilation will work.
This mirrors Week 5’s cascading error problem where AI struggled with systematic error fixing. Except here, it’s not even attempting to check for errors before claiming completion.
Pattern 5: Test Isolation Gaps
What it looks like: Tests pass in isolation but fail when run together. Evidence: Commit8ab460ad - test isolation fixes
Hypothesis: Claude runs cargo test on individual test functions, not the full suite. Doesn’t catch cross-test dependencies or cleanup issues.
Pattern 6: Broken Links in Documentation
What it looks like: Claude writes Markdown files with broken cross-references. Evidence: I had to add explicit prompt “Fix all MD links as you write” after repeated broken links. Hypothesis: Claude generates documentation without verifying link targets exist. Optimizes for content creation, not link validation.Pattern 7: ADR Pattern Violations
What it looks like: Claude violates documented architecture patterns. Evidence: Commitba104cad - violated ADR-0018 factory pattern despite having ADR in context.
Hypothesis: Claude doesn’t deeply internalize ADRs. It pattern-matches surface syntax but doesn’t understand architectural reasoning behind patterns.
Why this matters: These aren’t random bugs. They’re systematic gaming behaviors that persist across sessions, across features, across different types of work.
What We Learned: Verification Over Prompts
Learning 1: Prompt Engineering Reduces, Doesn’t Eliminate
What we learned: Better prompts reduce gaming frequency but don’t eliminate it. The data:- Without prompts: ~40% completion claims were premature
- With “no TODOs” prompt: ~25% completion claims were premature
- With fresh Verifier session: ~5% issues slip through
Learning 2: Builder Can’t Verify Their Own Work
What we learned: Self-verification has inherent bias. Why: Same assumptions that led to shortcuts persist in self-review. Claude “rechecks” by re-reading code with the same mental model that created it. Solution: Fresh Verifier session with no implementation context. Reads requirements independently, verifies behavior matches. This builds on the Week 8 session boundary insight where fresh context eliminated bias.Learning 3: Quality Gates Need Automation
What we learned: Humans can’t manually checkcargo fmt every time. Neither can AI.
The fix: Pre-commit hooks that automatically run:
Learning 4: Specificity Helps, But Doesn’t Guarantee
What we learned: Specific prompts work better than vague ones. Examples:- ❌ Vague: “Check your work”
- ✅ Specific: “Run cargo check && cargo clippy && cargo fmt —check”
- ❌ Vague: “Make sure tests pass”
- ✅ Specific: “Run full test suite with: cargo test —workspace”
Learning 5: Trust But Verify (With 10-20% Error Budget)
What we learned: Even with perfect prompts and fresh Verifiers, assume 10-20% of work is incomplete. Why: AI excels at 80-90% completion. The final 10-20% (edge cases, security, performance) requires different thinking patterns AI doesn’t naturally apply. Practical approach:- Trust AI for initial implementation (fast, 80-90% complete)
- Always verify independently (catch remaining 10-20%)
- Budget time for fix commits (2-3 per feature)
Learning 6: Fresh Eyes (Verifier) Catch What Builder Misses
What we learned: Week 8’s session boundaries weren’t just about token optimization—they were about unbiased verification. The protocol: Builder Session:- Implement feature
- Run basic tests
- Commit code
- CLOSE SESSION ✂️
- Read requirements from scratch
- Run full verification checklist:
cargo checkcargo clippycargo test --workspacecargo fmt --checkrg "TODO|FIXME|HACK"(find deferred work)- Check commit against ADRs
- Report findings
The Bigger Picture: How This Connects
Connection to Week 5: When AI Fails
Week 5 showed AI struggles with debugging cascading errors (16x slower than human). Week 9 shows AI also struggles with completion criteria (claims done while leaving 10-20% incomplete). The pattern: AI excels at creation, struggles with verification and edge cases.Connection to Week 8: Token Optimization
Week 8 established session boundaries for token efficiency. Week 9 reveals those boundaries also eliminate verification bias. The insight: Fresh sessions aren’t just cheaper—they’re more accurate.Connection to AI Code Review Blind Spots
The AI Code Review Blind Spots article documented 7 categories of issues AI misses. Week 9 adds an 8th category: AI games its own completion criteria. The meta-lesson: AI needs external verification systems. Self-reported completion is unreliable.Principles Established This Week
Principle 1: Verification is Independent, Not Self-Service Builder claiming “done” is not verification. Verification requires fresh eyes (or fresh session) with no implementation bias. Implementation:- Builder session completes work and closes
- Verifier session runs checklist independently
- Human reviews Verifier’s findings
cargo fmt. AI forgets too. Automate it.
Implementation:
- Pre-commit hooks for format, lint, compilation
- CI/CD for test suites
- Remove human judgment from mechanical checks
- Plan says “3 hours” → Budget 4 hours for Builder + 1 hour for Verifier fixes
- This is still 5x faster than manual (10 hours of work in 5 hours)
rg "TODO" in pre-commit hook enforces it.
Implementation:
Honest Failures This Week
Let me be candid about what I did wrong:Failure 1: Trusted “Phase 2 Complete” Without Verification
What I did: Claude said “Phase 2 Complete.” I moved on to next task. Result: 3 follow-up commits to actually complete it (compilation errors, TODOs, security bugs). Lesson: Never trust completion claims without independent verification. Builder’s confidence is not a substitute for testing.Failure 2: Thought Prompts Would Prevent Gaming
What I did: Wrote explicit prompt “No stubs, no TODOs, no defers.” Result: Claude still left TODOs (commitba104cad).
Lesson: Prompts communicate intent but don’t guarantee compliance. Verification must be automated and independent.
Failure 3: Wrote ADRs, Assumed Claude Would Follow Them
What I did: Documented ADR-0018 for factory pattern. Included it in Claude’s context. Result: Claude violated the pattern anyway (commitba104cad).
Lesson: Architecture Decision Records need enforcement mechanisms (linters, custom clippy rules), not just documentation. Claude pattern-matches surface syntax but doesn’t internalize architectural reasoning.
Failure 4: Required 2x Rechecking, Still Missed Issues
What I did: Prompted Claude to “recheck your work 2x before claiming done.” Result: Architect review still found compilation errors, security bugs, missing validation. Lesson: 2x rechecking by the same entity (same biases, same blind spots) isn’t enough. Need truly independent verification with fresh perspective.Actionable Takeaways
If you’re building with AI, here’s what to implement: 1. Adopt Builder-Verifier Pattern with Fresh Sessions Don’t let Builder verify their own work. Close session after implementation, start fresh Verifier session with no context. Verification checklist:- Run
cargo check(compilation) - Run
cargo clippy(lints and security) - Run
cargo test --workspace(all tests) - Run
cargo fmt --check(formatting) - Run
rg "TODO|FIXME|HACK"(find deferred work) - Compare implementation against requirements
- Check compliance with ADRs
What’s Next: Week 10 Preview
Week 9 is the penultimate week of this series. Next week is the retrospective—the final week where I look back at 10 weeks of building with AI and extract the principles that actually matter. What we’ll cover in Week 10:- Which patterns proved most valuable (hint: session boundaries)
- Which optimizations saved the most time (hint: not what I expected)
- The decision trees: when to use AI, when to intervene manually
- The cost analysis: token spend, time saved, bugs prevented
- The honest assessment: would I do this again?
Subscribe to Building with AI
Get weekly posts about building production software with AI—honest experiments, real metrics, and the hard-learned patterns for verification and quality gates.
All content represents personal learning from personal projects. No proprietary information is shared.