Skip to main content
Four weeks of building a production SaaS platform with AI agents. Here’s what actually happened - metrics, failures, learnings, and patterns that emerged.

By the Numbers

Code Output

Production Code:
  • 6,800+ lines of production code (Week 2 CRM domain alone)
  • 2,400+ lines of test code
  • 4,702 lines of boilerplate eliminated through macros (94% reduction)
  • 6,862 lines for billing/accounting foundation (95 files)
  • 2,816 lines of authorization infrastructure
Commits & Files:
  • 524 total commits (Dec 31, 2025 - Jan 30, 2026)
  • Week 2: 216 commits for CRM foundation
  • Week 3: 72 commits for event sourcing patterns
  • Week 4: 107 commits in one week (8-10x speedup)
  • Breaking changes migration: 1,003 tests updated across 6 entities
System Scale:
  • 15 entity types
  • 37 API routes
  • 8 background workers
  • 3 event processing pipelines
  • 21 E2E test scenarios
  • 92% test coverage (8,464 of 9,200 lines covered)

Speed & Productivity

Speedup Multipliers:
  • Plan → Implement → Verify workflow: 4.4x faster than manual estimate
  • Systematic work (boilerplate, patterns): 8-10x speedup
  • Breaking changes at scale: 5-7x faster (4 days vs 4-6 weeks manual)
  • Macro implementation: Single commit eliminating 812 lines → 43 lines per entity
Time Savings:
  • Week 2 alone: 120 hours saved on CRM domain implementation
  • Breaking changes: Saved 4-6 weeks of manual refactoring
Time Costs:
  • Mistake from skipping planning: 3 hours wasted, 60% code rewrite
  • Cascading error chase (AI): 24 hours, 31 commits
  • Same errors (human fix): 90 minutes, 3 commits

Quality Metrics

Bugs Found:
  • Week 2: 18 bugs found by Verifier before merge
    • Missing edge cases: 8 instances
    • Requirement gaps: 6 instances
    • Cross-entity inconsistencies: 4 instances
  • Week 3: 6 isolation violations caught in test environment
  • Week 3+: Zero capsule isolation bugs after pattern hardening
Verification Effectiveness:
  • Average time to fix Verifier issues: 45 minutes
  • Post-Week 3: Compile-time guarantees prevent entire bug classes
  • False positive rate: Minimal (Verifier focused on real issues)

Cost & ROI

Token Usage & Cost (Week 2 Example):
  • Evaluator (Opus): 145k tokens = $2.18
  • Builder (Sonnet): 520k tokens = $7.80
  • Verifier (Sonnet): 180k tokens = $2.70
  • Total: $12.68 for 120 hours of manual work
Return on Investment:
  • Week 2 ROI: 946x (120 hours @ ~127/hrvs127/hr vs 12.68 in tokens)
  • Average developer cost: $127/hour (fully loaded)
  • Token cost: $0.10-0.15 per hour of equivalent work

What Worked

1. Multi-Agent Workflow

Pattern: Evaluator → Builder → Verifier The three-agent separation proved critical:
  • Evaluator (Opus): Architectural planning, pattern selection, requirement analysis
  • Builder (Sonnet): Implementation, test generation, boilerplate creation
  • Verifier (Sonnet, fresh session): Independent verification, edge case discovery
Key insight: Fresh Verifier session caught 18 bugs in Week 2 that Builder’s own tests missed. The independent context is non-negotiable. Mistake avoided: Week 1 experiment of reusing Builder session for verification → Builder biased toward own implementation, missed obvious bugs.

2. Plan → Implement → Verify Gates

Three-phase workflow with quality gates: Phase 1: Plan (20% of time)
  • Evaluator analyzes requirements
  • Chooses patterns and architecture
  • Identifies edge cases and dependencies
  • Outputs implementation plan
Phase 2: Implement (60% of time)
  • Builder follows plan
  • Generates code and tests
  • Documents decisions
Phase 3: Verify (20% of time)
  • Verifier reviews independently
  • Finds gaps and inconsistencies
  • Reports blocking issues
Key metric: 20% planning time saves 80% rework. Week 1 experiment skipping planning cost 3 hours and 60% code rewrite.

3. Systematic Work Amplification

AI excelled at (8-10x speedup):
  • Boilerplate generation: 4,702 lines of repository patterns automated with macros
  • Test scenario creation: 21 E2E test cases with EventCollector infrastructure
  • Pattern application: Consistent PK/SK patterns across 15 entity types
  • Documentation: 35-page organization model (CLAUDE.md) that Builder actually followed
Example: Account entity repository implementation:
  • Before macros: 812 lines of hand-written code
  • After macros: 43 lines with 5 derive annotations
  • AI generated macro: Single commit, 94% reduction
Why this worked: Systematic work has clear patterns, well-documented APIs, and formulaic structure. AI pattern-matches effectively.

4. Compile-Time Guarantees

Event pattern macros (Week 3): Rather than relying on runtime validation or AI remembering rules, we encoded architectural invariants as compile-time checks:
#[derive(DomainAggregate, DomainEvent)]
#[capsule_isolated]  // Enforces tenant_id + capsule_id fields
pub struct Lead { ... }
Impact:
  • Week 3: 6 isolation violations caught in test
  • Post-Week 3: Zero isolation bugs - invalid code won’t compile
Lesson: Don’t ask AI to “remember” architectural rules. Encode them in types and macros. AI generates code that satisfies the compiler, compiler enforces rules.

5. Documentation as Architecture

CLAUDE.md organization model (35 pages): Week 4 created comprehensive project documentation that Builder agents actually followed:
  • Project structure and conventions
  • Event patterns and naming
  • Testing strategy (4 levels)
  • Domain model relationships
Before CLAUDE.md: Builder invented inconsistent patterns across entities. After CLAUDE.md: Builder suggested BETTER designs aligned with documented patterns. Key insight: AI needs external memory. Well-structured documentation becomes architectural guidance that AI follows consistently.

What Failed

1. Cascading Compilation Errors

The disaster (Week 3): Changed macro signature for InMemoryRepository:
// Old
fn pk_for_id(tenant_id, capsule_id, id) -> String

// New
fn pk_for_id(self, id) -> String  // ❌ Breaking change
AI’s response:
  • 24 hours of fix attempts
  • 31 commits
  • Cascading errors across 95 files
  • Fix-in-session anti-pattern (each fix broke something else)
Human fix:
  • 90 minutes
  • 3 commits
  • Methodical approach: update signature → fix call sites → fix tests
What went wrong:
  • AI lost context across large refactorings
  • Each incremental fix created new compilation errors
  • Builder couldn’t see full dependency graph
  • Verifier couldn’t run until code compiled
Lesson: Large-scale breaking changes are still human work. AI excels at localized changes, not system-wide refactoring with cascading dependencies.

2. Cross-Entity Consistency

The bug (Week 2): Opportunity entity used wrong ID format for foreign key to Account:
// Account
pub struct Account {
    pub id: AccountId(Uuid),  // UUID v4
}

// Opportunity (WRONG)
pub struct Opportunity {
    pub account_id: String,  // String "ACC-{ulid}"
}
Why it happened:
  • Verifier checked each entity independently
  • No cross-entity consistency validation
  • Relationship bugs only appeared during integration
Fix: Added cross-entity verification step checking foreign key types, ID formats, event patterns, and API route consistency. Result: 4 cross-entity inconsistencies caught in subsequent weeks.

3. Novel Problem Struggles

Pattern confusion: AI applied wrong patterns when problem didn’t match known examples:
  • Used generic repository pattern for event-sourced entity (should use event store)
  • Suggested REST endpoints for background job orchestration (should use message queue)
  • Missed multi-tenancy implications for new feature
Why: AI pattern-matches. Novel combinations of requirements don’t have clear matches in training data. Workaround: Evaluator phase explicitly calls out “novel aspects” and requires human review before Builder proceeds.

4. Fix-in-Session Anti-Pattern

Observed behavior: When verification fails, Builder tries to fix issues in same session:
  • Gets attached to original approach
  • Makes incremental patches
  • Creates technical debt
  • Misses opportunity for better design
Better approach:
  • Close Builder session
  • Review Verifier report as human
  • Decide: quick fix or redesign?
  • Start fresh Builder session with updated plan
Metric: Fixes from fresh sessions were higher quality and faster (45 min avg vs multiple hours in same session).

Emerging Patterns

1. The Verification Ladder

Level 1: Intra-Entity Verification
  • Does code compile?
  • Do tests pass?
  • Are requirements met?
Level 2: Inter-Entity Verification
  • Are foreign key types consistent?
  • Do event patterns match?
  • Are API routes consistent?
Level 3: Architectural Verification
  • Does solution follow documented patterns?
  • Are isolation boundaries respected?
  • Is error handling consistent?
Level 4: Domain Verification
  • Do models accurately represent business domain?
  • Are edge cases handled?
  • Is the abstraction future-proof?
Observation: AI handles Level 1-2 well. Levels 3-4 require human judgment.

2. The Documentation Flywheel

Week 1-2: AI creates inconsistent patterns → human documents what works → CLAUDE.md Week 3-4: AI follows CLAUDE.md → generates consistent code → human updates with new learnings Result: Documentation quality improves, AI suggestions get better, human reviews get faster. Metric: Week 4 had 107 commits with fewer Verifier rejections than Week 2’s 216 commits.

3. The Macro Threshold

When to create a macro: If you’ve written the same pattern >3 times, automate it:
  • DomainAggregate derive macro (after 7 similar entities)
  • InMemoryRepository derive macro (after 5 repository implementations)
  • CachedRepository derive macro (after 3 caching layers)
ROI calculation:
  • Macro creation cost: ~2-4 hours (Builder + Verifier)
  • Per-entity manual cost: ~3-4 hours
  • Break-even: 1 entity
  • Actual savings: 94% reduction after 15 entities

4. The Test Worth Threshold

AI made tests “worth it” that weren’t before: Level 3 Tests (Integration - DynamoDB + SQS):
  • Manual effort: 2-3 hours per test
  • AI effort: 20-30 minutes per test
  • Coverage: 8 integration test scenarios in Week 2
Level 4 Tests (E2E - Full event flows):
  • Manual effort: 4-6 hours per scenario
  • AI effort: 45-60 minutes per scenario
  • Coverage: 21 E2E scenarios in Week 4
Key insight: AI changed the economics of testing. Tests that were “too expensive” became economically viable.

Surprising Learnings

1. Token Cost Is Negligible

Expected: Token costs would be significant concern. Reality: $12.68 for 120 hours of work (Week 2). Even at 10x that cost, still 94x ROI. Implication: Don’t optimize for token usage. Optimize for quality and speed. Use Opus for planning where architecture matters. The $2.18 planning cost saved hours of rework.

2. Fresh Context > Smart Prompts

Expected: Better prompts would improve AI output quality. Reality: Fresh Verifier session with basic prompts caught more bugs than elaborate Builder prompts in same session. Implication: Session management and agent roles matter more than prompt engineering. Independent verification is the forcing function for quality.

3. AI Documents Better Than Humans

Expected: AI would generate minimal documentation. Reality: Builder created 35-page organization model that was more comprehensive and consistent than human-written docs. Why: AI has no “documentation debt” aversion. It documents as naturally as it codes. Humans skip docs to save time.

4. Compile-Time > Runtime > Trust

Evolution of validation strategy: Week 1-2: Trust AI to follow patterns → bugs in production Week 3: Add runtime validation for patterns → caught issues in tests Week 3+: Encode patterns in types/macros → invalid code won’t compile Key learning: The best validation is making invalid states unrepresentable. Zero capsule isolation bugs after Week 3 because capsule-isolated entities enforce isolation at compile time.

5. Breaking Changes Still Hurt

Expected: AI would handle refactoring well. Reality: Large breaking changes (macro signature change) took AI 24 hours vs human 90 minutes. Why: AI excels at localized changes with clear context. System-wide refactoring with cascading dependencies overwhelms context windows. Workaround: Use human for breaking changes, then AI for applying fixes consistently across codebase.

Recommendations

Based on 4 weeks and 524 commits: 1. Start with multi-agent workflow
  • Don’t try to do everything in one session
  • Invest in Evaluator planning upfront
  • Always use fresh Verifier session
2. Document as you go
  • Create CLAUDE.md from Week 1
  • Update with patterns that work
  • AI will follow documented conventions
3. Encode rules in types
  • Don’t rely on AI “remembering” architectural constraints
  • Use macros for cross-cutting concerns
  • Make invalid states unrepresentable
4. Know when to switch to human
  • Large breaking changes
  • Novel problem combinations
  • Cross-entity architectural decisions
  • System-wide refactoring
5. Measure everything
  • Track speedup multipliers per work type
  • Monitor bug sources (Verifier vs production)
  • Calculate actual ROI (token cost vs time saved)
6. Use economics as guide
  • AI changes what’s “worth doing” (comprehensive tests, docs)
  • Token cost is negligible compared to developer time
  • Don’t optimize for token usage - optimize for quality

What’s Next

The 4-week foundation is in place:
  • Multi-agent workflow proven (4.4x-10x speedup)
  • 15 entity types with zero isolation bugs
  • 92% test coverage with AI-economical E2E tests
  • Patterns documented and consistently applied
Next challenges:
  • Scale to 100+ entities (will macros hold up?)
  • Multi-service integration (event-driven architecture)
  • Production performance optimization
  • AI-assisted debugging at scale
Key question: Does the Plan → Implement → Verify workflow scale to multi-service, multi-team development? Or do new coordination patterns emerge? The journey continues.

Discussion

What surprised you most from these metrics? Where does your AI experience differ? Share your learnings: