By the Numbers
Code Output
Production Code:- 6,800+ lines of production code (Week 2 CRM domain alone)
- 2,400+ lines of test code
- 4,702 lines of boilerplate eliminated through macros (94% reduction)
- 6,862 lines for billing/accounting foundation (95 files)
- 2,816 lines of authorization infrastructure
- 524 total commits (Dec 31, 2025 - Jan 30, 2026)
- Week 2: 216 commits for CRM foundation
- Week 3: 72 commits for event sourcing patterns
- Week 4: 107 commits in one week (8-10x speedup)
- Breaking changes migration: 1,003 tests updated across 6 entities
- 15 entity types
- 37 API routes
- 8 background workers
- 3 event processing pipelines
- 21 E2E test scenarios
- 92% test coverage (8,464 of 9,200 lines covered)
Speed & Productivity
Speedup Multipliers:- Plan → Implement → Verify workflow: 4.4x faster than manual estimate
- Systematic work (boilerplate, patterns): 8-10x speedup
- Breaking changes at scale: 5-7x faster (4 days vs 4-6 weeks manual)
- Macro implementation: Single commit eliminating 812 lines → 43 lines per entity
- Week 2 alone: 120 hours saved on CRM domain implementation
- Breaking changes: Saved 4-6 weeks of manual refactoring
- Mistake from skipping planning: 3 hours wasted, 60% code rewrite
- Cascading error chase (AI): 24 hours, 31 commits
- Same errors (human fix): 90 minutes, 3 commits
Quality Metrics
Bugs Found:- Week 2: 18 bugs found by Verifier before merge
- Missing edge cases: 8 instances
- Requirement gaps: 6 instances
- Cross-entity inconsistencies: 4 instances
- Week 3: 6 isolation violations caught in test environment
- Week 3+: Zero capsule isolation bugs after pattern hardening
- Average time to fix Verifier issues: 45 minutes
- Post-Week 3: Compile-time guarantees prevent entire bug classes
- False positive rate: Minimal (Verifier focused on real issues)
Cost & ROI
Token Usage & Cost (Week 2 Example):- Evaluator (Opus): 145k tokens = $2.18
- Builder (Sonnet): 520k tokens = $7.80
- Verifier (Sonnet): 180k tokens = $2.70
- Total: $12.68 for 120 hours of manual work
- Week 2 ROI: 946x (120 hours @ ~12.68 in tokens)
- Average developer cost: $127/hour (fully loaded)
- Token cost: $0.10-0.15 per hour of equivalent work
What Worked
1. Multi-Agent Workflow
Pattern: Evaluator → Builder → Verifier The three-agent separation proved critical:- Evaluator (Opus): Architectural planning, pattern selection, requirement analysis
- Builder (Sonnet): Implementation, test generation, boilerplate creation
- Verifier (Sonnet, fresh session): Independent verification, edge case discovery
2. Plan → Implement → Verify Gates
Three-phase workflow with quality gates: Phase 1: Plan (20% of time)- Evaluator analyzes requirements
- Chooses patterns and architecture
- Identifies edge cases and dependencies
- Outputs implementation plan
- Builder follows plan
- Generates code and tests
- Documents decisions
- Verifier reviews independently
- Finds gaps and inconsistencies
- Reports blocking issues
3. Systematic Work Amplification
AI excelled at (8-10x speedup):- Boilerplate generation: 4,702 lines of repository patterns automated with macros
- Test scenario creation: 21 E2E test cases with EventCollector infrastructure
- Pattern application: Consistent PK/SK patterns across 15 entity types
- Documentation: 35-page organization model (CLAUDE.md) that Builder actually followed
- Before macros: 812 lines of hand-written code
- After macros: 43 lines with 5 derive annotations
- AI generated macro: Single commit, 94% reduction
4. Compile-Time Guarantees
Event pattern macros (Week 3): Rather than relying on runtime validation or AI remembering rules, we encoded architectural invariants as compile-time checks:- Week 3: 6 isolation violations caught in test
- Post-Week 3: Zero isolation bugs - invalid code won’t compile
5. Documentation as Architecture
CLAUDE.md organization model (35 pages): Week 4 created comprehensive project documentation that Builder agents actually followed:- Project structure and conventions
- Event patterns and naming
- Testing strategy (4 levels)
- Domain model relationships
What Failed
1. Cascading Compilation Errors
The disaster (Week 3): Changed macro signature for InMemoryRepository:- 24 hours of fix attempts
- 31 commits
- Cascading errors across 95 files
- Fix-in-session anti-pattern (each fix broke something else)
- 90 minutes
- 3 commits
- Methodical approach: update signature → fix call sites → fix tests
- AI lost context across large refactorings
- Each incremental fix created new compilation errors
- Builder couldn’t see full dependency graph
- Verifier couldn’t run until code compiled
2. Cross-Entity Consistency
The bug (Week 2): Opportunity entity used wrong ID format for foreign key to Account:- Verifier checked each entity independently
- No cross-entity consistency validation
- Relationship bugs only appeared during integration
3. Novel Problem Struggles
Pattern confusion: AI applied wrong patterns when problem didn’t match known examples:- Used generic repository pattern for event-sourced entity (should use event store)
- Suggested REST endpoints for background job orchestration (should use message queue)
- Missed multi-tenancy implications for new feature
4. Fix-in-Session Anti-Pattern
Observed behavior: When verification fails, Builder tries to fix issues in same session:- Gets attached to original approach
- Makes incremental patches
- Creates technical debt
- Misses opportunity for better design
- Close Builder session
- Review Verifier report as human
- Decide: quick fix or redesign?
- Start fresh Builder session with updated plan
Emerging Patterns
1. The Verification Ladder
Level 1: Intra-Entity Verification- Does code compile?
- Do tests pass?
- Are requirements met?
- Are foreign key types consistent?
- Do event patterns match?
- Are API routes consistent?
- Does solution follow documented patterns?
- Are isolation boundaries respected?
- Is error handling consistent?
- Do models accurately represent business domain?
- Are edge cases handled?
- Is the abstraction future-proof?
2. The Documentation Flywheel
Week 1-2: AI creates inconsistent patterns → human documents what works → CLAUDE.md Week 3-4: AI follows CLAUDE.md → generates consistent code → human updates with new learnings Result: Documentation quality improves, AI suggestions get better, human reviews get faster. Metric: Week 4 had 107 commits with fewer Verifier rejections than Week 2’s 216 commits.3. The Macro Threshold
When to create a macro: If you’ve written the same pattern >3 times, automate it:- DomainAggregate derive macro (after 7 similar entities)
- InMemoryRepository derive macro (after 5 repository implementations)
- CachedRepository derive macro (after 3 caching layers)
- Macro creation cost: ~2-4 hours (Builder + Verifier)
- Per-entity manual cost: ~3-4 hours
- Break-even: 1 entity
- Actual savings: 94% reduction after 15 entities
4. The Test Worth Threshold
AI made tests “worth it” that weren’t before: Level 3 Tests (Integration - DynamoDB + SQS):- Manual effort: 2-3 hours per test
- AI effort: 20-30 minutes per test
- Coverage: 8 integration test scenarios in Week 2
- Manual effort: 4-6 hours per scenario
- AI effort: 45-60 minutes per scenario
- Coverage: 21 E2E scenarios in Week 4
Surprising Learnings
1. Token Cost Is Negligible
Expected: Token costs would be significant concern. Reality: $12.68 for 120 hours of work (Week 2). Even at 10x that cost, still 94x ROI. Implication: Don’t optimize for token usage. Optimize for quality and speed. Use Opus for planning where architecture matters. The $2.18 planning cost saved hours of rework.2. Fresh Context > Smart Prompts
Expected: Better prompts would improve AI output quality. Reality: Fresh Verifier session with basic prompts caught more bugs than elaborate Builder prompts in same session. Implication: Session management and agent roles matter more than prompt engineering. Independent verification is the forcing function for quality.3. AI Documents Better Than Humans
Expected: AI would generate minimal documentation. Reality: Builder created 35-page organization model that was more comprehensive and consistent than human-written docs. Why: AI has no “documentation debt” aversion. It documents as naturally as it codes. Humans skip docs to save time.4. Compile-Time > Runtime > Trust
Evolution of validation strategy: Week 1-2: Trust AI to follow patterns → bugs in production Week 3: Add runtime validation for patterns → caught issues in tests Week 3+: Encode patterns in types/macros → invalid code won’t compile Key learning: The best validation is making invalid states unrepresentable. Zero capsule isolation bugs after Week 3 because capsule-isolated entities enforce isolation at compile time.5. Breaking Changes Still Hurt
Expected: AI would handle refactoring well. Reality: Large breaking changes (macro signature change) took AI 24 hours vs human 90 minutes. Why: AI excels at localized changes with clear context. System-wide refactoring with cascading dependencies overwhelms context windows. Workaround: Use human for breaking changes, then AI for applying fixes consistently across codebase.Recommendations
Based on 4 weeks and 524 commits: 1. Start with multi-agent workflow- Don’t try to do everything in one session
- Invest in Evaluator planning upfront
- Always use fresh Verifier session
- Create CLAUDE.md from Week 1
- Update with patterns that work
- AI will follow documented conventions
- Don’t rely on AI “remembering” architectural constraints
- Use macros for cross-cutting concerns
- Make invalid states unrepresentable
- Large breaking changes
- Novel problem combinations
- Cross-entity architectural decisions
- System-wide refactoring
- Track speedup multipliers per work type
- Monitor bug sources (Verifier vs production)
- Calculate actual ROI (token cost vs time saved)
- AI changes what’s “worth doing” (comprehensive tests, docs)
- Token cost is negligible compared to developer time
- Don’t optimize for token usage - optimize for quality
What’s Next
The 4-week foundation is in place:- Multi-agent workflow proven (4.4x-10x speedup)
- 15 entity types with zero isolation bugs
- 92% test coverage with AI-economical E2E tests
- Patterns documented and consistently applied
- Scale to 100+ entities (will macros hold up?)
- Multi-service integration (event-driven architecture)
- Production performance optimization
- AI-assisted debugging at scale