Weeks 1-4 Retrospective: Building with AI

Four weeks of building a production SaaS platform with AI agents. Here’s what actually happened - metrics, failures, learnings, and patterns that emerged.

By the Numbers

Code Output

Production Code:

6,800+ lines of production code (Week 2 CRM domain alone)
2,400+ lines of test code
4,702 lines of boilerplate eliminated through macros (94% reduction)
6,862 lines for billing/accounting foundation (95 files)
2,816 lines of authorization infrastructure

Commits & Files:

524 total commits (Dec 31, 2025 - Jan 30, 2026)
Week 2: 216 commits for CRM foundation
Week 3: 72 commits for event sourcing patterns
Week 4: 107 commits in one week (8-10x speedup)
Breaking changes migration: 1,003 tests updated across 6 entities

System Scale:

15 entity types
37 API routes
8 background workers
3 event processing pipelines
21 E2E test scenarios
92% test coverage (8,464 of 9,200 lines covered)

Speed & Productivity

Speedup Multipliers:

Plan → Implement → Verify workflow: 4.4x faster than manual estimate
Systematic work (boilerplate, patterns): 8-10x speedup
Breaking changes at scale: 5-7x faster (4 days vs 4-6 weeks manual)
Macro implementation: Single commit eliminating 812 lines → 43 lines per entity

Time Savings:

Week 2 alone: 120 hours saved on CRM domain implementation
Breaking changes: Saved 4-6 weeks of manual refactoring

Time Costs:

Mistake from skipping planning: 3 hours wasted, 60% code rewrite
Cascading error chase (AI): 24 hours, 31 commits
Same errors (human fix): 90 minutes, 3 commits

Quality Metrics

Bugs Found:

Week 2: 18 bugs found by Verifier before merge
- Missing edge cases: 8 instances
- Requirement gaps: 6 instances
- Cross-entity inconsistencies: 4 instances
Week 3: 6 isolation violations caught in test environment
Week 3+: Zero capsule isolation bugs after pattern hardening

Verification Effectiveness:

Average time to fix Verifier issues: 45 minutes
Post-Week 3: Compile-time guarantees prevent entire bug classes
False positive rate: Minimal (Verifier focused on real issues)

Cost & ROI

Token Usage & Cost (Week 2 Example):

Evaluator (Opus): 145k tokens = $2.18
Builder (Sonnet): 520k tokens = $7.80
Verifier (Sonnet): 180k tokens = $2.70
Total: $12.68 for 120 hours of manual work

Return on Investment:

Week 2 ROI: 946x (120 hours @ ~ $127/hr vs$ 12.68 in tokens)
Average developer cost: $127/hour (fully loaded)
Token cost: $0.10-0.15 per hour of equivalent work

What Worked

1. Multi-Agent Workflow

Pattern: Evaluator → Builder → Verifier The three-agent separation proved critical:

Evaluator (Opus): Architectural planning, pattern selection, requirement analysis
Builder (Sonnet): Implementation, test generation, boilerplate creation
Verifier (Sonnet, fresh session): Independent verification, edge case discovery

Key insight: Fresh Verifier session caught 18 bugs in Week 2 that Builder’s own tests missed. The independent context is non-negotiable. Mistake avoided: Week 1 experiment of reusing Builder session for verification → Builder biased toward own implementation, missed obvious bugs.

2. Plan → Implement → Verify Gates

Three-phase workflow with quality gates: Phase 1: Plan (20% of time)

Evaluator analyzes requirements
Chooses patterns and architecture
Identifies edge cases and dependencies
Outputs implementation plan

Phase 2: Implement (60% of time)

Builder follows plan
Generates code and tests
Documents decisions

Phase 3: Verify (20% of time)

Verifier reviews independently
Finds gaps and inconsistencies
Reports blocking issues

Key metric: 20% planning time saves 80% rework. Week 1 experiment skipping planning cost 3 hours and 60% code rewrite.

3. Systematic Work Amplification

AI excelled at (8-10x speedup):

Boilerplate generation: 4,702 lines of repository patterns automated with macros
Test scenario creation: 21 E2E test cases with EventCollector infrastructure
Pattern application: Consistent PK/SK patterns across 15 entity types
Documentation: 35-page organization model (CLAUDE.md) that Builder actually followed

Example: Account entity repository implementation:

Before macros: 812 lines of hand-written code
After macros: 43 lines with 5 derive annotations
AI generated macro: Single commit, 94% reduction

Why this worked: Systematic work has clear patterns, well-documented APIs, and formulaic structure. AI pattern-matches effectively.

4. Compile-Time Guarantees

Event pattern macros (Week 3): Rather than relying on runtime validation or AI remembering rules, we encoded architectural invariants as compile-time checks:

#[derive(DomainAggregate, DomainEvent)]
#[capsule_isolated]  // Enforces tenant_id + capsule_id fields
pub struct Lead { ... }

Impact:

Week 3: 6 isolation violations caught in test
Post-Week 3: Zero isolation bugs - invalid code won’t compile

Lesson: Don’t ask AI to “remember” architectural rules. Encode them in types and macros. AI generates code that satisfies the compiler, compiler enforces rules.

5. Documentation as Architecture

CLAUDE.md organization model (35 pages): Week 4 created comprehensive project documentation that Builder agents actually followed:

Project structure and conventions
Event patterns and naming
Testing strategy (4 levels)
Domain model relationships

Before CLAUDE.md: Builder invented inconsistent patterns across entities. After CLAUDE.md: Builder suggested BETTER designs aligned with documented patterns. Key insight: AI needs external memory. Well-structured documentation becomes architectural guidance that AI follows consistently.

What Failed

1. Cascading Compilation Errors

The disaster (Week 3): Changed macro signature for InMemoryRepository:

// Old
fn pk_for_id(tenant_id, capsule_id, id) -> String

// New
fn pk_for_id(self, id) -> String  // ❌ Breaking change

AI’s response:

24 hours of fix attempts
31 commits
Cascading errors across 95 files
Fix-in-session anti-pattern (each fix broke something else)

Human fix:

90 minutes
3 commits
Methodical approach: update signature → fix call sites → fix tests

What went wrong:

AI lost context across large refactorings
Each incremental fix created new compilation errors
Builder couldn’t see full dependency graph
Verifier couldn’t run until code compiled

Lesson: Large-scale breaking changes are still human work. AI excels at localized changes, not system-wide refactoring with cascading dependencies.

2. Cross-Entity Consistency

The bug (Week 2): Opportunity entity used wrong ID format for foreign key to Account:

// Account
pub struct Account {
    pub id: AccountId(Uuid),  // UUID v4
}

// Opportunity (WRONG)
pub struct Opportunity {
    pub account_id: String,  // String "ACC-{ulid}"
}

Why it happened:

Verifier checked each entity independently
No cross-entity consistency validation
Relationship bugs only appeared during integration

Fix: Added cross-entity verification step checking foreign key types, ID formats, event patterns, and API route consistency. Result: 4 cross-entity inconsistencies caught in subsequent weeks.

3. Novel Problem Struggles

Pattern confusion: AI applied wrong patterns when problem didn’t match known examples:

Used generic repository pattern for event-sourced entity (should use event store)
Suggested REST endpoints for background job orchestration (should use message queue)
Missed multi-tenancy implications for new feature

Why: AI pattern-matches. Novel combinations of requirements don’t have clear matches in training data. Workaround: Evaluator phase explicitly calls out “novel aspects” and requires human review before Builder proceeds.

4. Fix-in-Session Anti-Pattern

Observed behavior: When verification fails, Builder tries to fix issues in same session:

Gets attached to original approach
Makes incremental patches
Creates technical debt
Misses opportunity for better design

Better approach:

Close Builder session
Review Verifier report as human
Decide: quick fix or redesign?
Start fresh Builder session with updated plan

Metric: Fixes from fresh sessions were higher quality and faster (45 min avg vs multiple hours in same session).

Emerging Patterns

1. The Verification Ladder

Level 1: Intra-Entity Verification

Does code compile?
Do tests pass?
Are requirements met?

Level 2: Inter-Entity Verification

Are foreign key types consistent?
Do event patterns match?
Are API routes consistent?

Level 3: Architectural Verification

Does solution follow documented patterns?
Are isolation boundaries respected?
Is error handling consistent?

Level 4: Domain Verification

Do models accurately represent business domain?
Are edge cases handled?
Is the abstraction future-proof?

Observation: AI handles Level 1-2 well. Levels 3-4 require human judgment.

2. The Documentation Flywheel

Week 1-2: AI creates inconsistent patterns → human documents what works → CLAUDE.md Week 3-4: AI follows CLAUDE.md → generates consistent code → human updates with new learnings Result: Documentation quality improves, AI suggestions get better, human reviews get faster. Metric: Week 4 had 107 commits with fewer Verifier rejections than Week 2’s 216 commits.

3. The Macro Threshold

When to create a macro: If you’ve written the same pattern >3 times, automate it:

DomainAggregate derive macro (after 7 similar entities)
InMemoryRepository derive macro (after 5 repository implementations)
CachedRepository derive macro (after 3 caching layers)

ROI calculation:

Macro creation cost: ~2-4 hours (Builder + Verifier)
Per-entity manual cost: ~3-4 hours
Break-even: 1 entity
Actual savings: 94% reduction after 15 entities

4. The Test Worth Threshold

AI made tests “worth it” that weren’t before: Level 3 Tests (Integration - DynamoDB + SQS):

Manual effort: 2-3 hours per test
AI effort: 20-30 minutes per test
Coverage: 8 integration test scenarios in Week 2

Level 4 Tests (E2E - Full event flows):

Manual effort: 4-6 hours per scenario
AI effort: 45-60 minutes per scenario
Coverage: 21 E2E scenarios in Week 4

Key insight: AI changed the economics of testing. Tests that were “too expensive” became economically viable.

Surprising Learnings

1. Token Cost Is Negligible

Expected: Token costs would be significant concern. Reality: $12.68 for 120 hours of work (Week 2). Even at 10x that cost, still 94x ROI. Implication: Don’t optimize for token usage. Optimize for quality and speed. Use Opus for planning where architecture matters. The $2.18 planning cost saved hours of rework.

2. Fresh Context > Smart Prompts

Expected: Better prompts would improve AI output quality. Reality: Fresh Verifier session with basic prompts caught more bugs than elaborate Builder prompts in same session. Implication: Session management and agent roles matter more than prompt engineering. Independent verification is the forcing function for quality.

3. AI Documents Better Than Humans

Expected: AI would generate minimal documentation. Reality: Builder created 35-page organization model that was more comprehensive and consistent than human-written docs. Why: AI has no “documentation debt” aversion. It documents as naturally as it codes. Humans skip docs to save time.

4. Compile-Time > Runtime > Trust

Evolution of validation strategy: Week 1-2: Trust AI to follow patterns → bugs in production Week 3: Add runtime validation for patterns → caught issues in tests Week 3+: Encode patterns in types/macros → invalid code won’t compile Key learning: The best validation is making invalid states unrepresentable. Zero capsule isolation bugs after Week 3 because capsule-isolated entities enforce isolation at compile time.

5. Breaking Changes Still Hurt

Expected: AI would handle refactoring well. Reality: Large breaking changes (macro signature change) took AI 24 hours vs human 90 minutes. Why: AI excels at localized changes with clear context. System-wide refactoring with cascading dependencies overwhelms context windows. Workaround: Use human for breaking changes, then AI for applying fixes consistently across codebase.

Recommendations

Based on 4 weeks and 524 commits: 1. Start with multi-agent workflow

Don’t try to do everything in one session
Invest in Evaluator planning upfront
Always use fresh Verifier session

2. Document as you go

Create CLAUDE.md from Week 1
Update with patterns that work
AI will follow documented conventions

3. Encode rules in types

Don’t rely on AI “remembering” architectural constraints
Use macros for cross-cutting concerns
Make invalid states unrepresentable

4. Know when to switch to human

Large breaking changes
Novel problem combinations
Cross-entity architectural decisions
System-wide refactoring

5. Measure everything

Track speedup multipliers per work type
Monitor bug sources (Verifier vs production)
Calculate actual ROI (token cost vs time saved)

6. Use economics as guide

AI changes what’s “worth doing” (comprehensive tests, docs)
Token cost is negligible compared to developer time
Don’t optimize for token usage - optimize for quality

What’s Next

The 4-week foundation is in place:

Multi-agent workflow proven (4.4x-10x speedup)
15 entity types with zero isolation bugs
92% test coverage with AI-economical E2E tests
Patterns documented and consistently applied

Next challenges:

Scale to 100+ entities (will macros hold up?)
Multi-service integration (event-driven architecture)
Production performance optimization
AI-assisted debugging at scale

Key question: Does the Plan → Implement → Verify workflow scale to multi-service, multi-team development? Or do new coordination patterns emerge? The journey continues.

Discussion

What surprised you most from these metrics? Where does your AI experience differ? Share your learnings:

Metrics & Retrospective

​By the Numbers

​Code Output

​Speed & Productivity

​Quality Metrics

​Cost & ROI

​What Worked

​1. Multi-Agent Workflow

​2. Plan → Implement → Verify Gates

​3. Systematic Work Amplification

​4. Compile-Time Guarantees

​5. Documentation as Architecture

​What Failed

​1. Cascading Compilation Errors

​2. Cross-Entity Consistency

​3. Novel Problem Struggles

​4. Fix-in-Session Anti-Pattern

​Emerging Patterns

​1. The Verification Ladder

​2. The Documentation Flywheel

​3. The Macro Threshold

​4. The Test Worth Threshold

​Surprising Learnings

​1. Token Cost Is Negligible

​2. Fresh Context > Smart Prompts

​3. AI Documents Better Than Humans

​4. Compile-Time > Runtime > Trust

​5. Breaking Changes Still Hurt

​Recommendations

​What’s Next

​Discussion