Week 4: When AI Truly Excels
This is Week 4 of “Building with AI” - a 10-week journey documenting how I use multi-agent AI
workflows to build a production-grade SaaS platform.This week: Discovering that AI doesn’t just speed up coding - it fundamentally changes what
work becomes worth doing. 107 commits across testing, documentation, and organizational modeling
that would have been impossible to tackle manually.
The Surprising Realization
After several weeks into this experiment, I had established a pattern: AI builds features fast, verification catches bugs, everything works. This week revealed something unexpected: AI doesn’t just accelerate existing workflows - it makes previously impossible work suddenly feasible. The work I tackled this week:- Complete end-to-end test coverage for event flows (21 test scenarios)
- Organization model documentation (agent responsibilities, workflows, decision frameworks)
- API visibility architecture for SDK filtering
- Bundle/unbundle workflow state machine with examples
- Usage metering infrastructure with aggregation pipelines
What We Built This Week
1. Event Flow E2E Testing (The Work Nobody Wants to Do)
The Context: Week 2-3 built a CRM domain layer with event sourcing. We had unit tests, integration tests, even some event flow tests. But we didn’t have comprehensive end-to-end verification that events actually flow correctly through the entire system. Why this matters: In event-sourced systems, bugs in event flow are catastrophic:- Miss an event → audit trail broken
- Wrong event order → state corruption
- Cross-tenant event leak → compliance violation
- Set up test infrastructure (EventBridge, SQS, DynamoDB Streams)
- Create test data for each entity type
- Trigger actions and wait for async event delivery
- Verify event payload, ordering, and side effects
- Clean up test resources
- Repeat for every entity and workflow combination
- The Prompt
- AI's Output
- Implementation
- What Verifier Caught
Key Insight: AI makes thorough testing economically viable. The marginal cost of going from
“some tests” to “comprehensive coverage” dropped from weeks to hours.
2. Organization Model Documentation (The Work Nobody Does)
The surprise of the week: The most valuable output wasn’t code - it was organizational documentation. The Context: By Week 4, I had built significant functionality with the multi-agent workflow. But I noticed problems:- Evaluator sometimes made decisions outside its scope
- Builder occasionally asked questions Evaluator should answer
- Verification reports varied in quality
- No clear escalation path for conflicts
The Planning Prompt
The Planning Prompt
Agent Personas
Evaluator (Principal Architect)
- Owns: Architecture decisions, technical strategy
- Decides: Technology choices, design patterns, trade-offs
- Cannot: Implement code, override verification failures
- Escalates to: Human (for business decisions)
- Owns: Implementation quality, test coverage
- Decides: Code structure, algorithm choice (within plan)
- Cannot: Change architecture, skip verification
- Escalates to: Evaluator (for plan changes)
- Owns: Quality standards, requirement coverage
- Decides: Pass/conditional/fail, required fixes
- Cannot: Implement fixes, change requirements
- Escalates to: Human (for requirement ambiguity)
Decision Framework
Type 1 Decisions (Reversible):
- Variable naming, code formatting
- Test data values
- Comment wording
- Owner: Builder makes decision, no approval needed
- Algorithm choice (within performance constraints)
- Error message wording
- Test organization structure
- Owner: Builder proposes, Verifier validates
- Database schema changes
- API contract changes
- Event structure changes
- Owner: Evaluator decides, human approves
- Multi-tenant isolation strategy
- Compliance approach
- Core technology choices
- Owner: Human decides, Evaluator advises
Quality Gates
Gate 1: Planning Complete
- ✅ Requirements understood
- ✅ Design options evaluated
- ✅ Recommended approach justified
- ✅ Human approval received
- Then: Builder can start implementation
- ✅ All plan requirements implemented
- ✅ Four-level tests passing
- ✅ No compiler warnings
- ✅ Builder self-review done
- Then: Verifier can review
- ✅ Requirements coverage verified
- ✅ Test adequacy confirmed
- ✅ Edge cases identified
- ✅ No blocking issues
- Then: Human final review
- ✅ Spot-check implementation
- ✅ Verify AI didn’t hallucinate features
- ✅ Confirm alignment with business goals
- Then: Merge to main
Conflict Resolution
Scenario 1: Verifier rejects implementation
- Builder must fix issues (no debate)
- If Builder believes rejection is wrong:
- Builder escalates to Evaluator
- Evaluator reviews both perspectives
- Evaluator makes final call
- Human can override if needed
- Builder documents concern in PR comment
- Builder proposes alternative approach
- Evaluator reviews:
- If minor: Evaluator approves change
- If major: Human decides
- Verifier flags ambiguity (doesn’t fail review)
- Verifier suggests clarifying questions
- Human clarifies requirement
- Builder updates implementation
- Re-verification required
- Verifier would sometimes say “looks good” without checking edge cases
- Builder occasionally added features not in the plan (“seemed useful”)
- Unclear when to involve human vs. let agents decide
- Verifier now follows checklist: requirement coverage, test adequacy, edge cases, consistency
- Builder stays strictly within plan (escalates if deviation needed)
- Clear decision rights eliminate back-and-forth
3. Documentation Strategy (The Work That Never Gets Done)
With the organization model defined, AI suggested something I didn’t ask for:“The organization model references several documents that don’t exist yet (PRD templates, ADR process, verification report format). Should we create a documentation strategy to ensure consistency?”My first reaction: “That’s overkill for a solo project.” My second thought: “But what if it’s not? Let AI try.”
- The Prompt
- AI's Strategy
- Templates Created
- The Surprise
The Journey: When Tedious Work Becomes Systematic
The pattern that emerged this week: AI excels at work that requires systematic thoroughness over creative insight.API Visibility Architecture (Creative + Systematic)
The problem: Need to generate different SDK versions:- Customer SDK (only public-facing routes)
- Platform SDK (all routes including internal)
- Partner SDK (partner portal routes only)
- Design visibility tagging system
- Choose implementation approach (compile-time vs runtime)
- Define SDK filtering rules
- Tag 127 existing API routes with visibility
- Update OpenAPI generation to filter by visibility
- Create SDK generation scripts for each audience
- Write migration guide for future routes
Product Features (Bundle/Unbundle Workflow)
The requirement: Products can be bundled (multiple items sold as one) or unbundled (split a bundle into components). The complexity: This is a state machine with validation rules:- Can’t unbundle a simple product (must be a bundle)
- Can’t bundle already-bundled products (no nested bundles)
- Pricing must be recalculated on bundle/unbundle
- Events must be emitted for audit trail
- Design state machine (1 day)
- Implement domain logic (1 day)
- Write tests (1 day)
- Write example showing how to use it (2 hours)
- Total: 3+ days
-
Evaluator designs state machine (2 hours)
- Identifies 7 states and 12 transitions
- Defines validation rules for each transition
- Plans event emission strategy
-
Builder implements (4 hours)
- State machine with typed states (compile-time enforcement)
- Validation logic per state
- Event emission + repository integration
- 15 unit tests covering all transitions
-
Builder creates example (1 hour)
bundle_unbundle_workflow.rsshowing real usage- Includes error handling and edge cases
- Documented with comments explaining each step
-
Verifier catches issues (30 minutes)
- Missing negative test: “What if unbundle fails mid-operation?”
- Missing validation: “Can bundle quantity be zero?”
- Unclear example: “Should show rollback scenario”
-
Builder fixes (1 hour)
- Added rollback test
- Added quantity validation
- Enhanced example with rollback scenario
Usage Metering Pipeline (Pure Systematic Work)
The requirement: Track partner API usage and aggregate for billing. The architecture:- Emit usage events on each API call
- Consume events from queue
- Aggregate usage by partner, date, endpoint
- Store in DynamoDB for billing queries
- Define 5 event types (API call, storage, bandwidth, feature usage, error)
- Create DynamoDB schema for events + aggregates
- Implement event consumer with batch processing
- Create aggregation pipeline (sum, count, percentiles)
- Write 12 integration tests (event → consumer → aggregate → query)
- Evaluator designed comprehensive pipeline (2 hours)
- Builder implemented everything including edge cases (5 hours)
- Verifier validated correctness (1 hour)
- Event deduplication (idempotency)
- Out-of-order event handling
- Aggregate correction on late-arriving events
- Efficient querying with GSI design
Meta-insight: AI changes the cost-benefit calculation for “good enough vs. excellent.”Work that was previously “too expensive to do right” becomes “might as well do it right since AI makes it cheap.”
What We Learned: The Taxonomy of AI-Suitable Work
After 107 commits this week, a pattern emerged: Not all work benefits equally from AI.- Where AI Excels
- Where AI Struggles
- The Decision Tree
1. Systematic Implementation
- Applying patterns repeatedly (tag 127 API routes)
- Comprehensive test coverage (21 E2E scenarios)
- Documentation from templates (35-page org model)
- State machine implementation from spec
- Cross-reference requirements across documents
- Identify edge cases systematically
- Verify consistency across related components
- Generate examples covering all scenarios
- ADRs, verification reports, planning docs
- Following templates consistently
- Cross-linking related documents
- Generating examples from specs
Principles Established This Week
Based on what worked and what didn’t, we established new principles:Principle 1: AI Excels at Systematic Thoroughness
Principle 1: AI Excels at Systematic Thoroughness
What we learned: Work that requires consistent application of rules across many cases is perfectly suited for AI.Examples this week:
- Tag 127 API routes with visibility levels
- Write 21 E2E test scenarios following same pattern
- Create documentation from templates for 7 different types
Principle 2: Documentation Quality Pays Compound Interest
Principle 2: Documentation Quality Pays Compound Interest
What we learned: Good documentation makes future AI work more effective.The cycle:
- AI creates documentation following templates
- Documentation captures decision context
- Future AI agents read documentation to understand requirements
- Better requirements → better implementation → better verification
Principle 3: Organization Models Scale AI Workflows
Principle 3: Organization Models Scale AI Workflows
What we learned: Defining agent roles and responsibilities improves output quality more than prompt engineering.Before organization model:
- Verifier inconsistently applied quality checks
- Builder occasionally hallucinated features
- Unclear escalation for conflicts
- Verifier follows checklist every time
- Builder stays within plan boundaries
- Decision rights clearly defined
Principle 4: Testing Becomes Worth Doing
Principle 4: Testing Becomes Worth Doing
What we learned: Comprehensive testing that was “too expensive” manually becomes “obviously worth it” with AI.Example: E2E event flow testing
- Manual estimate: 2-3 weeks (not worth it)
- With AI: 1.5 days (absolutely worth it)
Principle 5: Examples are Implementation Artifacts
Principle 5: Examples are Implementation Artifacts
What we learned: AI can create usage examples as part of implementation, not as afterthoughts.Traditional approach:
- Implement feature (2 days)
- “I’ll add examples later” (never happens)
- Builder implements feature (4 hours)
- Builder immediately creates example while context is fresh (1 hour)
- Example becomes part of verification (Verifier checks example works)
Metrics: Week 4 by the Numbers
- Velocity
- Quality
- Cost
- AI Suitability
Commits: 107 (previous record: 62 in Week 2)Major features completed:
- Event flow E2E testing (21 test scenarios)
- Organization model + documentation strategy
- API visibility architecture (127 routes tagged)
- Bundle/unbundle workflow state machine
- Usage metering pipeline
- Partner cost matrix
- Workflow approval infrastructure
The Mistake I Made (And What It Taught Me)
After implementation: Builder finished implementing the partner cost matrix. Tests passed. Requested verification. Verifier reviewed and said: PASSED ✅ I merged to main. During integration testing: Tried to use the partner cost matrix in integration tests. It failed with a cryptic error:Updated Verification Checklist: Integration Requirements
Updated Verification Checklist: Integration Requirements
Cross-Cutting Concerns Checklist
For every feature, verify tests cover:-
Authorization
- Feature-level permission checks tested
- Tenant isolation verified (can’t access other tenant data)
- OAuth scope requirements documented
-
Multi-tenancy
- Tenant context properly scoped
- Queries include tenant filter
- Cross-tenant negative tests exist
-
Event Sourcing
- Events emitted for state changes
- Event payload includes required fields
- Event ordering tested
-
Error Handling
- Expected errors return proper status codes
- Unexpected errors logged with context
- Partial failure scenarios tested
-
Observability
- Metrics emitted for key operations
- Logs include correlation IDs
- Traces capture end-to-end flow
What’s Next: Week 5 Preview
Week 4 revealed that AI excels at systematic work. Week 5 will test the limits: Can AI handle architectural refactoring? Planned work:- Refactor DynamoDB entities to single-table design (breaking change)
- Migrate event schema to versioned events (backward compatibility required)
- Consolidate API routes (eliminate duplication)
- Performance optimization (query patterns, indexing strategy)
- Understanding existing code deeply enough to refactor safely?
- Maintaining backward compatibility?
- Verifying refactoring didn’t change behavior?
Week 5: Refactoring with AI
Next week: When changing existing code is harder than writing new code
Code Examples (Sanitized)
Here’s the event collector utility we built for E2E testing:- Exponential backoff handles EventBridge → SQS delays
- Predicate filtering allows flexible event matching
- Helpful error messages on timeout (shows expected vs received)
- Reusable across all 21 test scenarios
Discussion: Where Does AI Excel for You?
What Work Became Worth Doing?
Have you found work that’s now “worth it” with AI that wasn’t before?I’d love to hear:
- What systematic work do you delegate to AI?
- What testing became economically viable?
- What documentation do you actually create now?
Disclaimer: This content documents my personal AI workflow experiments. All examples are from
personal projects and have been sanitized to remove proprietary information.Code snippets are generic patterns for educational purposes. This does not represent my employer’s
technologies or approaches.