Skip to main content

Week 4: When AI Truly Excels

This is Week 4 of “Building with AI” - a 10-week journey documenting how I use multi-agent AI workflows to build a production-grade SaaS platform.This week: Discovering that AI doesn’t just speed up coding - it fundamentally changes what work becomes worth doing. 107 commits across testing, documentation, and organizational modeling that would have been impossible to tackle manually.

The Surprising Realization

After several weeks into this experiment, I had established a pattern: AI builds features fast, verification catches bugs, everything works. This week revealed something unexpected: AI doesn’t just accelerate existing workflows - it makes previously impossible work suddenly feasible. The work I tackled this week:
  • Complete end-to-end test coverage for event flows (21 test scenarios)
  • Organization model documentation (agent responsibilities, workflows, decision frameworks)
  • API visibility architecture for SDK filtering
  • Bundle/unbundle workflow state machine with examples
  • Usage metering infrastructure with aggregation pipelines
Traditional estimate for this scope: 6-8 weeks Actual time with AI workflow: 5.5 days But here’s what surprised me: I wouldn’t have attempted most of this work without AI. Not because it’s technically hard, but because the effort-to-value ratio seemed wrong. Let me show you what changed.

What We Built This Week

1. Event Flow E2E Testing (The Work Nobody Wants to Do)

The Context: Week 2-3 built a CRM domain layer with event sourcing. We had unit tests, integration tests, even some event flow tests. But we didn’t have comprehensive end-to-end verification that events actually flow correctly through the entire system. Why this matters: In event-sourced systems, bugs in event flow are catastrophic:
  • Miss an event → audit trail broken
  • Wrong event order → state corruption
  • Cross-tenant event leak → compliance violation
The traditional problem: Writing E2E tests for event flows is tedious:
  1. Set up test infrastructure (EventBridge, SQS, DynamoDB Streams)
  2. Create test data for each entity type
  3. Trigger actions and wait for async event delivery
  4. Verify event payload, ordering, and side effects
  5. Clean up test resources
  6. Repeat for every entity and workflow combination
Estimated manual effort: 2-3 weeks for comprehensive coverage What I tried this week: Give the entire task to AI.
    Planning session for comprehensive E2E event flow testing.

    Context:
    - Event-sourced CRM with 7 entities (Account, Contact, Lead,
      Opportunity, Activity, Product, Address)
    - Events flow: DynamoDB Streams → EventBridge → SQS → Consumers
    - Need to verify event delivery, ordering, and payload correctness
    - Must test cross-entity workflows (e.g., Lead conversion)

    Requirements:
    1. Test each entity's event flow independently
    2. Test multi-entity workflows (create Account → add Contact → convert Lead)
    3. Test failure scenarios (event delivery failure, consumer errors)
    4. All tests must run against LocalStack (no AWS resources)

    Design comprehensive test suite with:
    - Test data fixtures
    - Event verification utilities
    - Async event waiting helpers
    - Clear assertion patterns
The result: Comprehensive E2E event testing that would have taken 2-3 weeks manually, done in 1.5 days with AI. But more importantly: I actually have confidence in the event system now. Without AI, I would’ve written 3-4 “smoke tests” and hoped for the best.
Key Insight: AI makes thorough testing economically viable. The marginal cost of going from “some tests” to “comprehensive coverage” dropped from weeks to hours.

2. Organization Model Documentation (The Work Nobody Does)

The surprise of the week: The most valuable output wasn’t code - it was organizational documentation. The Context: By Week 4, I had built significant functionality with the multi-agent workflow. But I noticed problems:
  • Evaluator sometimes made decisions outside its scope
  • Builder occasionally asked questions Evaluator should answer
  • Verification reports varied in quality
  • No clear escalation path for conflicts
The root cause: I never formally defined what each agent was responsible for. The traditional approach: Write a README explaining the workflow (maybe). What I tried instead: Create a formal organization model with AI.
Planning session for agent organization model.

Context:

- Using 3 agents: Evaluator (Opus), Builder (Sonnet), Verifier (Sonnet)
- Working well for features, but roles blur on complex issues
- Need clear boundaries, responsibilities, and escalation paths

Design an organization model with:

1. Agent roles and responsibilities (what each agent owns)
2. Decision rights (who decides what)
3. Communication protocols (how agents interact)
4. Conflict resolution (what happens when agents disagree)
5. Quality gates (when work can progress to next agent)

Model after real organizations:

```text
- Evaluator = Principal Architect
- Builder = Senior Engineer
- Verifier = Tech Lead / Reviewer

Include failure modes and handling strategies.

AI produced something remarkable: A 35-page organizational constitution that included:

Agent Personas

Evaluator (Principal Architect)
  • Owns: Architecture decisions, technical strategy
  • Decides: Technology choices, design patterns, trade-offs
  • Cannot: Implement code, override verification failures
  • Escalates to: Human (for business decisions)
Builder (Senior Engineer)
  • Owns: Implementation quality, test coverage
  • Decides: Code structure, algorithm choice (within plan)
  • Cannot: Change architecture, skip verification
  • Escalates to: Evaluator (for plan changes)
Verifier (Tech Lead)
  • Owns: Quality standards, requirement coverage
  • Decides: Pass/conditional/fail, required fixes
  • Cannot: Implement fixes, change requirements
  • Escalates to: Human (for requirement ambiguity)

Decision Framework

Type 1 Decisions (Reversible):
  • Variable naming, code formatting
  • Test data values
  • Comment wording
  • Owner: Builder makes decision, no approval needed
Type 2 Decisions (Reversible with effort):
  • Algorithm choice (within performance constraints)
  • Error message wording
  • Test organization structure
  • Owner: Builder proposes, Verifier validates
Type 3 Decisions (Hard to reverse):
  • Database schema changes
  • API contract changes
  • Event structure changes
  • Owner: Evaluator decides, human approves
Type 4 Decisions (Irreversible):
  • Multi-tenant isolation strategy
  • Compliance approach
  • Core technology choices
  • Owner: Human decides, Evaluator advises

Quality Gates

Gate 1: Planning Complete
  • ✅ Requirements understood
  • ✅ Design options evaluated
  • ✅ Recommended approach justified
  • ✅ Human approval received
  • Then: Builder can start implementation
Gate 2: Implementation Complete
  • ✅ All plan requirements implemented
  • ✅ Four-level tests passing
  • ✅ No compiler warnings
  • ✅ Builder self-review done
  • Then: Verifier can review
Gate 3: Verification Passed
  • ✅ Requirements coverage verified
  • ✅ Test adequacy confirmed
  • ✅ Edge cases identified
  • ✅ No blocking issues
  • Then: Human final review
Gate 4: Human Approved
  • ✅ Spot-check implementation
  • ✅ Verify AI didn’t hallucinate features
  • ✅ Confirm alignment with business goals
  • Then: Merge to main

Conflict Resolution

Scenario 1: Verifier rejects implementation
  • Builder must fix issues (no debate)
  • If Builder believes rejection is wrong:
    1. Builder escalates to Evaluator
    2. Evaluator reviews both perspectives
    3. Evaluator makes final call
    4. Human can override if needed
Scenario 2: Builder disagrees with plan
  • Builder documents concern in PR comment
  • Builder proposes alternative approach
  • Evaluator reviews:
    • If minor: Evaluator approves change
    • If major: Human decides
Scenario 3: Requirement ambiguity
  • Verifier flags ambiguity (doesn’t fail review)
  • Verifier suggests clarifying questions
  • Human clarifies requirement
  • Builder updates implementation
  • Re-verification required
Why this matters: After implementing this organization model, I saw immediate improvements: Before (Weeks 1-3):
  • Verifier would sometimes say “looks good” without checking edge cases
  • Builder occasionally added features not in the plan (“seemed useful”)
  • Unclear when to involve human vs. let agents decide
After (Week 4):
  • Verifier now follows checklist: requirement coverage, test adequacy, edge cases, consistency
  • Builder stays strictly within plan (escalates if deviation needed)
  • Clear decision rights eliminate back-and-forth
The meta-insight: AI agents benefit from organizational structure just like human teams do.
Critical Learning: Without clear roles and responsibilities, AI agents optimize for “make the task work” instead of “produce the right outcome.”The organization model changed agent behavior more than prompt engineering ever could.

3. Documentation Strategy (The Work That Never Gets Done)

With the organization model defined, AI suggested something I didn’t ask for:
“The organization model references several documents that don’t exist yet (PRD templates, ADR process, verification report format). Should we create a documentation strategy to ensure consistency?”
My first reaction: “That’s overkill for a solo project.” My second thought: “But what if it’s not? Let AI try.”
    Design a documentation strategy for the multi-agent workflow.

    Context:
    - Solo developer using AI agents
    - Building production SaaS platform
    - Need to maintain context across weeks/months
    - AI agents need consistent documentation format to work effectively

    Requirements:
    - Documentation types (what to document)
    - Documentation templates (consistent format)
    - Documentation lifecycle (when to create/update)
    - Documentation storage (where to keep)
    - Documentation quality (how to verify)

    Optimize for:
    - AI agents can find and use docs effectively
    - Human can understand project after 3-month break
    - Minimal maintenance overhead

The Journey: When Tedious Work Becomes Systematic

The pattern that emerged this week: AI excels at work that requires systematic thoroughness over creative insight.

API Visibility Architecture (Creative + Systematic)

The problem: Need to generate different SDK versions:
  • Customer SDK (only public-facing routes)
  • Platform SDK (all routes including internal)
  • Partner SDK (partner portal routes only)
The creative part (Evaluator):
  • Design visibility tagging system
  • Choose implementation approach (compile-time vs runtime)
  • Define SDK filtering rules
The systematic part (Builder):
  • Tag 127 existing API routes with visibility
  • Update OpenAPI generation to filter by visibility
  • Create SDK generation scripts for each audience
  • Write migration guide for future routes
Manual estimate: Creative part (4 hours) + Systematic part (8 hours) = 12 hours With AI: Creative part (2 hours with Evaluator) + Systematic part (3 hours with Builder) = 5 hours Why AI excelled: The systematic work (tagging 127 routes) would have been mind-numbing manually. Builder never gets bored, maintains perfect consistency, and actually catches edge cases I would miss.

Product Features (Bundle/Unbundle Workflow)

The requirement: Products can be bundled (multiple items sold as one) or unbundled (split a bundle into components). The complexity: This is a state machine with validation rules:
  • Can’t unbundle a simple product (must be a bundle)
  • Can’t bundle already-bundled products (no nested bundles)
  • Pricing must be recalculated on bundle/unbundle
  • Events must be emitted for audit trail
Traditional approach:
  1. Design state machine (1 day)
  2. Implement domain logic (1 day)
  3. Write tests (1 day)
  4. Write example showing how to use it (2 hours)
  5. Total: 3+ days
With AI workflow:
  1. Evaluator designs state machine (2 hours)
    • Identifies 7 states and 12 transitions
    • Defines validation rules for each transition
    • Plans event emission strategy
  2. Builder implements (4 hours)
    • State machine with typed states (compile-time enforcement)
    • Validation logic per state
    • Event emission + repository integration
    • 15 unit tests covering all transitions
  3. Builder creates example (1 hour)
    • bundle_unbundle_workflow.rs showing real usage
    • Includes error handling and edge cases
    • Documented with comments explaining each step
  4. Verifier catches issues (30 minutes)
    • Missing negative test: “What if unbundle fails mid-operation?”
    • Missing validation: “Can bundle quantity be zero?”
    • Unclear example: “Should show rollback scenario”
  5. Builder fixes (1 hour)
    • Added rollback test
    • Added quantity validation
    • Enhanced example with rollback scenario
Total time: 8.5 hours (vs. 3+ days) The key difference: AI doesn’t context-switch. Builder implemented the state machine, then immediately created the example while the context was fresh. Manually, I would’ve delayed the example (“I’ll add it later”) and never done it.

Usage Metering Pipeline (Pure Systematic Work)

The requirement: Track partner API usage and aggregate for billing. The architecture:
  • Emit usage events on each API call
  • Consume events from queue
  • Aggregate usage by partner, date, endpoint
  • Store in DynamoDB for billing queries
The systematic work:
  • Define 5 event types (API call, storage, bandwidth, feature usage, error)
  • Create DynamoDB schema for events + aggregates
  • Implement event consumer with batch processing
  • Create aggregation pipeline (sum, count, percentiles)
  • Write 12 integration tests (event → consumer → aggregate → query)
This is work I would never do manually. Not because it’s hard, but because the effort-to-benefit ratio seems poor. I’d write a basic version, ship it, and only improve it when billing issues arose. With AI:
  • Evaluator designed comprehensive pipeline (2 hours)
  • Builder implemented everything including edge cases (5 hours)
  • Verifier validated correctness (1 hour)
Result: Production-grade usage metering that handles:
  • Event deduplication (idempotency)
  • Out-of-order event handling
  • Aggregate correction on late-arriving events
  • Efficient querying with GSI design
Total time: 8 hours Total value: Prevented at least 2 weeks of billing issues and customer complaints.
Meta-insight: AI changes the cost-benefit calculation for “good enough vs. excellent.”Work that was previously “too expensive to do right” becomes “might as well do it right since AI makes it cheap.”

What We Learned: The Taxonomy of AI-Suitable Work

After 107 commits this week, a pattern emerged: Not all work benefits equally from AI.
1. Systematic Implementation
  • Applying patterns repeatedly (tag 127 API routes)
  • Comprehensive test coverage (21 E2E scenarios)
  • Documentation from templates (35-page org model)
  • State machine implementation from spec
Why: AI never gets bored, maintains perfect consistencyROI: 5-10x speedup2. Thorough Analysis
  • Cross-reference requirements across documents
  • Identify edge cases systematically
  • Verify consistency across related components
  • Generate examples covering all scenarios
Why: AI doesn’t “skim” - it reads everything fullyROI: Catches issues humans would miss3. Structured Documentation
  • ADRs, verification reports, planning docs
  • Following templates consistently
  • Cross-linking related documents
  • Generating examples from specs
Why: AI excels at structure and completenessROI: Documentation actually gets written

Principles Established This Week

Based on what worked and what didn’t, we established new principles:
What we learned: Work that requires consistent application of rules across many cases is perfectly suited for AI.Examples this week:
  • Tag 127 API routes with visibility levels
  • Write 21 E2E test scenarios following same pattern
  • Create documentation from templates for 7 different types
Rule: If work requires “do the same thing many times consistently,” delegate entirely to AI.Anti-pattern: Using AI for one-off creative tasks (AI defaults to patterns from training).
What we learned: Good documentation makes future AI work more effective.The cycle:
  1. AI creates documentation following templates
  2. Documentation captures decision context
  3. Future AI agents read documentation to understand requirements
  4. Better requirements → better implementation → better verification
Metric: Time to regain context after break dropped from 2-3 hours to 15 minutes.Rule: Invest in documentation templates once, get consistent documentation forever.
What we learned: Defining agent roles and responsibilities improves output quality more than prompt engineering.Before organization model:
  • Verifier inconsistently applied quality checks
  • Builder occasionally hallucinated features
  • Unclear escalation for conflicts
After organization model:
  • Verifier follows checklist every time
  • Builder stays within plan boundaries
  • Decision rights clearly defined
Rule: Treat AI agents like team members - give them clear roles, responsibilities, and decision rights.
What we learned: Comprehensive testing that was “too expensive” manually becomes “obviously worth it” with AI.Example: E2E event flow testing
  • Manual estimate: 2-3 weeks (not worth it)
  • With AI: 1.5 days (absolutely worth it)
Result: Testing quality improved not because AI writes better tests, but because comprehensive testing became economically viable.Rule: Re-evaluate “not worth the effort” decisions when AI changes the effort equation.
What we learned: AI can create usage examples as part of implementation, not as afterthoughts.Traditional approach:
  • Implement feature (2 days)
  • “I’ll add examples later” (never happens)
AI approach:
  • Builder implements feature (4 hours)
  • Builder immediately creates example while context is fresh (1 hour)
  • Example becomes part of verification (Verifier checks example works)
Result: Every feature now has working examples because marginal cost dropped to near-zero.Rule: Make examples part of the implementation task, not a separate documentation task.

Metrics: Week 4 by the Numbers

Commits: 107 (previous record: 62 in Week 2)Major features completed:
  • Event flow E2E testing (21 test scenarios)
  • Organization model + documentation strategy
  • API visibility architecture (127 routes tagged)
  • Bundle/unbundle workflow state machine
  • Usage metering pipeline
  • Partner cost matrix
  • Workflow approval infrastructure
Manual estimate: 6-8 weeksActual time: 5.5 daysSpeedup: ~8-10xKey insight: Speedup increased from previous weeks (4-6x) because work was more systematic.

The Mistake I Made (And What It Taught Me)

After implementation: Builder finished implementing the partner cost matrix. Tests passed. Requested verification. Verifier reviewed and said: PASSED ✅ I merged to main. During integration testing: Tried to use the partner cost matrix in integration tests. It failed with a cryptic error:
Error: PartnerCostMatrix query failed: Access denied
What happened? The partner cost matrix uses OAuth scope-based authorization. Builder implemented the feature. Tests passed. But tests used a mock auth context that bypassed scope checks. Verifier missed this because: The test suite had 100% coverage of the feature logic, but 0% coverage of the authorization integration. The root cause: I didn’t specify “test authorization” in the requirements. So Builder tested business logic (correctly) but not integration with auth system. The deeper issue: As tasks become more systematic, I stopped thinking about implicit requirements. I assumed AI would “figure out” that authorization needs testing. The fix: I updated the organization model with a new checklist for Verifier:

Cross-Cutting Concerns Checklist

For every feature, verify tests cover:
  1. Authorization
    • Feature-level permission checks tested
    • Tenant isolation verified (can’t access other tenant data)
    • OAuth scope requirements documented
  2. Multi-tenancy
    • Tenant context properly scoped
    • Queries include tenant filter
    • Cross-tenant negative tests exist
  3. Event Sourcing
    • Events emitted for state changes
    • Event payload includes required fields
    • Event ordering tested
  4. Error Handling
    • Expected errors return proper status codes
    • Unexpected errors logged with context
    • Partial failure scenarios tested
  5. Observability
    • Metrics emitted for key operations
    • Logs include correlation IDs
    • Traces capture end-to-end flow
If any cross-cutting concern is untested, flag as CONDITIONAL (not FAILED). Provide specific test scenarios to add.
Lesson learned: AI excels at explicit requirements but struggles with implicit “you should know” requirements. The solution isn’t better prompts - it’s better checklists.

What’s Next: Week 5 Preview

Week 4 revealed that AI excels at systematic work. Week 5 will test the limits: Can AI handle architectural refactoring? Planned work:
  • Refactor DynamoDB entities to single-table design (breaking change)
  • Migrate event schema to versioned events (backward compatibility required)
  • Consolidate API routes (eliminate duplication)
  • Performance optimization (query patterns, indexing strategy)
The challenge: This isn’t greenfield implementation. This is changing working code without breaking anything. The question: Can the multi-agent workflow handle:
  • Understanding existing code deeply enough to refactor safely?
  • Maintaining backward compatibility?
  • Verifying refactoring didn’t change behavior?
Hypothesis: Refactoring requires more human involvement than greenfield work, because implicit assumptions are harder for AI to discover. We’ll find out.

Week 5: Refactoring with AI

Next week: When changing existing code is harder than writing new code

Code Examples (Sanitized)

Here’s the event collector utility we built for E2E testing:
/// Async event collector with timeout and filtering
pub struct EventCollector {
    queue_url: String,
    sqs_client: aws_sdk_sqs::Client,
    timeout: Duration,
}

impl EventCollector {
    /// Collect events matching predicate within timeout
    pub async fn collect_events<F>(
        &self,
        predicate: F,
        expected_count: usize,
    ) -> Result<Vec<Event>>
    where
        F: Fn(&Event) -> bool,
    {
        let start = Instant::now();
        let mut collected = Vec::new();

        // Exponential backoff with jitter
        let mut delay = Duration::from_millis(100);

        while start.elapsed() < self.timeout {
            // Poll SQS queue
            let messages = self.sqs_client
                .receive_message()
                .queue_url(&self.queue_url)
                .max_number_of_messages(10)
                .wait_time_seconds(1)
                .send()
                .await?
                .messages
                .unwrap_or_default();

            for msg in messages {
                let event: Event = serde_json::from_str(&msg.body)?;

                if predicate(&event) {
                    collected.push(event);

                    if collected.len() >= expected_count {
                        return Ok(collected);
                    }
                }
            }

            // Exponential backoff with jitter
            tokio::time::sleep(delay).await;
            delay = (delay * 2).min(Duration::from_secs(5));
        }

        Err(Error::EventCollectionTimeout {
            expected: expected_count,
            received: collected.len(),
            elapsed: start.elapsed(),
        })
    }
}
Usage in tests:
#[tokio::test]
async fn test_lead_conversion_emits_events() {
    let collector = EventCollector::new("test-queue-url", Duration::from_secs(10));

    // Trigger lead conversion
    convert_lead_to_opportunity(lead_id).await?;

    // Collect events
    let events = collector
        .collect_events(
            |e| e.entity_type == "Lead" || e.entity_type == "Opportunity",
            2, // Expect: LeadConverted + OpportunityCreated
        )
        .await?;

    // Verify event ordering and payload
    assert_eq!(events[0].event_type, "LeadConverted");
    assert_eq!(events[1].event_type, "OpportunityCreated");
    assert_eq!(events[1].payload["lead_id"], lead_id);
}
What made this work:
  • Exponential backoff handles EventBridge → SQS delays
  • Predicate filtering allows flexible event matching
  • Helpful error messages on timeout (shows expected vs received)
  • Reusable across all 21 test scenarios

Discussion: Where Does AI Excel for You?

What Work Became Worth Doing?

Have you found work that’s now “worth it” with AI that wasn’t before?I’d love to hear:
  • What systematic work do you delegate to AI?
  • What testing became economically viable?
  • What documentation do you actually create now?
Share your experience:

Disclaimer: This content documents my personal AI workflow experiments. All examples are from personal projects and have been sanitized to remove proprietary information.Code snippets are generic patterns for educational purposes. This does not represent my employer’s technologies or approaches.