This is Week 1 of “Building with AI” - a 10-week journey documenting how I use multi-agent AI workflows to build a production-grade SaaS platform.This week: Why single-prompt coding fails and how to structure specialized AI agents for real work.
I started like everyone else: asking Claude to “write a function that does X.”It worked… for simple tasks.But when I asked Claude to “implement event sourcing for a multi-tenant platform,” I got:
✅ Code that compiles
✅ Looks reasonable at first glance
❌ Violates our multi-tenancy requirements
❌ Missing critical edge cases
❌ Tests verify wrong behavior
The issue: A single AI session tries to do everything - planning, implementing, reviewing - and fails at the transitions.
Key Insight: AI in a single session develops implementation bias.It “remembers” the shortcuts it took while writing code, so when you ask it to verify, it validates its own assumptions instead of checking requirements.
Model: Claude Opus (best reasoning)
Role: Architect and decision-makerResponsibilities:
Read requirements thoroughly
Explore existing codebase patterns
Design solution with trade-offs
Create detailed implementation plan
Get human approval before any code
Why Opus: Architecture decisions need deep reasoning. Opus excels at understanding context and evaluating trade-offs.Example prompt:
Copy
This is a planning session for implementing event sourcing in ourmulti-tenant SaaS platform.Requirements:- All state changes must be audited- Support time-travel queries- Tenant isolation is criticalExplore the codebase to understand:1. Existing DynamoDB patterns2. Current multi-tenancy implementation3. Event handling infrastructureThen design an event sourcing architecture with:- DynamoDB schema design- Event versioning strategy- Replay mechanism- Tenant isolation approachProvide 2-3 options with trade-offs, then recommend one.
Output: A detailed plan in .plans/357-event-sourcing.md with:
Model: Claude Sonnet + GitHub Copilot
Role: Implementer following approved planResponsibilities:
Read and follow approved plan exactly
Write code matching specified patterns
Create four-level test suite
Request verification when complete
Why Sonnet: Good at complex logic, faster than Opus, cost-effective for implementation.Why + Copilot: Copilot handles boilerplate autocomplete (free), Sonnet handles complex domain logic.Example prompt:
Copy
Implement event sourcing per approved plan:.plans/357-event-sourcing.mdFocus on:1. EventStore trait with append/load methods2. DynamoDB entity with #[derive(DynamoDbEntity)]3. Repository implementation (InMemory + DynamoDB)4. Four-level test suite: - L1: Unit tests for event validation - L2: Repository integration tests with LocalStack - L3: Event flow tests (DynamoDB Streams → EventBridge → SQS) - L4: E2E workflow testFollow existing patterns from:- eva-auth/src/infrastructure/entities/security_group_entity.rs (DynamoDB entity pattern)- eva-auth/src/infrastructure/repositories/security_group_repository.rs (repository pattern)
Output: Working implementation with comprehensive testsCritical: Builder NEVER deviates from plan without getting approval. If plan needs changes, Builder asks Evaluator for plan revision.
Model: Claude Sonnet (FRESH session, not reused Builder session)
Role: Independent code reviewerResponsibilities:
Read requirements from scratch
Review implementation against plan
Check test coverage (all 4 levels present?)
Validate edge cases
Report pass/conditional/fail decision
Why Fresh Session: This is CRITICAL. If you reuse the Builder’s session for verification, the AI is biased toward its own implementation.A fresh Verifier session:
Has no memory of implementation shortcuts
Must read requirements independently
Catches assumptions Builder made
Example prompt:
Copy
Verify implementation for issue #357 against requirements.Read:1. Original issue: gh issue view #3572. Approved plan: .plans/357-event-sourcing.md3. Implementation: git diff main...HEADCheck:- Requirements coverage (are all acceptance criteria met?)- Plan adherence (did implementation follow approved design?)- Test coverage (L1, L2, L3, L4 tests present and adequate?)- Edge cases (empty input, concurrent access, failure scenarios?)- Documentation (API docs, examples, PRD updated?)Post verification report with decision:- PASSED: Ready for human review- CONDITIONAL: Minor issues to fix (list them)- FAILED: Significant gaps (list them with severity)
Planning session for implementing event sourcing.Context: Building multi-tenant SaaS platform with strict data isolation.Need event sourcing for audit compliance.Explore codebase to understand existing patterns, then design solution.
Evaluator’s process:
Read ADR-0001 (existing architecture decisions)
Grep for DynamoDB patterns
Read existing repository implementations
Ask clarifying questions:
“Should events be tenant-scoped or global?”
“Event versioning strategy?”
“Snapshot frequency?”
Output (90 minutes later):
Plan file with 3 options:
Option A: Single events table, tenant prefix in PK
Builder created 15 tests. All green. Shipped.Bug in production: Tests verified wrong behavior (AI hallucinated requirement).Fix: Verifier now reviews test assertions, not just coverage.
Disclaimer: This content documents my personal AI workflow experiments and does not represent my employer’s technologies or approaches.All examples are from personal projects. Code snippets are generic patterns for educational purposes.