Skip to main content

Week 1: Setting Up Multi-Agent AI Workflow

This is Week 1 of “Building with AI” - a 10-week journey documenting how I use multi-agent AI workflows to build a production-grade SaaS platform.This week: Why single-prompt coding fails and how to structure specialized AI agents for real work.

The Problem with Single-Prompt Coding

I started like everyone else: asking Claude to “write a function that does X.” It worked… for simple tasks. But when I asked Claude to “implement event sourcing for a multi-tenant platform,” I got:
  • ✅ Code that compiles
  • ✅ Looks reasonable at first glance
  • ❌ Violates our multi-tenancy requirements
  • ❌ Missing critical edge cases
  • ❌ Tests verify wrong behavior
The issue: A single AI session tries to do everything - planning, implementing, reviewing - and fails at the transitions.
Key Insight: AI in a single session develops implementation bias.It “remembers” the shortcuts it took while writing code, so when you ask it to verify, it validates its own assumptions instead of checking requirements.

The Multi-Agent Solution

Instead of one AI doing everything, I split the work across three specialized agents:

Agent 1: Evaluator (Planning)

Model: Claude Opus (best reasoning) Role: Architect and decision-maker Responsibilities:
  • Read requirements thoroughly
  • Explore existing codebase patterns
  • Design solution with trade-offs
  • Create detailed implementation plan
  • Get human approval before any code
Why Opus: Architecture decisions need deep reasoning. Opus excels at understanding context and evaluating trade-offs. Example prompt:
This is a planning session for implementing event sourcing in our
multi-tenant SaaS platform.

Requirements:
- All state changes must be audited
- Support time-travel queries
- Tenant isolation is critical

Explore the codebase to understand:
1. Existing DynamoDB patterns
2. Current multi-tenancy implementation
3. Event handling infrastructure

Then design an event sourcing architecture with:
- DynamoDB schema design
- Event versioning strategy
- Replay mechanism
- Tenant isolation approach

Provide 2-3 options with trade-offs, then recommend one.
Output: A detailed plan in .plans/357-event-sourcing.md with:
  • Architecture decision rationale
  • Files to create/modify
  • Data model design
  • Test strategy
  • Risks and mitigation

Agent 2: Builder (Implementation)

Model: Claude Sonnet + GitHub Copilot Role: Implementer following approved plan Responsibilities:
  • Read and follow approved plan exactly
  • Write code matching specified patterns
  • Create four-level test suite
  • Request verification when complete
Why Sonnet: Good at complex logic, faster than Opus, cost-effective for implementation. Why + Copilot: Copilot handles boilerplate autocomplete (free), Sonnet handles complex domain logic. Example prompt:
Implement event sourcing per approved plan:
.plans/357-event-sourcing.md

Focus on:
1. EventStore trait with append/load methods
2. DynamoDB entity with #[derive(DynamoDbEntity)]
3. Repository implementation (InMemory + DynamoDB)
4. Four-level test suite:
   - L1: Unit tests for event validation
   - L2: Repository integration tests with LocalStack
   - L3: Event flow tests (DynamoDB Streams → EventBridge → SQS)
   - L4: E2E workflow test

Follow existing patterns from:
- eva-auth/src/infrastructure/entities/security_group_entity.rs (DynamoDB entity pattern)
- eva-auth/src/infrastructure/repositories/security_group_repository.rs (repository pattern)
Output: Working implementation with comprehensive tests Critical: Builder NEVER deviates from plan without getting approval. If plan needs changes, Builder asks Evaluator for plan revision.

Agent 3: Verifier (Quality Gate)

Model: Claude Sonnet (FRESH session, not reused Builder session) Role: Independent code reviewer Responsibilities:
  • Read requirements from scratch
  • Review implementation against plan
  • Check test coverage (all 4 levels present?)
  • Validate edge cases
  • Report pass/conditional/fail decision
Why Fresh Session: This is CRITICAL. If you reuse the Builder’s session for verification, the AI is biased toward its own implementation. A fresh Verifier session:
  • Has no memory of implementation shortcuts
  • Must read requirements independently
  • Catches assumptions Builder made
Example prompt:
Verify implementation for issue #357 against requirements.

Read:
1. Original issue: gh issue view #357
2. Approved plan: .plans/357-event-sourcing.md
3. Implementation: git diff main...HEAD

Check:
- Requirements coverage (are all acceptance criteria met?)
- Plan adherence (did implementation follow approved design?)
- Test coverage (L1, L2, L3, L4 tests present and adequate?)
- Edge cases (empty input, concurrent access, failure scenarios?)
- Documentation (API docs, examples, PRD updated?)

Post verification report with decision:
- PASSED: Ready for human review
- CONDITIONAL: Minor issues to fix (list them)
- FAILED: Significant gaps (list them with severity)
Output: Structured verification report Verification Report Example:
## Verification Report: #357

### 1. Requirements Coverage
| Requirement | Met? | Test | Notes |
|-------------|------|------|-------|
| Audit all state changes | ✅ | L1, L2 | EventStore appends |
| Time-travel queries | ✅ | L3 | Event replay tested |
| Tenant isolation | ⚠️ | L2 | Tests pass, but missing L4 cross-tenant test |

### 2. Plan Adherence
- ✅ EventStore trait matches plan
- ✅ DynamoDB schema as designed
- ⚠️ DEVIATION: Added `event_version` field (not in plan)
  - Justified in PR comment #15 (needed for versioning)

### 3. Test Coverage
| Level | Required | Present | Adequate? |
|-------|----------|---------|-----------|
| L1: Unit | ✅ | 12 tests | ✅ Good coverage |
| L2: Repository | ✅ | 6 tests | ✅ CRUD + GSI |
| L3: Event Flow | ✅ | 2 tests | ⚠️ Missing failure scenario |
| L4: E2E | ✅ | 1 test | ⚠️ Missing cross-tenant negative test |

### 4. Edge Cases
- ✅ Empty event list
- ✅ Concurrent append
- ⚠️ MISSING: Cross-tenant event access attempt (should fail)

### Decision: ⚠️ CONDITIONAL PASS

**Required before merge:**
1. Add L3 test for event delivery failure
2. Add L4 test for cross-tenant isolation

**Estimated effort:** 1 hour

Setting Up the Workflow

Tool: Claude Code CLI

I use Claude Code for all three agents. Installation:
npm install -g @anthropics/claude-code
claude --version

Configuration

Create .claude/settings.local.json:
{
  "agents": {
    "evaluator": {
      "model": "opus-4.5",
      "role": "planning",
      "outputDir": ".plans/"
    },
    "builder": {
      "model": "sonnet-4.5",
      "role": "implementation"
    },
    "verifier": {
      "model": "sonnet-4.5",
      "role": "verification",
      "freshSession": true
    }
  }
}

Session Management

Three separate terminal windows/tabs: Terminal 1: Evaluator (Opus)
export CLAUDE_MODEL=opus
claude
# Planning session
> "Planning session for #357: Event sourcing design"
Terminal 2: Builder (Sonnet)
export CLAUDE_MODEL=sonnet
claude
# Implementation session
> "Implement #357 per plan: .plans/357-event-sourcing.md"
Terminal 3: Verifier (Sonnet - FRESH)
export CLAUDE_MODEL=sonnet
claude  # New session, not reused!
# Verification session
> "Verify implementation for #357"

First Feature: Event Sourcing for Multi-Tenant Platform

Let me walk through how this workflow played out for my first real feature.

Phase 1: Planning (Evaluator)

Session started: Early in the morning Prompt:
Planning session for implementing event sourcing.

Context: Building multi-tenant SaaS platform with strict data isolation.
Need event sourcing for audit compliance.

Explore codebase to understand existing patterns, then design solution.
Evaluator’s process:
  1. Read ADR-0001 (existing architecture decisions)
  2. Grep for DynamoDB patterns
  3. Read existing repository implementations
  4. Ask clarifying questions:
    • “Should events be tenant-scoped or global?”
    • “Event versioning strategy?”
    • “Snapshot frequency?”
Output (90 minutes later): Plan file with 3 options:
  • Option A: Single events table, tenant prefix in PK
  • Option B: Per-tenant events tables (rejected: management overhead)
  • Option C: Single table + GSI for queries (rejected: complexity)
Recommendation: Option A with rationale Human approval: “Proceed with Option A”

Phase 2: Implementation (Builder)

Session started: Later that morning Prompt:
Implement event sourcing per plan: .plans/357-event-sourcing.md

Use existing patterns from eva-auth crate.
Create four-level test suite.
Builder’s work (4 hours):
  • Created EventStore trait
  • Implemented DynamoDbEventStore with tenant isolation
  • Created InMemoryEventStore for tests
  • Wrote 20 tests across 4 levels
  • Ran cargo test - all passed
  • Requested verification
Output: PR ready for review

Phase 3: Verification (Verifier - FRESH)

Session started: Afternoon (new session) Prompt:
Verify implementation for #357 against requirements.

Read:
- Issue #357
- Plan: .plans/357-event-sourcing.md
- Implementation: git diff main...HEAD
Verifier found 3 issues:
  1. ⚠️ Missing L3 test for event stream failure
  2. ⚠️ Missing L4 cross-tenant negative test
  3. ✅ All other requirements met
Decision: CONDITIONAL PASS Builder fixed issues (45 minutes) Re-verification: PASSED Human review: Approved and merged

Results: Week 1 Feature

Metrics:
Total time: 7 hours (planning: 1.5h, impl: 4h, verify: 0.5h, fixes: 1h)vs Manual estimate: 12 hoursSavings: 42% faster

What I Learned

Fresh Sessions Matter

Reusing Builder session for verification caught 40% fewer bugs than fresh Verifier.Always start new session for verification.

Planning Saves Rework

Without Evaluator plan, Builder made wrong architecture choice twice.Spend 20% of time planning to save 80% of rework.

Model Selection Matters

Tried using Sonnet for planning - missed subtle multi-tenancy requirement.Use Opus for architecture, Sonnet for implementation.

Workflows Over Prompts

Multi-agent workflow caught issues single-session approach missed.Structure matters more than prompt engineering.

Common Pitfalls (And How to Avoid)

Mistake #1: Reusing Sessions

Don’t do this:
# Builder session
> "Implement #357"
# ... implementation done ...
> "Now verify what you just built"  # ❌ WRONG
Why: Builder is biased toward its own implementation.Do this instead:
# Builder session
> "Implement #357"
# ... done, close session ...

# NEW Verifier session
> "Verify implementation for #357"  # ✅ CORRECT

Mistake #2: Skipping Planning

I tried skipping Evaluator once: “It’s a simple feature, just build it.” Result:
  • Builder chose wrong pattern (missed tenant isolation edge case)
  • Had to rewrite 60% of code
  • Wasted 3 hours
Lesson: Even “simple” features need planning in complex systems.

Mistake #3: Trusting “All Tests Pass”

Builder created 15 tests. All green. Shipped. Bug in production: Tests verified wrong behavior (AI hallucinated requirement). Fix: Verifier now reviews test assertions, not just coverage.

Actionable Takeaways

If you’re starting with multi-agent AI:
  1. Start with 3 agents - Evaluator, Builder, Verifier (don’t overcomplicate)
  2. Always use fresh Verifier session - Independent context is critical
  3. Plan before implementing - 20% planning time saves 80% rework
  4. Choose models strategically:
    • Opus for planning/architecture
    • Sonnet for implementation
    • Fresh Sonnet for verification
  5. Measure everything:
    • Time per phase
    • Bugs found in verification vs production
    • Token usage and cost

Next Week: Plan → Implement → Verify Workflow

Now that we have agents set up, Week 2 will dive deeper into the three-phase workflow with quality gates:
  • How to write effective planning prompts
  • When to override AI suggestions
  • Auto-remediation (AI fixing its own bugs)

Week 2: Plan → Implement → Verify

Next: Deep dive into the workflow that prevents AI hallucination

Discussion

How Are You Using AI?

Are you using multi-agent workflows? Single-session coding?Share your experience:

Disclaimer: This content documents my personal AI workflow experiments and does not represent my employer’s technologies or approaches.All examples are from personal projects. Code snippets are generic patterns for educational purposes.