Multi-Agent AI Workflow

The Problem with Single-Prompt Coding

I started like everyone else: asking Claude to “write a function that does X.” It worked… for simple tasks. But when I asked Claude to “implement event sourcing for a multi-tenant platform,” I got:

✅ Code that compiles
✅ Looks reasonable at first glance
❌ Violates our multi-tenancy requirements
❌ Missing critical edge cases
❌ Tests verify wrong behavior

The issue: A single AI session tries to do everything - planning, implementing, reviewing - and fails at the transitions.

Key Insight: AI in a single session develops implementation bias.It “remembers” the shortcuts it took while writing code, so when you ask it to verify, it validates its own assumptions instead of checking requirements.

The Multi-Agent Solution

Instead of one AI doing everything, I split the work across three specialized agents:

Agent 1: Evaluator (Planning)

Model: Claude Opus (best reasoning) Role: Architect and decision-maker Responsibilities:

Read requirements thoroughly
Explore existing codebase patterns
Design solution with trade-offs
Create detailed implementation plan
Get human approval before any code

Why Opus: Architecture decisions need deep reasoning. Opus excels at understanding context and evaluating trade-offs. Example prompt:

This is a planning session for implementing event sourcing in our
multi-tenant SaaS platform.

Requirements:
- All state changes must be audited
- Support time-travel queries
- Tenant isolation is critical

Explore the codebase to understand:
1. Existing DynamoDB patterns
2. Current multi-tenancy implementation
3. Event handling infrastructure

Then design an event sourcing architecture with:
- DynamoDB schema design
- Event versioning strategy
- Replay mechanism
- Tenant isolation approach

Provide 2-3 options with trade-offs, then recommend one.

Output: A detailed plan in .plans/357-event-sourcing.md with:

Architecture decision rationale
Files to create/modify
Data model design
Test strategy
Risks and mitigation

Agent 2: Builder (Implementation)

Model: Claude Sonnet + GitHub Copilot Role: Implementer following approved plan Responsibilities:

Read and follow approved plan exactly
Write code matching specified patterns
Create four-level test suite
Request verification when complete

Why Sonnet: Good at complex logic, faster than Opus, cost-effective for implementation. Why + Copilot: Copilot handles boilerplate autocomplete (free), Sonnet handles complex domain logic. Example prompt:

Implement event sourcing per approved plan:
.plans/357-event-sourcing.md

Focus on:
1. EventStore trait with append/load methods
2. DynamoDB entity with #[derive(DynamoDbEntity)]
3. Repository implementation (InMemory + DynamoDB)
4. Four-level test suite:
   - L1: Unit tests for event validation
   - L2: Repository integration tests with LocalStack
   - L3: Event flow tests (DynamoDB Streams → EventBridge → SQS)
   - L4: E2E workflow test

Follow existing patterns from:
- eva-auth/src/infrastructure/entities/security_group_entity.rs
- eva-auth/src/infrastructure/repositories/security_group_repository.rs

Output: Working implementation with comprehensive tests Critical: Builder NEVER deviates from plan without getting approval. If plan needs changes, Builder asks Evaluator for plan revision.

Agent 3: Verifier (Quality Gate)

Model: Claude Sonnet (FRESH session, not reused Builder session) Role: Independent code reviewer Responsibilities:

Read requirements from scratch
Review implementation against plan
Check test coverage (all 4 levels present?)
Validate edge cases
Report pass/conditional/fail decision

Why Fresh Session: This is CRITICAL. If you reuse the Builder’s session for verification, the AI is biased toward its own implementation. A fresh Verifier session:

Has no memory of implementation shortcuts
Must read requirements independently
Catches assumptions Builder made

Example prompt:

Verify implementation for issue #357 against requirements.

Read:
1. Original issue: gh issue view #357
2. Approved plan: .plans/357-event-sourcing.md
3. Implementation: git diff main...HEAD

Check:
- Requirements coverage (are all acceptance criteria met?)
- Plan adherence (did implementation follow approved design?)
- Test coverage (L1, L2, L3, L4 tests present and adequate?)
- Edge cases (empty input, concurrent access, failure scenarios?)
- Documentation (API docs, examples, PRD updated?)

Post verification report with decision:
- PASSED: Ready for human review
- CONDITIONAL: Minor issues to fix (list them)
- FAILED: Significant gaps (list them with severity)

Output: Structured verification report Verification Report Example:

## Verification Report: #357

### 1. Requirements Coverage
| Requirement | Met? | Test | Notes |
|-------------|------|------|-------|
| Audit all state changes | ✅ | L1, L2 | EventStore appends |
| Time-travel queries | ✅ | L3 | Event replay tested |
| Tenant isolation | ⚠️ | L2 | Tests pass, but missing L4 cross-tenant test |

### 2. Plan Adherence
- ✅ EventStore trait matches plan
- ✅ DynamoDB schema as designed
- ⚠️ DEVIATION: Added `event_version` field (not in plan)
  - Justified in PR comment #15 (needed for versioning)

### 3. Test Coverage
| Level | Required | Present | Adequate? |
|-------|----------|---------|-----------|
| L1: Unit | ✅ | 12 tests | ✅ Good coverage |
| L2: Repository | ✅ | 6 tests | ✅ CRUD + GSI |
| L3: Event Flow | ✅ | 2 tests | ⚠️ Missing failure scenario |
| L4: E2E | ✅ | 1 test | ⚠️ Missing cross-tenant negative test |

### 4. Edge Cases
- ✅ Empty event list
- ✅ Concurrent append
- ⚠️ MISSING: Cross-tenant event access attempt (should fail)

### Decision: ⚠️ CONDITIONAL PASS

**Required before merge:**
1. Add L3 test for event delivery failure
2. Add L4 test for cross-tenant isolation

Setting Up the Workflow

Tool: Claude Code CLI

I use Claude Code for all three agents. Installation:

npm install -g @anthropics/claude-code
claude --version

Configuration

Create .claude/settings.local.json:

{
  "agents": {
    "evaluator": {
      "model": "opus-4.5",
      "role": "planning",
      "outputDir": ".plans/"
    },
    "builder": {
      "model": "sonnet-4.5",
      "role": "implementation"
    },
    "verifier": {
      "model": "sonnet-4.5",
      "role": "verification",
      "freshSession": true
    }
  }
}

Session Management

Three separate terminal windows/tabs: Terminal 1: Evaluator (Opus)

export CLAUDE_MODEL=opus
claude
> "Planning session for #357: Event sourcing design"

Terminal 2: Builder (Sonnet)

export CLAUDE_MODEL=sonnet
claude
> "Implement #357 per plan: .plans/357-event-sourcing.md"

Terminal 3: Verifier (Sonnet - FRESH)

export CLAUDE_MODEL=sonnet
claude  # New session, not reused!
> "Verify implementation for #357"

First Feature: Event Sourcing for Multi-Tenant Platform

Let me walk through how this workflow played out for my first real feature.

Phase 1: Planning (Evaluator)

Prompt:

Planning session for implementing event sourcing.

Context: Building multi-tenant SaaS platform with strict data isolation.
Need event sourcing for audit compliance.

Explore codebase to understand existing patterns, then design solution.

Evaluator’s process:

Read ADR-0001 (existing architecture decisions)
Grep for DynamoDB patterns
Read existing repository implementations
Ask clarifying questions:
- “Should events be tenant-scoped or global?”
- “Event versioning strategy?”
- “Snapshot frequency?”

Output: Plan file with 3 options:

Option A: Single events table, tenant prefix in PK
Option B: Per-tenant events tables (rejected: management overhead)
Option C: Single table + GSI for queries (rejected: complexity)

Recommendation: Option A with rationale Human approval: “Proceed with Option A”

Phase 2: Implementation (Builder)

Prompt:

Implement event sourcing per plan: .plans/357-event-sourcing.md

Use existing patterns from eva-auth crate.
Create four-level test suite.

Builder’s work:

Created EventStore trait
Implemented DynamoDbEventStore with tenant isolation
Created InMemoryEventStore for tests
Wrote 20 tests across 4 levels
Ran cargo test - all passed
Requested verification

Output: PR ready for review

Phase 3: Verification (Verifier - FRESH)

Prompt:

Verify implementation for #357 against requirements.

Read:
- Issue #357
- Plan: .plans/357-event-sourcing.md
- Implementation: git diff main...HEAD

Verifier found 3 issues:

⚠️ Missing L3 test for event stream failure
⚠️ Missing L4 cross-tenant negative test
✅ All other requirements met

Decision: CONDITIONAL PASS Builder fixed issues Re-verification: PASSED Human review: Approved and merged

Results

Metrics:

Quality
Cost

Bugs found in verification: 2 (missing tests)Bugs found in production: 0Test coverage: 92% (20 tests across 4 levels)Rework cycles: 1 (conditional pass → fix → pass)

What I Learned

Fresh Sessions Matter

Reusing Builder session for verification caught 40% fewer bugs than fresh Verifier.Always start new session for verification.

Planning Saves Rework

Without Evaluator plan, Builder made wrong architecture choice twice.Spend 20% of time planning to save 80% of rework.

Model Selection Matters

Tried using Sonnet for planning - missed subtle multi-tenancy requirement.Use Opus for architecture, Sonnet for implementation.

Workflows Over Prompts

Multi-agent workflow caught issues single-session approach missed.Structure matters more than prompt engineering.

Common Pitfalls (And How to Avoid)

Mistake #1: Reusing Sessions

Don’t do this:

> "Implement #357"
> "Now verify what you just built"  # ❌ WRONG

Why: Builder is biased toward its own implementation.Do this instead:

> "Implement #357"

> "Verify implementation for #357"  # ✅ CORRECT

Mistake #2: Skipping Planning

I tried skipping Evaluator once: “It’s a simple feature, just build it.” Result:

Builder chose wrong pattern (missed tenant isolation edge case)
Had to rewrite 60% of code
Wasted 3 hours

Lesson: Even “simple” features need planning in complex systems.

Mistake #3: Trusting “All Tests Pass”

Builder created 15 tests. All green. Shipped. Bug in production: Tests verified wrong behavior (AI hallucinated requirement). Fix: Verifier now reviews test assertions, not just coverage.

Actionable Takeaways

If you’re starting with multi-agent AI:

Start with 3 agents - Evaluator, Builder, Verifier (don’t overcomplicate)
Always use fresh Verifier session - Independent context is critical
Plan before implementing - 20% planning time saves 80% rework
Choose models strategically:
- Opus for planning/architecture
- Sonnet for implementation
- Fresh Sonnet for verification
Measure everything:
- Time per phase
- Bugs found in verification vs production
- Token usage and cost

Workflows

Process

​The Problem with Single-Prompt Coding

​The Multi-Agent Solution

​Agent 1: Evaluator (Planning)

​Agent 2: Builder (Implementation)

​Agent 3: Verifier (Quality Gate)

​Setting Up the Workflow

​Tool: Claude Code CLI

​Configuration

​Session Management

​First Feature: Event Sourcing for Multi-Tenant Platform

​Phase 1: Planning (Evaluator)

​Phase 2: Implementation (Builder)

​Phase 3: Verification (Verifier - FRESH)

​Results

​What I Learned

Fresh Sessions Matter

Planning Saves Rework

Model Selection Matters

Workflows Over Prompts

​Common Pitfalls (And How to Avoid)

​Mistake #1: Reusing Sessions

​Mistake #2: Skipping Planning

​Mistake #3: Trusting “All Tests Pass”

​Actionable Takeaways

The Problem with Single-Prompt Coding

The Multi-Agent Solution

Agent 1: Evaluator (Planning)

Agent 2: Builder (Implementation)

Agent 3: Verifier (Quality Gate)

Setting Up the Workflow

Tool: Claude Code CLI

Configuration

Session Management

First Feature: Event Sourcing for Multi-Tenant Platform

Phase 1: Planning (Evaluator)

Phase 2: Implementation (Builder)

Phase 3: Verification (Verifier - FRESH)

Results

What I Learned

Common Pitfalls (And How to Avoid)

Mistake #1: Reusing Sessions

Mistake #2: Skipping Planning

Mistake #3: Trusting “All Tests Pass”

Actionable Takeaways