Skip to main content

The Problem with Single-Prompt Coding

I started like everyone else: asking Claude to “write a function that does X.” It worked… for simple tasks. But when I asked Claude to “implement event sourcing for a multi-tenant platform,” I got:
  • ✅ Code that compiles
  • ✅ Looks reasonable at first glance
  • ❌ Violates our multi-tenancy requirements
  • ❌ Missing critical edge cases
  • ❌ Tests verify wrong behavior
The issue: A single AI session tries to do everything - planning, implementing, reviewing - and fails at the transitions.
Key Insight: AI in a single session develops implementation bias.It “remembers” the shortcuts it took while writing code, so when you ask it to verify, it validates its own assumptions instead of checking requirements.

The Multi-Agent Solution

Instead of one AI doing everything, I split the work across three specialized agents:

Agent 1: Evaluator (Planning)

Model: Claude Opus (best reasoning) Role: Architect and decision-maker Responsibilities:
  • Read requirements thoroughly
  • Explore existing codebase patterns
  • Design solution with trade-offs
  • Create detailed implementation plan
  • Get human approval before any code
Why Opus: Architecture decisions need deep reasoning. Opus excels at understanding context and evaluating trade-offs. Example prompt:
This is a planning session for implementing event sourcing in our
multi-tenant SaaS platform.

Requirements:
- All state changes must be audited
- Support time-travel queries
- Tenant isolation is critical

Explore the codebase to understand:
1. Existing DynamoDB patterns
2. Current multi-tenancy implementation
3. Event handling infrastructure

Then design an event sourcing architecture with:
- DynamoDB schema design
- Event versioning strategy
- Replay mechanism
- Tenant isolation approach

Provide 2-3 options with trade-offs, then recommend one.
Output: A detailed plan in .plans/357-event-sourcing.md with:
  • Architecture decision rationale
  • Files to create/modify
  • Data model design
  • Test strategy
  • Risks and mitigation

Agent 2: Builder (Implementation)

Model: Claude Sonnet + GitHub Copilot Role: Implementer following approved plan Responsibilities:
  • Read and follow approved plan exactly
  • Write code matching specified patterns
  • Create four-level test suite
  • Request verification when complete
Why Sonnet: Good at complex logic, faster than Opus, cost-effective for implementation. Why + Copilot: Copilot handles boilerplate autocomplete (free), Sonnet handles complex domain logic. Example prompt:
Implement event sourcing per approved plan:
.plans/357-event-sourcing.md

Focus on:
1. EventStore trait with append/load methods
2. DynamoDB entity with #[derive(DynamoDbEntity)]
3. Repository implementation (InMemory + DynamoDB)
4. Four-level test suite:
   - L1: Unit tests for event validation
   - L2: Repository integration tests with LocalStack
   - L3: Event flow tests (DynamoDB Streams → EventBridge → SQS)
   - L4: E2E workflow test

Follow existing patterns from:
- eva-auth/src/infrastructure/entities/security_group_entity.rs
- eva-auth/src/infrastructure/repositories/security_group_repository.rs
Output: Working implementation with comprehensive tests Critical: Builder NEVER deviates from plan without getting approval. If plan needs changes, Builder asks Evaluator for plan revision.

Agent 3: Verifier (Quality Gate)

Model: Claude Sonnet (FRESH session, not reused Builder session) Role: Independent code reviewer Responsibilities:
  • Read requirements from scratch
  • Review implementation against plan
  • Check test coverage (all 4 levels present?)
  • Validate edge cases
  • Report pass/conditional/fail decision
Why Fresh Session: This is CRITICAL. If you reuse the Builder’s session for verification, the AI is biased toward its own implementation. A fresh Verifier session:
  • Has no memory of implementation shortcuts
  • Must read requirements independently
  • Catches assumptions Builder made
Example prompt:
Verify implementation for issue #357 against requirements.

Read:
1. Original issue: gh issue view #357
2. Approved plan: .plans/357-event-sourcing.md
3. Implementation: git diff main...HEAD

Check:
- Requirements coverage (are all acceptance criteria met?)
- Plan adherence (did implementation follow approved design?)
- Test coverage (L1, L2, L3, L4 tests present and adequate?)
- Edge cases (empty input, concurrent access, failure scenarios?)
- Documentation (API docs, examples, PRD updated?)

Post verification report with decision:
- PASSED: Ready for human review
- CONDITIONAL: Minor issues to fix (list them)
- FAILED: Significant gaps (list them with severity)
Output: Structured verification report Verification Report Example:
## Verification Report: #357

### 1. Requirements Coverage
| Requirement | Met? | Test | Notes |
|-------------|------|------|-------|
| Audit all state changes | ✅ | L1, L2 | EventStore appends |
| Time-travel queries | ✅ | L3 | Event replay tested |
| Tenant isolation | ⚠️ | L2 | Tests pass, but missing L4 cross-tenant test |

### 2. Plan Adherence
- ✅ EventStore trait matches plan
- ✅ DynamoDB schema as designed
- ⚠️ DEVIATION: Added `event_version` field (not in plan)
  - Justified in PR comment #15 (needed for versioning)

### 3. Test Coverage
| Level | Required | Present | Adequate? |
|-------|----------|---------|-----------|
| L1: Unit | ✅ | 12 tests | ✅ Good coverage |
| L2: Repository | ✅ | 6 tests | ✅ CRUD + GSI |
| L3: Event Flow | ✅ | 2 tests | ⚠️ Missing failure scenario |
| L4: E2E | ✅ | 1 test | ⚠️ Missing cross-tenant negative test |

### 4. Edge Cases
- ✅ Empty event list
- ✅ Concurrent append
- ⚠️ MISSING: Cross-tenant event access attempt (should fail)

### Decision: ⚠️ CONDITIONAL PASS

**Required before merge:**
1. Add L3 test for event delivery failure
2. Add L4 test for cross-tenant isolation

Setting Up the Workflow

Tool: Claude Code CLI

I use Claude Code for all three agents. Installation:
npm install -g @anthropics/claude-code
claude --version

Configuration

Create .claude/settings.local.json:
{
  "agents": {
    "evaluator": {
      "model": "opus-4.5",
      "role": "planning",
      "outputDir": ".plans/"
    },
    "builder": {
      "model": "sonnet-4.5",
      "role": "implementation"
    },
    "verifier": {
      "model": "sonnet-4.5",
      "role": "verification",
      "freshSession": true
    }
  }
}

Session Management

Three separate terminal windows/tabs: Terminal 1: Evaluator (Opus)
export CLAUDE_MODEL=opus
claude
> "Planning session for #357: Event sourcing design"
Terminal 2: Builder (Sonnet)
export CLAUDE_MODEL=sonnet
claude
> "Implement #357 per plan: .plans/357-event-sourcing.md"
Terminal 3: Verifier (Sonnet - FRESH)
export CLAUDE_MODEL=sonnet
claude  # New session, not reused!
> "Verify implementation for #357"

First Feature: Event Sourcing for Multi-Tenant Platform

Let me walk through how this workflow played out for my first real feature.

Phase 1: Planning (Evaluator)

Prompt:
Planning session for implementing event sourcing.

Context: Building multi-tenant SaaS platform with strict data isolation.
Need event sourcing for audit compliance.

Explore codebase to understand existing patterns, then design solution.
Evaluator’s process:
  1. Read ADR-0001 (existing architecture decisions)
  2. Grep for DynamoDB patterns
  3. Read existing repository implementations
  4. Ask clarifying questions:
    • “Should events be tenant-scoped or global?”
    • “Event versioning strategy?”
    • “Snapshot frequency?”
Output: Plan file with 3 options:
  • Option A: Single events table, tenant prefix in PK
  • Option B: Per-tenant events tables (rejected: management overhead)
  • Option C: Single table + GSI for queries (rejected: complexity)
Recommendation: Option A with rationale Human approval: “Proceed with Option A”

Phase 2: Implementation (Builder)

Prompt:
Implement event sourcing per plan: .plans/357-event-sourcing.md

Use existing patterns from eva-auth crate.
Create four-level test suite.
Builder’s work:
  • Created EventStore trait
  • Implemented DynamoDbEventStore with tenant isolation
  • Created InMemoryEventStore for tests
  • Wrote 20 tests across 4 levels
  • Ran cargo test - all passed
  • Requested verification
Output: PR ready for review

Phase 3: Verification (Verifier - FRESH)

Prompt:
Verify implementation for #357 against requirements.

Read:
- Issue #357
- Plan: .plans/357-event-sourcing.md
- Implementation: git diff main...HEAD
Verifier found 3 issues:
  1. ⚠️ Missing L3 test for event stream failure
  2. ⚠️ Missing L4 cross-tenant negative test
  3. ✅ All other requirements met
Decision: CONDITIONAL PASS Builder fixed issues Re-verification: PASSED Human review: Approved and merged

Results

Metrics:
Bugs found in verification: 2 (missing tests)Bugs found in production: 0Test coverage: 92% (20 tests across 4 levels)Rework cycles: 1 (conditional pass → fix → pass)

What I Learned

Fresh Sessions Matter

Reusing Builder session for verification caught 40% fewer bugs than fresh Verifier.Always start new session for verification.

Planning Saves Rework

Without Evaluator plan, Builder made wrong architecture choice twice.Spend 20% of time planning to save 80% of rework.

Model Selection Matters

Tried using Sonnet for planning - missed subtle multi-tenancy requirement.Use Opus for architecture, Sonnet for implementation.

Workflows Over Prompts

Multi-agent workflow caught issues single-session approach missed.Structure matters more than prompt engineering.

Common Pitfalls (And How to Avoid)

Mistake #1: Reusing Sessions

Don’t do this:
> "Implement #357"
> "Now verify what you just built"  # ❌ WRONG
Why: Builder is biased toward its own implementation.Do this instead:
> "Implement #357"

> "Verify implementation for #357"  # ✅ CORRECT

Mistake #2: Skipping Planning

I tried skipping Evaluator once: “It’s a simple feature, just build it.” Result:
  • Builder chose wrong pattern (missed tenant isolation edge case)
  • Had to rewrite 60% of code
  • Wasted 3 hours
Lesson: Even “simple” features need planning in complex systems.

Mistake #3: Trusting “All Tests Pass”

Builder created 15 tests. All green. Shipped. Bug in production: Tests verified wrong behavior (AI hallucinated requirement). Fix: Verifier now reviews test assertions, not just coverage.

Actionable Takeaways

If you’re starting with multi-agent AI:
  1. Start with 3 agents - Evaluator, Builder, Verifier (don’t overcomplicate)
  2. Always use fresh Verifier session - Independent context is critical
  3. Plan before implementing - 20% planning time saves 80% rework
  4. Choose models strategically:
    • Opus for planning/architecture
    • Sonnet for implementation
    • Fresh Sonnet for verification
  5. Measure everything:
    • Time per phase
    • Bugs found in verification vs production
    • Token usage and cost