Building with AI: Series Overview

The Real Story

This isn’t a series about building a SaaS platform. It’s a series about using AI agents to build a SaaS platform - and documenting what actually works, what fails spectacularly, and how human-AI collaboration really plays out at scale.

Not a Tutorial - This is experimental work with real failures, pivots, and “AI got it completely wrong” moments.The SaaS platform is the vehicle for testing AI workflows. The AI workflows are the point.

The Experiment

Question: Can multi-agent AI workflows build production-grade systems faster without sacrificing quality? Hypothesis: Yes, if:

Humans handle architecture and security decisions
AI handles boilerplate, patterns, and testing
Independent verification catches AI mistakes
Clear workflows prevent AI from hallucinating requirements

Testing Ground: A multi-tenant SaaS platform with:

Event sourcing (complex pattern, good test for AI)
DynamoDB single-table design (AI struggles here)
Rust macros (AI excels at boilerplate)
Four-level testing (can AI generate good tests?)

The AI Workflow

Three Core Agents

Evaluator (Claude Opus):

Architecture planning
Trade-off analysis
ADR creation
Design decisions

Builder (Claude Sonnet + GitHub Copilot):

Implementation
Boilerplate generation
Pattern application
Test writing

Verifier (Claude Sonnet - Fresh Session):

Independent code review
Test coverage analysis
Edge case validation
Bug detection

Why Separate Agents?

Hypothesis: A fresh Verifier catches mistakes the Builder makes because:

No implementation bias (hasn’t seen the code being written)
Forces reading requirements from scratch
Different session = different “perspective”

Result: 70% bug detection rate improvement vs reusing same session

Series Structure (10 Weeks)

Week 1: Setting Up Multi-Agent Workflow

AI Focus: Evaluator, Builder, Verifier setup and coordinationSystem Example: Planning event sourcing architectureKey Learning: How to structure prompts for each agent role

Week 2: Plan → Implement → Verify

AI Focus: Three-phase workflow with quality gatesSystem Example: Implementing DynamoDB event storeKey Learning: Independent verification prevents AI hallucination

Week 3: When AI Excels - Boilerplate

AI Focus: AI-generated DynamoDB entities and repositoriesSystem Example: Creating 20+ entities with macrosKey Learning: AI saves 500+ lines of code on boilerplate

Week 4: When AI Excels - Pattern Recognition

AI Focus: AI learning from codebase patternsSystem Example: Multi-tenant isolation implementationKey Learning: AI consistency > human copy-paste

Week 5: When AI Fails - Architecture

AI Focus: AI suggesting wrong patterns for novel problemsSystem Example: Capsule isolation design (AI got it wrong)Key Learning: Humans must own architecture decisions

Week 6: When AI Fails - Security

AI Focus: AI missing subtle security vulnerabilitiesSystem Example: Cross-tenant query bug AI didn’t catchKey Learning: Dedicated CISO agent + human review required

Week 7: Testing with AI

AI Focus: Can AI generate good tests? (Spoiler: mostly yes)System Example: Four-level test suite generationKey Learning: AI writes 80% of tests, humans review edge cases

Week 8: Auto-Remediation

AI Focus: AI fixing its own bugs automaticallySystem Example: /code-review → auto-fix → re-verify loopKey Learning: 90% fix success rate for high-confidence issues

Week 9: Prompt Engineering for Code Quality

AI Focus: Optimizing prompts for better AI outputSystem Example: Reducing false positives in verificationKey Learning: Prompt structure matters more than length

Week 10: Cost vs Value Analysis

AI Focus: Token usage, time savings, quality metricsSystem Example: Full platform retrospectiveKey Learning: ROI calculation: 40% faster, same quality, $X cost

Key Themes

1. AI as Collaborator (Not Replacement)

AI Excels At
AI Fails At
Humans Must Own

✅ Boilerplate code - Structs, DTOs, CRUD operations✅ Pattern application - Following existing code patterns✅ Test generation - Happy path and common edge cases✅ Code review - Finding bugs, style issues, unused code✅ Refactoring - Consistent renames, extract method

2. Workflows Over Prompts

Single-prompt coding doesn’t scale. What works:

1. PLAN (Evaluator - Opus)
   → Detailed design doc
   → Trade-off analysis
   → Human approval

2. IMPLEMENT (Builder - Sonnet)
   → Follow approved plan
   → Use Copilot for boilerplate
   → Request verification when done

3. VERIFY (Verifier - Fresh Sonnet)
   → Independent review
   → Test coverage check
   → Report issues or pass

4. REVIEW (Human + Specialist Agents)
   → Security (CISO agent)
   → Architecture (Architect agent)
   → Final approval

Why this works:

Each phase has clear goals
Independent verification catches mistakes
Human approval gates prevent runaway AI
Fresh sessions reduce hallucination

3. Real Mistakes (AI and Human)

Mistake #1: Trusting AI Architecture (Week 5)

What Happened: AI suggested Saga pattern for capsule provisioningWhy Wrong: Capsule creation is synchronous, not distributed transactionWho Failed: Human (me) - blindly accepted AI suggestionFix: Reverted to simple transaction, wrote ADR explaining whyLesson: AI doesn’t understand your specific context. Validate architecture suggestions.

Mistake #2: Reusing Builder for Verification (Week 2)

What Happened: Used same Claude session for build + verifyWhy Wrong: Builder was biased toward its own implementationImpact: Missed 5 bugs that fresh Verifier would catchFix: Always use fresh session for verificationLesson: Independent verification requires independent context

Mistake #3: AI Generated Wrong Tests (Week 6)

What Happened: AI generated 50 tests, all passed, shipped bug to productionWhy Wrong: Tests verified wrong behavior (false positives)Root Cause: AI hallucinated requirement that didn’t existFix: Human review of test assertions, not just coverageLesson: Green tests ≠ correct tests. Review what’s being tested.

Mistake #4: Over-Optimizing Prompts (Week 8)

What Happened: Spent 3 days tweaking verification prompts for 2% improvementWhy Wrong: Diminishing returns, time better spent elsewhereImpact: Delayed feature work for marginal gainsFix: Accept 90% accuracy, focus on high-value workLesson: Perfect is the enemy of done (applies to AI prompts too)

4. Metrics That Matter

We’re tracking: Quality:

Bugs found in verification vs production (goal: 80/20)
Test coverage across 4 levels (goal: 90%+)
False positive rate in AI reviews (goal: less than 10%)

Speed:

Time per feature: Planning, Implementation, Verification
Rework cycles (goal: less than 2 per feature)
Token usage per feature (cost control)

Value:

Lines of code AI wrote vs human wrote
Time saved vs manual implementation
Cost (tokens) vs value (speed + quality)

Results so far (Week 6):

40% faster implementation (with AI)
Same bug rate (AI didn’t hurt quality)
$50/month in tokens for 10 features
ROI: $600 in time saved for$ 50 in tokens

What You’ll Learn

AI Workflows

Multi-agent coordination patterns
Plan → Implement → Verify process
When to use which agent
Prompt engineering for code quality

When AI Helps

Boilerplate generation (massive time saver)
Pattern recognition (consistency)
Test creation (80% automation)
Code review (finds subtle bugs)

When AI Fails

Architecture decisions (needs human judgment)
Security edge cases (subtle vulnerabilities)
Novel problems (no pattern to follow)
False confidence (hallucinating requirements)

ROI Analysis

Token usage and costs
Time savings measurement
Quality impact assessment
When AI is worth it (and when not)

Who This Is For

Engineering Leaders
Senior Engineers
AI-Curious Developers

You want to know:

Can AI scale our team’s output?
What workflows actually work?
What are the risks?
What’s the ROI?

You’ll learn:

Real cost/benefit analysis
Quality gate patterns
When AI helps vs hurts
How to structure AI workflows

The System (As Proof)

The SaaS platform I’m building is the vehicle for testing AI workflows, not the end goal. But it’s a real, production-grade system:

Event sourcing with DynamoDB
Multi-tenant isolation with capsule pattern
Rust with macro-driven development
Four-level testing (unit → repository → event flow → E2E)

Why this complexity?

Simple CRUD apps don’t stress-test AI capabilities
Event sourcing is hard (good test: can AI help?)
Security-critical (tests AI vulnerability detection)
Real trade-offs (not toy examples)

Read the Series

Week 1: Multi-Agent Setup

Setting up Evaluator, Builder, and Verifier agents with Claude Code

Week 2: Plan-Implement-Verify

Three-phase workflow with quality gates and independent verification

Week 3: AI for Event Sourcing

How AI helped (and hindered) designing event sourcing architecture

Week 4: When AI Excels

Coming Soon - Boilerplate generation and pattern recognition wins

Get Weekly Updates

New articles published regularly documenting real AI workflow learnings

Disclaimer: This is experimental work from my personal projects. Results are real but may not generalize to all contexts. Your mileage may vary.This content does not represent my employer’s views or technologies.

Building with AI

​Building with AI: Series Overview

​The Real Story

​The Experiment

​The AI Workflow

​Three Core Agents

​Why Separate Agents?

​Series Structure (10 Weeks)

​Key Themes

​1. AI as Collaborator (Not Replacement)

​2. Workflows Over Prompts

​3. Real Mistakes (AI and Human)

​4. Metrics That Matter

​What You’ll Learn

AI Workflows

When AI Helps

When AI Fails

ROI Analysis

​Who This Is For

​The System (As Proof)

​Read the Series

Week 1: Multi-Agent Setup

Week 2: Plan-Implement-Verify

Week 3: AI for Event Sourcing

Week 4: When AI Excels

​Subscribe for Updates

Get Weekly Updates

Building with AI: Series Overview

The Real Story

The Experiment

The AI Workflow

Three Core Agents

Why Separate Agents?

Series Structure (10 Weeks)

Key Themes

1. AI as Collaborator (Not Replacement)

2. Workflows Over Prompts

3. Real Mistakes (AI and Human)

4. Metrics That Matter

What You’ll Learn

Who This Is For

The System (As Proof)

Read the Series

Subscribe for Updates