Skip to main content

Building with AI: Series Overview

The Real Story

This isn’t a series about building a SaaS platform. It’s a series about using AI agents to build a SaaS platform - and documenting what actually works, what fails spectacularly, and how human-AI collaboration really plays out at scale.
Not a Tutorial - This is experimental work with real failures, pivots, and “AI got it completely wrong” moments.The SaaS platform is the vehicle for testing AI workflows. The AI workflows are the point.

The Experiment

Question: Can multi-agent AI workflows build production-grade systems faster without sacrificing quality? Hypothesis: Yes, if:
  • Humans handle architecture and security decisions
  • AI handles boilerplate, patterns, and testing
  • Independent verification catches AI mistakes
  • Clear workflows prevent AI from hallucinating requirements
Testing Ground: A multi-tenant SaaS platform with:
  • Event sourcing (complex pattern, good test for AI)
  • DynamoDB single-table design (AI struggles here)
  • Rust macros (AI excels at boilerplate)
  • Four-level testing (can AI generate good tests?)

The AI Workflow

Three Core Agents

Evaluator (Claude Opus):
  • Architecture planning
  • Trade-off analysis
  • ADR creation
  • Design decisions
Builder (Claude Sonnet + GitHub Copilot):
  • Implementation
  • Boilerplate generation
  • Pattern application
  • Test writing
Verifier (Claude Sonnet - Fresh Session):
  • Independent code review
  • Test coverage analysis
  • Edge case validation
  • Bug detection

Why Separate Agents?

Hypothesis: A fresh Verifier catches mistakes the Builder makes because:
  • No implementation bias (hasn’t seen the code being written)
  • Forces reading requirements from scratch
  • Different session = different “perspective”
Result: 70% bug detection rate improvement vs reusing same session

Series Structure (10 Weeks)

1

Week 1: Setting Up Multi-Agent Workflow

AI Focus: Evaluator, Builder, Verifier setup and coordinationSystem Example: Planning event sourcing architectureKey Learning: How to structure prompts for each agent role
2

Week 2: Plan → Implement → Verify

AI Focus: Three-phase workflow with quality gatesSystem Example: Implementing DynamoDB event storeKey Learning: Independent verification prevents AI hallucination
3

Week 3: When AI Excels - Boilerplate

AI Focus: AI-generated DynamoDB entities and repositoriesSystem Example: Creating 20+ entities with macrosKey Learning: AI saves 500+ lines of code on boilerplate
4

Week 4: When AI Excels - Pattern Recognition

AI Focus: AI learning from codebase patternsSystem Example: Multi-tenant isolation implementationKey Learning: AI consistency > human copy-paste
5

Week 5: When AI Fails - Architecture

AI Focus: AI suggesting wrong patterns for novel problemsSystem Example: Capsule isolation design (AI got it wrong)Key Learning: Humans must own architecture decisions
6

Week 6: When AI Fails - Security

AI Focus: AI missing subtle security vulnerabilitiesSystem Example: Cross-tenant query bug AI didn’t catchKey Learning: Dedicated CISO agent + human review required
7

Week 7: Testing with AI

AI Focus: Can AI generate good tests? (Spoiler: mostly yes)System Example: Four-level test suite generationKey Learning: AI writes 80% of tests, humans review edge cases
8

Week 8: Auto-Remediation

AI Focus: AI fixing its own bugs automaticallySystem Example: /code-review → auto-fix → re-verify loopKey Learning: 90% fix success rate for high-confidence issues
9

Week 9: Prompt Engineering for Code Quality

AI Focus: Optimizing prompts for better AI outputSystem Example: Reducing false positives in verificationKey Learning: Prompt structure matters more than length
10

Week 10: Cost vs Value Analysis

AI Focus: Token usage, time savings, quality metricsSystem Example: Full platform retrospectiveKey Learning: ROI calculation: 40% faster, same quality, $X cost

Key Themes

1. AI as Collaborator (Not Replacement)

Boilerplate code - Structs, DTOs, CRUD operationsPattern application - Following existing code patternsTest generation - Happy path and common edge casesCode review - Finding bugs, style issues, unused codeRefactoring - Consistent renames, extract method

2. Workflows Over Prompts

Single-prompt coding doesn’t scale. What works:
1. PLAN (Evaluator - Opus)
   → Detailed design doc
   → Trade-off analysis
   → Human approval

2. IMPLEMENT (Builder - Sonnet)
   → Follow approved plan
   → Use Copilot for boilerplate
   → Request verification when done

3. VERIFY (Verifier - Fresh Sonnet)
   → Independent review
   → Test coverage check
   → Report issues or pass

4. REVIEW (Human + Specialist Agents)
   → Security (CISO agent)
   → Architecture (Architect agent)
   → Final approval
Why this works:
  • Each phase has clear goals
  • Independent verification catches mistakes
  • Human approval gates prevent runaway AI
  • Fresh sessions reduce hallucination

3. Real Mistakes (AI and Human)

What Happened: AI suggested Saga pattern for capsule provisioningWhy Wrong: Capsule creation is synchronous, not distributed transactionWho Failed: Human (me) - blindly accepted AI suggestionFix: Reverted to simple transaction, wrote ADR explaining whyLesson: AI doesn’t understand your specific context. Validate architecture suggestions.
What Happened: Used same Claude session for build + verifyWhy Wrong: Builder was biased toward its own implementationImpact: Missed 5 bugs that fresh Verifier would catchFix: Always use fresh session for verificationLesson: Independent verification requires independent context
What Happened: AI generated 50 tests, all passed, shipped bug to productionWhy Wrong: Tests verified wrong behavior (false positives)Root Cause: AI hallucinated requirement that didn’t existFix: Human review of test assertions, not just coverageLesson: Green tests ≠ correct tests. Review what’s being tested.
What Happened: Spent 3 days tweaking verification prompts for 2% improvementWhy Wrong: Diminishing returns, time better spent elsewhereImpact: Delayed feature work for marginal gainsFix: Accept 90% accuracy, focus on high-value workLesson: Perfect is the enemy of done (applies to AI prompts too)

4. Metrics That Matter

We’re tracking: Quality:
  • Bugs found in verification vs production (goal: 80/20)
  • Test coverage across 4 levels (goal: 90%+)
  • False positive rate in AI reviews (goal: less than 10%)
Speed:
  • Time per feature: Planning, Implementation, Verification
  • Rework cycles (goal: less than 2 per feature)
  • Token usage per feature (cost control)
Value:
  • Lines of code AI wrote vs human wrote
  • Time saved vs manual implementation
  • Cost (tokens) vs value (speed + quality)
Results so far (Week 6):
  • 40% faster implementation (with AI)
  • Same bug rate (AI didn’t hurt quality)
  • $50/month in tokens for 10 features
  • ROI: 600intimesavedfor600 in time saved for 50 in tokens

What You’ll Learn

AI Workflows

  • Multi-agent coordination patterns
  • Plan → Implement → Verify process
  • When to use which agent
  • Prompt engineering for code quality

When AI Helps

  • Boilerplate generation (massive time saver)
  • Pattern recognition (consistency)
  • Test creation (80% automation)
  • Code review (finds subtle bugs)

When AI Fails

  • Architecture decisions (needs human judgment)
  • Security edge cases (subtle vulnerabilities)
  • Novel problems (no pattern to follow)
  • False confidence (hallucinating requirements)

ROI Analysis

  • Token usage and costs
  • Time savings measurement
  • Quality impact assessment
  • When AI is worth it (and when not)

Who This Is For

You want to know:
  • Can AI scale our team’s output?
  • What workflows actually work?
  • What are the risks?
  • What’s the ROI?
You’ll learn:
  • Real cost/benefit analysis
  • Quality gate patterns
  • When AI helps vs hurts
  • How to structure AI workflows

The System (As Proof)

The SaaS platform I’m building is the vehicle for testing AI workflows, not the end goal. But it’s a real, production-grade system:
  • Event sourcing with DynamoDB
  • Multi-tenant isolation with capsule pattern
  • Rust with macro-driven development
  • Four-level testing (unit → repository → event flow → E2E)
Why this complexity?
  • Simple CRUD apps don’t stress-test AI capabilities
  • Event sourcing is hard (good test: can AI help?)
  • Security-critical (tests AI vulnerability detection)
  • Real trade-offs (not toy examples)

Read the Series

Subscribe for Updates

Get Weekly Updates

New articles published regularly documenting real AI workflow learnings

Disclaimer: This is experimental work from my personal projects. Results are real but may not generalize to all contexts. Your mileage may vary.This content does not represent my employer’s views or technologies.