Building with AI: Series Overview
The Real Story
This isn’t a series about building a SaaS platform. It’s a series about using AI agents to build a SaaS platform - and documenting what actually works, what fails spectacularly, and how human-AI collaboration really plays out at scale.The Experiment
Question: Can multi-agent AI workflows build production-grade systems faster without sacrificing quality? Hypothesis: Yes, if:- Humans handle architecture and security decisions
- AI handles boilerplate, patterns, and testing
- Independent verification catches AI mistakes
- Clear workflows prevent AI from hallucinating requirements
- Event sourcing (complex pattern, good test for AI)
- DynamoDB single-table design (AI struggles here)
- Rust macros (AI excels at boilerplate)
- Four-level testing (can AI generate good tests?)
The AI Workflow
Three Core Agents
Evaluator (Claude Opus):- Architecture planning
- Trade-off analysis
- ADR creation
- Design decisions
- Implementation
- Boilerplate generation
- Pattern application
- Test writing
- Independent code review
- Test coverage analysis
- Edge case validation
- Bug detection
Why Separate Agents?
Hypothesis: A fresh Verifier catches mistakes the Builder makes because:- No implementation bias (hasn’t seen the code being written)
- Forces reading requirements from scratch
- Different session = different “perspective”
Series Structure (10 Weeks)
Week 1: Setting Up Multi-Agent Workflow
AI Focus: Evaluator, Builder, Verifier setup and coordinationSystem Example: Planning event sourcing architectureKey Learning: How to structure prompts for each agent role
Week 2: Plan → Implement → Verify
AI Focus: Three-phase workflow with quality gatesSystem Example: Implementing DynamoDB event storeKey Learning: Independent verification prevents AI hallucination
Week 3: When AI Excels - Boilerplate
AI Focus: AI-generated DynamoDB entities and repositoriesSystem Example: Creating 20+ entities with macrosKey Learning: AI saves 500+ lines of code on boilerplate
Week 4: When AI Excels - Pattern Recognition
AI Focus: AI learning from codebase patternsSystem Example: Multi-tenant isolation implementationKey Learning: AI consistency > human copy-paste
Week 5: When AI Fails - Architecture
AI Focus: AI suggesting wrong patterns for novel problemsSystem Example: Capsule isolation design (AI got it wrong)Key Learning: Humans must own architecture decisions
Week 6: When AI Fails - Security
AI Focus: AI missing subtle security vulnerabilitiesSystem Example: Cross-tenant query bug AI didn’t catchKey Learning: Dedicated CISO agent + human review required
Week 7: Testing with AI
AI Focus: Can AI generate good tests? (Spoiler: mostly yes)System Example: Four-level test suite generationKey Learning: AI writes 80% of tests, humans review edge cases
Week 8: Auto-Remediation
AI Focus: AI fixing its own bugs automaticallySystem Example:
/code-review → auto-fix → re-verify loopKey Learning: 90% fix success rate for high-confidence issuesWeek 9: Prompt Engineering for Code Quality
AI Focus: Optimizing prompts for better AI outputSystem Example: Reducing false positives in verificationKey Learning: Prompt structure matters more than length
Key Themes
1. AI as Collaborator (Not Replacement)
- AI Excels At
- AI Fails At
- Humans Must Own
✅ Boilerplate code - Structs, DTOs, CRUD operations✅ Pattern application - Following existing code patterns✅ Test generation - Happy path and common edge cases✅ Code review - Finding bugs, style issues, unused code✅ Refactoring - Consistent renames, extract method
2. Workflows Over Prompts
Single-prompt coding doesn’t scale. What works:- Each phase has clear goals
- Independent verification catches mistakes
- Human approval gates prevent runaway AI
- Fresh sessions reduce hallucination
3. Real Mistakes (AI and Human)
Mistake #1: Trusting AI Architecture (Week 5)
Mistake #1: Trusting AI Architecture (Week 5)
What Happened: AI suggested Saga pattern for capsule provisioningWhy Wrong: Capsule creation is synchronous, not distributed transactionWho Failed: Human (me) - blindly accepted AI suggestionFix: Reverted to simple transaction, wrote ADR explaining whyLesson: AI doesn’t understand your specific context. Validate architecture suggestions.
Mistake #2: Reusing Builder for Verification (Week 2)
Mistake #2: Reusing Builder for Verification (Week 2)
What Happened: Used same Claude session for build + verifyWhy Wrong: Builder was biased toward its own implementationImpact: Missed 5 bugs that fresh Verifier would catchFix: Always use fresh session for verificationLesson: Independent verification requires independent context
Mistake #3: AI Generated Wrong Tests (Week 6)
Mistake #3: AI Generated Wrong Tests (Week 6)
What Happened: AI generated 50 tests, all passed, shipped bug to productionWhy Wrong: Tests verified wrong behavior (false positives)Root Cause: AI hallucinated requirement that didn’t existFix: Human review of test assertions, not just coverageLesson: Green tests ≠ correct tests. Review what’s being tested.
Mistake #4: Over-Optimizing Prompts (Week 8)
Mistake #4: Over-Optimizing Prompts (Week 8)
What Happened: Spent 3 days tweaking verification prompts for 2% improvementWhy Wrong: Diminishing returns, time better spent elsewhereImpact: Delayed feature work for marginal gainsFix: Accept 90% accuracy, focus on high-value workLesson: Perfect is the enemy of done (applies to AI prompts too)
4. Metrics That Matter
We’re tracking: Quality:- Bugs found in verification vs production (goal: 80/20)
- Test coverage across 4 levels (goal: 90%+)
- False positive rate in AI reviews (goal: less than 10%)
- Time per feature: Planning, Implementation, Verification
- Rework cycles (goal: less than 2 per feature)
- Token usage per feature (cost control)
- Lines of code AI wrote vs human wrote
- Time saved vs manual implementation
- Cost (tokens) vs value (speed + quality)
- 40% faster implementation (with AI)
- Same bug rate (AI didn’t hurt quality)
- $50/month in tokens for 10 features
- ROI: 50 in tokens
What You’ll Learn
AI Workflows
- Multi-agent coordination patterns
- Plan → Implement → Verify process
- When to use which agent
- Prompt engineering for code quality
When AI Helps
- Boilerplate generation (massive time saver)
- Pattern recognition (consistency)
- Test creation (80% automation)
- Code review (finds subtle bugs)
When AI Fails
- Architecture decisions (needs human judgment)
- Security edge cases (subtle vulnerabilities)
- Novel problems (no pattern to follow)
- False confidence (hallucinating requirements)
ROI Analysis
- Token usage and costs
- Time savings measurement
- Quality impact assessment
- When AI is worth it (and when not)
Who This Is For
- Engineering Leaders
- Senior Engineers
- AI-Curious Developers
You want to know:
- Can AI scale our team’s output?
- What workflows actually work?
- What are the risks?
- What’s the ROI?
- Real cost/benefit analysis
- Quality gate patterns
- When AI helps vs hurts
- How to structure AI workflows
The System (As Proof)
The SaaS platform I’m building is the vehicle for testing AI workflows, not the end goal. But it’s a real, production-grade system:- Event sourcing with DynamoDB
- Multi-tenant isolation with capsule pattern
- Rust with macro-driven development
- Four-level testing (unit → repository → event flow → E2E)
- Simple CRUD apps don’t stress-test AI capabilities
- Event sourcing is hard (good test: can AI help?)
- Security-critical (tests AI vulnerability detection)
- Real trade-offs (not toy examples)
Read the Series
Week 1: Multi-Agent Setup
Setting up Evaluator, Builder, and Verifier agents with Claude Code
Week 2: Plan-Implement-Verify
Three-phase workflow with quality gates and independent verification
Week 3: AI for Event Sourcing
How AI helped (and hindered) designing event sourcing architecture
Week 4: When AI Excels
Coming Soon - Boilerplate generation and pattern recognition wins
Subscribe for Updates
Get Weekly Updates
New articles published regularly documenting real AI workflow learnings
Disclaimer: This is experimental work from my personal projects. Results are real but may not
generalize to all contexts. Your mileage may vary.This content does not represent my employer’s views or technologies.