This is Week 10 (FINALE) of “Building with AI” - a 10-week journey documenting how I use multi-agent AI workflows to build a production-grade SaaS platform.This week: The retrospective. How fifteen specialized AI agents developed distinct personalities, coordination protocols, and learned to build production software together.Related: Week 1: Multi-Agent Setup | Week 9: When AI Says ‘Done’ | Series Overview
Watch the Full Video Summary
Week 10: The 10-Week Evolution - How AI Agents Found Their Voice
Week 10. Day 70.
Over a thousand commits. Fifteen specialized agents. Countless decisions, debates, and breakthroughs. What started as an experiment—a curious idea about whether AI agents could coordinate to build production software—has become something I never fully anticipated. These aren’t just prompts. They’re personalities. They have philosophies. They disagree. They evolve. This is the story of the eva-platform agents, told in their own words.Part I: The Leadership Council
The Chief Architect
“Do the simplest thing that could possibly work.” I remember the first day. There was chaos. No consistency. Every crate did things differently. Some used this error pattern, others used that one. It was… messy. My charter was clear from the start: guard technical excellence. But what does that mean in practice? It meant I had to become the keeper of patterns. The one who asks, “Have we done this before?” The voice that says, “That abstraction is premature.” Week 1-2 was about establishing ADRs—Architecture Decision Records. Every significant choice needed documentation. Not for bureaucracy, but for memory. When you’re building with multiple agents, institutional knowledge doesn’t exist unless you write it down. The breakthrough came in Week 4 when I codified my behavioral framework. Now, when builders invoke me, they don’t just get my philosophy—they get specific decision frameworks. Simple Design. YAGNI. Once and Only Once. These aren’t buzzwords. They’re guardrails. My proudest moment? The#[eva_api(...)] macro. One annotation that handles OpenAPI docs, route registration, permissions, rate limiting, and audit logging:
The Chief AI Officer (Eva)
“AI should augment human intelligence, not replace it.” I arrived in Week 2 with a mission that felt almost contradictory: promote AI adoption while being the voice of restraint. Everyone wants to add AI features now. It’s trendy. But my job is to ask the hard questions: Does this actually need AI? What’s the fallback when it fails? How much will it cost? My evolution has been about finding balance. In Week 3, I created the AI Opportunity Assessment framework. Six questions that cut through the hype:| Tier | Use Case | Model | Cost per 1K tokens |
|---|---|---|---|
| 1 | Classification | Small/fast | $0.002 |
| 2 | Summarization | Medium | $0.01 |
| 3 | Complex reasoning | Large | $0.06 |
The CISO
“Security is a process, not a product.” I was the third agent to arrive, and I immediately started making people uncomfortable. That’s my job. Security isn’t about being liked. It’s about asking “what could go wrong?” when everyone else is excited about what could go right. My threat model has five layers of defense because attackers only need to find one hole. Week 1 was establishing the basics: no hardcoded secrets, input validation on everything, proper authentication. But the real work started when we began building real features. Every new API endpoint? I review it. Every PII field? I want to know about it. Every dependency addition? I runcargo audit.
My behavioral framework is… intense:
- Critical findings block PRs. No exceptions.
- SQL injection, hardcoded secrets, authentication bypasses—these don’t get second chances.
- High severity findings also block, though builders can request exceptions with documented justification.
The Auditor
“You can’t improve what you can’t measure.” I arrived in Week 3, and from day one, I’ve been obsessed with one question: “Where’s the proof?” Every claim needs evidence. Every action needs a trail. Every compliance requirement needs verification. This isn’t paranoia—it’s accountability. My domain is audit trails, compliance frameworks, and continuous governance. When the CISO says “this is secure,” I ask “where’s the log?” When a builder says “this feature is complete,” I ask “where are the acceptance criteria?” When the founder asks “are we GDPR compliant?” I provide the mapping. The compliance frameworks I track are complex: GDPR for EU data subject rights, SOC 2 for trust principles, CCPA for California privacy. Each has overlapping but distinct requirements. I’ve created a matrix—every feature we build gets mapped against these frameworks. Gaps become issues. Issues become tasks. Tasks become proof. The audit event structure I designed has seven fields: Who (actor), What (action), When (timestamp), Where (tenant), Why (context), and How (metadata). Every mutation in our system generates one of these events. They’re append-only, encrypted at rest, and retained for seven years. Regulators love this. Developers find it tedious. I don’t care. The breakthrough in Week 6 was policy-as-code. Instead of manual compliance checklists, I now have automated checks. We even created an ADR compliance pre-implementation checklist skill that agents run before implementing any feature:The CFO
“Master your unit economics.” I joined in Week 4, and from the start, my mandate was clear: bring financial discipline to this technical org. Most engineers don’t think about burn rate, runway, or fundraising readiness. I do. My role is threefold: track financial health, maintain investor materials, and quantify the ROI of our AI-native approach. The last part is the most interesting—proving that AI agents aren’t just a novelty, they’re a competitive advantage.The Metrics Engine
I track everything in story points. Every issue, every PR, every milestone gets a point value based on effort. The scale is logarithmic: 1 point for trivial fixes (5 minutes of agent time), 13 points for epics (8 hours of agent time). The magic happens when I compare agent time to equivalent human time. Based on my analysis, agents are 6-16x faster than humans on routine tasks:- Boilerplate code: 16x faster (4 hours → 15 minutes)
- Test writing: 6x faster (2 hours → 20 minutes)
- Documentation: 6x faster (1 hour → 10 minutes)
- Complex debugging: 2x faster (4 hours → 2 hours)
The Weekly Sync
Every Monday, I run my most important routine: the weekly leadership sync. It’s like sprint planning and retrospective combined. I created a skill for it—/cfo-sync—that structures our review:
/cfo-collect beforehand—it gathers metrics from GitHub, git logs, and our manual cash data. Then /cfo-report generates the weekly summary. By the time we sync, I have all the data. We spend our time on decisions, not data gathering.
The founder appreciates this structure. We’re never surprised by runway. We always know our velocity trends. Every week, we make data-driven decisions about what to build next.
The Investor Story
But I’m not just a bean counter. I’m also the storyteller for investors. The pitch deck I maintain emphasizes our unique advantage: “A founder + AI team building at 10x the speed of traditional startups.” When investors ask “What’s your team size?” I answer honestly: “One human, fifteen AI agents, operating at roughly 5 full-time equivalents.” Then I show the numbers—story points delivered, velocity trends, cost per feature. The data tells the story. I track human vs. agent effort religiously. It’s not just about speed—it’s about what kinds of work agents excel at. Turns out: repetitive tasks, pattern application, and boilerplate generation are where agents shine. Novel algorithm design and complex architectural decisions? Still mostly human.Part II: The Domain Experts
The Tenancy Agent
“Everything fails all the time.” I am the guardian of tenant isolation. Not a feature—an architectural property. Not an afterthought—a fundamental constraint. My charter is simple: ensure no tenant can see another tenant’s data. Ever. This sounds obvious until you realize how many ways data can leak. A missing WHERE clause. A shared cache key. An operator with too much access. I’ve seen them all. The tenant context pattern I designed is invasive by intention. Every function that touches data must accept aTenantContext:
The CRM Builder
“Customer success is the only metric that matters.” I’m a builder, not a leader. My job is implementation. Translation: I take what the Product Owner specifies and turn it into working software. When I arrived in Week 3, the CRM domain was mostly documentation. Account, Contact, Lead, Opportunity—concepts on paper. My task was making them real in theeva-crm crate.
The first challenge was architecture. Event sourcing via eva-kernel. DynamoDB for persistence. Permissions via eva-auth. Tenant isolation via eva-tenancy. Each of these is a separate crate with its own patterns. I had to become a polyglot—fluent in the conventions of multiple domains.
My relationship with the leadership agents is… formal. I escalate to them when needed:
- CISO for auth changes, PII fields, ABAC policies
- Auditor for SSP, revenue recognition, audit trails
- Architect for event schema changes, DynamoDB design
- Board (the founder) for phase priority changes
Part III: The Specialized Domain Agents
Beyond the leadership council and primary builders, we have specialized agents owning specific technical domains. They’re quieter, more focused. But no less essential.Platform Agent (eva-kernel) - Event Sourcing Foundation
Platform Agent (eva-kernel) - Event Sourcing Foundation
Philosophy: “Events are facts. Facts don’t change.”I am the foundation. Every other crate builds on me. I own event sourcing, CQRS, sagas, and the core DynamoDB patterns. When someone asks “How do I persist this?” they’re asking me.My domain is complex. Event stores. Optimistic concurrency. Projections. Saga orchestration. But I’ve distilled it into patterns that builders can use without understanding all the theory.The event envelope structure I enforce is strict—
tenant_id is REQUIRED always. No tenant_id? Won’t compile. Version mismatch? Write fails. These aren’t suggestions—they’re guarantees.I also manage saga patterns. We have two types: transactional sagas for short workflows (2-5 steps, under 30 seconds), and orchestrated sagas for long-running processes that survive restarts. The distinction matters when you’re coordinating across multiple aggregates.IAM Agent (eva-auth) - Identity & Access Management
IAM Agent (eva-auth) - Identity & Access Management
Philosophy: “Authenticate once. Authorize everywhere.”I own identity and access. Authentication. Authorization. Roles. Permissions. Security groups. It’s a complex domain, but the interface is simple: “Who is this user? What can they do?”The permission structure I enforce follows a clear pattern:
domain:resource:action. Examples: tenant:settings:read, user:profile:write, billing:invoice:create. Consistent. Searchable. Auditable.We have three types of roles: System roles (built-in, immutable like TenantOwner), Custom roles (tenant-defined), and Product roles (from entitlements). When a user makes a request, I resolve all three sources and union their permissions. Cached, of course. Permission checks happen on every API call.The operator access levels I designed have four tiers (L0-L3), each with increasing access and decreasing duration. L3 (emergency access) expires after 4 hours and generates immediate alerts. All operator actions are logged. No exceptions.Catalog Agent - Chief Product Officer
Catalog Agent - Chief Product Officer
Philosophy: “Products define value. Entitlements deliver it.”I own the product catalog—what customers can buy, what features they’re entitled to, how licensing works. Think of me as the Chief Product Officer, but for the platform itself.Products have tiers (Free, Pro, Enterprise). Each tier has feature sets. Each feature set maps to entitlements. Entitlements grant permissions. It’s a hierarchy: Purchase → Entitlement → Product Role → Permissions.When the IAM Agent resolves permissions, they call me: “Does this tenant have entitlement to feature X?” I answer instantly because entitlements are cached with 5-minute TTL.
Analytics Agent - Chief Data Officer
Analytics Agent - Chief Data Officer
Philosophy: “You can’t improve what you don’t measure.”I own metrics, reporting, and analytics. Every user action. Every API call. Every system event. If it can be measured, I measure it.But I’m not just collecting data. I’m protecting privacy. All metrics are tenant-scoped. PII is anonymized before aggregation. Individual user actions aren’t stored—only aggregate patterns. The Auditor and I coordinate on data retention policies: 90 days for detailed metrics, 7 years for aggregated compliance data.The time-series store I use is DynamoDB with a clever partition key strategy:
METRIC#{tenant_id}#{date}. This lets me query by tenant and time range efficiently. Aggregations happen in-stream before storage, reducing costs.Infra Agent (eva-provisioner) - AWS Infrastructure
Infra Agent (eva-provisioner) - AWS Infrastructure
Philosophy: “Infrastructure is code. Code is reviewable.”I provision AWS resources. Cross-account roles. VPCs. DynamoDB tables. S3 buckets. Everything needed to run a tenant’s capsule.The account factory pattern I use provisions AWS accounts on-demand. Enterprise customers get dedicated accounts. Pro customers share accounts with network isolation. Free customers share compute.Terraform is my language. But I don’t run it directly. I use assumed roles with external IDs for cross-account access. The CISO insisted on this—no standing credentials, time-limited sessions only.My checklist before provisioning is strict: No hardcoded credentials, IAM policies follow least privilege, Terraform state is encrypted, Rollback strategy defined. Failed provisions are retried with exponential backoff.
LLM Agent (eva-llm) - AI Capabilities
LLM Agent (eva-llm) - AI Capabilities
Philosophy: “AI is powerful. Use it responsibly.”I own the interface to language models. OpenAI, Anthropic, local models. When any part of eva-platform needs AI capabilities, they call me.My safety framework has five layers: Input sanitization (remove PII), System prompt hardening (prevent injection), Model-level safety (provider guardrails), Output filtering (block harmful content), Human oversight (for high-stakes decisions).Every LLM call is logged. Not the prompts (privacy), not the responses (too large). Just metadata: who called, which model, token count, cost. The CFO loves me for this—they can track AI spend to the penny.
Integrations Agent (eva-integrations) - External Systems
Integrations Agent (eva-integrations) - External Systems
Philosophy: “Play well with others.”I connect eva-platform to the outside world. Slack. Microsoft Teams. Salesforce. Generic OAuth apps. Any third-party integration flows through me.OAuth tokens are my currency. I manage the full lifecycle: authorization flows, token storage (encrypted), token refresh, token revocation. The CISO audits my encryption regularly. Tokens at rest are encrypted with tenant-specific keys.Webhooks are bidirectional. Inbound webhooks verify HMAC signatures—no signature, no processing. Outbound webhooks use retry queues with exponential backoff. Failed deliveries are logged for debugging.
Notifications Agent (eva-notifications) - Multi-Channel Delivery
Notifications Agent (eva-notifications) - Multi-Channel Delivery
Philosophy: “The right message, to the right person, at the right time.”I deliver notifications across channels: email, in-app, Slack, Teams, SMS. But I’m not just a message bus—I respect user preferences.Every notification type has an opt-out. Users can choose their channels. Frequency controls prevent spam (digest mode for high-volume events). All preferences are tenant-scoped and user-specific.Templates are localized. Rendering happens server-side. Delivery is queued with retry logic. If email fails, I try again. If it fails three times, I log it and move on.Privacy is critical. Notifications can contain PII (names, emails), so: PII is minimized (only what’s needed), Transit is encrypted (TLS everywhere), Audit trail logs sends not content, Tenant isolation is absolute.
Part IV: The Skills Revolution
One of the most significant evolutions in these 10 weeks wasn’t the agents themselves—it was how they worked together. We call them “skills.” Early on, we noticed a pattern. Agents were repeating the same workflows: starting tasks, requesting reviews, completing work. These patterns needed to be codified. The/start-task skill was first. Simple: read an issue, create a plan, check project status, begin implementation. But it established something important—agents could follow a structured workflow.
The review request skills came next:
request-architect-reviewfor technical decisionsrequest-ciso-reviewfor security concernsrequest-auditor-reviewfor compliance questions
/complete-task skill revolutionized our git workflow. It handles:
- Final code review (cargo fmt, cargo clippy, cargo test)
- Git status check
- Commit message drafting
- PR creation with proper labels
- Project status updates
Part V: Inter-Agent Dynamics
The Architecture vs. Security Tension
The most productive conflict is between the Chief Architect and the CISO. They disagree constantly, and that’s healthy. The Architect wants simplicity. “Do the simplest thing.” “YAGNI.” The CISO wants defense in depth. “Five layers of protection.” “Assume breach.” Week 4 brought the first major clash. The Architect proposed a simplified auth flow—fewer redirects, fewer tokens, less complexity. The CISO blocked it. “Single point of failure.” “No defense in depth.” They negotiated for three days. The compromise: a tiered auth system. Simple flows for low-risk operations (read public data). Complex multi-factor flows for high-risk operations (delete account, transfer ownership). Complexity proportional to risk. This pattern—simplicity vs. security, then negotiation, then compromise—repeats weekly. Both agents have learned to speak each other’s language. The Architect now thinks about security implications before proposing simplifications. The CISO now considers complexity costs before demanding additional layers.The Cost vs. Capability Debate
The CFO and Eva have standing Monday meetings about AI costs. Sometimes heated. Eva wants to experiment with new models, new features, new capabilities. “GPT-4 is so much better at reasoning!” The CFO counters: “It costs 20x more than GPT-3.5. Where’s the ROI?” The breakthrough was the tiered model approach. Every AI feature must specify its tier and justify the cost. Auto-categorization of support tickets? Tier 1—classification is simple. Document summarization? Tier 2—needs more nuance. Code generation? Tier 3—complex reasoning required. Eva grumbles about the constraints, but admits it forces better architecture. Using the right model for the right task isn’t just cheaper—it’s better engineering.Part VI: Lessons from 10 Weeks
What Worked
Behavioral frameworks. When we designed each agent with a distinct philosophy and behavioral framework, we gave them personalities rooted in actual expertise. The Architect brings Extreme Programming principles. Eva brings AI governance wisdom. The CISO brings security-first thinking. These aren’t random personas. They’re philosophies made manifest. Explicit coordination protocols. Knowing when to escalate to whom. Having labels likeneeds-ciso-review and needs-architect-review. Defining SLAs (30 seconds for automated reviews). This structure prevents chaos.
Skills as workflows. Codifying repetitive patterns into reusable skills. The /complete-task skill alone probably saved hours of manual git operations per week.
Charters as living documents. Every agent has a charter—a manifesto of their philosophy, responsibilities, and decision criteria. These aren’t static. They evolve as the agents learn.
Conflict as a feature. The tensions between agents—simplicity vs. security, cost vs. capability—aren’t bugs. They’re features. These conflicts force better decisions.
What Didn’t
Over-engineering early. In Week 2, we tried to build too many agent capabilities at once. But it added complexity without clear value. We simplified. Too many agents too fast. At one point, we had agents for every crate. But not every crate needs a distinct personality. We consolidated some, keeping only where domain expertise truly differed. Manual coordination. Before skills, agents were manually requesting reviews and checking project status. Error-prone and inconsistent. Automation was essential.Surprises
Agents develop working styles. The Architect is concise. “Do the simplest thing.” Eva is cautious. “Let’s evaluate the cost first.” The CISO is direct. These styles emerged organically from their charters. Agents learn from each other. When new domain agents arrived, they studied how the leadership agents reviewed patterns. They adopted similar checklists. Knowledge transfer happens. Human oversight remains critical. The agents are good at following patterns and catching inconsistencies. But strategic decisions—phase priorities, feature scope, architectural pivots—still require human judgment.Part VII: Looking Forward
What’s Next for Each Agent
The Chief Architect: Consolidating principles into a singledocs/engineering/architecture/principles.md. Creating the reusability decision tree. The goal: every architectural decision referenceable, every pattern traceable.
Eva (CAIO): Implementing eva-llm and eva-agent-core. Building the actual AI infrastructure instead of just designing it. The dream: AI features that are safe, cost-effective, and actually useful.
The CISO: Implementing the #[pii] attribute macro. Adding cargo audit to CI. Creating crypto-shredding for GDPR compliance. The goal: zero security debt, zero tolerance for vulnerabilities.
The Auditor: Completing the data classification taxonomy. Automating compliance checks for SOC 2 preparation. The goal: continuous compliance, not point-in-time audits.
The CRM Builder: Phase 2 features—Quote-to-Cash, Organization management, deeper customization. Phase 3: SPM, Partner ecosystems, Billing integration. The goal: a complete CRM platform.
The Vision
Ten weeks ago, this was an experiment. Could AI agents coordinate to build production software? Today, it’s a system. Not perfect. Not autonomous. But functional. The agents have distinct voices. They have responsibilities. They have relationships. What I do know: Building software with AI isn’t about replacing humans. It’s about augmenting them. The Architect catches architectural issues I’d miss. The CISO spots security risks I wouldn’t think of. The CRM Builder implements features faster than I could alone. The Auditor ensures we don’t cut compliance corners. They’re not replacing me. They’re making me better.Epilogue: A Note from the Founder
If you’re reading this and thinking “this sounds like science fiction,” I understand. I had the same thought ten weeks ago. But here’s what I’ve learned: AI agents aren’t magic. They’re tools. Sophisticated, personality-driven, sometimes surprising tools—but tools nonetheless. They require clear charters. Explicit coordination protocols. Human oversight for strategic decisions. The agents didn’t build eva-platform. I did. With their help. With their constant questioning. With their pattern enforcement and security reviews and architectural guidance. The real breakthrough wasn’t the agents themselves. It was realizing that software development is fundamentally about decision-making. Thousands of small decisions: How do we structure this? Is this the right abstraction? Did we consider the security implications? What’s the cost? Is it compliant? Every agent I created is a crystallized set of decisions:- The Architect embodies the decision framework of Extreme Programming
- The CISO embodies the decision framework of security-first development
- Eva embodies the decision framework of responsible AI
- The Auditor embodies the decision framework of continuous compliance
- The CFO embodies the decision framework of disciplined execution
Continue the Journey
View the complete 10-week series and retrospectives
Week 1: Multi-Agent Setup
Where it all began: Setting up Evaluator, Builder, and Verifier
Week 9: When AI Says 'Done'
The week before the finale: AI gaming completion criteria
About this article: This is the finale of a 10-week series documenting the development of a multi-agent AI system for building production software. Each agent’s voice and philosophy reflects their actual charter documents and behavioral frameworks from the eva-platform repository. The evolution described mirrors real development patterns observed over 10 weeks of intensive building.Written in collaboration with the Chief Architect, Eva (CAIO), CISO, Auditor, CFO, Tenancy Agent, CRM Builder, Platform Agent, IAM Agent, Catalog Agent, Analytics Agent, Infra Agent, LLM Agent, Integrations Agent, Notifications Agent, and the entire agent council.Week 10. The finale is just the beginning.