Episode 1: The Orchestration Problem — Why One AI Isn't Enough

This is Episode 1 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.

We Thought We Knew What Was Next

Week 10 closed a chapter. Ten weeks of building with AI — what works, what fails, how to collaborate with agents that have no memory, no context, no skin in the game. We thought we’d figured it out. Then we tried to go further. The assumption: if AI can help you build, maybe AI can build without you. Hand off implementation entirely. Become the director, not the developer. We were wrong. Not about the destination — about how far we actually were from it. Here’s what the attempt taught us.

The Monorepo Problem

The first thing that broke wasn’t the pipeline. It was the context. Running AI agents against a large, unified codebase with everything in it — we kept hitting the same problem: the AI would miss things. Not because it was incapable, but because the context was too large and too tangled. Fix something in the billing layer, break something in notifications. Restructure an API, break a consumer three folders away. The AI wasn’t failing at coding. It was failing at holding the whole thing in mind simultaneously. And honestly — so would a new engineer. The architectural response: go polyrepo. Break the monolith into bounded contexts. Each repository owns one thing clearly. Billing is its own repo. User management is its own repo. The API gateway is its own repo. This wasn’t just a code organization decision. It was an AI performance decision. Smaller context, better output. An agent can hold one bounded domain completely without dropping things at the edges. That distinction matters more than most architecture discussions acknowledge. Polyrepo isn’t just about team ownership or deployment independence — it’s about giving each agent a coherent, contained problem to solve.

The Multi-AI Discovery

Here’s where it got interesting. We used one model to do the restructuring work — pulling services out of the monolith, setting up new repositories, wiring CI. Then we used a second model to audit the result. The second model found real problems. Inconsistent flows. Broken links. Things the first model had missed. We fed the feedback back to the first. It agreed, improved, fixed. It worked. But we were the bridge — manually carrying findings from one AI to the other, interpreting the gaps, deciding what mattered. That’s not a system. That’s a human doing coordination work between two tools that can’t talk to each other. The discovery: a single AI tool isn’t enough for complex, multi-repo work. You need multiple tools that can check each other. But the orchestration between them can’t be manual — that’s just replacing one kind of implementation work with another kind of coordination work. Same bottleneck. Different shape.

The Engineering Problem Nobody Is Talking About

This is the part that surprised us. The conversation about AI in software development is almost entirely about capability — what can AI code, how accurate is it, which model is best. Almost nobody is talking about the harder problem: how do you orchestrate multiple AI tools, across multiple bounded contexts, reliably and efficiently? That’s not a tooling problem. Not a prompt engineering problem. It’s an honest systems design problem. When we tried to solve it by running multiple containers with different tools orchestrated between them — it didn’t work. The interfaces between agents were ill-defined. Failures didn’t route cleanly. The whole thing was brittle in ways that were hard to debug. The containers approach collapsed under its own complexity before it produced anything useful. We needed to step back further than we expected.

What We Stepped Back From (And Why)

We built the task worker. An autonomous agent that pulls from the backlog, implements, verifies, and merges — without a human in the loop. We got it running. Then we watched the costs. An autonomous agent running against a paid API isn’t a background process — it’s a cost center with no natural off switch. Every task picked, every file scanned, every iteration on a failing test: tokens. At scale, the math gets uncomfortable quickly. We’ve written about this separately — if you missed it, Nobody Shows You the Billing Dashboard covers exactly what that moment looks like. The short version: we stepped back deliberately and rethought the architecture. That decision led directly to better design than we would have built otherwise.

The Multi-Model Response

The constraint forced a better architecture decision. Instead of depending on a single provider, we built multi-model support. Each model has different strengths, different cost profiles, different context limits. The system can route tasks to the appropriate model based on what the task actually needs — not just which API key is loaded. Cheap models for boilerplate. Strong models for architecture decisions, rejection analysis, critical verification. The cost curve flattens when you stop using a sledgehammer for every problem. This wasn’t the original plan. It came directly from hitting a wall. That’s how real engineering usually goes. The design you’d sketch on a whiteboard isn’t the one you build. The one you build is shaped by the obstacles you actually hit.

The Pragmatic Pivot: One Host, One Loop

Full API-driven automation is a target, not today’s reality. The target architecture: structured workflows where one process picks the task, one implements, one verifies. Defined interfaces between each step. Failures route back to the queue. When that pattern is solid, Kubernetes scales it — bounded contexts running in parallel, each domain with its own agents, without stepping on each other. What’s running today is the pragmatic version of that: a single dedicated host running a CLI session, looping continuously through the issue backlog. One machine. One queue. One loop. No runaway spend. No orchestration complexity that isn’t ready yet. The dramatic version burned budget before it shipped anything. The pragmatic version closed 35 tasks in the last 24 hours. The pragmatic version closed 35 tasks in the last 24 hours:

The Velocity of the Loop: 35 Tasks in 24 Hours

Domain Repos: Opportunity and product management
API Handlers: Invoicing, partner costs, product discounts
Financial Precision: Replaced integers with Decimal types
Governance: Branch protection and CODEOWNERS enforcement

Domain repositories implemented for opportunity and product management
API handlers unblocked for invoicing, partner costs, product discounts
Lead conversion service — converting a lead into account, contact, and opportunity with a single operation
CRM activities module — calls, meetings, tasks, emails
Proration calculation for mid-term subscription changes
Financial precision hardened — replaced integer arithmetic with Decimal types, corrected period calculations to use calendar months instead of 30-day approximations
Partner tier review workflow with quarterly scheduling
PII field classification added to contract and subscription domain structs
Branch protection and CODEOWNERS enforcement deployed across all product repositories
CI governance checks upgraded

That last category is important. It’s not just code. It’s governance. Security. Financial accuracy. CI reliability. The loop isn’t writing features — it’s building a product. Thirty-five tasks. No implementation code written by hand.

Writing Code Is the Easy Part

This is where most AI development conversations stop short. Writing code is one phase of software delivery. Getting to production is not. A real product moves through environments — development, staging, production. It requires configuration management across those environments. It requires testing that covers not just unit behavior but permutations: different tenant configurations, edge case inputs, failure modes, concurrent access patterns. Infrastructure that can be torn down and rebuilt. Observability. Access controls. Data migrations. An autonomous development loop that only writes code is a loop that stops at the first gate. What we’re building — and what makes this genuinely interesting — is the full loop. LLM handles implementation. Solid engineering practice handles everything that implementation alone can’t solve: test automation, environment parity, infrastructure-as-code, CI/CD gates, verification criteria that actually reflect user behavior. The next phases will use AI to help automate infrastructure provisioning and test generation. Kubernetes to run verification at scale. Testing tools to cover the permutations and combinations that manual testing can’t reach at speed. None of that replaces the engineering discipline. It accelerates it. This matters for a reason beyond productivity. The conventional assumption about building a software company is that you need capital — to find engineers, hire them, retain them, manage them. That assumption is worth questioning. Not because AI replaces engineers, but because a small team that has automated the right parts of delivery can cover ground that previously required multiples of the headcount. We’re not claiming to have proven that yet. We’re claiming it’s worth engineering seriously — not just prompting casually.

What This Actually Taught Us

AI development is still an engineering task. Just a different kind. The engineering challenge isn’t writing code anymore — it’s building the system that writes the code. The orchestration layer. Context management. Multi-model cost routing. Failure handling. Feedback loops between agents. Verification criteria that reflect real user behavior. Get that right and you get a lean, fast development organization that scales without adding headcount. Get it wrong and you get hundreds of commits that fix authentication between containers, a billing surprise, and nothing shipped. We’re somewhere in the middle. But the direction is clear, and today the backlog is moving on its own.

What’s Next

The loop produces code. What it hasn’t done yet is prove the product works. End-to-end verification — using the product as a real user, across real environments, with real data — is the next phase. The coming weeks will close that gap. The loop will miss things. Production-grade software isn’t just correct implementations in isolation — it’s correct behavior under the full combination of real constraints. The cycle we’re building: implement, deploy, verify, find gaps, close them. Repeat. When it’s running reliably, we’ll know because the product works end-to-end. Not because the issue count dropped.

All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Building with AI

Autonomous Dev Org

​We Thought We Knew What Was Next

​The Monorepo Problem

​The Multi-AI Discovery

​The Engineering Problem Nobody Is Talking About

​What We Stepped Back From (And Why)

​The Multi-Model Response

​The Pragmatic Pivot: One Host, One Loop

​Writing Code Is the Easy Part

​What This Actually Taught Us

​What’s Next