We got our autonomous pipeline working. Task picked from the queue, AI implements, verifier reviews the PR, requirement closes. The loop we’d been building toward for weeks — it ran. Then we watched the API usage. Credits were gone almost instantly. This isn’t a complaint about pricing. It’s a systems observation the AI development conversation almost entirely ignores: an autonomous agent running against a paid API isn’t a background process. It’s a cost center with no natural off switch. Think through what happens inside a single task:
- Agent reads the task requirements: tokens
- Agent explores the codebase to understand context: tokens
- Agent writes an implementation: tokens
- Tests fail, agent reads the error, iterates: tokens, tokens, tokens
- Verifier reads the diff, runs its review: tokens
The Constraint Forced the Architecture
We stepped back. Not because the pipeline didn’t work — it did. Because running it continuously wasn’t fiscally responsible at our stage. That constraint turned out to be the most useful thing that happened to us. The obvious response when you’re over-spending on one provider: stop depending on one provider. We built multi-model routing — sending tasks to different models based on what the task actually requires. Boilerplate implementation goes to a cheaper, faster model. Architectural decisions, or tasks the verifier has rejected and needs serious rethinking, route to a stronger model. The unit economics shift considerably when you stop treating every task as equally expensive. This wasn’t in the original design. The original design was “run the best model on everything.” The billing dashboard forced a better architecture than we would have built on a whiteboard. The ideal system you design in theory isn’t the one you end up with. The one you build is shaped by the obstacles you actually hit. We’ve found that constraints produce more robust designs than unconstrained planning — but we would never have believed that until we hit one hard enough.The Honest Middle Ground
Full API-driven autonomy is on pause. Deliberately, not indefinitely. What’s running instead: a persistent local session working through the backlog continuously, on hardware we already own. The cost is electricity. The tradeoff is occasional steering — the session doesn’t self-schedule the way a cron-driven pipeline does. But delivery keeps moving, and the bill is predictable. It’s less dramatic than “fully autonomous.” It’s also more honest about the tradeoff most teams will face when they try to build this. The path from “AI helps me code” to “AI runs the development organization” isn’t a switch you flip. It’s an economics problem as much as an engineering problem. You’re building a system that consumes a resource with a per-unit cost every time it acts. Getting the cost-per-outcome into a range that makes sense for your stage is part of the engineering, not an afterthought. Encoding your standards as hard constraints — the way Claude Code hooks enforce architectural rules — is one lever. Another is knowing, before you scale, where autonomous agents fail systematically: tasks with no clear pattern, problems that require judgment at the boundaries of your domain. Those are the tasks that run longest, branch most, and cost most. Routing them correctly is the economic problem underneath the capability problem.What the Demos Owe You
The autonomous agent demos aren’t dishonest. The capabilities they show are real. But they’re showing you the output without the input cost, and that gap matters if you’re trying to build something sustainable rather than just impressive. Before you architect an autonomous pipeline, work through the economics. What does a single task cost end-to-end, including failed attempts and verifier cycles? At your expected task volume, what does that translate to monthly? Which tasks are genuinely worth strong-model pricing, and which can run cheaper? What does the fully-loaded cost of a human doing the same task look like? The answers will shape your architecture more than any capability benchmark will. The billing dashboard isn’t the end of the story. It’s where the real engineering starts.All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.