Nobody Shows You the Billing Dashboard

Every demo of an autonomous AI agent looks the same. The agent picks a task. It reads files, writes code, runs tests, iterates on failures, opens a pull request. The founder narrates. It’s impressive. You close the tab thinking about what you could build. Nobody shows the next screen. The billing dashboard.

We got our autonomous pipeline working. Task picked from the queue, AI implements, verifier reviews the PR, requirement closes. The loop we’d been building toward for weeks — it ran. Then we watched the API usage. Credits were gone almost instantly. This isn’t a complaint about pricing. It’s a systems observation the AI development conversation almost entirely ignores: an autonomous agent running against a paid API isn’t a background process. It’s a cost center with no natural off switch. Think through what happens inside a single task:

Agent reads the task requirements: tokens
Agent explores the codebase to understand context: tokens
Agent writes an implementation: tokens
Tests fail, agent reads the error, iterates: tokens, tokens, tokens
Verifier reads the diff, runs its review: tokens

A task that a senior engineer resolves in two hours of focused work might consume twenty or thirty agent interactions, each pulling context, each generating output. At scale — multiple repos, multiple tasks running in parallel, a verifier scanning PRs every hour — the math gets uncomfortable fast. The problem compounds when an agent hits a failure loop: each retry cycle burns tokens trying variations on an approach that may not work, with no budget ceiling to stop it. The demos end when the PR merges. The bill doesn’t.

The Constraint Forced the Architecture

We stepped back. Not because the pipeline didn’t work — it did. Because running it continuously wasn’t fiscally responsible at our stage. That constraint turned out to be the most useful thing that happened to us. The obvious response when you’re over-spending on one provider: stop depending on one provider. We built multi-model routing — sending tasks to different models based on what the task actually requires. Boilerplate implementation goes to a cheaper, faster model. Architectural decisions, or tasks the verifier has rejected and needs serious rethinking, route to a stronger model. The unit economics shift considerably when you stop treating every task as equally expensive. This wasn’t in the original design. The original design was “run the best model on everything.” The billing dashboard forced a better architecture than we would have built on a whiteboard. The ideal system you design in theory isn’t the one you end up with. The one you build is shaped by the obstacles you actually hit. We’ve found that constraints produce more robust designs than unconstrained planning — but we would never have believed that until we hit one hard enough.

The Honest Middle Ground

Full API-driven autonomy is on pause. Deliberately, not indefinitely. What’s running instead: a persistent local session working through the backlog continuously, on hardware we already own. The cost is electricity. The tradeoff is occasional steering — the session doesn’t self-schedule the way a cron-driven pipeline does. But delivery keeps moving, and the bill is predictable. It’s less dramatic than “fully autonomous.” It’s also more honest about the tradeoff most teams will face when they try to build this. The path from “AI helps me code” to “AI runs the development organization” isn’t a switch you flip. It’s an economics problem as much as an engineering problem. You’re building a system that consumes a resource with a per-unit cost every time it acts. Getting the cost-per-outcome into a range that makes sense for your stage is part of the engineering, not an afterthought. Encoding your standards as hard constraints — the way Claude Code hooks enforce architectural rules — is one lever. Another is knowing, before you scale, where autonomous agents fail systematically: tasks with no clear pattern, problems that require judgment at the boundaries of your domain. Those are the tasks that run longest, branch most, and cost most. Routing them correctly is the economic problem underneath the capability problem.

What the Demos Owe You

The autonomous agent demos aren’t dishonest. The capabilities they show are real. But they’re showing you the output without the input cost, and that gap matters if you’re trying to build something sustainable rather than just impressive. Before you architect an autonomous pipeline, work through the economics. What does a single task cost end-to-end, including failed attempts and verifier cycles? At your expected task volume, what does that translate to monthly? Which tasks are genuinely worth strong-model pricing, and which can run cheaper? What does the fully-loaded cost of a human doing the same task look like? The answers will shape your architecture more than any capability benchmark will. The billing dashboard isn’t the end of the story. It’s where the real engineering starts.

All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Practical Guides

Insights & Debate

Nobody Shows You the Billing Dashboard

The Constraint Forced the Architecture

The Honest Middle Ground

What the Demos Owe You

Practical Guides

Insights & Debate

​The Constraint Forced the Architecture

​The Honest Middle Ground

​What the Demos Owe You

The Constraint Forced the Architecture

The Honest Middle Ground

What the Demos Owe You