Skip to main content
This is Episode 2 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.

The Loop Was Running. The Memory Wasn’t.

Episode 1 ended with the loop closing 35 tasks in 24 hours. One host. One queue. One continuous execution loop. The pragmatic version of the architecture, running. Then we watched more closely. The same DynamoDB gotcha that cost three hours on Monday appeared again on Wednesday. Different task, different file, same mistake — conditional write syntax that the documentation describes incorrectly. The agent had worked through it before. But “before” was a closed session. This session started blank. Every session starts blank. That’s not a model limitation — it’s an architectural one. The model is perfectly capable of using prior context when prior context exists. The problem is that nothing was giving it prior context. The instinct was to add a vector database. Embed session transcripts, build a retrieval pipeline, query semantically before each task. It would work. It would also mean running another service, maintaining embeddings, and adding latency to every task kickoff. We looked at what we already had instead.

The Insight

The task queue is GitHub Issues. Every closed issue is a record of completed work. Comments capture what happened — what was tried, what failed, what the final approach was. The information was already there, accumulated over weeks of the loop running. The gap wasn’t that the information didn’t exist. It was that the information wasn’t structured for retrieval. A vector database solves a retrieval problem. But before retrieval comes structure. Unstructured prose needs embeddings and semantic search to pull signal from noise. Structured evidence needs only a keyword search against a typed schema. The reframe: make task outcomes structured as they’re recorded, and the retrieval problem becomes simple.

The Bead Format

We call the structured outcome record a “bead” — a comment written to the GitHub Issue when the task closes. Five fields, each earning its place:
{
  "bead": {
    "task_id": "GH-142",
    "completed": "2026-02-18",
    "outcome": "Implemented tenant-scoped event handler with idempotency key",
    "friction_points": [
      "DynamoDB conditional write syntax differs from docs — use ConditionExpression not ConditionalOperator",
      "Lambda cold start added 800ms on first invocation — pre-warm with EventBridge schedule"
    ],
    "patterns_used": ["idempotency-key", "conditional-write", "tenant-scoping"],
    "tags": ["dynamodb", "lambda", "event-sourcing", "multi-tenant"]
  }
}
outcome — what was actually delivered. One sentence. Verifiable. friction_points — this is the most valuable field. The specific things that slowed the work down: wrong documentation, surprising API behavior, a gotcha that cost hours. These are the corrections that otherwise vanish at session end. patterns_used — the named approaches that applied. Creates a vocabulary for pattern-based retrieval: “find tasks that used idempotency-key.” tags — technology and domain tags for broader filtering. The friction_points field captures what didn’t work on the way to what did. Most knowledge systems optimize for capturing solutions. Beads capture the wrong turns too — and that’s the information that most changes how you’d approach the next similar task.

The Retrieval Layer

Once beads are structured, the retrieval question becomes straightforward. An MCP (Model Context Protocol) tool is the right interface: callable from within the agent session, adds zero new infrastructure, returns structured context in milliseconds. The tool search_past_tasks takes a query and optional filters and returns the N most relevant beads. No embeddings. No vector index. Keyword overlap against friction_points and patterns_used, with recency as tiebreaker.
@mcp_tool
def search_past_tasks(
    query: str,
    tags: list[str] = None,
    patterns: list[str] = None,
    limit: int = 5
) -> list[dict]:
    """
    Search completed task beads for relevant prior context.
    Returns friction_points, patterns_used, and outcome for matching tasks.
    """
    issues = gh.issues(state="closed", labels=tags or [])

    beads = []
    for issue in issues:
        bead = extract_bead_comment(issue)
        if bead and matches_query(bead, query, patterns):
            beads.append(bead)

    # Sort by friction_point keyword overlap, recency as tiebreaker
    return sorted(beads, key=lambda b: score(b, query))[:limit]
For most task domains this is sufficient. Similar tasks share vocabulary. Friction points describe the specific technologies and APIs involved — exactly the terms that appear in new tasks touching the same ground. When the executor picks a new task, it now calls search_past_tasks first. The DynamoDB conditional write gotcha is in the bead from GH-142. The next agent that touches DynamoDB sees it before writing a line.

Where Beads Fit: Three Tiers

Beads solve a specific tier of the memory problem. Worth being precise about which one. Tier 1 — Non-negotiable rules that apply to every task. Never use floats for money. Scope every query by tenant. These live as enforced constraints in Claude Code hooks, not documentation that might be skipped. Tier 2 — Architectural decisions, ADRs, established patterns. “We use conditional writes for idempotency.” Included at session start as project context. Tier 3 — The episodic record: what actually happened on past tasks, which approaches failed, what the friction points were. This is where beads live. The instinct to add a vector database usually targets Tier 3. Beads address the same tier with less infrastructure — the tradeoff is retrieval sophistication vs operational simplicity. For a bounded task domain running against one codebase with recurring task types, keyword retrieval over structured beads is enough.

The Real Cost

The tradeoff is manual curation. Beads don’t write themselves. The executor agent has to write one when closing each task. This is real. But it’s also a forcing function. Writing a good bead means articulating what happened — which has value independent of future retrieval. The agent that writes the bead is encoding its own correction for the next agent that encounters the same situation. If you want beads to populate automatically from session transcripts, that’s when a purpose-built memory system starts to justify itself. For the current loop, the executor writes them as part of the close-out sequence, and the quality is consistently high because the context is fresh.

What Changed

The loop was closing tasks before beads. It’s closing them faster now. Not because the agents are smarter — because they start informed. The DynamoDB gotcha that cost three hours the first time costs twenty minutes the second time because the friction point is surfaced before the first write. Pre-task context instead of post-task regret. That’s the whole thing.

What’s Next

The loop has memory now. But it still doesn’t know what it’s about to break. A task like “add a parameter to this function” looks small. It isn’t — not in a codebase where 47 modules call that function. The agent discovers the scope reactively: write, compile, find 23 errors, fix each one. Six compiler cycles. Forty-five minutes. Episode 3 covers the architecture we’re building to give the loop blast radius awareness before the first keystroke — a proactive impact graph that makes the reactive discovery loop unnecessary.

Episode 3: The Agent That Couldn't See What It Was Breaking

How a code intelligence layer — Tree-sitter, KuzuDB, MCP — gives agents proactive impact awareness before they write.

All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.