Skip to main content

The Problem: Discovered in Production

During testing week 3: Event logs showed a warning.
[WARN] Capsule isolation violation detected
Entity: FinancialConfig
Event: FinancialConfigUpdated
Issue: Missing capsule_id in partition key
Impact: DEVUS test data appearing in PRODUS production queries
The scenario that triggered this: A developer testing financial configuration in the DEVUS capsule (test environment) created a config with industry = “Technology”. Later, when querying PRODUS capsule (production) for all Technology accounts, the test data appeared in results. Why this happened: FinancialConfig entity was scoped at tenant level, not capsule level:
// WRONG: Tenant-scoped only
#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    // capsule_id missing!
    pub industry: String,
    // ...
}
Query pattern (broken):
// Queries for "Technology" industry returned results from BOTH capsules
let configs = repo.query_by_industry(tenant_id, "Technology").await?;
// Returns: [DEVUS test config, PRODUS production config]
The Impact: Test data contaminating production results is a compliance violation.For financial services, mixing test and production data violates SOC 2 and can fail audits.

The Scope: 6 Entities to Migrate

Analysis revealed the problem was widespread:
EntityScopeIssueRisk
FinancialConfigTenantTest configs in prod queriesHigh
ContractTenantTest contracts in revenue reportsCritical
ContractLineItemTenantTest line items in billingCritical
ContractAmendmentTenantTest amendments in audit trailHigh
RevenueScheduleTenantTest revenue in financial reportingCritical
AccessEntityTenantTest access grants in security queriesMedium
Total impact:
  • 6 entities
  • 21 repository methods per entity
  • 47 API handlers
  • 1,003 tests
  • Estimated migration effort: 4-6 weeks manually

The Decision: Systematic vs. Incremental

Option A: Incremental Migration (Rejected)
  • Migrate one entity at a time
  • Deploy after each entity
  • Gradual rollout over 6 weeks
Problems:
  • Mixed isolation states during migration
  • Complex conditional queries (is this entity capsule-scoped yet?)
  • 6 deployment cycles, 6 risk windows
Option B: Big Bang Migration (Chosen)
  • Plan all 6 entities comprehensively
  • Implement in parallel
  • Single coordinated deployment
  • One-time breaking change
Rationale: Capsule isolation is a security boundary. Can’t be “partially” enforced.

The Planning Session

Evaluator session (Opus, 6 hours):
Planning session for capsule isolation migration.

Context:
- 6 entities currently tenant-scoped, need capsule scope
- ADR-0010 defines capsule isolation requirements
- Migration is breaking change (PK patterns change)
- Cannot break existing data

Requirements:
1. Migrate entity schemas (add capsule_id field)
2. Update PK/SK patterns (TENANT#...#CAPSULE#...)
3. Update all repositories (add capsule_id parameters)
4. Update all API handlers (pass capsule_id)
5. Migrate existing DynamoDB data
6. Update all tests (1,003 tests)

Constraints:
- Zero downtime
- Backward compatibility during migration
- All 6 entities migrate together (atomic)
- No data loss

Design migration strategy with:
- Entity modification plan
- Data migration plan
- Rollback plan
- Test plan
Evaluator’s output: A 464-page migration plan for just the Contract entities alone.

Decision 1: Dual-Write Migration Strategy

Phase 1: Add capsule_id, maintain old PK pattern
  • Add capsule_id field to entities
  • Still use old PK: TENANT#{tenant_id}#CONTRACT#{id}
  • Dual-write: Write to both old and new patterns
  • Queries use old pattern (no behavior change)
Phase 2: Flip queries to new pattern
  • Start querying new PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONTRACT#{id}
  • Still dual-write to both patterns
  • Monitor for issues
Phase 3: Drop old pattern
  • Stop writing to old pattern
  • Clean up old data
  • Remove dual-write code
Rationale: Gradual cutover prevents big-bang deployment risk.

Decision 2: GSI Pattern Update

Old pattern:
GSI1PK: CAPSULE#{capsule_id}#ACCOUNT#{account_id}
Problem: Missing tenant_id prefix makes cross-tenant isolation unverifiable.New pattern:
GSI1PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCOUNT#{account_id}
Impact: All GSI helper methods need tenant_id parameter:
// Old signature
fn gsi1pk_for_account(capsule_id: CapsuleId, account_id: AccountId) -> String;

// New signature
fn gsi1pk_for_account(tenant_id: TenantId, capsule_id: CapsuleId, account_id: AccountId) -> String;
This signature change affects 127 call sites across the codebase.

Decision 3: Test Data Migration

Challenge: 1,003 tests use hard-coded tenant_id, no capsule_id.Options:
  1. Update all tests to include capsule_id (manual)
  2. Create default capsule for tests (automated)
  3. Generate migration script for test data
Choice: Option 2 with fallback to Option 1 for critical tests.Implementation:
  • Test helper: test_capsule() returns default CapsuleId for all tests
  • Critical tests (cross-capsule scenarios): Explicit capsule_id values

The Implementation: Parallel Migration

Instead of migrating entities sequentially, we migrated in parallel using AI:

Entity #1: FinancialConfig (Template Migration)

Builder session:
Migrate FinancialConfig to capsule scope per plan:
.plans/567-financial-config-migration.md

Steps:
1. Add capsule_id field
2. Update PK pattern
3. Update repository methods
4. Update API handlers
5. Update 21 tests
Time: 4 hours Result:
  • Entity migrated ✅
  • 21 tests passing ✅
  • API handlers updated ✅
Verification session:
Verify FinancialConfig capsule migration.

Check:
- PK pattern includes CAPSULE# segment
- All repository methods require capsule_id
- Tests verify isolation (no cross-capsule leaks)
- API handlers extract capsule from request context
Verifier found 1 issue:
Issue: GSI Pattern InconsistencyEntity updated to:
PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL
But GSI pattern still used old format:
GSI1PK: CAPSULE#{capsule_id}#INDUSTRY#{industry}  // Missing TENANT#
Why Builder missed: Plan showed PK update clearly but GSI update was in a separate section.Impact: GSI queries wouldn’t enforce tenant isolation.Fix: Updated GSI pattern to include TENANT# prefix.
This finding influenced all subsequent migrations.

Entities #2-6: Parallel Migration

With the pattern established from FinancialConfig, I launched 5 parallel Builder sessions (one per entity): Contract entities (Session 1):
  • ContractEntity
  • ContractLineItemEntity
  • ContractAmendmentEntity
  • RevenueScheduleEntryEntity
Access entities (Session 2):
  • AccessEntity
Time per entity: 3-4 hours (faster due to template from FinancialConfig) Coordination challenge: All 5 sessions made changes to shared files (error types, API common code). Solution: Merge order discipline:
  1. FinancialConfig first (establishes pattern)
  2. Contract entities second (largest change)
  3. Access entities last (smallest change)
  4. Resolve merge conflicts in shared files

The Cascade We Prevented

What could have gone wrong without systematic planning:

Scenario: Ad-Hoc Migration

Day 1: Developer migrates FinancialConfig
  • Adds capsule_id field
  • Updates PK
  • Forgets to update GSI
  • Merges
Day 3: Another developer migrates Contract
  • Follows FinancialConfig pattern
  • Copies the broken GSI pattern
  • Merges
Day 5: Security audit finds issue
  • GSI queries bypass tenant isolation
  • All 6 entities have the same bug
  • Must fix all entities again
Total: 6 entities × 2 migrations each = 12 weeks of work

What Actually Happened: Planned Migration

Day 1: Evaluator creates comprehensive plan
  • Identifies GSI pattern issue upfront
  • Documents correct pattern for ALL entities
  • Creates migration checklist
Day 2-3: Migrate all 6 entities following plan
  • First entity (FinancialConfig) validates pattern
  • Remaining 5 entities copy validated pattern
  • All entities migrated correctly first time
Day 4: Verification
  • Run full test suite (1,003 tests)
  • All passing
  • Security audit confirms isolation
Total: 4 days Speedup: 18x faster than ad-hoc approach

What AI Excelled At

1. Systematic Call Site Updates

The task: Update 127 call sites to pass capsule_id parameter.
// BEFORE:
let key = gsi1pk_for_account(capsule_id, account_id);

// AFTER:
let key = gsi1pk_for_account(tenant_id, capsule_id, account_id);
AI’s approach:
  1. Find all calls to gsi1pk_for_account (and 6 similar functions)
  2. Update each call to include tenant_id
  3. Verify tenant_id is available in scope (add parameter if needed)
  4. Compile and check
Time: 2 hours Manual estimate: 8 hours (tedious, error-prone) Why AI excelled: Systematic pattern application across many files.

2. Test Data Updates

The task: Update 1,003 tests to include capsule_id. AI’s approach:
// Created test helper:
pub fn test_capsule() -> CapsuleId {
    CapsuleId::from_str("PRODUS").unwrap()
}

// Then updated all tests:
// BEFORE:
let config = FinancialConfig::new(tenant_id, "USD");

// AFTER:
let config = FinancialConfig::new(tenant_id, test_capsule(), "USD");
Time: 3 hours for all 1,003 tests Manual estimate: 2-3 weeks Why AI excelled: Repetitive pattern across hundreds of files.

What AI Struggled With

1. Understanding Migration Order

The issue: Contract entities have foreign key relationships:
ContractEntity (parent)

ContractLineItemEntity (child)

RevenueScheduleEntryEntity (grandchild)
AI’s first attempt: Migrated in alphabetical order (Amend→Contract→LineItem→Revenue). Problem: Can’t update child entities before parent. Human intervention: Defined migration order:
  1. ContractEntity (parent)
  2. ContractLineItemEntity (child)
  3. ContractAmendmentEntity + RevenueScheduleEntryEntity (grandchildren)
Lesson: AI doesn’t infer dependency graphs. Explicit ordering required.

2. Data Migration Strategy

The task: Migrate existing DynamoDB data from old PK pattern to new. AI’s plan:
1. Scan all items with old PK pattern
2. For each item:
   - Read item
   - Update PK to new pattern
   - Write to new key
   - Delete old key
3. Verify migration complete
Problem: This approach has a race condition. Between read and write, data could be modified. Human intervention: Use TransactWriteItems for atomic migration:
// Atomic migration
client.transact_write_items()
    .transact_items(
        TransactWriteItem::builder()
            .put(/* new PK */)
            .condition_expression("attribute_not_exists(PK)")
            .build()
    )
    .transact_items(
        TransactWriteItem::builder()
            .delete(/* old PK */)
            .condition_expression("attribute_exists(PK)")
            .build()
    )
    .send()
    .await?;
Lesson: AI doesn’t reason about concurrency edge cases. Needs explicit guidance.

Principles Established

What we learned: Breaking changes affecting multiple entities need coordinated planning.Checklist for breaking changes:
  • List ALL affected entities
  • Identify all call sites (use rg/grep)
  • Define migration order (dependency graph)
  • Create rollback plan
  • Plan data migration strategy
  • Update all affected tests
  • Document breaking changes
Rule: Don’t start implementation until plan covers all entities.
What we learned: Migrating one entity first (FinancialConfig) caught pattern issues before they propagated to other entities.Practice:
  1. Choose simplest entity as template
  2. Migrate template entity completely
  3. Verify thoroughly (Verifier catches pattern issues)
  4. Fix pattern issues in template
  5. Use validated template for remaining entities
Benefit: Pattern bugs fixed once, not 6 times.
What we learned: 5 parallel Builder sessions created merge conflicts in shared files.Strategy:
  1. Identify shared files upfront (error types, API common)
  2. Define merge order (simplest → most complex)
  3. First entity establishes patterns in shared files
  4. Subsequent entities adapt to established patterns
Example: FinancialConfig updated error types, remaining 5 entities reused those updates.
What we learned: Updated 1,003 tests as part of entity migration, not afterward.Why: Tests validate the migration worked correctly.Practice:
  • Migrate entity domain model
  • Update repositories
  • Update tests (verify new behavior)
  • Update API handlers
  • Integration tests confirm end-to-end
Anti-pattern: Migrate entity, skip tests, discover issues later.
What we learned: Future developers need to understand why entities are capsule-scoped.Documentation created:
  • ADR-0010: Capsule Isolation Enforcement
  • Migration plan for each entity (preserved in .plans/)
  • Updated CLAUDE.md with capsule isolation patterns
Format:
  • Context: Why was this change needed?
  • Decision: What pattern did we choose?
  • Consequences: What broke? How did we fix it?
  • Lessons: What would we do differently?

The Migration Process

Step 1: Entity Schema Update

Before:
#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
#[sk = "METADATA"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    pub base_currency: String,
    // ...
}
After:
#[derive(DynamoDbEntity)]
#[capsule_isolated]  // NEW: Compile-time enforcement
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL"]
#[sk = "METADATA"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // NEW: Required field
    pub base_currency: String,
    // ...
}
Changes:
  • Added #[capsule_isolated] attribute
  • Added capsule_id field
  • Updated PK pattern to include CAPSULE#{capsule_id}

Step 2: Repository Method Updates

Before:
pub trait FinancialConfigRepository {
    async fn get(&self, tenant_id: TenantId) -> Result<Option<FinancialConfig>>;
    async fn save(&self, config: FinancialConfig) -> Result<()>;
}
After:
pub trait FinancialConfigRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId) -> Result<Option<FinancialConfig>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, config: FinancialConfig) -> Result<()>;
}
Impact: All implementations (InMemory, DynamoDB, Cached) needed updates.

Step 3: Call Site Updates

The systematic work: 21 call sites for FinancialConfig alone. Before:
let config = repo.get(tenant_id).await?;
After:
let capsule_id = extract_capsule_from_context(&request)?;
let config = repo.get(tenant_id, capsule_id).await?;
AI’s strength: Found all 21 call sites, updated systematically. Time: 1 hour (vs. 4-6 hours manually)

Step 4: Test Updates

1,003 tests × 6 entities: Most were straightforward. Before:
#[test]
fn test_financial_config_creation() {
    let config = FinancialConfig::new(tenant_id(), "USD");
    assert_eq!(config.base_currency, "USD");
}
After:
#[test]
fn test_financial_config_creation() {
    let config = FinancialConfig::new(tenant_id(), test_capsule(), "USD");
    assert_eq!(config.base_currency, "USD");
    assert_eq!(config.capsule_id, test_capsule());  // NEW: Verify capsule
}
AI added: Assertions to verify capsule_id in all relevant tests.

What Went Right

1. The Template Entity Approach

FinancialConfig as the template:
  • Simplest entity (single config per tenant/capsule)
  • No foreign keys to other entities
  • Fewest call sites (21 vs. 40+ for Contract)
Benefits:
  • Caught GSI pattern issue early
  • Established test update pattern
  • Created reusable helper functions (test_capsule, extract_capsule_from_context)
Time saved: ~8 hours (pattern bugs fixed once, not 6 times)

2. Comprehensive Verification

After each entity migration, Verifier checked:
  1. Schema compliance:
    • PK includes both TENANT# and CAPSULE#
    • GSI patterns consistent across entities
    • capsule_id field present and required
  2. Isolation verification:
    • Tests include cross-capsule negative tests (DEVUS data shouldn’t appear in PRODUS queries)
    • Repository methods enforce capsule parameter
    • API handlers extract capsule from request context
  3. Data migration:
    • Old data accessible during migration
    • New writes use new pattern
    • No data loss
Bugs caught by Verifier:
  • Missing TENANT# in GSI patterns (2 entities)
  • Cross-capsule test missing (3 entities)
  • Incomplete API handler updates (1 entity)
Total bugs prevented: 6 (would have been production issues)

Metrics: The Migration by Numbers

Entities migrated: 6Files modified: 47Lines changed:
  • Added: 1,247 lines
  • Removed: 721 lines
  • Net: +526 lines (additional isolation code)
Call sites updated: 127Tests updated: 1,003All tests passing:

The Mistake I Made

After Contract migration completed: I was confident in the pattern. I skipped detailed verification for the last 2 entities (AccessEntity and one other). Impact: Deployed to staging environment. What broke: Access control queries returned inconsistent results. Root cause: AccessEntity migration updated PK correctly but forgot to update the query method:
// Entity correctly updated:
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCESS#{id}"]

// But query still used old pattern:
pub async fn query_by_resource(&self, tenant_id: TenantId, resource: Resource) -> Result<Vec<Access>> {
    let pk = format!("TENANT#{}", tenant_id);  // ❌ Missing CAPSULE#
    // ...
}
Why this happened: I rushed the last entities, trusting they’d “just work” like the previous ones. Fix time: 2 hours to find and fix in staging (before production deploy). Lesson: Even with templates, verify every entity. Don’t skip verification because “it should be the same.”

Code Example: The Migration Pattern

Here’s the final pattern we established for capsule isolation migration:
// Step 1: Update entity definition
#[derive(DynamoDbEntity, Debug, Clone)]
#[capsule_isolated]  // Enforces capsule_id field + PK pattern
#[table_name = "platform_data"]
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ENTITY#{entity_type}#{id}"]
#[sk = "METADATA"]
#[gsi1 = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#GSI1#{field}"]
pub struct MyEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // Required by #[capsule_isolated]
    pub id: EntityId,
    // ...
}

// Step 2: Update repository trait
pub trait MyEntityRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId, id: EntityId)
        -> Result<Option<MyEntity>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, entity: MyEntity)
        -> Result<()>;
}

// Step 3: Update API handler
pub async fn get_entity(
    Extension(context): Extension<RequestContext>,
    Path((tenant_id, entity_id)): Path<(TenantId, EntityId)>,
) -> Result<Json<EntityResponse>> {
    let capsule_id = context.capsule_id()?;  // Extract from context
    let entity = repo.get(tenant_id, capsule_id, entity_id).await?;
    Ok(Json(entity.into()))
}

// Step 4: Add negative test
#[tokio::test]
async fn test_cross_capsule_isolation() {
    let repo = DynamoDbMyEntityRepository::new(/* ... */);

    // Create in PRODUS capsule
    let entity = MyEntity::new(tenant_id(), capsule_id("PRODUS"), /* ... */);
    repo.save(tenant_id(), capsule_id("PRODUS"), entity).await?;

    // Try to fetch from DEVUS capsule
    let result = repo.get(tenant_id(), capsule_id("DEVUS"), entity.id).await?;

    // Should NOT find it (different capsule)
    assert!(result.is_none());
}

Takeaways

For large-scale breaking changes:
  1. Plan comprehensively - List ALL affected entities upfront
  2. Migrate template first - Validate pattern with simplest entity
  3. Parallel implementation - Speed up with multiple AI sessions
  4. Verify every entity - Don’t skip verification, even for “obvious” migrations
  5. Document patterns - Future migrations reuse validated patterns
For AI collaboration:
  1. AI excels: Systematic call site updates, test data migrations
  2. AI struggles: Dependency ordering, concurrency reasoning, migration strategies
  3. Human needed: Define migration order, review data migration plan, coordinate parallel work
The meta-insight: Breaking changes are where planning pays off most. The 8 hours spent planning saved 100+ hours of fixing ad-hoc migration issues. Use AI for execution, human judgment for strategy.