Breaking Changes at Scale

The Problem: Discovered in Production

During testing week 3: Event logs showed a warning.

[WARN] Capsule isolation violation detected
Entity: FinancialConfig
Event: FinancialConfigUpdated
Issue: Missing capsule_id in partition key
Impact: DEVUS test data appearing in PRODUS production queries

The scenario that triggered this: A developer testing financial configuration in the DEVUS capsule (test environment) created a config with industry = “Technology”. Later, when querying PRODUS capsule (production) for all Technology accounts, the test data appeared in results. Why this happened: FinancialConfig entity was scoped at tenant level, not capsule level:

// WRONG: Tenant-scoped only
#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    // capsule_id missing!
    pub industry: String,
    // ...
}

Query pattern (broken):

// Queries for "Technology" industry returned results from BOTH capsules
let configs = repo.query_by_industry(tenant_id, "Technology").await?;
// Returns: [DEVUS test config, PRODUS production config]

The Impact: Test data contaminating production results is a compliance violation.For financial services, mixing test and production data violates SOC 2 and can fail audits.

The Scope: 6 Entities to Migrate

Analysis revealed the problem was widespread:

Entity	Scope	Issue	Risk
FinancialConfig	Tenant	Test configs in prod queries	High
Contract	Tenant	Test contracts in revenue reports	Critical
ContractLineItem	Tenant	Test line items in billing	Critical
ContractAmendment	Tenant	Test amendments in audit trail	High
RevenueSchedule	Tenant	Test revenue in financial reporting	Critical
AccessEntity	Tenant	Test access grants in security queries	Medium

Total impact:

6 entities
21 repository methods per entity
47 API handlers
1,003 tests
Estimated migration effort: 4-6 weeks manually

The Decision: Systematic vs. Incremental

Option A: Incremental Migration (Rejected)

Migrate one entity at a time
Deploy after each entity
Gradual rollout over 6 weeks

Problems:

Mixed isolation states during migration
Complex conditional queries (is this entity capsule-scoped yet?)
6 deployment cycles, 6 risk windows

Option B: Big Bang Migration (Chosen)

Plan all 6 entities comprehensively
Implement in parallel
Single coordinated deployment
One-time breaking change

Rationale: Capsule isolation is a security boundary. Can’t be “partially” enforced.

The Planning Session

Evaluator session (Opus, 6 hours):

Planning session for capsule isolation migration.

Context:
- 6 entities currently tenant-scoped, need capsule scope
- ADR-0010 defines capsule isolation requirements
- Migration is breaking change (PK patterns change)
- Cannot break existing data

Requirements:
1. Migrate entity schemas (add capsule_id field)
2. Update PK/SK patterns (TENANT#...#CAPSULE#...)
3. Update all repositories (add capsule_id parameters)
4. Update all API handlers (pass capsule_id)
5. Migrate existing DynamoDB data
6. Update all tests (1,003 tests)

Constraints:
- Zero downtime
- Backward compatibility during migration
- All 6 entities migrate together (atomic)
- No data loss

Design migration strategy with:
- Entity modification plan
- Data migration plan
- Rollback plan
- Test plan

Evaluator’s output: A 464-page migration plan for just the Contract entities alone.

Key Architecture Decisions from Migration Plan

Decision 1: Dual-Write Migration Strategy

Phase 1: Add capsule_id, maintain old PK pattern

Add capsule_id field to entities
Still use old PK: TENANT#{tenant_id}#CONTRACT#{id}
Dual-write: Write to both old and new patterns
Queries use old pattern (no behavior change)

Phase 2: Flip queries to new pattern

Start querying new PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONTRACT#{id}
Still dual-write to both patterns
Monitor for issues

Phase 3: Drop old pattern

Stop writing to old pattern
Clean up old data
Remove dual-write code

Rationale: Gradual cutover prevents big-bang deployment risk.

Decision 2: GSI Pattern Update

Old pattern:

GSI1PK: CAPSULE#{capsule_id}#ACCOUNT#{account_id}

Problem: Missing tenant_id prefix makes cross-tenant isolation unverifiable.New pattern:

GSI1PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCOUNT#{account_id}

Impact: All GSI helper methods need tenant_id parameter:

// Old signature
fn gsi1pk_for_account(capsule_id: CapsuleId, account_id: AccountId) -> String;

// New signature
fn gsi1pk_for_account(tenant_id: TenantId, capsule_id: CapsuleId, account_id: AccountId) -> String;

This signature change affects 127 call sites across the codebase.

Decision 3: Test Data Migration

Challenge: 1,003 tests use hard-coded tenant_id, no capsule_id.Options:

Update all tests to include capsule_id (manual)
Create default capsule for tests (automated)
Generate migration script for test data

Choice: Option 2 with fallback to Option 1 for critical tests.Implementation:

Test helper: test_capsule() returns default CapsuleId for all tests
Critical tests (cross-capsule scenarios): Explicit capsule_id values

The Implementation: Parallel Migration

Instead of migrating entities sequentially, we migrated in parallel using AI:

Entity #1: FinancialConfig (Template Migration)

Builder session:

Migrate FinancialConfig to capsule scope per plan:
.plans/567-financial-config-migration.md

Steps:
1. Add capsule_id field
2. Update PK pattern
3. Update repository methods
4. Update API handlers
5. Update 21 tests

Time: 4 hours Result:

Entity migrated ✅
21 tests passing ✅
API handlers updated ✅

Verification session:

Verify FinancialConfig capsule migration.

Check:
- PK pattern includes CAPSULE# segment
- All repository methods require capsule_id
- Tests verify isolation (no cross-capsule leaks)
- API handlers extract capsule from request context

Verifier found 1 issue:

Issue: GSI Pattern InconsistencyEntity updated to:

PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL

But GSI pattern still used old format:

GSI1PK: CAPSULE#{capsule_id}#INDUSTRY#{industry}  // Missing TENANT#

Why Builder missed: Plan showed PK update clearly but GSI update was in a separate section.Impact: GSI queries wouldn’t enforce tenant isolation.Fix: Updated GSI pattern to include TENANT# prefix.

This finding influenced all subsequent migrations.

Entities #2-6: Parallel Migration

With the pattern established from FinancialConfig, I launched 5 parallel Builder sessions (one per entity): Contract entities (Session 1):

ContractEntity
ContractLineItemEntity
ContractAmendmentEntity
RevenueScheduleEntryEntity

Access entities (Session 2):

AccessEntity

Time per entity: 3-4 hours (faster due to template from FinancialConfig) Coordination challenge: All 5 sessions made changes to shared files (error types, API common code). Solution: Merge order discipline:

FinancialConfig first (establishes pattern)
Contract entities second (largest change)
Access entities last (smallest change)
Resolve merge conflicts in shared files

The Cascade We Prevented

What could have gone wrong without systematic planning:

Scenario: Ad-Hoc Migration

Day 1: Developer migrates FinancialConfig

Adds capsule_id field
Updates PK
Forgets to update GSI
Merges

Day 3: Another developer migrates Contract

Follows FinancialConfig pattern
Copies the broken GSI pattern
Merges

Day 5: Security audit finds issue

GSI queries bypass tenant isolation
All 6 entities have the same bug
Must fix all entities again

Total: 6 entities × 2 migrations each = 12 weeks of work

What Actually Happened: Planned Migration

Day 1: Evaluator creates comprehensive plan

Identifies GSI pattern issue upfront
Documents correct pattern for ALL entities
Creates migration checklist

Day 2-3: Migrate all 6 entities following plan

First entity (FinancialConfig) validates pattern
Remaining 5 entities copy validated pattern
All entities migrated correctly first time

Day 4: Verification

Run full test suite (1,003 tests)
All passing
Security audit confirms isolation

Total: 4 days Speedup: 18x faster than ad-hoc approach

What AI Excelled At

1. Systematic Call Site Updates

The task: Update 127 call sites to pass capsule_id parameter.

// BEFORE:
let key = gsi1pk_for_account(capsule_id, account_id);

// AFTER:
let key = gsi1pk_for_account(tenant_id, capsule_id, account_id);

AI’s approach:

Find all calls to gsi1pk_for_account (and 6 similar functions)
Update each call to include tenant_id
Verify tenant_id is available in scope (add parameter if needed)
Compile and check

Time: 2 hours Manual estimate: 8 hours (tedious, error-prone) Why AI excelled: Systematic pattern application across many files.

2. Test Data Updates

The task: Update 1,003 tests to include capsule_id. AI’s approach:

// Created test helper:
pub fn test_capsule() -> CapsuleId {
    CapsuleId::from_str("PRODUS").unwrap()
}

// Then updated all tests:
// BEFORE:
let config = FinancialConfig::new(tenant_id, "USD");

// AFTER:
let config = FinancialConfig::new(tenant_id, test_capsule(), "USD");

Time: 3 hours for all 1,003 tests Manual estimate: 2-3 weeks Why AI excelled: Repetitive pattern across hundreds of files.

What AI Struggled With

1. Understanding Migration Order

The issue: Contract entities have foreign key relationships:

ContractEntity (parent)
  ↓
ContractLineItemEntity (child)
  ↓
RevenueScheduleEntryEntity (grandchild)

AI’s first attempt: Migrated in alphabetical order (Amend→Contract→LineItem→Revenue). Problem: Can’t update child entities before parent. Human intervention: Defined migration order:

ContractEntity (parent)
ContractLineItemEntity (child)
ContractAmendmentEntity + RevenueScheduleEntryEntity (grandchildren)

Lesson: AI doesn’t infer dependency graphs. Explicit ordering required.

2. Data Migration Strategy

The task: Migrate existing DynamoDB data from old PK pattern to new. AI’s plan:

1. Scan all items with old PK pattern
2. For each item:
   - Read item
   - Update PK to new pattern
   - Write to new key
   - Delete old key
3. Verify migration complete

Problem: This approach has a race condition. Between read and write, data could be modified. Human intervention: Use TransactWriteItems for atomic migration:

// Atomic migration
client.transact_write_items()
    .transact_items(
        TransactWriteItem::builder()
            .put(/* new PK */)
            .condition_expression("attribute_not_exists(PK)")
            .build()
    )
    .transact_items(
        TransactWriteItem::builder()
            .delete(/* old PK */)
            .condition_expression("attribute_exists(PK)")
            .build()
    )
    .send()
    .await?;

Lesson: AI doesn’t reason about concurrency edge cases. Needs explicit guidance.

Principles Established

Principle 1: Plan Breaking Changes Comprehensively

What we learned: Breaking changes affecting multiple entities need coordinated planning.Checklist for breaking changes:

List ALL affected entities
Identify all call sites (use rg/grep)
Define migration order (dependency graph)
Create rollback plan
Plan data migration strategy
Update all affected tests
Document breaking changes

Rule: Don’t start implementation until plan covers all entities.

Principle 2: Template Entity Validates Pattern

What we learned: Migrating one entity first (FinancialConfig) caught pattern issues before they propagated to other entities.Practice:

Choose simplest entity as template
Migrate template entity completely
Verify thoroughly (Verifier catches pattern issues)
Fix pattern issues in template
Use validated template for remaining entities

Benefit: Pattern bugs fixed once, not 6 times.

Principle 3: Parallel Migration Requires Merge Discipline

What we learned: 5 parallel Builder sessions created merge conflicts in shared files.Strategy:

Identify shared files upfront (error types, API common)
Define merge order (simplest → most complex)
First entity establishes patterns in shared files
Subsequent entities adapt to established patterns

Example: FinancialConfig updated error types, remaining 5 entities reused those updates.

Principle 4: Test Migration Is Part of Feature Migration

What we learned: Updated 1,003 tests as part of entity migration, not afterward.Why: Tests validate the migration worked correctly.Practice:

Migrate entity domain model
Update repositories
Update tests (verify new behavior)
Update API handlers
Integration tests confirm end-to-end

Anti-pattern: Migrate entity, skip tests, discover issues later.

Principle 5: Breaking Changes Document Migration Path

What we learned: Future developers need to understand why entities are capsule-scoped.Documentation created:

ADR-0010: Capsule Isolation Enforcement
Migration plan for each entity (preserved in .plans/)
Updated CLAUDE.md with capsule isolation patterns

Format:

Context: Why was this change needed?
Decision: What pattern did we choose?
Consequences: What broke? How did we fix it?
Lessons: What would we do differently?

The Migration Process

Step 1: Entity Schema Update

Before:

#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
#[sk = "METADATA"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    pub base_currency: String,
    // ...
}

After:

#[derive(DynamoDbEntity)]
#[capsule_isolated]  // NEW: Compile-time enforcement
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL"]
#[sk = "METADATA"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // NEW: Required field
    pub base_currency: String,
    // ...
}

Changes:

Added #[capsule_isolated] attribute
Added capsule_id field
Updated PK pattern to include CAPSULE#{capsule_id}

Step 2: Repository Method Updates

Before:

pub trait FinancialConfigRepository {
    async fn get(&self, tenant_id: TenantId) -> Result<Option<FinancialConfig>>;
    async fn save(&self, config: FinancialConfig) -> Result<()>;
}

After:

pub trait FinancialConfigRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId) -> Result<Option<FinancialConfig>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, config: FinancialConfig) -> Result<()>;
}

Impact: All implementations (InMemory, DynamoDB, Cached) needed updates.

Step 3: Call Site Updates

The systematic work: 21 call sites for FinancialConfig alone. Before:

let config = repo.get(tenant_id).await?;

After:

let capsule_id = extract_capsule_from_context(&request)?;
let config = repo.get(tenant_id, capsule_id).await?;

AI’s strength: Found all 21 call sites, updated systematically. Time: 1 hour (vs. 4-6 hours manually)

Step 4: Test Updates

1,003 tests × 6 entities: Most were straightforward. Before:

#[test]
fn test_financial_config_creation() {
    let config = FinancialConfig::new(tenant_id(), "USD");
    assert_eq!(config.base_currency, "USD");
}

After:

#[test]
fn test_financial_config_creation() {
    let config = FinancialConfig::new(tenant_id(), test_capsule(), "USD");
    assert_eq!(config.base_currency, "USD");
    assert_eq!(config.capsule_id, test_capsule());  // NEW: Verify capsule
}

AI added: Assertions to verify capsule_id in all relevant tests.

What Went Right

1. The Template Entity Approach

FinancialConfig as the template:

Simplest entity (single config per tenant/capsule)
No foreign keys to other entities
Fewest call sites (21 vs. 40+ for Contract)

Benefits:

Caught GSI pattern issue early
Established test update pattern
Created reusable helper functions (test_capsule, extract_capsule_from_context)

Time saved: ~8 hours (pattern bugs fixed once, not 6 times)

2. Comprehensive Verification

After each entity migration, Verifier checked:

Schema compliance:
- PK includes both TENANT# and CAPSULE#
- GSI patterns consistent across entities
- capsule_id field present and required
Isolation verification:
- Tests include cross-capsule negative tests (DEVUS data shouldn’t appear in PRODUS queries)
- Repository methods enforce capsule parameter
- API handlers extract capsule from request context
Data migration:
- Old data accessible during migration
- New writes use new pattern
- No data loss

Bugs caught by Verifier:

Missing TENANT# in GSI patterns (2 entities)
Cross-capsule test missing (3 entities)
Incomplete API handler updates (1 entity)

Total bugs prevented: 6 (would have been production issues)

Metrics: The Migration by Numbers

Scope
Time
Quality
Cost

Entities migrated: 6Files modified: 47Lines changed:

Added: 1,247 lines
Removed: 721 lines
Net: +526 lines (additional isolation code)

Call sites updated: 127Tests updated: 1,003All tests passing: ✅

The Mistake I Made

After Contract migration completed: I was confident in the pattern. I skipped detailed verification for the last 2 entities (AccessEntity and one other). Impact: Deployed to staging environment. What broke: Access control queries returned inconsistent results. Root cause: AccessEntity migration updated PK correctly but forgot to update the query method:

// Entity correctly updated:
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCESS#{id}"]

// But query still used old pattern:
pub async fn query_by_resource(&self, tenant_id: TenantId, resource: Resource) -> Result<Vec<Access>> {
    let pk = format!("TENANT#{}", tenant_id);  // ❌ Missing CAPSULE#
    // ...
}

Why this happened: I rushed the last entities, trusting they’d “just work” like the previous ones. Fix time: 2 hours to find and fix in staging (before production deploy). Lesson: Even with templates, verify every entity. Don’t skip verification because “it should be the same.”

Code Example: The Migration Pattern

Here’s the final pattern we established for capsule isolation migration:

// Step 1: Update entity definition
#[derive(DynamoDbEntity, Debug, Clone)]
#[capsule_isolated]  // Enforces capsule_id field + PK pattern
#[table_name = "platform_data"]
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ENTITY#{entity_type}#{id}"]
#[sk = "METADATA"]
#[gsi1 = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#GSI1#{field}"]
pub struct MyEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // Required by #[capsule_isolated]
    pub id: EntityId,
    // ...
}

// Step 2: Update repository trait
pub trait MyEntityRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId, id: EntityId)
        -> Result<Option<MyEntity>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, entity: MyEntity)
        -> Result<()>;
}

// Step 3: Update API handler
pub async fn get_entity(
    Extension(context): Extension<RequestContext>,
    Path((tenant_id, entity_id)): Path<(TenantId, EntityId)>,
) -> Result<Json<EntityResponse>> {
    let capsule_id = context.capsule_id()?;  // Extract from context
    let entity = repo.get(tenant_id, capsule_id, entity_id).await?;
    Ok(Json(entity.into()))
}

// Step 4: Add negative test
#[tokio::test]
async fn test_cross_capsule_isolation() {
    let repo = DynamoDbMyEntityRepository::new(/* ... */);

    // Create in PRODUS capsule
    let entity = MyEntity::new(tenant_id(), capsule_id("PRODUS"), /* ... */);
    repo.save(tenant_id(), capsule_id("PRODUS"), entity).await?;

    // Try to fetch from DEVUS capsule
    let result = repo.get(tenant_id(), capsule_id("DEVUS"), entity.id).await?;

    // Should NOT find it (different capsule)
    assert!(result.is_none());
}

Takeaways

For large-scale breaking changes:

Plan comprehensively - List ALL affected entities upfront
Migrate template first - Validate pattern with simplest entity
Parallel implementation - Speed up with multiple AI sessions
Verify every entity - Don’t skip verification, even for “obvious” migrations
Document patterns - Future migrations reuse validated patterns

For AI collaboration:

AI excels: Systematic call site updates, test data migrations
AI struggles: Dependency ordering, concurrency reasoning, migration strategies
Human needed: Define migration order, review data migration plan, coordinate parallel work

The meta-insight: Breaking changes are where planning pays off most. The 8 hours spent planning saved 100+ hours of fixing ad-hoc migration issues. Use AI for execution, human judgment for strategy.

Workflows

Process

​The Problem: Discovered in Production

​The Scope: 6 Entities to Migrate

​The Decision: Systematic vs. Incremental

​The Planning Session

​Decision 1: Dual-Write Migration Strategy

​Decision 2: GSI Pattern Update

​Decision 3: Test Data Migration

​The Implementation: Parallel Migration

​Entity #1: FinancialConfig (Template Migration)

​Entities #2-6: Parallel Migration

​The Cascade We Prevented

​Scenario: Ad-Hoc Migration

​What Actually Happened: Planned Migration

​What AI Excelled At

​1. Systematic Call Site Updates

​2. Test Data Updates

​What AI Struggled With

​1. Understanding Migration Order

​2. Data Migration Strategy

​Principles Established

​The Migration Process

​Step 1: Entity Schema Update

​Step 2: Repository Method Updates

​Step 3: Call Site Updates

​Step 4: Test Updates

​What Went Right

​1. The Template Entity Approach

​2. Comprehensive Verification

​Metrics: The Migration by Numbers

​The Mistake I Made

​Code Example: The Migration Pattern

​Takeaways

The Problem: Discovered in Production

The Scope: 6 Entities to Migrate

The Decision: Systematic vs. Incremental

The Planning Session

Decision 1: Dual-Write Migration Strategy

Decision 2: GSI Pattern Update

Decision 3: Test Data Migration

The Implementation: Parallel Migration

Entity #1: FinancialConfig (Template Migration)

Entities #2-6: Parallel Migration

The Cascade We Prevented

Scenario: Ad-Hoc Migration

What Actually Happened: Planned Migration

What AI Excelled At

1. Systematic Call Site Updates

2. Test Data Updates

What AI Struggled With

1. Understanding Migration Order

2. Data Migration Strategy

Principles Established

The Migration Process

Step 1: Entity Schema Update

Step 2: Repository Method Updates

Step 3: Call Site Updates

Step 4: Test Updates

What Went Right

1. The Template Entity Approach

2. Comprehensive Verification

Metrics: The Migration by Numbers

The Mistake I Made

Code Example: The Migration Pattern

Takeaways