Skip to main content

Technical Debt Triage: What to Fix, What to Live With

The code was incomplete. The list_by_tenant() method just said unimplemented!() - it would panic in production. But we shipped it anyway. Six months later, it’s still there. And it was the right decision. Here’s why.

The Problem with Technical Debt Advice

Most technical debt advice falls into two camps: “Never ship with known issues” or “Move fast and break things.” Neither works in practice. The real question isn’t whether to take on debt - it’s which debt to pay down, which to live with, and how to tell the difference. Over six months building a multi-tenant SaaS platform, we accumulated technical debt. Some we fixed immediately. Some we deferred indefinitely. And some we wish we’d fixed sooner. This article breaks down those decisions with real ROI calculations and decision frameworks you can use.

Debt Paid Down: Four Success Stories

Story 1: AWS Runtime Migration

The Debt: Each crate managed AWS clients independently. Six different implementations, all slightly different. Some cached connections, some didn’t. Some handled retries, some failed fast. Cost to Fix: 16 hours Benefits:
  • Eliminated 200+ lines of duplicated code
  • Prevented 3 potential isolation bugs (wrong client configuration)
  • Reduced onboarding confusion (one pattern instead of six)
ROI Calculation:
  • Time saved on future changes: ~2 hours/month (consistent patterns)
  • Bug prevention value: ~4 hours/incident × 3 incidents = 12 hours
  • Payback period: 8 months
  • Verdict: Positive ROI
Why It Worked: High-frequency code (used in every service), clear architectural benefit, reasonable fix cost.

Story 2: LegalEntityId Consolidation

The Debt: Two definitions of LegalEntityId existed - one using UUID v4, one using UUID v7. Different files, different semantics, same name. Cost to Fix: 30 minutes Benefits:
  • Single source of truth
  • Prevented future confusion about which to use
  • Made code review conversations simpler
ROI Calculation:
  • Prevention value: ~2 hours/confusion incident
  • Expected frequency: 1-2 times per quarter
  • Verdict: Immediate positive ROI
Why It Worked: Trivial fix cost, clear architectural win, prevents compound confusion.

Story 3: Opportunity Pipeline Refactor

The Debt: Pipeline stages were hardcoded as an enum. No tenant could customize their sales process. Enterprise customers need custom pipelines. Cost to Fix: 12 hours (breaking change, database migration) Benefits:
  • Enterprise-ready feature unlocked
  • Enables customer customization
  • Direct revenue impact
ROI Calculation:
  • Customer value: Enables enterprise sales
  • Alternative cost: Build workaround layer (20+ hours)
  • Verdict: Customer value justifies cost
Why It Worked: Directly tied to revenue, blocking feature for enterprise tier, fix cheaper than workaround.

Story 4: TODO Comment Cleanup

The Debt: Seven TODO comments suggested using the wrong event pattern for domain events. Copy-paste had spread bad architectural guidance. Cost to Fix: 1 hour (find and update comments) Benefits:
  • Prevents future bugs from wrong architecture
  • Clarifies intended patterns
  • Saves future code review time
ROI Calculation:
  • Prevention value: 4 hours per misdirected implementation
  • Expected frequency: 2-3 times per year
  • Verdict: Immediate positive ROI
Why It Worked: Comments are high-leverage documentation. Wrong guidance compounds over time.

Debt Lived With: Three Success Stories

Story 1: Cross-Capsule Queries - The $20K Decision

The Debt: The list_by_tenant() method remains unimplemented!(). It panics if called. It’s been six months. It’s still there. Use Case: Operator tooling needs to list all items across capsules (pods) for a tenant. Happens fewer than 10 times per month. Current Workaround: Manually iterate through capsules (5 minutes/month). Implementation Cost:
  • Global query infrastructure: 60 hours
  • Cross-capsule coordination: 30 hours
  • Testing and edge cases: 10 hours
  • Total: 100 hours at 200/hr=200/hr = 20,000
ROI Calculation:
  • Time saved: 5 minutes/month = 1 hour/year
  • Cost to maintain: 2 hours/year
  • Net savings: $19,800/year
Decision Framework (ADR-0015):
✓ Use cases are rare (<10/month)
✓ Workarounds exist and are acceptable
✓ Architectural integrity matters (avoid premature optimization)
✓ Implementation cost is high (100 hours)
→ Decision: Keep unimplemented
Why It Worked: Low frequency + working workaround + high implementation cost = don’t build it.

Story 2: UX Demo Applications

The Debt: TypeScript type mismatches in demo apps. Several component demos incomplete or outdated. Alternative: Storybook already provides interactive component demos with proper types. Fix Cost: 4 hours debugging + ongoing maintenance burden Decision: Deferred indefinitely. Storybook is sufficient for component exploration. ROI Calculation:
  • Value added: Minimal (Storybook covers use case)
  • Maintenance cost: 2-3 hours/month
  • Net savings: $800/year
Why It Worked: Don’t maintain two systems when one works. Recognize sunk costs.

Story 3: LocalStack Integration Tests

The Debt: Many integration tests marked #[ignore] - they won’t run locally without infrastructure setup. Blocker: Infrastructure setup scripts incomplete. Workaround: CI environment runs tests with proper LocalStack setup. Decision: Defer until blocker resolved. Tests run where they matter (CI). Why It Worked: Tests still run in gating environment. Local convenience isn’t worth unblocking effort yet.

Debt Lived With: One Failure Story

Capsule Isolation Violations

The Debt: Four entities stayed tenant-scoped when they should have been capsule-scoped. In a multi-capsule (pod) architecture, this means dev/test data could mix with production data. Impact: HIGH RISK - architectural integrity violation, potential data leakage. Discovery: Architectural audit found 7% violation rate (4 out of ~60 entities). Original Fix Cost (if done immediately): 4 hours per entity = 16 hours total Current Fix Cost (after 6 months): 12-18 hours per entity = 48-72 hours total Why It Got Harder:
  • More code depends on wrong scoping
  • Migration complexity increased
  • Test data assumptions baked in
The Lesson: Security and architectural debt compounds. The 7% violation became harder to fix over time, not easier. Some debt categories should never be deferred.

Three Decision Frameworks

Framework 1: Issue Backpressure (Priority Labels)

This framework categorizes work by urgency and impact:
CRITICAL  → Production bug, security issue
           Action: Do first, drop other work
           
HIGH      → Blocks team work, user-facing feature
           Action: Do next, schedule within sprint
           
MEDIUM    → Important, not blocking
           Action: Do when bandwidth available
           
LOW       → Nice to have, polish, optimization
           Action: Defer to backlog
           
BLOCKED   → Waiting on dependency
           Action: Skip until unblocked
Example Applications:
  • Capsule isolation violations → CRITICAL (architectural risk)
  • Pipeline customization → HIGH (blocks enterprise sales)
  • AWS runtime consolidation → MEDIUM (important, not urgent)
  • Cross-capsule queries → LOW (workaround exists)
  • LocalStack tests → BLOCKED (waiting on #441)

Framework 2: Cost/Benefit Matrix

Map frequency against value to determine action:
                HIGH FREQUENCY        LOW FREQUENCY
                (>100/day)            (<10/day)
              ┌───────────────────┬─────────────────┐
HIGH VALUE    │ Build Now         │ Workaround OK   │
(user/rev)    │                   │                 │
              │ Ex: Pipeline      │ Ex: Cross-      │
              │ customization     │ capsule queries │
              │                   │                 │
              ├───────────────────┼─────────────────┤
LOW VALUE     │ Build If Easy     │ Defer/Won't Fix │
(polish)      │                   │                 │
              │ Ex: Helper macros │ Ex: Demo apps   │
              │ (1 hour, 260 LOC) │ Ex: Soft-delete │
              │                   │                 │
              └───────────────────┴─────────────────┘

ALWAYS FIX IMMEDIATELY (OVERRIDE ALL QUADRANTS):
❗ Security issues
❗ Architectural violations
❗ Compliance gaps
❗ Blocking dependencies
Key Insight: Cross-capsule queries fell into “high value, low frequency” - exactly where workarounds make sense. Demo apps fell into “low value, low frequency” - classic defer/won’t-fix territory.

Framework 3: Stage-Based Complexity (Lean Until Proven)

Don’t build Stage 4 infrastructure in Stage 0:
Stage 0 (Garage):      Minimal complexity, prove concept
Stage 1 (Internal):    Low complexity, internal users
Stage 2 (Design):      Medium complexity, first customers
Stage 3 (Growth):      Full complexity, scaling challenges
Stage 4 (Enterprise):  Maximum complexity, enterprise needs
Application Examples:
  • Cross-capsule queries (Stage 4) → Defer in Stage 0 ✓
  • Configuration governance (Stage 2) → Transitional approach OK ✓
  • Soft-delete (Stage 3) → Defer to backlog until scaling needs ✓
  • Pipeline customization (Stage 2) → Build for design partners ✓
Rule: Match infrastructure complexity to current stage, not future dreams.

When Managed Services Win

Sometimes the best way to avoid technical debt is to not write the code at all.

Case Study: Timestream vs Custom DynamoDB

Requirement: Store and query time-series metrics data. Option A: Custom DynamoDB Implementation
  • Development: 100 hours ($20,000)
  • Operations: 10 hours/month ($2,000/year)
  • Storage: $280-450/month
  • Query performance: 2-8 seconds
  • Total Year 1: $48,584
Option B: AWS Timestream (Managed Service)
  • Development: 0 hours
  • Operations: 0 hours/month
  • Storage: $89/month
  • Query performance: less than 500ms
  • Total Year 1: $1,068
Savings: $47,540/year (97% cost reduction) The Decision Rule: When a managed service costs 3% of a custom solution and performs better, building custom is technical debt from day one.

Transitional Debt: How to Take On Debt Safely

Not all debt is created equal. Some debt is explicitly temporary - planned, tracked, and bounded.

Configuration Governance Example

The Requirement: Implement configuration governance with encryption, field-level validation, and migration capabilities. Option A: Full Implementation
  • Cost: 5 days
  • Risk: Over-engineering for current needs
Option B: Iterative (4 phases)
  • Phase 1: Basic interface (1 day)
  • Phase 2: Value types (1 day)
  • Phase 3: Migration tools (1 day)
  • Phase 4: Encryption (1 day)
  • Total: 4 days, delivered incrementally
Chosen Approach: Option B with explicit technical debt tracking.

Requirements for Transitional Debt

Every transitional debt decision must have:
MUST HAVE:
✓ "technical-debt" label on issue
✓ Follow-up issues created for future phases
✓ Migration path documented in ADR
✓ Acceptance criteria defined
✓ Revisit timeline set (e.g., "after 3 months" or "when X customers")

MUST NOT:
✗ Open-ended "we'll fix it later"
✗ No tracking or documentation
✗ No clear success criteria
Why This Works: Transitional debt is debt with a plan. It’s bounded, tracked, and has exit criteria.

The Documentation Checklist

Good debt management requires documentation. Every debt decision needs:
  1. ADR or Plan Document
    • Explains WHY the decision was made
    • Captures alternatives considered
    • Documents decision criteria
  2. GitHub Issue
    • Tracks WHAT needs to be done (if anything)
    • Links to ADR for context
  3. Labels
    • technical-debt - Marks known debt
    • blocked - Waiting on dependency
    • deferred - Intentionally postponed
    • wont-fix - Explicitly accepting debt
  4. Migration Path or Acceptance Criteria
    • How to fix it (if we decide to)
    • What success looks like
  5. Decision Point
    • When to revisit (timeline, milestone, or condition)
    • Example: “Revisit when >100 queries/month”

The Key Insight

Quote from our project constitution:
“Good debt management is not avoiding all debt - it’s making conscious decisions, documenting them clearly, and having criteria for when to revisit.”
The difference between strategic debt and technical bankruptcy is documentation and decision-making.

Outcomes Summary

Debt Paid Down:
  • AWS runtime migration: 16 hours invested, positive ROI in 8 months
  • LegalEntityId consolidation: 30 minutes, immediate ROI
  • Pipeline refactor: 12 hours, unlocked enterprise sales
  • TODO cleanup: 1 hour, prevented architectural confusion
Debt Lived With Successfully:
  • Cross-capsule queries: $19,800/year savings vs building it
  • Demo apps: $800/year savings by accepting Storybook
  • LocalStack tests: Deferred until blocker resolved
Debt Lived With Too Long:
  • Capsule isolation violations: 4x harder to fix after 6 months
  • Lesson: Architectural debt compounds
Transitional Debt:
  • Configuration governance: Tracked explicitly across 4 phases
  • Status: Manageable and progressing

Applying These Frameworks

When you encounter technical debt, run through these questions:
  1. Priority (Backpressure): Is this CRITICAL, HIGH, MEDIUM, LOW, or BLOCKED?
  2. Frequency: How often is this code executed or touched?
  3. Value: Does this impact revenue, users, security, or architecture?
  4. Cost: How many hours to fix? What’s the risk of waiting?
  5. Stage: Does this complexity match our current stage?
  6. Alternative: Is there a managed service or workaround?
Then map to the decision matrix:
  • CRITICAL or security: Fix immediately, no questions
  • High frequency + high value: Build now
  • Low frequency + high value: Workaround acceptable
  • High frequency + low value: Build only if cheap
  • Low frequency + low value: Defer or won’t-fix
And finally: Document the decision. Whether you fix it, defer it, or accept it forever - write down why. Your future self (and your teammates) will thank you.
The unimplemented function saved us $20,000. The architectural violation cost us 3x more to fix later. The difference wasn’t the debt itself - it was knowing which debt to take on, and which to pay down immediately. Choose wisely.