Skip to main content

The Honest Assessment

After building a production SaaS platform with AI agents for several months, I’ve shipped over 2,500 lines of infrastructure code weekly, eliminated 4,700 lines of boilerplate through macros, and maintained a 99.2% test pass rate. AI absolutely excels at certain types of work. But there are clear, consistent boundaries where AI fails. Not “struggles” or “needs help” - actively fails and makes things worse. Understanding these boundaries isn’t AI criticism. It’s boundary exploration that makes you better at using AI effectively. The insight: AI’s failures cluster into distinct categories. Learn to recognize them, and you’ll know when to step in before AI burns 24 hours on a problem you could solve in 90 minutes.

The Seven Failure Categories

1. Novel Problems (No Pattern to Match)

What fails: Problems AI has never seen in training data. Real example: Designing a scope-based AWS client factory that enforces multi-tenant isolation boundaries through the type system.
Week 6: Build AWS client factory that prevents cross-tenant data access at compile time.Requirements:
  • 4 operational scopes (platform, tenant, capsule, operator)
  • Automatic table name prefixing per scope
  • Type-safe enforcement of isolation
AI’s initial proposal: Generic factory with runtime checks.Problem: Runtime checks can be bypassed. Needed compile-time enforcement.
Why AI fails: Novel type-level enforcement patterns aren’t in training data. AI defaults to runtime checks. When to intervene: When requirements include “enforce at compile time” or “prevent X architecturally.”

2. Framework Quirks (Middleware Ordering, Lifecycle)

What fails: Framework-specific execution order, lifecycle hooks, initialization sequences. Real example: Week 7’s middleware ordering bug.
The Bug: ConfigMiddleware registered BEFORE CapsuleExtractor.Why it failed: In Actix-web, middleware wraps execute in reverse order. ConfigMiddleware ran first but needed CapsuleContext that CapsuleExtractor provides.AI’s mistake: Designed both middleware correctly but didn’t understand Actix-web’s wrap-in-reverse semantics.
// ❌ AI's initial code (wrong order)
App::new()
    .wrap(ConfigMiddleware::new(config_service))  // Runs first, needs capsule
    .wrap(CapsuleExtractor::new())                // Runs second, provides capsule

// ✅ Human fix
App::new()
    .wrap(CapsuleExtractor::new())                // Runs first
    .wrap(ConfigMiddleware::new(config_service))  // Runs second, has capsule
How we caught it: Middleware returned 500 error “CapsuleContext not found.” Took 15 minutes to diagnose, 2 minutes to fix.
Why AI fails: Training data shows middleware registration, not execution order semantics. Framework docs don’t always make this explicit. When to intervene: Any framework feature with hidden ordering dependencies (middleware, hooks, lifecycle).

3. Performance Optimization (Needs Profiling Data)

What fails: Choosing what to optimize without actual performance measurements. Real example: Week 6’s credential caching.
AWS client factory with perfect architecture:
  • Type-safe scope enforcement
  • Automatic table prefixing
  • Clean API design
What AI missed: Credential caching.
Why AI fails: AI doesn’t know that STS assume-role adds 200-500ms. That’s operational knowledge from running production systems. When to intervene: Any code path that calls external services repeatedly. Add caching before AI implementation.

4. Security Hardening (Requires Threat Modeling)

What fails: Identifying attack vectors, validating security boundaries, preventing subtle bypasses. Real example: Multi-tenant isolation validation (implied from Week 6 work). What AI builds: Working isolation logic per requirements. What AI misses:
  • Cross-tenant queries via timestamp-based GSI (leaked data)
  • Missing tenant validation in batch operations
  • Race conditions in tenant-scoped locks
  • Token substitution attacks (swap tenant_id in JWT)
Why AI fails: Security requires adversarial thinking. “What if attacker does X?” AI optimizes for happy path. When to intervene: Security-critical code paths. Add threat model before AI implements.

5. Operational Concerns (Caching, Monitoring, Scaling)

What fails: Production deployment concerns that don’t appear in development. Real examples from the journey:

Caching Strategy

AI missed: DynamoDB 400KB item size limitImpact: Event sourcing worked in dev (small events), failed in prod (aggregated events > 400KB)Human fix: Added event compression + splitting logic

Circuit Breakers

AI missed: DynamoDB throttling protectionImpact: Bulk operations exhausted write capacityHuman fix: Added exponential backoff + batch size limits

Observability

AI missed: Structured logging with correlation IDsImpact: Couldn’t trace requests across servicesHuman fix: Added tracing middleware with request_id propagation

Graceful Degradation

AI missed: Fallback when config service unavailableImpact: All requests failed if config DynamoDB downHuman fix: Added platform-level defaults + circuit breaker
Why AI fails: Production concerns come from running systems at scale. AI trained on code, not deployment war stories. When to intervene: Before deploying to production. Review for caching, monitoring, error handling, scaling.

6. Business Logic (Domain-Specific Rules)

What fails: Subtle business rules that aren’t explicitly documented. Example: Billing calculation edge cases. What AI implements: Straightforward billing from requirements:
  • Charge per API call
  • Monthly aggregation
  • Pro-rated refunds
What AI misses:
  • Don’t charge for failed requests (500 errors)
  • Don’t charge during maintenance windows
  • Don’t charge for health checks
  • Cap monthly charge at contract limit
  • Handle timezone boundaries for “monthly”
Why AI fails: These rules live in domain expert’s heads, not requirements docs. AI can’t infer implicit business knowledge. When to intervene: Domain logic with regulatory, financial, or compliance implications. Verify with domain experts.

7. Edge Cases (Requires Deep Context)

What fails: Rare but critical scenarios that break assumptions. Real example from Week 5: The cascading errors.
Change: Updated macro to generate CRUD methods automaticallyExpected impact: Update 4 crates, maybe 20 call sitesActual impact: 214 compilation errors across 30 filesAI’s approach: Fix one error at a timeResult: 30 commits over 24 hours, still had errors
Why AI fails: Cascading changes require understanding system-wide dependencies. AI optimizes locally (fix this error) not globally (understand the change pattern). When to intervene: When error count stops decreasing or AI makes “partial fix” commits. Sign of being stuck.

Real Examples: What AI Did vs. What AI Missed

Week 6: AWS Client Factory

Event sourcing pattern:
  • Complete implementation of event store
  • Aggregate root pattern
  • Event versioning
  • Idempotency keys
  • 2,100 lines of working code
AWS client factory:
  • Type-safe scope enforcement
  • Platform/Tenant/Capsule/Operator clients
  • Automatic table name prefixing
  • 2,364 lines with 39 tests
Config middleware:
  • Hierarchical resolution (Platform → Tenant → Capsule)
  • Automatic injection via middleware
  • REST API with preview endpoints
  • 5,541 lines with 28 tests

Why These Boundaries Exist

1. Training Data Limitations

AI learns from public code repositories. What’s missing:
  • Production deployment configs: Not in repos (secrets, scaling params)
  • Incident post-mortems: Private docs, not public code
  • Performance profiling results: Runtime data, not source code
  • Security threat models: Confidential, not open-source
  • Business domain knowledge: In people’s heads, not documentation
Implication: AI designs clean patterns but misses operational reality.

2. Context Window Limits

Even with 200K token context: What fits:
  • Single crate implementation
  • Related test files
  • Architecture docs
What doesn’t fit:
  • Entire workspace (9 crates, 180 files)
  • Cross-crate dependency chain
  • Historical evolution (why code changed)
Result: AI sees local correctness, misses global impact (Week 5’s cascade).

3. Operational Knowledge Gap

AI knows code patterns but not production behavior:
  • AWS STS assume-role latency
  • DynamoDB 400KB item size limit
  • EventBridge PutEvents throttling
  • CORS preflight optimization
  • HTTP/2 connection pooling
These come from running production systems, not reading code.

What This Means for Workflows

Where Humans Add Value

Human role: Review AI’s proposed architecture for gapsQuestions to ask:
  • Performance: What needs caching?
  • Security: What are the attack vectors?
  • Operational: How does this fail? How do we debug it?
  • Limits: What happens at scale?
Example: Week 6’s credential caching caught in design reviewSaved: Launching without caching, discovering 200ms latency in production
Human role: Catch framework-specific quirks AI missesWatch for:
  • Middleware execution order
  • Lifecycle hook timing
  • Dependency injection scope
  • Transaction boundaries
Example: Week 7’s middleware ordering bugFix time: 15 minutes (caught in testing)
Human role: Add operational concerns before deploymentChecklist:
  • Monitoring: Metrics, logs, traces
  • Error handling: Retries, circuit breakers
  • Performance: Caching, connection pooling
  • Limits: Rate limiting, batch sizes
Example: Added DynamoDB throttling protection after Week 6Result: No production incidents from write capacity exhaustion
Human role: Debug production issues with full contextWhy AI struggles:
  • Needs correlation across logs, metrics, traces
  • Requires understanding of deployed versions
  • Must reason about race conditions, timing
Example: Production 500 error from middleware orderingAI diagnosis: Suggested 10 potential causes Human diagnosis: Checked Actix-web execution order → fixed in 2 minutes

The Decision Tree: When to Use AI vs. Intervene

New task arrives:

Is it a novel problem AI hasn't seen?
├─ YES → Human designs, AI implements
└─ NO → Continue

Does it involve framework-specific quirks?
├─ YES → AI implements, human reviews framework behavior
└─ NO → Continue

Does it need performance optimization?
├─ YES → Human profiles first, then AI optimizes hotspots
└─ NO → Continue

Is it security-critical?
├─ YES → Human threat models, AI implements controls
└─ NO → Continue

Does it involve operational concerns?
├─ YES → AI builds infrastructure, human adds monitoring/caching
└─ NO → AI can handle end-to-end

Is it domain-heavy business logic?
├─ YES → Human validates with domain expert, AI implements
└─ NO → Continue

Are there cascading errors (>10)?
├─ YES → Manual intervention (batch fix)
└─ NO → Let AI fix incrementally

Principles for Working Within Boundaries

Design Before Build

Lesson from Week 6: ADR-driven design worked. Week 5’s reactive fixing failed.Practice:
  1. Document constraints (ADR, requirements)
  2. Let AI propose architecture
  3. Human reviews for gaps (caching, security)
  4. Implement with confidence
Result: 2,364 lines in 3 days, 0 production bugs

Atomic Changes

Lesson from Week 5’s cascade: 30 commits of partial fixes created more errors.Practice:
  • Migrate one component completely
  • Test thoroughly
  • Then migrate next component
  • Never commit broken intermediate states
Result: Week 6’s 3-crate migration, zero cascading errors

Add Operational Layer

Lesson from production: AI builds clean patterns, misses caching/monitoring.Practice:
  • Review AI implementation for external calls
  • Add caching before deployment
  • Add metrics, tracing, structured logging
  • Add circuit breakers for downstream services
Result: 99.2% cache hit rate, 42ms latency improvement

Monitor AI Progress

Lesson from Week 5: Error count stopped decreasing = AI stuck.Practice: Track errors fixed per commit:
  • Healthy: 5-10 errors per commit
  • Warning: 2-4 errors per commit
  • Critical: less than 2 errors per commit
Action: If critical for 3 commits → manual intervention

The Meta-Insight

After several months of building with AI: AI isn’t “almost there” on these seven categories. They’re fundamental boundaries:
  1. Novel problems: Training data limitation
  2. Framework quirks: Hidden execution order
  3. Performance: No profiling data
  4. Security: Requires adversarial thinking
  5. Operational: Production experience needed
  6. Business logic: Domain knowledge gap
  7. Edge cases: Global context required
These aren’t bugs to fix. They’re boundaries to work within. The workflow shift:
  • ❌ “Let AI do everything, fix what breaks”
  • ✅ “Use AI where it excels, human where it doesn’t”
Recognizing the boundaries makes you 10x more effective with AI.

Actionable Takeaways

If you’re building with AI:
  1. Document constraints before asking AI to design - ADRs, requirements docs, isolation rules. Let AI design within boundaries.
  2. Review for the seven gaps - Performance (add caching), security (threat model), operational (monitoring), framework (quirks), novel patterns (human designs first).
  3. Watch for cascade signals - Diminishing returns (errors per commit dropping), “partial fix” in commit messages, AI making same fix repeatedly.
  4. Add operational layer before shipping - Metrics, tracing, circuit breakers, caching. AI builds infrastructure, humans harden for production.
  5. Design prevents debugging - Week 6’s proactive design: 3 days, 0 bugs. Week 5’s reactive fixing: 24 hours, still broken. Design wins.
Pro tip: Create a production readiness checklist for AI-generated code:
  • Performance: What needs caching?
  • Security: What’s the threat model?
  • Monitoring: Metrics, logs, traces added?
  • Error handling: Circuit breakers, retries?
  • Limits: Rate limits, batch sizes configured?
  • Framework: Execution order correct?
Run this before deploying AI implementations. Catches 90% of gaps.

Discussion

Share Your Experience

What limitations have you hit with AI-assisted development? Which of these seven categories matches your experience?Connect on LinkedIn or comment on YouTube

Disclaimer: This content represents personal learning from building with AI on a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns for educational purposes.