The Honest Assessment
After building a production SaaS platform with AI agents for several months, I’ve shipped over 2,500 lines of infrastructure code weekly, eliminated 4,700 lines of boilerplate through macros, and maintained a 99.2% test pass rate. AI absolutely excels at certain types of work. But there are clear, consistent boundaries where AI fails. Not “struggles” or “needs help” - actively fails and makes things worse. Understanding these boundaries isn’t AI criticism. It’s boundary exploration that makes you better at using AI effectively. The insight: AI’s failures cluster into distinct categories. Learn to recognize them, and you’ll know when to step in before AI burns 24 hours on a problem you could solve in 90 minutes.The Seven Failure Categories
1. Novel Problems (No Pattern to Match)
What fails: Problems AI has never seen in training data. Real example: Designing a scope-based AWS client factory that enforces multi-tenant isolation boundaries through the type system.- The Challenge
- What AI Missed
- Human Solution
Week 6: Build AWS client factory that prevents cross-tenant data access at compile time.Requirements:
- 4 operational scopes (platform, tenant, capsule, operator)
- Automatic table name prefixing per scope
- Type-safe enforcement of isolation
2. Framework Quirks (Middleware Ordering, Lifecycle)
What fails: Framework-specific execution order, lifecycle hooks, initialization sequences. Real example: Week 7’s middleware ordering bug. Why AI fails: Training data shows middleware registration, not execution order semantics. Framework docs don’t always make this explicit. When to intervene: Any framework feature with hidden ordering dependencies (middleware, hooks, lifecycle).3. Performance Optimization (Needs Profiling Data)
What fails: Choosing what to optimize without actual performance measurements. Real example: Week 6’s credential caching.- What AI Built
- The Performance Issue
- Human Addition
AWS client factory with perfect architecture:
- Type-safe scope enforcement
- Automatic table prefixing
- Clean API design
4. Security Hardening (Requires Threat Modeling)
What fails: Identifying attack vectors, validating security boundaries, preventing subtle bypasses. Real example: Multi-tenant isolation validation (implied from Week 6 work). What AI builds: Working isolation logic per requirements. What AI misses:- Cross-tenant queries via timestamp-based GSI (leaked data)
- Missing tenant validation in batch operations
- Race conditions in tenant-scoped locks
- Token substitution attacks (swap tenant_id in JWT)
5. Operational Concerns (Caching, Monitoring, Scaling)
What fails: Production deployment concerns that don’t appear in development. Real examples from the journey:Caching Strategy
AI missed: DynamoDB 400KB item size limitImpact: Event sourcing worked in dev (small events), failed in prod (aggregated events > 400KB)Human fix: Added event compression + splitting logic
Circuit Breakers
AI missed: DynamoDB throttling protectionImpact: Bulk operations exhausted write capacityHuman fix: Added exponential backoff + batch size limits
Observability
AI missed: Structured logging with correlation IDsImpact: Couldn’t trace requests across servicesHuman fix: Added tracing middleware with request_id propagation
Graceful Degradation
AI missed: Fallback when config service unavailableImpact: All requests failed if config DynamoDB downHuman fix: Added platform-level defaults + circuit breaker
6. Business Logic (Domain-Specific Rules)
What fails: Subtle business rules that aren’t explicitly documented. Example: Billing calculation edge cases. What AI implements: Straightforward billing from requirements:- Charge per API call
- Monthly aggregation
- Pro-rated refunds
- Don’t charge for failed requests (500 errors)
- Don’t charge during maintenance windows
- Don’t charge for health checks
- Cap monthly charge at contract limit
- Handle timezone boundaries for “monthly”
7. Edge Cases (Requires Deep Context)
What fails: Rare but critical scenarios that break assumptions. Real example from Week 5: The cascading errors.- The Cascade
- What AI Missed
- Human Recovery
Change: Updated macro to generate CRUD methods automaticallyExpected impact: Update 4 crates, maybe 20 call sitesActual impact: 214 compilation errors across 30 filesAI’s approach: Fix one error at a timeResult: 30 commits over 24 hours, still had errors
Real Examples: What AI Did vs. What AI Missed
Week 6: AWS Client Factory
- ✅ AI Wrote Successfully
- ❌ AI Missed (Human Added)
- The Pattern
Event sourcing pattern:
- Complete implementation of event store
- Aggregate root pattern
- Event versioning
- Idempotency keys
- 2,100 lines of working code
- Type-safe scope enforcement
- Platform/Tenant/Capsule/Operator clients
- Automatic table name prefixing
- 2,364 lines with 39 tests
- Hierarchical resolution (Platform → Tenant → Capsule)
- Automatic injection via middleware
- REST API with preview endpoints
- 5,541 lines with 28 tests
Why These Boundaries Exist
1. Training Data Limitations
AI learns from public code repositories. What’s missing:- Production deployment configs: Not in repos (secrets, scaling params)
- Incident post-mortems: Private docs, not public code
- Performance profiling results: Runtime data, not source code
- Security threat models: Confidential, not open-source
- Business domain knowledge: In people’s heads, not documentation
2. Context Window Limits
Even with 200K token context: What fits:- Single crate implementation
- Related test files
- Architecture docs
- Entire workspace (9 crates, 180 files)
- Cross-crate dependency chain
- Historical evolution (why code changed)
3. Operational Knowledge Gap
AI knows code patterns but not production behavior:- AWS STS assume-role latency
- DynamoDB 400KB item size limit
- EventBridge PutEvents throttling
- CORS preflight optimization
- HTTP/2 connection pooling
What This Means for Workflows
Where Humans Add Value
Design Review (Before Implementation)
Design Review (Before Implementation)
Human role: Review AI’s proposed architecture for gapsQuestions to ask:
- Performance: What needs caching?
- Security: What are the attack vectors?
- Operational: How does this fail? How do we debug it?
- Limits: What happens at scale?
Framework Knowledge (During Implementation)
Framework Knowledge (During Implementation)
Human role: Catch framework-specific quirks AI missesWatch for:
- Middleware execution order
- Lifecycle hook timing
- Dependency injection scope
- Transaction boundaries
Production Hardening (After Implementation)
Production Hardening (After Implementation)
Human role: Add operational concerns before deploymentChecklist:
- Monitoring: Metrics, logs, traces
- Error handling: Retries, circuit breakers
- Performance: Caching, connection pooling
- Limits: Rate limiting, batch sizes
Incident Response (When Things Break)
Incident Response (When Things Break)
Human role: Debug production issues with full contextWhy AI struggles:
- Needs correlation across logs, metrics, traces
- Requires understanding of deployed versions
- Must reason about race conditions, timing
The Decision Tree: When to Use AI vs. Intervene
Principles for Working Within Boundaries
Design Before Build
Lesson from Week 6:
ADR-driven design worked. Week 5’s reactive fixing failed.Practice:
- Document constraints (ADR, requirements)
- Let AI propose architecture
- Human reviews for gaps (caching, security)
- Implement with confidence
Atomic Changes
Lesson from Week 5’s cascade:
30 commits of partial fixes created more errors.Practice:
- Migrate one component completely
- Test thoroughly
- Then migrate next component
- Never commit broken intermediate states
Add Operational Layer
Lesson from production:
AI builds clean patterns, misses caching/monitoring.Practice:
- Review AI implementation for external calls
- Add caching before deployment
- Add metrics, tracing, structured logging
- Add circuit breakers for downstream services
Monitor AI Progress
Lesson from Week 5:
Error count stopped decreasing = AI stuck.Practice:
Track errors fixed per commit:
- Healthy: 5-10 errors per commit
- Warning: 2-4 errors per commit
- Critical: less than 2 errors per commit
The Meta-Insight
After several months of building with AI: AI isn’t “almost there” on these seven categories. They’re fundamental boundaries:- Novel problems: Training data limitation
- Framework quirks: Hidden execution order
- Performance: No profiling data
- Security: Requires adversarial thinking
- Operational: Production experience needed
- Business logic: Domain knowledge gap
- Edge cases: Global context required
- ❌ “Let AI do everything, fix what breaks”
- ✅ “Use AI where it excels, human where it doesn’t”
Actionable Takeaways
If you’re building with AI:- Document constraints before asking AI to design - ADRs, requirements docs, isolation rules. Let AI design within boundaries.
- Review for the seven gaps - Performance (add caching), security (threat model), operational (monitoring), framework (quirks), novel patterns (human designs first).
- Watch for cascade signals - Diminishing returns (errors per commit dropping), “partial fix” in commit messages, AI making same fix repeatedly.
- Add operational layer before shipping - Metrics, tracing, circuit breakers, caching. AI builds infrastructure, humans harden for production.
- Design prevents debugging - Week 6’s proactive design: 3 days, 0 bugs. Week 5’s reactive fixing: 24 hours, still broken. Design wins.
Discussion
Disclaimer: This content represents personal learning from building with AI on a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns for educational purposes.