The alert came through at 2:17 AM: “API response times exceeding 5 seconds.”I opened the dashboard. What I saw made my stomach drop:
Session queries: 500ms average (should be less than 50ms)
Contact lookups: Full table scans on a 100k record table
Webhook execution: 200ms per call (half of that just establishing connections)
Memory usage: Spiking to 50MB for simple queries
We had the classic multi-tenant SaaS performance problem: code that works fine in development falls apart at scale.The platform was built on DynamoDB with event sourcing. Good architectural choices, but we’d made critical mistakes in how we queried data, managed connections, and structured our code.Over the next month, we systematically addressed seven major bottlenecks. The results were dramatic: 10x to 1000x improvements across the board.Here’s what we learned about performance optimization at scale.
// Load ALL sessions for the tenant, then filter in memorypub async fn get_sessions_for_capsule( &self, tenant_id: &str, capsule_id: &str,) -> Result<Vec<Session>> { // Query DynamoDB for ALL tenant sessions let all_sessions = self.repository .query_by_tenant(tenant_id) .await?; // Filter in memory to find capsule sessions let filtered: Vec<Session> = all_sessions .into_iter() .filter(|s| s.capsule_id == capsule_id) .collect(); Ok(filtered)}
This worked fine in development with 10 sessions per tenant.In production with 1,000 sessions per tenant, it was a disaster:
Latency: 500ms (loading 1,000 records, filtering to 50)
Memory: 50MB allocated for the full dataset
Cost: Reading 1,000 DynamoDB items when we needed 50
High cardinality filters - Selecting small subset from large dataset
Repeated queries - Same filter pattern used frequently
Large result sets - Base query returns more than 100 records
Warning: DynamoDB filter expressions reduce data transfer but still consume read capacity for scanned items. For true O(1) lookups, you need a GSI (see Pattern #2).
This isn’t a runtime performance issue, but a developer velocity bottleneck that led to performance bugs.Every DynamoDB entity required 100-200 lines of boilerplate:
Total code changes: less than 500 lines
Total performance improvement: 10x-1000x across the boardThe lesson: Look for the highest-leverage changes first. Don’t optimize everything—optimize the bottlenecks.
Caching isn’t always the answer:We tried caching session queries (Pattern #1). Result: Minimal improvement.Why? Sessions change frequently (every user action). Cache hit rate was only 12%, so we were paying cache overhead for little benefit.Better solution: Fix the root cause (query scoping) instead of papering over it with caching.The right tool for the job:
1. Measure first, optimize secondWe wasted time optimizing the wrong queries. Our initial guess was “contact queries are the problem” (they were visible in logs). The real culprit was session queries (10x more frequent but hidden in background jobs).Lesson: Use profiling and monitoring to identify real bottlenecks, not gut feel.2. Document the “why” in ADRsSix months later, a new developer asked: “Why do we have 5-minute cache TTL for hooks?”No one remembered. Was it arbitrary? Load testing? Customer requirement?Lesson: Write Architecture Decision Records (ADRs) explaining performance choices and trade-offs.
Create an Architecture Decision Record for significant optimizations:
Copy
# ADR-015: Session Query Scoping with DynamoDB Filter Expressions## ContextSession queries were taking 500ms and consuming 50MB memory per requestby loading all tenant sessions and filtering in memory.## DecisionUse DynamoDB filter expressions to push filtering to the database level.## Consequences**Positive:**- 10x latency improvement (500ms → 50ms)- 90% memory reduction (50MB → 5MB)- 20x fewer DynamoDB reads**Negative:**- Filter expressions still consume read capacity for scanned items- For true O(1) lookups, need GSI (see ADR-016)**Trade-offs:**- Chose filter expressions over GSI due to lower complexity- GSI would add storage cost and write amplification- Current solution adequate for less than 1000 sessions per tenant## Metrics- Benchmark: 485ms → 48ms- Production P95: 520ms → 52ms- Monthly cost savings: $240## References- Related: ADR-016 (Contact GSI indices)- AWS docs: DynamoDB Filter Expressions
Measure before optimizing - Use profiling and monitoring to find real bottlenecks, not guesses
Start with database optimization - Query scoping and indices often give 10x-100x improvements for minimal code changes
Cache judiciously - Only cache data with high read-to-write ratios and acceptable staleness
Reuse connections - Enable connection pooling for HTTP clients and database connections (often just configuration)
Automate repetitive code - Macros and code generation reduce errors and make performance optimizations consistent
Document performance decisions - Write ADRs explaining why you chose specific optimizations and their trade-offs
Track the right metrics:
P95/P99 latency (not just averages)
Cache hit rates
Database query patterns
Cost per request
Pro tip: The fastest query is the one you don’t make. Before adding caching or indices, ask: “Can we eliminate this query entirely?”In our case, we eliminated 60% of hook config queries by passing configuration in the event payload instead of looking it up.
What performance optimizations have you implemented? What patterns worked (or didn’t)?Connect on LinkedIn or comment on the YouTube Short
Disclaimer: This content represents my personal learning journey using AI for a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns or pseudocode for educational purposes. Performance numbers are from real implementations but have been sanitized and rounded for clarity.