Skip to main content

The 2 AM Wake-Up Call

The alert came through at 2:17 AM: “API response times exceeding 5 seconds.” I opened the dashboard. What I saw made my stomach drop:
  • Session queries: 500ms average (should be less than 50ms)
  • Contact lookups: Full table scans on a 100k record table
  • Webhook execution: 200ms per call (half of that just establishing connections)
  • Memory usage: Spiking to 50MB for simple queries
We had the classic multi-tenant SaaS performance problem: code that works fine in development falls apart at scale. The platform was built on DynamoDB with event sourcing. Good architectural choices, but we’d made critical mistakes in how we queried data, managed connections, and structured our code. Over the next month, we systematically addressed seven major bottlenecks. The results were dramatic: 10x to 1000x improvements across the board. Here’s what we learned about performance optimization at scale.

Pattern #1: Query Scoping - Filter at the Database, Not in Memory

The Problem

Our session query looked innocent enough:
// Load ALL sessions for the tenant, then filter in memory
pub async fn get_sessions_for_capsule(
    &self,
    tenant_id: &str,
    capsule_id: &str,
) -> Result<Vec<Session>> {
    // Query DynamoDB for ALL tenant sessions
    let all_sessions = self.repository
        .query_by_tenant(tenant_id)
        .await?;

    // Filter in memory to find capsule sessions
    let filtered: Vec<Session> = all_sessions
        .into_iter()
        .filter(|s| s.capsule_id == capsule_id)
        .collect();

    Ok(filtered)
}
This worked fine in development with 10 sessions per tenant. In production with 1,000 sessions per tenant, it was a disaster:
  • Latency: 500ms (loading 1,000 records, filtering to 50)
  • Memory: 50MB allocated for the full dataset
  • Cost: Reading 1,000 DynamoDB items when we needed 50

The Solution

Push the filtering down to the database level:
// Filter at the database level using DynamoDB expressions
pub async fn get_sessions_for_capsule(
    &self,
    tenant_id: &str,
    capsule_id: &str,
) -> Result<Vec<Session>> {
    // Query with filter expression - DynamoDB does the filtering
    let sessions = self.repository
        .query_by_tenant(tenant_id)
        .filter_expression("capsule_id = :capsule_id")
        .expression_values(hashmap! {
            ":capsule_id" => AttributeValue::S(capsule_id.to_string())
        })
        .await?;

    Ok(sessions)
}

The Results

  • Latency: 500ms → 50ms (10x improvement)
  • Memory: 50MB → 5MB (90% reduction)
  • DynamoDB reads: 1,000 items → 50 items (20x fewer reads)
  • Monthly cost savings: $240 in reduced read capacity

When to Apply

Use database-level filtering when:
  1. High cardinality filters - Selecting small subset from large dataset
  2. Repeated queries - Same filter pattern used frequently
  3. Large result sets - Base query returns more than 100 records
Warning: DynamoDB filter expressions reduce data transfer but still consume read capacity for scanned items. For true O(1) lookups, you need a GSI (see Pattern #2).

Pattern #2: Strategic Indexing - Global Secondary Indices for Fast Lookups

The Problem

Contact queries were our second major bottleneck:
// Find contact by account_id - requires full table scan
pub async fn get_contact_by_account(
    &self,
    account_id: &str,
) -> Result<Option<Contact>> {
    let all_contacts = self.repository
        .scan()  // ⚠️ FULL TABLE SCAN
        .await?;

    all_contacts
        .into_iter()
        .find(|c| c.account_id == account_id)
        .ok_or(Error::NotFound)
}
With 100,000 contacts in the table, every lookup was scanning the entire table. Performance:
  • Best case: 800ms (table scan of 100k items)
  • Worst case: 2,000ms (when contact is at the end)
  • Complexity: O(n) - gets slower as data grows

The Solution

Add Global Secondary Indices (GSI) for common query patterns:
#[derive(DynamoDbEntity)]
#[dynamodb(table = "contacts")]
pub struct ContactEntity {
    #[dynamodb(partition_key)]
    pub id: String,

    #[dynamodb(sort_key)]
    pub tenant_id: String,

    // GSI5: PrimaryContactIndex for account lookups
    #[dynamodb(gsi5_partition_key)]
    pub account_id: String,

    // GSI6: ExecutiveIndex for role-based queries
    #[dynamodb(gsi6_partition_key)]
    pub role: String,

    #[dynamodb(gsi6_sort_key)]
    pub department: String,
}

// Generated query method (from derive macro)
pub async fn query_by_account(
    &self,
    account_id: &str,
) -> Result<Vec<Contact>> {
    self.repository
        .query_gsi5(account_id)  // Direct GSI lookup
        .await
}

The Results

  • Latency: 800ms → 15ms (53x improvement)
  • Complexity: O(n) → O(1) (constant time lookups)
  • Cost: Full table scan → Single partition read

Index Design Strategy

We created GSI indices for our top 3 query patterns:
Query PatternIndexUse Case
Account lookupsGSI5 (account_id)Find contact by account
Role queriesGSI6 (role + department)Executive dashboards
Status filtersGSI7 (status + created_at)Active contacts list

When to Apply

Create a GSI when:
  1. Non-key queries - Querying on attributes other than primary key
  2. High query frequency - Pattern used more than 100 times per day
  3. Large tables - Table has more than 10,000 items
Cost consideration: Each GSI consumes additional storage and write capacity. We limit to 3-4 GSIs per table to balance performance and cost.

Pattern #3: Smart Caching - In-Memory Cache with TTL

The Problem

Our webhook system needed to load hook configuration for every webhook execution:
// Every webhook call hits DynamoDB
pub async fn execute_webhook(
    &self,
    hook_id: &str,
    payload: &str,
) -> Result<()> {
    // Load hook config from DynamoDB - 100ms
    let hook = self.repository
        .get_hook(hook_id)
        .await?;

    // Execute webhook - 150ms
    self.http_client
        .post(&hook.url)
        .body(payload)
        .send()
        .await?;

    Ok(())
}
Problem: Hook configuration rarely changes (maybe once a week), but we were hitting DynamoDB on every execution. For a high-volume tenant:
  • 10,000 webhook calls per day
  • 10,000 DynamoDB reads per day
  • 100ms × 10,000 = 16 minutes of cumulative latency

The Solution

In-memory cache with 5-minute TTL:
use std::sync::Arc;
use tokio::sync::RwLock;
use std::time::{Duration, Instant};

pub struct CachedHookRepository {
    repository: Arc<DynamoDbRepository>,
    cache: Arc<RwLock<HashMap<String, CachedHook>>>,
    ttl: Duration,
}

struct CachedHook {
    hook: Hook,
    expires_at: Instant,
}

impl CachedHookRepository {
    pub fn new(repository: DynamoDbRepository) -> Self {
        Self {
            repository: Arc::new(repository),
            cache: Arc::new(RwLock::new(HashMap::new())),
            ttl: Duration::from_secs(300), // 5 minutes
        }
    }

    pub async fn get_hook(&self, hook_id: &str) -> Result<Hook> {
        // Check cache first
        {
            let cache = self.cache.read().await;
            if let Some(cached) = cache.get(hook_id) {
                if cached.expires_at > Instant::now() {
                    return Ok(cached.hook.clone()); // Cache hit: 0.1ms
                }
            }
        }

        // Cache miss or expired - fetch from DynamoDB
        let hook = self.repository.get_hook(hook_id).await?;

        // Update cache
        {
            let mut cache = self.cache.write().await;
            cache.insert(hook_id.to_string(), CachedHook {
                hook: hook.clone(),
                expires_at: Instant::now() + self.ttl,
            });
        }

        Ok(hook)
    }
}

The Results

  • Cache hit latency: 100ms → 0.1ms (1000x improvement)
  • DynamoDB reads: 10,000/day → 288/day (97% reduction)
  • Monthly savings: $180 in DynamoDB read costs
Cache hit rate: 96% (only refresh every 5 minutes)

Cache Design Decisions

Why 5-minute TTL?
  • Hook configs change infrequently (weekly at most)
  • 5 minutes is acceptable staleness for this use case
  • Shorter TTL (1 min) → 83% cache hit rate
  • Longer TTL (15 min) → 98% cache hit rate but unacceptable staleness
Why in-memory instead of Redis?
  • Single-instance application (not distributed)
  • Cache size less than 10MB (not memory-constrained)
  • No need for cross-instance consistency
  • Simpler architecture, zero infrastructure cost

When to Apply

Use in-memory caching when:
  1. Read-heavy workload - more than 10:1 read-to-write ratio
  2. Acceptable staleness - Data doesn’t need real-time consistency
  3. Small dataset - Cached data less than 100MB
  4. Single instance - Or use Redis for distributed cache
When NOT to cache:
  • User session data (security risk)
  • Financial transactions (consistency critical)
  • Real-time analytics (staleness unacceptable)

Pattern #4: Connection Pooling - Reuse HTTP Connections

The Problem

Our webhook execution was establishing a new HTTP connection for every call:
// Creating new HTTP client per request
pub async fn execute_webhook(
    &self,
    url: &str,
    payload: &str,
) -> Result<()> {
    // New connection + TLS handshake: ~50ms
    let client = reqwest::Client::new();

    let response = client
        .post(url)
        .body(payload)
        .send()  // Actual request: ~100ms
        .await?;

    Ok(())
}
Performance breakdown per call:
  • TLS handshake: 30-50ms
  • DNS lookup: 10-20ms
  • Actual HTTP request: 100ms
  • Total: 150ms (1/3 of time spent on connection setup)
With 10,000 webhook calls per day, that’s 8 hours of wasted time on connection setup.

The Solution

HTTP connection pool with keepalive:
use reqwest::Client;
use std::time::Duration;

pub struct WebhookExecutor {
    // Shared HTTP client with connection pool
    client: Client,
}

impl WebhookExecutor {
    pub fn new() -> Self {
        let client = Client::builder()
            .pool_max_idle_per_host(32)  // Keep 32 idle connections per host
            .pool_idle_timeout(Duration::from_secs(90))
            .timeout(Duration::from_secs(30))
            .build()
            .expect("Failed to build HTTP client");

        Self { client }
    }

    pub async fn execute_webhook(
        &self,
        url: &str,
        payload: &str,
    ) -> Result<()> {
        // Reuses connection from pool - no TLS handshake
        let response = self.client
            .post(url)
            .body(payload)
            .send()  // Only ~100ms (50ms saved)
            .await?;

        Ok(())
    }
}

The Results

  • Latency: 150ms → 100ms (33% improvement)
  • Connection overhead: 50ms → less than 1ms (on pool hit)
  • Daily time saved: 8 hours → less than 10 minutes

Pool Configuration

Why 32 idle connections per host?
  • Typical webhook workload: 5-10 concurrent calls
  • 32 gives headroom for bursts
  • Idle timeout (90s) prevents stale connections
Pool hit rate: 92% (only 8% of calls need new connection)

When to Apply

Use connection pooling for:
  1. HTTP clients - Any service making frequent HTTP requests
  2. Database connections - Connection pools are critical
  3. External API calls - Especially with TLS overhead
Most HTTP libraries (like reqwest in Rust) have built-in pooling. Just enable it and configure appropriately.

Pattern #5: Code Generation - Eliminate Boilerplate with Macros

The Problem

This isn’t a runtime performance issue, but a developer velocity bottleneck that led to performance bugs. Every DynamoDB entity required 100-200 lines of boilerplate:
// Manual implementation - 150 lines per entity
pub struct ContactEntity {
    pub id: String,
    pub tenant_id: String,
    pub account_id: String,
    pub role: String,
}

impl ContactEntity {
    // Manual attribute conversion - error-prone
    pub fn from_item(item: HashMap<String, AttributeValue>) -> Result<Self> {
        Ok(Self {
            id: item.get("id")
                .and_then(|v| v.as_s().ok())
                .ok_or(Error::MissingField("id"))?
                .clone(),
            tenant_id: item.get("tenant_id")
                .and_then(|v| v.as_s().ok())
                .ok_or(Error::MissingField("tenant_id"))?
                .clone(),
            // ... 20 more fields
        })
    }

    pub fn to_item(&self) -> HashMap<String, AttributeValue> {
        let mut item = HashMap::new();
        item.insert("id".to_string(), AttributeValue::S(self.id.clone()));
        item.insert("tenant_id".to_string(), AttributeValue::S(self.tenant_id.clone()));
        // ... 20 more fields
        item
    }

    // Manual GSI query methods
    pub async fn query_gsi5(&self, account_id: &str) -> Result<Vec<Self>> {
        // 30 lines of query logic
    }
}
Problems:
  1. 150 lines × 7 entities = 1,050 lines of boilerplate
  2. Copy-paste errors (wrong field mappings)
  3. Missing GSI query methods (led to full table scans)
  4. No type safety (typos in field names caught at runtime)

The Solution

Derive macro for automatic code generation:
// Macro-driven implementation - 15 lines total
#[derive(DynamoDbEntity)]
#[dynamodb(table = "contacts")]
pub struct ContactEntity {
    #[dynamodb(partition_key)]
    pub id: String,

    #[dynamodb(sort_key)]
    pub tenant_id: String,

    #[dynamodb(gsi5_partition_key)]
    pub account_id: String,

    #[dynamodb(gsi6_partition_key)]
    pub role: String,
}

// Macro generates:
// - from_item() / to_item() methods
// - query_gsi5() / query_gsi6() methods
// - Type-safe field accessors
// - Compile-time validation

The Results

Code reduction:
  • Before: 150 lines per entity × 7 entities = 1,050 lines
  • After: 15 lines per entity × 7 entities = 105 lines
  • Reduction: 95% (945 lines eliminated)
Developer velocity:
  • Adding new entity: 30 minutes → 3 minutes (10x faster)
  • Zero copy-paste errors (compile-time validation)
  • All GSI queries auto-generated (no more accidental table scans)
Runtime performance impact:
  • Generated code is identical to hand-written (zero overhead)
  • Bonus: Caught 3 inefficient queries at compile time (missing GSI annotations)

When to Apply

Use code generation (macros, code gen tools) when:
  1. Repetitive patterns - Same structure across multiple entities
  2. Error-prone boilerplate - Manual code leads to bugs
  3. Compile-time validation - Type safety prevents runtime errors
Languages with macro support:
  • Rust: derive macros
  • Java: Annotation processors
  • Python: Decorators + code generation
  • TypeScript: Decorators + transformers

Lessons Learned

What Worked: The 80/20 Rule

Small changes, massive impact:
  1. Session query scoping (20 lines changed) → 10x improvement
  2. Hook caching (50 lines added) → 1000x improvement on cache hits
  3. Connection pooling (5 lines changed) → 33% latency reduction
Total code changes: less than 500 lines Total performance improvement: 10x-1000x across the board The lesson: Look for the highest-leverage changes first. Don’t optimize everything—optimize the bottlenecks.

What Surprised Us

Caching isn’t always the answer: We tried caching session queries (Pattern #1). Result: Minimal improvement. Why? Sessions change frequently (every user action). Cache hit rate was only 12%, so we were paying cache overhead for little benefit. Better solution: Fix the root cause (query scoping) instead of papering over it with caching. The right tool for the job:
  • Frequently changing data → Query optimization
  • Rarely changing data → Caching
  • Repeated patterns → Code generation

What We’d Do Differently

1. Measure first, optimize second We wasted time optimizing the wrong queries. Our initial guess was “contact queries are the problem” (they were visible in logs). The real culprit was session queries (10x more frequent but hidden in background jobs). Lesson: Use profiling and monitoring to identify real bottlenecks, not gut feel. 2. Document the “why” in ADRs Six months later, a new developer asked: “Why do we have 5-minute cache TTL for hooks?” No one remembered. Was it arbitrary? Load testing? Customer requirement? Lesson: Write Architecture Decision Records (ADRs) explaining performance choices and trade-offs.

Implementation Guide

Step 1: Identify Bottlenecks

Don’t guess. Measure. Tools we used:
  1. Application metrics - Track latency by operation
  2. Database slow query logs - Identify expensive queries
  3. Profiling - Find CPU/memory hotspots
  4. Distributed tracing - Track request flow across services
Key metrics to track:
  • P50, P95, P99 latency (not just average)
  • Database query counts and latency
  • Memory allocation per request
  • Cache hit rates

Step 2: Prioritize by Impact

Not all slow queries matter equally. Impact = Frequency × Slowness × Business Value
QueryFrequencyLatencyBusiness ImpactPriority
Session by capsule10,000/day500msHigh (API)1
Contact by account5,000/day800msHigh (Dashboard)2
Hook config load10,000/day100msMedium (Background)3
Admin reports10/day2000msLow (Internal)4
Fix high-frequency, high-latency queries first.

Step 3: Optimize and Measure

Before optimizing:
  1. Write benchmark test
  2. Record baseline metrics
  3. Set target improvement (e.g., “reduce P95 to less than 100ms”)
After optimizing:
  1. Run benchmark again
  2. Verify improvement
  3. Deploy to staging
  4. Monitor production for regressions
Example benchmark:
#[bench]
fn bench_session_query(b: &mut Bencher) {
    let repo = setup_test_repo();

    b.iter(|| {
        repo.get_sessions_for_capsule("tenant_1", "capsule_42")
    });
}

// Before: 485ms ± 25ms
// After:  48ms ± 3ms (10x improvement ✅)

Step 4: Document with ADRs

Create an Architecture Decision Record for significant optimizations:
# ADR-015: Session Query Scoping with DynamoDB Filter Expressions

## Context
Session queries were taking 500ms and consuming 50MB memory per request
by loading all tenant sessions and filtering in memory.

## Decision
Use DynamoDB filter expressions to push filtering to the database level.

## Consequences
**Positive:**
- 10x latency improvement (500ms → 50ms)
- 90% memory reduction (50MB → 5MB)
- 20x fewer DynamoDB reads

**Negative:**
- Filter expressions still consume read capacity for scanned items
- For true O(1) lookups, need GSI (see ADR-016)

**Trade-offs:**
- Chose filter expressions over GSI due to lower complexity
- GSI would add storage cost and write amplification
- Current solution adequate for less than 1000 sessions per tenant

## Metrics
- Benchmark: 485ms → 48ms
- Production P95: 520ms → 52ms
- Monthly cost savings: $240

## References
- Related: ADR-016 (Contact GSI indices)
- AWS docs: DynamoDB Filter Expressions

Results Summary

Here’s the complete impact across all optimizations:
OptimizationMetricBeforeAfterImprovement
Query ScopingLatency500ms50ms10x
Memory50MB5MB10x
DynamoDB reads1,000 items50 items20x
Contact GSILatency800ms15ms53x
ComplexityO(n)O(1)-
Hook CachingCache hit latency100ms0.1ms1000x
Daily DB reads10,00028897% reduction
Connection PoolingLatency150ms100ms1.5x
Connection overhead50msless than 1ms50x
Code GenerationLines of code1,05010595% reduction
Dev time per entity30 min3 min10x
Cost Savings:
  • DynamoDB reads: $420/month saved
  • Developer time: 27 hours/month saved (new entities, maintenance)
Development Velocity:
  • Faster iteration (less boilerplate)
  • Fewer bugs (compile-time validation)
  • Better performance by default (generated GSI queries)

Actionable Takeaways

If you’re facing similar performance challenges:
  1. Measure before optimizing - Use profiling and monitoring to find real bottlenecks, not guesses
  2. Start with database optimization - Query scoping and indices often give 10x-100x improvements for minimal code changes
  3. Cache judiciously - Only cache data with high read-to-write ratios and acceptable staleness
  4. Reuse connections - Enable connection pooling for HTTP clients and database connections (often just configuration)
  5. Automate repetitive code - Macros and code generation reduce errors and make performance optimizations consistent
  6. Document performance decisions - Write ADRs explaining why you chose specific optimizations and their trade-offs
  7. Track the right metrics:
    • P95/P99 latency (not just averages)
    • Cache hit rates
    • Database query patterns
    • Cost per request
Pro tip: The fastest query is the one you don’t make. Before adding caching or indices, ask: “Can we eliminate this query entirely?”In our case, we eliminated 60% of hook config queries by passing configuration in the event payload instead of looking it up.

Resources & Further Reading

DynamoDB Optimization: Caching Strategies: Related Articles:

Discussion

Share Your Experience

What performance optimizations have you implemented? What patterns worked (or didn’t)?Connect on LinkedIn or comment on the YouTube Short

Disclaimer: This content represents my personal learning journey using AI for a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns or pseudocode for educational purposes. Performance numbers are from real implementations but have been sanitized and rounded for clarity.