Skip to main content

The Reality: I Use All Three

After building a production SaaS platform with AI assistance, I’ve formed strong opinions about which tool works best for what. This isn’t based on benchmarks or marketing claims - it’s based on real usage over months of infrastructure design, debugging sessions, and code reviews. The TLDR:
  • Claude Sonnet: Infrastructure design, architecture decisions, complex refactoring
  • GPT-4: Quick research, API documentation, alternative perspectives
  • GitHub Copilot: Autocomplete, boilerplate generation, quick fixes
But the interesting part is why these choices emerged, and when I broke my own rules.

The Comparison Table

CriteriaClaude SonnetGPT-4Copilot
Multi-file contextExcellent (200K tokens)Good (128K tokens)Poor (limited context)
Architecture decisionsBest - understands trade-offsGood - suggests alternativesNot designed for this
System designExcellent - holistic viewGood - pattern-focusedN/A
Planning documentsExcellent - structured outputGood - needs more directionN/A
Cost per session$2-5$3-7$0 (subscription)
My choice✅ ClaudeOccasional second opinionNever
Why Claude wins:When I designed the API visibility architecture (tagging 127 routes with different visibility levels), Claude:
  • Understood the full codebase context
  • Proposed compile-time vs runtime trade-offs
  • Generated comprehensive migration plan
  • Created verification checklist
GPT-4 would have done okay, but Claude’s longer context window meant I didn’t need to re-explain the system architecture multiple times.Real example:
Task: Design multi-tenant billing architecture
Claude session: 3 hours, $4.20 in API costs
Output: 35-page ADR with trade-offs, implementation plan, test strategy
Quality: Production-ready, deployed unchanged

Cost Breakdown: Real Monthly Spending

My usage (November 2024 - January 2025):

Claude Sonnet

Monthly cost: $180-220Usage:
  • Infrastructure design: 10-15 sessions ($60-80)
  • Code review (automated): 40-60 PRs ($40-60)
  • Documentation: 8-12 docs ($30-40)
  • Debugging: 15-25 sessions ($30-50)
  • Refactoring: 3-5 large refactors ($20-40)
Average session: $3-5 Tokens per month: ~8-10M ROI: Equivalent to 40-60 hours of work

GPT-4

Monthly cost: $20-40Usage:
  • Quick research: 20-30 queries ($10-15)
  • Alternative perspectives: 5-10 sessions ($5-10)
  • Fact-checking Claude: 8-12 queries ($5-10)
  • Writing assistance: occasional ($0-5)
Average session: $1-2 Tokens per month: ~1-2M ROI: Useful but not critical

GitHub Copilot

Monthly cost: $10 (flat)Usage:
  • Autocomplete: Hundreds of suggestions/day
  • Boilerplate: 50-100 generations/day
  • Quick fixes: 10-20/day
Value: Massive ROI: Best bang-for-buckThe catch: Only useful for mechanical coding, not thinking.
Total AI spend: ~$210-270/month Comparison to alternatives:
  • Junior developer salary: ~$5,000/month (20x more)
  • My time saved: ~60-80 hours/month
  • Hourly rate equivalence: $3-5/hour for AI work
Value assessment: Absurdly cheap compared to alternatives. The surprising part: I spend more on Claude than GPT-4 because Claude’s longer context means fewer sessions. GPT-4 requires more back-and-forth to maintain context, which adds up.

When I Switched Tools

These are real examples where one tool failed and another succeeded.

Disaster: The 30-Commit Debugging Cascade

Task: Update macro to generate CRUD methods (breaking change) Tool choice: Claude Sonnet What happened:
  1. Made macro change (1 commit)
  2. 214 compilation errors appeared
  3. Asked Claude to fix errors
  4. Claude fixed errors one at a time (30 commits, 24 hours)
  5. Still had 14 errors remaining
  6. Claude started hallucinating fixes
Error cascade example:
Commit 1: Fix method name in file A
→ New error in file B (calls renamed method)
Commit 2: Fix file B
→ New error in file C (type mismatch)
Commit 3: Fix file C
→ New errors in files D, E, F (dependency chain)
...
Commit 30: Still broken
Why Claude failed: Optimized for fixing individual errors, not understanding the systemic change pattern. The switch: After 24 hours, I stopped Claude and fixed it manually. Human approach:
# Step 1: Understand all breaking changes (30 min)
- save()  db_save()
- client field client() method
- RepositoryError EventStoreError

# Step 2: Batch fix (45 min)
rg "\.save\(" -t rust | xargs sd '\.save\(' '.db_save('
rg "\.client\b" -t rust | xargs sd 'self\.client' 'self.client()'
# Add error type conversions

# Step 3: Verify (15 min)
cargo check --workspace  # ✅ Clean
cargo test --workspace   # ✅ 142 tests passing
Total human time: 90 minutes Total commits: 3 Final state: Clean, working Lesson learned: For systematic refactoring with breaking changes, use Claude to plan, human to execute. The decision tree I now use:
Breaking change needed:
├─ Use Claude: Design migration plan
├─ Use human: Execute batch fixes
└─ Use Claude: Verify result
Critical insight: AI excels at preventing problems (via planning) but struggles with fixing cascading problems (reactive debugging).After this disaster, I added a pre-refactoring checklist to my workflow.

Success: Multi-Agent Workflow Design

Task: Design organization model for Evaluator/Builder/Verifier agents Tool choice: Claude Opus (expensive model) What happened: Claude produced a 35-page organizational constitution with:
  • Agent personas (roles, responsibilities, boundaries)
  • Decision framework (Type 1-4 decisions)
  • Quality gates (when work can progress)
  • Conflict resolution (what happens when agents disagree)
Why Claude excelled: This is pure design work - no code execution, just systematic thinking about:
  • “What should each agent be responsible for?”
  • “How should they interact?”
  • “What are the failure modes?”
Cost: $6.40 for 3-hour session ROI: Transformed agent output quality. Agents with clear roles produced better results than agents with better prompts. Why not GPT-4? I tried GPT-4 first. It gave good suggestions but lacked the systematic completeness Claude provided. GPT-4’s response felt like “here are some ideas” while Claude’s felt like “here’s a complete organizational system.”

Surprise: Copilot for Tests

Task: Write 21 E2E test scenarios for event flows Tool choice: Started with Claude, switched to Copilot What happened: Claude approach (initial):
  • Prompt: “Generate E2E tests for subscription activation flow”
  • Output: Complete test suite (excellent)
  • Cost: $2.50
  • Problem: Needed 20 more test suites for other flows
Copilot approach (discovered):
  • Write first test manually
  • Let Copilot generate next tests based on pattern
  • Result: 10x faster for repetitive test generation
Real example:
// I write test 1:
#[tokio::test]
async fn test_subscription_activated_creates_bundle() {
    let svc = setup_test_service().await;
    let event = SubscriptionActivatedEvent { /* ... */ };
    svc.handle(event).await.unwrap();
    assert_bundle_created(&svc, "bundle_id").await;
}

// Copilot suggests test 2 (I just accept):
#[tokio::test]
async fn test_subscription_cancelled_deactivates_bundle() {
    let svc = setup_test_service().await;
    let event = SubscriptionCancelledEvent { /* ... */ };
    svc.handle(event).await.unwrap();
    assert_bundle_deactivated(&svc, "bundle_id").await;
}

// And test 3, 4, 5... all following same pattern
Speed comparison:
  • Claude: 21 tests via prompt = 10 minutes + $2.50
  • Copilot: Write 1 test, accept 20 suggestions = 5 minutes + $0
  • Winner: Copilot (for repetitive patterns)
When to use Claude instead: When tests require complex setup or novel patterns. Copilot only repeats patterns it sees.

Surprising Findings

Claude Is Better at Documentation Than Code

What I expected: AI would excel at code generation, struggle with documentation. What I found: Opposite. Claude’s documentation is production-ready. Its code needs review. Why this surprised me: Code has strict correctness requirements (compiler, tests, runtime behavior). Documentation is “softer.” But actually:
  • Good code requires creativity, domain knowledge, performance intuition
  • Good documentation requires thoroughness, consistency, completeness
AI is better at thoroughness than creativity. Real example:
Task: Implement usage metering pipeline
Claude's code: 85% correct, needed 15% tweaks
Claude's documentation: 100% usable, zero changes needed

The code had bugs. The documentation was perfect.
Application: I now use Claude to write documentation while implementing features, not after. Documentation quality is higher when context is fresh.

GPT-4 Is Better at Explaining “Why Not”

What I expected: Claude and GPT-4 would give similar answers. What I found: GPT-4 is better at explaining why an approach is wrong. Example:
Me: "Should I use event sourcing for user authentication?"

Claude: "Here's how you could implement event sourcing for auth..."
[Proceeds to design event-sourced auth system]

GPT-4: "No. Authentication needs fast reads (every request) and event
sourcing optimizes for writes. Use traditional state-based auth with
audit logging if you need history."
Pattern: Claude defaults to “how to do what you asked.” GPT-4 more often says “you shouldn’t do that.” Application: When I’m considering a new approach, I ask GPT-4 first as a sanity check. If GPT-4 says “bad idea,” I reconsider. If it says “reasonable,” I use Claude for design.

Copilot Is Terrible at Error Handling

What I expected: Copilot would suggest proper error handling. What I found: Copilot suggests .unwrap() everywhere. Example:
// I write:
let config = load_config()

// Copilot suggests:
let config = load_config().unwrap();

// What I want:
let config = load_config()
    .context("Failed to load config")?;
Pattern: Copilot optimizes for “code that compiles” not “code that handles errors gracefully.” Application: I accept Copilot’s happy path suggestions, but always manually add error handling. Never trust Copilot for error paths.

Claude’s Context Window Is the Killer Feature

What I expected: Model intelligence would matter most. What I found: Context window size dominates quality. Why: With 200K context, I can give Claude:
  • Entire codebase structure (file tree)
  • 10-15 relevant files
  • ADR documents
  • Requirements
  • Previous conversation history
Result: Claude understands the problem holistically. GPT-4 with 128K context: Need to summarize, split into multiple sessions, lose context. Example:
Task: Design API visibility filtering for SDK generation

Claude approach (200K context):
- Load entire API route definitions (127 routes)
- Load existing SDK generation code
- Load documentation on visibility requirements
- Design complete solution in one session

GPT-4 approach (128K context):
- Session 1: Understand requirements
- Session 2: Design approach (re-explain context)
- Session 3: Plan implementation (re-explain context again)

Claude: 1 session, 3 hours, $4
GPT-4: 3 sessions, 5 hours, $8

Winner: Claude (context is king)
The insight: I’d rather have a slightly less intelligent model with 2x context than a smarter model that forgets half my project.

Recommendation Matrix

Based on real usage, here’s what to use when:
Essential:
  • GitHub Copilot ($10/month) - Autocomplete alone is worth it
  • Claude Sonnet (pay-as-you-go) - Infrastructure design, code review
Optional:
  • ⚠️ GPT-4 (occasional) - Fact-checking, alternative perspectives
Monthly cost: 10+10 + 50-150 usage = $60-160Usage pattern:
  • Copilot: Running all day (autocomplete)
  • Claude: 3-5 focused sessions per week (2-4 hours each)
  • GPT-4: 1-2 times per week (quick questions)
ROI: 30-50 hours saved per monthAvoid: Paying for both Claude and GPT-4 subscriptions. Use Claude pay-as-you-go for better cost control.

The Controversial Take: Why Not GPT-4?

GPT-4 is excellent. But in practice, I use Claude 90% of the time. Why:

1. Context Window Is King

Claude: 200K tokens GPT-4: 128K tokens Real impact:
Typical design session context:
- File tree (5K tokens)
- 10 relevant files (40K tokens)
- 3 ADRs (15K tokens)
- Requirements doc (10K tokens)
- Conversation history (30K tokens)
Total: 100K tokens

Claude: Fits comfortably
GPT-4: Hitting limits, need to summarize or split session
The productivity hit: Every time I have to re-explain context to GPT-4 is wasted time.

2. Claude’s Code Understanding Is Better

Subjective opinion based on hundreds of sessions: Claude seems to understand code structure better. When I paste a complex Rust trait hierarchy or event sourcing implementation, Claude “gets it” faster. Example:
// Complex trait hierarchy with associated types
trait Repository<E: Event> {
    type Aggregate: Aggregate<Event = E>;
    type Error: std::error::Error;

    async fn save(&self, aggregate: &Self::Aggregate) -> Result<(), Self::Error>;
}

// Claude understands this on first try
// GPT-4 sometimes confuses associated types with generics
Not scientific. Just my experience. Your mileage may vary.

3. Cost Is Actually Similar

Per-token pricing:
  • Claude Sonnet: 3/Minput,3/M input, 15/M output
  • GPT-4: 2.50/Minput,2.50/M input, 10/M output
GPT-4 is cheaper… but I use fewer total sessions with Claude because context window is larger. Real monthly costs:
  • Claude: $180-220 (fewer sessions, better context)
  • GPT-4 equivalent: $200-250 (more sessions, re-explaining context)
Effective cost: Similar. But Claude is more productive per session.

4. When I Do Use GPT-4

Sanity checks: “Is this architecture approach reasonable or am I overthinking?” GPT-4 is better at saying “you’re overthinking, use the simple solution.” Alternative perspectives: When Claude designs something, sometimes I ask GPT-4: “What are the downsides of this approach?” Gets me out of confirmation bias. Quick research: “What’s the current best practice for rate limiting in 2025?” GPT-4 is fine for quick factual questions.
The honest summary: If I could only choose one: Claude. If I have budget for both: Claude primary, GPT-4 for sanity checks. If I’m paying myself: Claude only, GPT-4 occasionally via ChatGPT free tier.

The Unconventional Workflow That Emerged

After months of experimentation, here’s the workflow that works for me:

Morning: Plan with Claude

8-10am: Architecture and planning sessions
Session type: Design
Tool: Claude Sonnet
Tasks:
- Review yesterday's progress
- Design today's features
- Create ADRs for major decisions
- Plan implementation approach

Cost: $2-4 per session
Output: Planning documents, architecture decisions
Why mornings: Design work requires clear thinking. Use AI to structure thoughts while mind is fresh.

Day: Build with Copilot

10am-5pm: Implementation
Tool: GitHub Copilot (running continuously)
Tasks:
- Write code with autocomplete
- Generate boilerplate
- Quick fixes

Cost: $0 (subscription)
Value: Massive time savings on mechanical coding
Pattern: Write function signature, let Copilot fill in obvious implementation. Review and tweak. Acceptance rate: ~70% (accept Copilot suggestion with minor edits)

Evening: Review with Claude

5-6pm: Code review and verification
Session type: Review
Tool: Claude Sonnet
Tasks:
- Review today's PRs
- Verify test coverage
- Check for edge cases
- Generate verification report

Cost: $1-3 per session
Output: Review comments, test suggestions
Why evenings: Catch issues before they reach production. Claude’s systematic review finds things I missed.

Weekly: Reflect with Claude

Friday afternoon: Retrospective and documentation
Session type: Documentation + Planning
Tool: Claude Sonnet
Tasks:
- Document this week's decisions (ADRs)
- Update architecture diagrams
- Plan next week's work
- Generate learning summaries

Cost: $3-5
Output: Documentation, session notes, planning docs
The surprising value: This documentation makes it easy to pick up work after breaks. Context reconstruction time dropped from 2-3 hours to 15 minutes.

Final Recommendations

If I were starting fresh today:

Minimum Viable AI Stack

Budget: $40/month
- GitHub Copilot: $10/month
- Claude Sonnet: $30/month pay-as-you-go usage

ROI: 30-40 hours saved per month
Equivalent value: $1,200-1,600 (at $40/hour)
Actual cost: $40

Return: 30-40x
This is the setup I’d recommend to anyone starting with AI-assisted development.

My Current Stack

Budget: $210-270/month
- GitHub Copilot: $10/month
- Claude Sonnet: $180-220/month (heavy usage)
- GPT-4: $20-40/month (occasional)

ROI: 60-80 hours saved per month
Equivalent value: $2,400-3,200 (at $40/hour)
Actual cost: $210-270

Return: 9-15x
Worth it for professional development work. Pays for itself in first 2 days of each month.

The Controversial Opinion

You don’t need GPT-4. If budget is constrained:
  • Copilot + Claude covers 95% of needs
  • GPT-4 adds marginal value (alternative perspectives)
  • Better to spend budget on more Claude usage than splitting across both
Exception: If you want vendor diversity (avoid single-point dependency), keep GPT-4 as backup.

The Tools I Wish Existed

1. Hybrid tool: Claude’s brain + Copilot’s speed Imagine Copilot-style inline suggestions powered by Claude’s understanding. 2. Context persistence across sessions I rebuild context every session. Wish tools remembered previous conversations. 3. Team collaboration features Share Claude sessions with team. Collaborative debugging. Shared context. 4. Cost optimization tools “This session will cost 8.Usesmallermodelfor8. Use smaller model for 2?” Let me choose speed vs cost. 5. Learning analytics “You asked Claude similar questions 3 times. Here’s a pattern you could document.”

Takeaways

After 6 months of heavy AI usage:

What worked:

  • Claude for design, Copilot for implementation, Claude for review
  • Pay-as-you-go beats subscriptions for cost control
  • Documentation as first-class output (not afterthought)
  • Multi-agent workflow (Evaluator/Builder/Verifier)
  • Systematic planning before reactive debugging

What failed:

  • Using AI for cascading error fixes (30-commit disaster)
  • Trusting AI output without verification
  • Assuming AI understands implicit requirements
  • Using AI for creative problem-solving (defaults to patterns)

What surprised me:

  • AI’s documentation is better than its code
  • Context window matters more than model intelligence
  • Copilot is terrible at error handling
  • GPT-4 is better at saying “don’t do that”
  • AI code review is more thorough than human review

The meta-lesson:

AI is a tool multiplier, not a skill replacement. I still need to:
  • Understand what I’m building (requirements)
  • Design the architecture (AI helps, doesn’t decide)
  • Review AI output (trust but verify)
  • Make final decisions (AI advises, human decides)
But with AI:
  • I build 3-5x faster
  • Documentation actually gets written
  • Code review is more thorough
  • Learning is accelerated (AI explains patterns)
Total impact: From solo developer → productive team equivalent.

The Question I Get Asked Most

“Should I learn to code with AI, or learn to code first?” My answer: Learn to code first. Why: AI accelerates when you know what you’re doing. AI misleads when you don’t. Example:
// Copilot suggests:
let result = data.iter().map(|x| x.unwrap()).collect();

// Beginner: "Looks good!" (compiles)
// Experienced: "This panics if any item is None. Bad suggestion."
If you can’t spot bad suggestions, AI will lead you astray. The learning path I recommend:
  1. Months 0-6: Learn programming fundamentals (no AI)
    • Understand syntax, types, control flow
    • Build projects manually
    • Learn to debug without AI
  2. Months 6-12: Start using Copilot (autocomplete only)
    • Verify every suggestion
    • Understand why suggestions are right/wrong
    • Build intuition for good code
  3. Months 12+: Add Claude for design
    • Use for architecture discussions
    • Verify decisions make sense
    • Build with confidence
Skip straight to AI: You’ll write code you don’t understand. Bad foundation. Learn with AI as assistant: You’ll learn faster AND build better intuition. The controversial take: AI makes senior developers more productive. It makes junior developers produce more code, but not necessarily better code. You need judgment to use AI effectively. Judgment comes from experience.