Debugging Production Issues at 3 AM - A Survival Guide

When Shit Hits the Fan and You’re On-Call

The PagerDuty alert wakes you at 3 AM. “500 errors spiking.” Your heart races. Your mind is foggy. Production is down. Here’s how to not panic and actually fix it.

The Alert That Ruined My Sleep

3:17 AM. Phone buzzes.

CRITICAL: API error rate 45%
Errors: 2,847 in last 5 minutes
Users affected: ~500

I panicked. Started making changes without understanding the problem. Made it worse. Woke up my entire team at 3:30 AM.

What I should have done: Follow a process.

The Golden Rules of Production Debugging

Rule 1: Don’t Panic

Take 30 seconds to breathe. The system was working before. The problem has a cause. You can find it.

Rule 2: Don’t Change Things Blindly

Every change could make it worse, destroy evidence, or confuse the root cause. First understand, THEN fix.

Rule 3: Communicate Early

Post in Slack immediately, even if you don’t know the problem yet:

🚨 INCIDENT: API errors spiking (45%)
Status: Investigating
Impact: ~500 users affected
ETA: 15 min update

I'm on it. Will update every 15 minutes.

Rule 4: Fix First, Root Cause Later

Stop the bleeding before investigating why it bled. Don’t spend an hour investigating while users can’t use your app.

The Debugging Process

Step 1: Assess the Damage (2 minutes)

Answer immediately:

  • What’s broken? Which endpoints? What errors?
  • How bad? Percentage of requests failing
  • When did it start? Check monitoring
  • What changed? Recent deployments, config, infrastructure

Step 2: Quick Wins (5 minutes)

Try these first:

Rollback recent deploy - Fixes ~60% of production issues Restart service - Fixes ~20% (memory leaks, connection exhaustion) Scale up - Buys time while investigating

Step 3: Check the Usual Suspects (10 minutes)

  • External services (Stripe, AWS status pages)
  • Database (connections, slow queries, locks)
  • Memory/CPU (check top, df -h)
  • Rate limiting (429 errors)

Step 4: Dig Into Logs (15 minutes)

Look for stack traces, database errors, network timeouts, memory errors. Find patterns in when errors occur.

Step 5: Form a Hypothesis

Observations:
- Errors started at 3:15 AM
- All errors are database timeouts
- Recent deploy at 3:10 AM
- New code added N+1 query

Hypothesis:
New deployment introduced N+1 query causing database overload.

Step 6: Test and Fix

Rollback, wait 2 minutes, check if errors stop. If yes, hypothesis correct. If no, try next hypothesis.

Step 7: Verify Fix

Check monitoring. Test manually. Post update:

✅ RESOLVED: API errors fixed
Root cause: N+1 query in new deployment
Fix: Reverted deployment
Status: Monitoring for 30 minutes

Step 8: Monitor for 30 Minutes

Don’t go back to sleep immediately. Watch dashboards to ensure error rate stays at 0%.

Common Production Issues and Quick Fixes

Database Connection Pool Exhausted: Increase pool size temporarily, then find connection leaks

Memory Leak: Restart service, add memory limit, profile later

Rate Limit Hit: Add retry with exponential backoff

Infinite Loop: Kill process, rollback deployment

Disk Space Full: Delete logs, increase disk size

The Incident Checklist

□ Breathe (30 seconds)
□ Post incident alert in Slack
□ Check error tracking and APM
□ Identify when it started
□ Check recent deployments
□ Try rollback or restart
□ Check external services
□ Check database
□ Form hypothesis
□ Test fix
□ Verify error rate drops
□ Monitor for 30 minutes
□ Post resolution update
□ Schedule post-mortem

Post-Incident: The Post-Mortem

Write a post-mortem 24-48 hours after with:

  • Summary and timeline
  • Root cause explanation
  • Impact assessment
  • What went well/wrong
  • Action items with owners and dates
  • Lessons learned

Prevention > Cure

Prevent 3 AM alerts:

  1. Monitoring: Add custom metrics for everything important
  2. Alerts: Set up smart alerts (error rate, response time, resources)
  3. Testing: Load test and performance test before deploying
  4. Gradual Rollouts: Deploy to 10% traffic first, watch for issues

Staying Calm Under Pressure

  • Have a runbook with common issues and fixes
  • Practice breaking things in staging
  • Pair with experienced engineers during incidents
  • Take breaks to clear your head
  • Learn from every incident

Production incidents are stressful. But they’re also learning opportunities.

Every 3 AM alert makes you a better engineer. You learn the system. You learn debugging. You learn to stay calm under pressure.

Remember:

  • Don’t panic
  • Follow a process
  • Communicate
  • Fix first, investigate later

You’ve got this. Now go back to sleep.

(But keep your phone charged.)