Debugging Production Issues at 3 AM - A Survival Guide

The Alert That Ruined My Sleep

3:17 AM. Phone buzzes.

CRITICAL: API error rate 45%
Errors: 2,847 in last 5 minutes
Users affected: ~500

I panicked. Started making changes without understanding the problem. Made it worse. Woke up my entire team at 3:30 AM.

What I should have done: Follow a process.

The Golden Rules of Production Debugging

Rule 1: Don’t Panic

Take 30 seconds to breathe. The system was working before. The problem has a cause. You can find it.

Rule 2: Don’t Change Things Blindly

Every change could make it worse, destroy evidence, or confuse the root cause. First understand, THEN fix.

Rule 3: Communicate Early

Post in Slack immediately, even if you don’t know the problem yet:

🚨 INCIDENT: API errors spiking (45%)
Status: Investigating
Impact: ~500 users affected
ETA: 15 min update

I'm on it. Will update every 15 minutes.

Rule 4: Fix First, Root Cause Later

Stop the bleeding before investigating why it bled. Don’t spend an hour investigating while users can’t use your app.

The Debugging Process

Step 1: Assess the Damage (2 minutes)

Answer immediately:

What’s broken? Which endpoints? What errors?
How bad? Percentage of requests failing
When did it start? Check monitoring
What changed? Recent deployments, config, infrastructure

Step 2: Quick Wins (5 minutes)

Try these first:

Rollback recent deploy - Fixes ~60% of production issues Restart service - Fixes ~20% (memory leaks, connection exhaustion) Scale up - Buys time while investigating

Step 3: Check the Usual Suspects (10 minutes)

External services (Stripe, AWS status pages)
Database (connections, slow queries, locks)
Memory/CPU (check top, df -h)
Rate limiting (429 errors)

Step 4: Dig Into Logs (15 minutes)

Look for stack traces, database errors, network timeouts, memory errors. Find patterns in when errors occur.

Step 5: Form a Hypothesis

Observations:
- Errors started at 3:15 AM
- All errors are database timeouts
- Recent deploy at 3:10 AM
- New code added N+1 query

Hypothesis:
New deployment introduced N+1 query causing database overload.

Step 6: Test and Fix

Rollback, wait 2 minutes, check if errors stop. If yes, hypothesis correct. If no, try next hypothesis.

Step 7: Verify Fix

Check monitoring. Test manually. Post update:

✅ RESOLVED: API errors fixed
Root cause: N+1 query in new deployment
Fix: Reverted deployment
Status: Monitoring for 30 minutes

Step 8: Monitor for 30 Minutes

Don’t go back to sleep immediately. Watch dashboards to ensure error rate stays at 0%.

Common Production Issues and Quick Fixes

Database Connection Pool Exhausted: Increase pool size temporarily, then find connection leaks

Memory Leak: Restart service, add memory limit, profile later

Rate Limit Hit: Add retry with exponential backoff

Infinite Loop: Kill process, rollback deployment

Disk Space Full: Delete logs, increase disk size

The Incident Checklist

□ Breathe (30 seconds)
□ Post incident alert in Slack
□ Check error tracking and APM
□ Identify when it started
□ Check recent deployments
□ Try rollback or restart
□ Check external services
□ Check database
□ Form hypothesis
□ Test fix
□ Verify error rate drops
□ Monitor for 30 minutes
□ Post resolution update
□ Schedule post-mortem

Post-Incident: The Post-Mortem

Write a post-mortem 24-48 hours after with:

Summary and timeline
Root cause explanation
Impact assessment
What went well/wrong
Action items with owners and dates
Lessons learned

Prevention > Cure

Prevent 3 AM alerts:

Monitoring: Add custom metrics for everything important
Alerts: Set up smart alerts (error rate, response time, resources)
Testing: Load test and performance test before deploying
Gradual Rollouts: Deploy to 10% traffic first, watch for issues

Staying Calm Under Pressure

Have a runbook with common issues and fixes
Practice breaking things in staging
Pair with experienced engineers during incidents
Take breaks to clear your head
Learn from every incident

Production incidents are stressful. But they’re also learning opportunities.

Every 3 AM alert makes you a better engineer. You learn the system. You learn debugging. You learn to stay calm under pressure.

Remember:

Don’t panic
Follow a process
Communicate
Fix first, investigate later

You’ve got this. Now go back to sleep.

(But keep your phone charged.)