The Alert That Ruined My Sleep
3:17 AM. Phone buzzes.
CRITICAL: API error rate 45%
Errors: 2,847 in last 5 minutes
Users affected: ~500
I panicked. Started making changes without understanding the problem. Made it worse. Woke up my entire team at 3:30 AM.
What I should have done: Follow a process.
The Golden Rules of Production Debugging
Rule 1: Don’t Panic
Take 30 seconds to breathe. The system was working before. The problem has a cause. You can find it.
Rule 2: Don’t Change Things Blindly
Every change could make it worse, destroy evidence, or confuse the root cause. First understand, THEN fix.
Rule 3: Communicate Early
Post in Slack immediately, even if you don’t know the problem yet:
🚨 INCIDENT: API errors spiking (45%)
Status: Investigating
Impact: ~500 users affected
ETA: 15 min update
I'm on it. Will update every 15 minutes.
Rule 4: Fix First, Root Cause Later
Stop the bleeding before investigating why it bled. Don’t spend an hour investigating while users can’t use your app.
The Debugging Process
Step 1: Assess the Damage (2 minutes)
Answer immediately:
- What’s broken? Which endpoints? What errors?
- How bad? Percentage of requests failing
- When did it start? Check monitoring
- What changed? Recent deployments, config, infrastructure
Step 2: Quick Wins (5 minutes)
Try these first:
Rollback recent deploy - Fixes ~60% of production issues Restart service - Fixes ~20% (memory leaks, connection exhaustion) Scale up - Buys time while investigating
Step 3: Check the Usual Suspects (10 minutes)
- External services (Stripe, AWS status pages)
- Database (connections, slow queries, locks)
- Memory/CPU (check top, df -h)
- Rate limiting (429 errors)
Step 4: Dig Into Logs (15 minutes)
Look for stack traces, database errors, network timeouts, memory errors. Find patterns in when errors occur.
Step 5: Form a Hypothesis
Observations:
- Errors started at 3:15 AM
- All errors are database timeouts
- Recent deploy at 3:10 AM
- New code added N+1 query
Hypothesis:
New deployment introduced N+1 query causing database overload.
Step 6: Test and Fix
Rollback, wait 2 minutes, check if errors stop. If yes, hypothesis correct. If no, try next hypothesis.
Step 7: Verify Fix
Check monitoring. Test manually. Post update:
✅ RESOLVED: API errors fixed
Root cause: N+1 query in new deployment
Fix: Reverted deployment
Status: Monitoring for 30 minutes
Step 8: Monitor for 30 Minutes
Don’t go back to sleep immediately. Watch dashboards to ensure error rate stays at 0%.
Common Production Issues and Quick Fixes
Database Connection Pool Exhausted: Increase pool size temporarily, then find connection leaks
Memory Leak: Restart service, add memory limit, profile later
Rate Limit Hit: Add retry with exponential backoff
Infinite Loop: Kill process, rollback deployment
Disk Space Full: Delete logs, increase disk size
The Incident Checklist
□ Breathe (30 seconds)
□ Post incident alert in Slack
□ Check error tracking and APM
□ Identify when it started
□ Check recent deployments
□ Try rollback or restart
□ Check external services
□ Check database
□ Form hypothesis
□ Test fix
□ Verify error rate drops
□ Monitor for 30 minutes
□ Post resolution update
□ Schedule post-mortem
Post-Incident: The Post-Mortem
Write a post-mortem 24-48 hours after with:
- Summary and timeline
- Root cause explanation
- Impact assessment
- What went well/wrong
- Action items with owners and dates
- Lessons learned
Prevention > Cure
Prevent 3 AM alerts:
- Monitoring: Add custom metrics for everything important
- Alerts: Set up smart alerts (error rate, response time, resources)
- Testing: Load test and performance test before deploying
- Gradual Rollouts: Deploy to 10% traffic first, watch for issues
Staying Calm Under Pressure
- Have a runbook with common issues and fixes
- Practice breaking things in staging
- Pair with experienced engineers during incidents
- Take breaks to clear your head
- Learn from every incident
Production incidents are stressful. But they’re also learning opportunities.
Every 3 AM alert makes you a better engineer. You learn the system. You learn debugging. You learn to stay calm under pressure.
Remember:
- Don’t panic
- Follow a process
- Communicate
- Fix first, investigate later
You’ve got this. Now go back to sleep.
(But keep your phone charged.)