Runbook: Webhook Disaster Recovery
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Webhook deliveries delayed or lost Required access: SSH to VPS, MongoDB admin credentials Related services: curateme-backend-gateway, curateme-celery-worker, curateme-mongo
Covers webhook service recovery, Redis/MongoDB failure recovery, and region failover. Use this when webhook delivery is completely down.
Severity: P0 - Critical | Estimated RTO: 15 minutes | Estimated RPO: 5 minutes
Overview
This runbook provides step-by-step procedures for recovering from catastrophic failures in the webhook delivery system. Use this runbook when:
- Webhook service is completely down (no webhooks delivering)
- Database/Redis data loss or corruption
- Complete region failure
- Data center outage
Emergency Contacts
| Role | Contact | Phone | Slack |
|---|---|---|---|
| Primary On-Call | Platform Team | +1-555-0100 | @oncall-platform |
| Secondary On-Call | SRE Team | +1-555-0101 | @oncall-sre |
| Engineering Manager | Jane Smith | +1-555-0102 | @jane-smith |
| CTO | John Doe | +1-555-0199 | @john-doe |
PagerDuty Escalation Policy: platform-critical
Service Dependencies
| Dependency | Impact if Down | Recovery Priority |
|---|---|---|
| Redis | No webhook queueing, circuit breaker state lost | P0 - Critical |
| FastAPI Service | No webhook API, no deliveries | P0 - Critical |
| MongoDB | No subscription management, no audit logs | P1 - High |
| PagerDuty | No alerting (monitoring blind) | P2 - Medium |
| Grafana/Prometheus | No observability, alerts delayed | P2 - Medium |
Disaster Scenarios
Scenario 1: Complete Service Outage
Symptoms:
- All webhook deliveries failing
- Health check endpoint returning 503
- Zero successful deliveries in past 5 minutes
Immediate Actions (0-5 minutes):
-
Confirm Outage Scope
# Check health endpoint curl https://api.curate-me.ai/api/v1/health # Check recent errors from production ./scripts/errors recent # Check platform health ./scripts/analytics health -
Trigger Incident
- Create PagerDuty incident:
P0: Webhook Service Complete Outage - Post to Slack:
#incident-response - Update status page: https://status.curate-me.ai
- Create PagerDuty incident:
-
Check Infrastructure Status
# SSH to platform VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check container status docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" # Check service logs docker logs curateme-backend-gateway --tail 200 --timestamps 2>&1 | grep -iE "error|webhook|fail"
Recovery Actions (5-15 minutes):
-
Restart Service (Quick Fix)
# Restart the backend service ssh $DEPLOY_USER@$PLATFORM_VPS_IP cd /opt/curateme && docker compose -f docker-compose.production.yml restart backend # Wait for health check for i in {1..30}; do if curl -f https://api.curate-me.ai/api/v1/health; then echo "Service recovered" break fi sleep 2 done -
If Restart Fails — Deploy Last Known Good Version
# From local machine, deploy last stable build ./scripts/deploy-to-vps.sh --backend # Or on VPS, roll back to previous image ssh $DEPLOY_USER@$PLATFORM_VPS_IP cd /opt/curateme && ./deploy/vps/deploy.sh update -
Verify Recovery
# Check health curl https://api.curate-me.ai/api/v1/health | jq # Check for new errors ./scripts/errors recent # Verify metrics ./scripts/analytics health
Post-Recovery (15-60 minutes):
- Process Dead Letter Queue (DLQ)
- Investigate root cause (check logs, metrics)
- Update postmortem document
- Schedule blameless postmortem meeting
Scenario 2: Redis Failure / Data Loss
Symptoms:
- Circuit breaker state lost (all circuits reset to CLOSED)
- Webhook queue empty
- Rate limiting not working
- Session data missing
Immediate Actions:
-
Check Redis Health
ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check Redis container docker ps -a --filter name=curateme-redis # Connect to Redis docker exec curateme-redis redis-cli ping # Check memory usage docker exec curateme-redis redis-cli info memory # Check connected clients docker exec curateme-redis redis-cli client listFor deeper Redis diagnostics, see Redis Incident.
-
If Redis Down — Restart Redis
docker restart curateme-redis # Verify it responds docker exec curateme-redis redis-cli ping -
If Data Loss — Restore from Backup
# Check available backups on VPS ls -la /opt/curateme/backups/redis/ # Restore from latest RDB dump docker cp /opt/curateme/backups/redis/latest.rdb curateme-redis:/data/dump.rdb docker restart curateme-redis
Data Recovery Priority:
| Data Type | Recovery Method | Priority |
|---|---|---|
| Circuit breaker state | Re-learn from live traffic (auto-recover) | P1 |
| Webhook queue | Replay from MongoDB audit log | P0 |
| Rate limit counters | Reset (acceptable loss) | P3 |
| Idempotency keys | Replay from last 24h (acceptable duplicates) | P2 |
Post-Recovery:
- Monitor circuit breaker auto-recovery (expect 5-10 minutes for state rebuild)
- Manually replay failed webhooks from DLQ
- Review Redis persistence configuration (
savedirectives) - Verify cost accumulation resumed:
./scripts/analytics costs today
Scenario 3: Database (MongoDB) Failure
Symptoms:
- Subscription management fails
- Audit logs not writing
- Certificate storage unavailable
Immediate Actions:
-
Check MongoDB Status
ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check MongoDB container docker ps -a --filter name=curateme-mongo # Test connectivity docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })" # Check replication status docker exec curateme-mongo mongosh --eval "rs.status()"For deeper MongoDB diagnostics, see MongoDB Incident.
-
If MongoDB Down — Restart
docker restart curateme-mongo # Verify it responds docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })" -
Degraded Mode Operation
- Webhook deliveries continue (subscriptions cached in Redis)
- New subscriptions temporarily disabled
- Certificate management read-only
Recovery Timeline:
- Container restart: 1-3 minutes
- Failover (multi-region): 5-10 minutes
- Full recovery: 15-30 minutes
Scenario 4: Complete Region Failure
Symptoms:
- All services in primary region unreachable
- DNS resolution timeout
- Hetzner status page shows region outage
Immediate Actions (Failover):
-
Trigger DR Failover
# Update DNS to point to secondary VPS # (Use your DNS provider dashboard or API) # Activate services on secondary VPS (curateme-runners) ssh $DEPLOY_USER@$RUNNERS_VPS_IP cd /opt/curateme && docker compose -f docker-compose.production.yml up -d -
Verify Secondary Region Health
# Health check on secondary curl https://api.curate-me.ai/api/v1/health # Check errors ./scripts/errors recent # Full health check ./scripts/analytics health -
Communication
- Update status page: “Failover in progress”
- Notify customers via email
- Post to Slack #incidents
Expected Impact:
- RTO: 15 minutes (DNS propagation time)
- RPO: 5 minutes (last MongoDB sync)
- Temporary latency increase for some users
Rollback Procedures
Rollback Recent Deployment
If a recent deployment caused the outage:
# On VPS: check recent deploy history
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
./deploy/vps/deploy.sh status
# Rollback via deploy script
./deploy/vps/deploy.sh update
# Or from local machine, deploy a specific known-good commit
git checkout <known-good-commit>
./scripts/deploy-to-vps.sh --backendRollback Database Migration
If a database migration caused issues:
cd services/backend
poetry run alembic downgrade -1
# Or restore from backup
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Check available MongoDB backups
ls -la /opt/curateme/backups/mongo/
# Restore from backup
docker exec curateme-mongo mongorestore --drop /data/backup/latest/Validation and Testing
Post-Recovery Health Checks
Run these checks to validate recovery:
# 1. Health endpoint
curl https://api.curate-me.ai/api/v1/health | jq
# 2. Test webhook delivery
curl -X POST https://api.curate-me.ai/api/v1/webhooks/deliver \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"webhook_id": "test_recovery",
"url": "https://webhook.site/test",
"payload": {"test": "recovery"}
}'
# 3. Check platform health and recent errors
./scripts/analytics health
./scripts/errors recent
# 4. Verify cost tracking is current
./scripts/analytics costs todaySLO Compliance Verification
Ensure SLOs are met post-recovery:
| SLO | Target | Check Method |
|---|---|---|
| Success Rate | >= 99.5% | ./scripts/analytics health |
| P95 Latency | < 300ms | ./scripts/analytics performance |
| Throughput | >= 100 req/s | ./scripts/analytics health |
Backup and Restore Procedures
Redis Backups
Scheduled Backups:
- Frequency: Every 6 hours
- Retention: 7 days
- Location:
/opt/curateme/backups/redis/on VPS
Manual Backup:
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Trigger manual RDB save
docker exec curateme-redis redis-cli bgsave
# Copy dump to backup location
docker cp curateme-redis:/data/dump.rdb /opt/curateme/backups/redis/manual-backup-$(date +%Y%m%d-%H%M).rdbRestore from Backup:
# List available backups
ls -la /opt/curateme/backups/redis/
# Restore specific backup
docker cp /opt/curateme/backups/redis/<backup-file>.rdb curateme-redis:/data/dump.rdb
docker restart curateme-redisMongoDB Backups
Automatic Backups:
- Frequency: Daily (configured via cron on VPS)
- Retention: 30 days
- Type:
mongodumpsnapshots
Manual Backup:
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
docker exec curateme-mongo mongodump --out /data/backup/manual-$(date +%Y%m%d-%H%M)Restore from Backup:
# List available backups
ls -la /opt/curateme/backups/mongo/
# Restore to specific snapshot
docker exec curateme-mongo mongorestore --drop /data/backup/<snapshot-dir>/Escalation Matrix
Escalation Timeline
| Time Elapsed | Action |
|---|---|
| 0 minutes | On-call engineer paged |
| 5 minutes | Secondary on-call paged if no response |
| 15 minutes | Engineering Manager notified |
| 30 minutes | CTO notified |
| 60 minutes | CEO notified (customer-facing impact) |
Escalation Commands
# Trigger PagerDuty escalation
curl -X POST https://api.pagerduty.com/incidents \
-H "Authorization: Token token=$PAGERDUTY_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"incident": {
"type": "incident",
"title": "Webhook Service P0 Outage",
"service": {"id": "PXXXXXX", "type": "service_reference"},
"urgency": "high",
"escalation_policy": {"id": "PXXXXXX", "type": "escalation_policy_reference"}
}
}'Communication Templates
Status Page Update (Incident Detected)
[INVESTIGATING] Webhook Delivery Service Issues
We are currently investigating issues with webhook deliveries.
Some webhooks may be delayed or failing.
Our engineering team is investigating and we will provide updates
as more information becomes available.
Posted: <timestamp>
Status: InvestigatingStatus Page Update (Resolved)
[RESOLVED] Webhook Delivery Service Restored
The webhook delivery issues have been resolved. All services are
operating normally. Delayed webhooks have been processed.
Root cause: [Brief description]
Impacted time: [Start] - [End]
Affected webhooks: [Count]
We apologize for any inconvenience caused.
Posted: <timestamp>
Status: ResolvedCustomer Email Template
Subject: [RESOLVED] Webhook Delivery Service Incident
Dear Customer,
We want to inform you about a recent incident affecting our webhook
delivery service:
Incident Summary:
- Start time: [timestamp]
- End time: [timestamp]
- Duration: [duration]
- Impact: Some webhooks may have been delayed or failed
Resolution:
All affected webhooks have been reprocessed and delivered. No action
is required on your part.
Root Cause:
[Brief technical explanation]
Prevention:
We have implemented [preventive measures] to prevent recurrence.
If you have any questions or concerns, please contact our support team.
Thank you for your patience and understanding.
Best regards,
curate-me.ai Platform TeamPostmortem Template
After resolving the incident, complete a postmortem within 48 hours:
Required Sections:
- Executive Summary
- Timeline of Events
- Root Cause Analysis (5 Whys)
- Impact Assessment (customers, revenue, SLO breach)
- What Went Well
- What Went Wrong
- Action Items (with owners and deadlines)
Postmortem Location: docs/postmortems/YYYY-MM-DD-webhook-outage.md
Disaster Recovery Testing
Quarterly DR Drills
Schedule and execute DR tests quarterly:
Test Scenarios:
- Primary region failover (Q1, Q3)
- Redis failure and recovery (Q2, Q4)
- Database restore (Q1, Q3)
- Complete service rollback (Q2, Q4)
Test Checklist:
- Schedule during low-traffic window
- Notify stakeholders 1 week in advance
- Prepare rollback plan
- Document actual RTO/RPO achieved
- Update runbook based on findings
Prevention and Monitoring
Preventive Measures
-
Redundancy:
- Two-VPS architecture (platform + runners)
- Redis persistence (RDB snapshots)
- MongoDB daily backups
-
Monitoring:
./scripts/errors recent— check production errors on demand./scripts/analytics health— full platform health check./scripts/analytics costs today— cost tracking verification
-
Automated Recovery:
- Health check-based auto-restart (Docker restart policies)
- Circuit breakers prevent cascading failures
- DLQ for failed webhook replay
Early Warning Signs
Monitor these metrics for early detection:
| Metric | Warning Threshold | Action |
|---|---|---|
| Success rate | < 99% | ./scripts/errors recent — investigate logs |
| P95 latency | > 400ms | ./scripts/analytics performance — check resource utilization |
| Open circuits | > 3 | Check endpoint health |
| DLQ depth | > 100 | Process queue manually |
| Redis memory | > 80% | Scale up or evict keys |
Related Runbooks
- Redis Incident — Redis container health, memory diagnostics, connection issues
- MongoDB Incident — MongoDB container health, replication, restore procedures
- Incident Response — General incident response playbook and escalation procedures