Runbook: Webhook Disaster Recovery

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Webhook deliveries delayed or lost Required access: SSH to VPS, MongoDB admin credentials Related services: curateme-backend-gateway, curateme-celery-worker, curateme-mongo

Covers webhook service recovery, Redis/MongoDB failure recovery, and region failover. Use this when webhook delivery is completely down.

Severity: P0 - Critical | Estimated RTO: 15 minutes | Estimated RPO: 5 minutes

Overview

This runbook provides step-by-step procedures for recovering from catastrophic failures in the webhook delivery system. Use this runbook when:

Webhook service is completely down (no webhooks delivering)
Database/Redis data loss or corruption
Complete region failure
Data center outage

Emergency Contacts

Role	Contact	Phone	Slack
Primary On-Call	Platform Team	+1-555-0100	@oncall-platform
Secondary On-Call	SRE Team	+1-555-0101	@oncall-sre
Engineering Manager	Jane Smith	+1-555-0102	@jane-smith
CTO	John Doe	+1-555-0199	@john-doe

PagerDuty Escalation Policy: platform-critical

Service Dependencies

Dependency	Impact if Down	Recovery Priority
Redis	No webhook queueing, circuit breaker state lost	P0 - Critical
FastAPI Service	No webhook API, no deliveries	P0 - Critical
MongoDB	No subscription management, no audit logs	P1 - High
PagerDuty	No alerting (monitoring blind)	P2 - Medium
Grafana/Prometheus	No observability, alerts delayed	P2 - Medium

Disaster Scenarios

Scenario 1: Complete Service Outage

Symptoms:

All webhook deliveries failing
Health check endpoint returning 503
Zero successful deliveries in past 5 minutes

Immediate Actions (0-5 minutes):

Confirm Outage Scope


# Check health endpoint
curl https://api.curate-me.ai/api/v1/health
 
# Check recent errors from production
./scripts/errors recent
 
# Check platform health
./scripts/analytics health

Trigger Incident
- Create PagerDuty incident: P0: Webhook Service Complete Outage
- Post to Slack: #incident-response
- Update status page: https://status.curate-me.ai

Check Infrastructure Status


# SSH to platform VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
# Check container status
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
 
# Check service logs
docker logs curateme-backend-gateway --tail 200 --timestamps 2>&1 | grep -iE "error|webhook|fail"

Recovery Actions (5-15 minutes):

Restart Service (Quick Fix)


# Restart the backend service
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
cd ~/platform && docker compose -f docker-compose.production.yml restart backend-b2b backend-gateway
 
# Wait for health check
for i in {1..30}; do
  if curl -f https://api.curate-me.ai/api/v1/health; then
    echo "Service recovered"
    break
  fi
  sleep 2
done

If Restart Fails — Deploy Last Known Good Version


# From local machine, deploy last stable build
./scripts/deploy-to-vps.sh --backend
 
# Or on VPS, roll back to previous image
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
cd ~/platform && ./deploy/vps/deploy.sh update

Verify Recovery


# Check health
curl https://api.curate-me.ai/api/v1/health | jq
 
# Check for new errors
./scripts/errors recent
 
# Verify metrics
./scripts/analytics health

Post-Recovery (15-60 minutes):

Process Dead Letter Queue (DLQ)
Investigate root cause (check logs, metrics)
Update postmortem document
Schedule blameless postmortem meeting

Scenario 2: Redis Failure / Data Loss

Symptoms:

Circuit breaker state lost (all circuits reset to CLOSED)
Webhook queue empty
Rate limiting not working
Session data missing

Immediate Actions:

Check Redis Health


ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
# Check Redis container
docker ps -a --filter name=curateme-redis
 
# Connect to Redis
docker exec curateme-redis redis-cli ping
 
# Check memory usage
docker exec curateme-redis redis-cli info memory
 
# Check connected clients
docker exec curateme-redis redis-cli client list

For deeper Redis diagnostics, see Redis Incident.

If Redis Down — Restart Redis


docker restart curateme-redis
 
# Verify it responds
docker exec curateme-redis redis-cli ping

If Data Loss — Restore from Backup


# Check available backups on VPS
ls -la /home/curateme/backups/redis/
 
# Restore from latest RDB dump
docker cp /home/curateme/backups/redis/latest.rdb curateme-redis:/data/dump.rdb
docker restart curateme-redis

Data Recovery Priority:

Data Type	Recovery Method	Priority
Circuit breaker state	Re-learn from live traffic (auto-recover)	P1
Webhook queue	Replay from MongoDB audit log	P0
Rate limit counters	Reset (acceptable loss)	P3
Idempotency keys	Replay from last 24h (acceptable duplicates)	P2

Post-Recovery:

Monitor circuit breaker auto-recovery (expect 5-10 minutes for state rebuild)
Manually replay failed webhooks from DLQ
Review Redis persistence configuration (save directives)
Verify cost accumulation resumed: ./scripts/analytics costs today

Scenario 3: Database (MongoDB) Failure

Symptoms:

Subscription management fails
Audit logs not writing
Certificate storage unavailable

Immediate Actions:

Check MongoDB Status


ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
# Check MongoDB container
docker ps -a --filter name=curateme-mongo
 
# Test connectivity
docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })"
 
# Check replication status
docker exec curateme-mongo mongosh --eval "rs.status()"

For deeper MongoDB diagnostics, see MongoDB Incident.

If MongoDB Down — Restart


docker restart curateme-mongo
 
# Verify it responds
docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })"

Degraded Mode Operation
- Webhook deliveries continue (subscriptions cached in Redis)
- New subscriptions temporarily disabled
- Certificate management read-only

Recovery Timeline:

Container restart: 1-3 minutes
Failover (multi-region): 5-10 minutes
Full recovery: 15-30 minutes

Scenario 4: Complete Region Failure

Symptoms:

All services in primary region unreachable
DNS resolution timeout
Hetzner status page shows region outage

Immediate Actions (Failover):

Trigger DR Failover


# Update DNS to point to secondary VPS
# (Use your DNS provider dashboard or API)
 
# Activate services on secondary VPS (curateme-runners)
ssh $DEPLOY_USER@$RUNNERS_VPS_IP
cd ~/platform && docker compose -f docker-compose.production.yml up -d

Verify Secondary Region Health


# Health check on secondary
curl https://api.curate-me.ai/api/v1/health
 
# Check errors
./scripts/errors recent
 
# Full health check
./scripts/analytics health

Communication
- Update status page: “Failover in progress”
- Notify customers via email
- Post to Slack #incidents

Expected Impact:

RTO: 15 minutes (DNS propagation time)
RPO: 5 minutes (last MongoDB sync)
Temporary latency increase for some users

Rollback Procedures

Rollback Recent Deployment

If a recent deployment caused the outage:


# On VPS: check recent deploy history
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
./deploy/vps/deploy.sh status
 
# Rollback via deploy script
./deploy/vps/deploy.sh update
 
# Or from local machine, deploy a specific known-good commit
git checkout <known-good-commit>
./scripts/deploy-to-vps.sh --backend

Restore MongoDB From Backup

There are no schema migrations to roll back — MongoDB is schemaless and accessed via motor (this project has no alembic). If a bad write or data corruption caused the issue, restore from the most recent backup:


ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Check available MongoDB backups
ls -la /home/curateme/backups/mongo/
 
# Restore from backup
docker exec curateme-mongo mongorestore --drop /data/backup/latest/

See Backup and Restore Procedures below for the full manual backup/restore commands.

Validation and Testing

Post-Recovery Health Checks

Run these checks to validate recovery:


# 1. Health endpoint
curl https://api.curate-me.ai/api/v1/health | jq
 
# 2. Test webhook delivery
curl -X POST https://api.curate-me.ai/api/v1/webhooks/deliver \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "webhook_id": "test_recovery",
    "url": "https://webhook.site/test",
    "payload": {"test": "recovery"}
  }'
 
# 3. Check platform health and recent errors
./scripts/analytics health
./scripts/errors recent
 
# 4. Verify cost tracking is current
./scripts/analytics costs today

SLO Compliance Verification

Ensure SLOs are met post-recovery:

SLO	Target	Check Method
Success Rate	>= 99.5%	`./scripts/analytics health`
P95 Latency	< 300ms	`./scripts/analytics performance`
Throughput	>= 100 req/s	`./scripts/analytics health`

Backup and Restore Procedures

Redis Backups

Scheduled Backups:

Frequency: Every 6 hours
Retention: 7 days
Location: /home/curateme/backups/redis/ on VPS

Manual Backup:


ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
# Trigger manual RDB save
docker exec curateme-redis redis-cli bgsave
 
# Copy dump to backup location
docker cp curateme-redis:/data/dump.rdb /home/curateme/backups/redis/manual-backup-$(date +%Y%m%d-%H%M).rdb

Restore from Backup:


# List available backups
ls -la /home/curateme/backups/redis/
 
# Restore specific backup
docker cp /home/curateme/backups/redis/<backup-file>.rdb curateme-redis:/data/dump.rdb
docker restart curateme-redis

MongoDB Backups

Automatic Backups:

Frequency: Daily (configured via cron on VPS)
Retention: 30 days
Type: mongodump snapshots

Manual Backup:


ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
docker exec curateme-mongo mongodump --out /data/backup/manual-$(date +%Y%m%d-%H%M)

Restore from Backup:


# List available backups
ls -la /home/curateme/backups/mongo/
 
# Restore to specific snapshot
docker exec curateme-mongo mongorestore --drop /data/backup/<snapshot-dir>/

Escalation Matrix

Escalation Timeline

Time Elapsed	Action
0 minutes	On-call engineer paged
5 minutes	Secondary on-call paged if no response
15 minutes	Engineering Manager notified
30 minutes	CTO notified
60 minutes	CEO notified (customer-facing impact)

Escalation Commands


# Trigger PagerDuty escalation
curl -X POST https://api.pagerduty.com/incidents \
  -H "Authorization: Token token=$PAGERDUTY_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "incident": {
      "type": "incident",
      "title": "Webhook Service P0 Outage",
      "service": {"id": "PXXXXXX", "type": "service_reference"},
      "urgency": "high",
      "escalation_policy": {"id": "PXXXXXX", "type": "escalation_policy_reference"}
    }
  }'

Communication Templates

Status Page Update (Incident Detected)


[INVESTIGATING] Webhook Delivery Service Issues

We are currently investigating issues with webhook deliveries.
Some webhooks may be delayed or failing.

Our engineering team is investigating and we will provide updates
as more information becomes available.

Posted: <timestamp>
Status: Investigating

Status Page Update (Resolved)


[RESOLVED] Webhook Delivery Service Restored

The webhook delivery issues have been resolved. All services are
operating normally. Delayed webhooks have been processed.

Root cause: [Brief description]
Impacted time: [Start] - [End]
Affected webhooks: [Count]

We apologize for any inconvenience caused.

Posted: <timestamp>
Status: Resolved

Customer Email Template


Subject: [RESOLVED] Webhook Delivery Service Incident

Dear Customer,

We want to inform you about a recent incident affecting our webhook
delivery service:

Incident Summary:
- Start time: [timestamp]
- End time: [timestamp]
- Duration: [duration]
- Impact: Some webhooks may have been delayed or failed

Resolution:
All affected webhooks have been reprocessed and delivered. No action
is required on your part.

Root Cause:
[Brief technical explanation]

Prevention:
We have implemented [preventive measures] to prevent recurrence.

If you have any questions or concerns, please contact our support team.

Thank you for your patience and understanding.

Best regards,
curate-me.ai Platform Team

Postmortem Template

After resolving the incident, complete a postmortem within 48 hours:

Required Sections:

Executive Summary
Timeline of Events
Root Cause Analysis (5 Whys)
Impact Assessment (customers, revenue, SLO breach)
What Went Well
What Went Wrong
Action Items (with owners and deadlines)

Postmortem Location: docs/postmortems/YYYY-MM-DD-webhook-outage.md

Disaster Recovery Testing

Quarterly DR Drills

Schedule and execute DR tests quarterly:

Test Scenarios:

Primary region failover (Q1, Q3)
Redis failure and recovery (Q2, Q4)
Database restore (Q1, Q3)
Complete service rollback (Q2, Q4)

Test Checklist:

Schedule during low-traffic window
Notify stakeholders 1 week in advance
Prepare rollback plan
Document actual RTO/RPO achieved
Update runbook based on findings

Prevention and Monitoring

Preventive Measures

Redundancy:
- Two-VPS architecture (platform + runners)
- Redis persistence (RDB snapshots)
- MongoDB daily backups
Monitoring:
- ./scripts/errors recent — check production errors on demand
- ./scripts/analytics health — full platform health check
- ./scripts/analytics costs today — cost tracking verification
Automated Recovery:
- Health check-based auto-restart (Docker restart policies)
- Circuit breakers prevent cascading failures
- DLQ for failed webhook replay

Early Warning Signs

Monitor these metrics for early detection:

Metric	Warning Threshold	Action
Success rate	< 99%	`./scripts/errors recent` — investigate logs
P95 latency	> 400ms	`./scripts/analytics performance` — check resource utilization
Open circuits	> 3	Check endpoint health
DLQ depth	> 100	Process queue manually
Redis memory	> 80%	Scale up or evict keys

Redis Incident — Redis container health, memory diagnostics, connection issues
MongoDB Incident — MongoDB container health, replication, restore procedures
Incident Response — General incident response playbook and escalation procedures

Runbook: Webhook Disaster Recovery

Overview

Emergency Contacts

Service Dependencies

Disaster Scenarios

Scenario 1: Complete Service Outage

Scenario 2: Redis Failure / Data Loss

Scenario 3: Database (MongoDB) Failure

Scenario 4: Complete Region Failure

Rollback Procedures

Rollback Recent Deployment

Restore MongoDB From Backup

Validation and Testing

Post-Recovery Health Checks

SLO Compliance Verification

Backup and Restore Procedures

Redis Backups

MongoDB Backups

Escalation Matrix

Escalation Timeline

Escalation Commands

Communication Templates

Status Page Update (Incident Detected)

Status Page Update (Resolved)

Customer Email Template

Postmortem Template

Disaster Recovery Testing

Quarterly DR Drills

Prevention and Monitoring

Preventive Measures

Early Warning Signs

Related Runbooks