Skip to Content
RunbooksRunbook: Webhook Disaster Recovery

Runbook: Webhook Disaster Recovery

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Webhook deliveries delayed or lost Required access: SSH to VPS, MongoDB admin credentials Related services: curateme-backend-gateway, curateme-celery-worker, curateme-mongo

Covers webhook service recovery, Redis/MongoDB failure recovery, and region failover. Use this when webhook delivery is completely down.

Severity: P0 - Critical | Estimated RTO: 15 minutes | Estimated RPO: 5 minutes


Overview

This runbook provides step-by-step procedures for recovering from catastrophic failures in the webhook delivery system. Use this runbook when:

  • Webhook service is completely down (no webhooks delivering)
  • Database/Redis data loss or corruption
  • Complete region failure
  • Data center outage

Emergency Contacts

RoleContactPhoneSlack
Primary On-CallPlatform Team+1-555-0100@oncall-platform
Secondary On-CallSRE Team+1-555-0101@oncall-sre
Engineering ManagerJane Smith+1-555-0102@jane-smith
CTOJohn Doe+1-555-0199@john-doe

PagerDuty Escalation Policy: platform-critical


Service Dependencies

DependencyImpact if DownRecovery Priority
RedisNo webhook queueing, circuit breaker state lostP0 - Critical
FastAPI ServiceNo webhook API, no deliveriesP0 - Critical
MongoDBNo subscription management, no audit logsP1 - High
PagerDutyNo alerting (monitoring blind)P2 - Medium
Grafana/PrometheusNo observability, alerts delayedP2 - Medium

Disaster Scenarios

Scenario 1: Complete Service Outage

Symptoms:

  • All webhook deliveries failing
  • Health check endpoint returning 503
  • Zero successful deliveries in past 5 minutes

Immediate Actions (0-5 minutes):

  1. Confirm Outage Scope

    # Check health endpoint curl https://api.curate-me.ai/api/v1/health # Check recent errors from production ./scripts/errors recent # Check platform health ./scripts/analytics health
  2. Trigger Incident

    • Create PagerDuty incident: P0: Webhook Service Complete Outage
    • Post to Slack: #incident-response
    • Update status page: https://status.curate-me.ai 
  3. Check Infrastructure Status

    # SSH to platform VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check container status docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" # Check service logs docker logs curateme-backend-gateway --tail 200 --timestamps 2>&1 | grep -iE "error|webhook|fail"

Recovery Actions (5-15 minutes):

  1. Restart Service (Quick Fix)

    # Restart the backend service ssh $DEPLOY_USER@$PLATFORM_VPS_IP cd /opt/curateme && docker compose -f docker-compose.production.yml restart backend # Wait for health check for i in {1..30}; do if curl -f https://api.curate-me.ai/api/v1/health; then echo "Service recovered" break fi sleep 2 done
  2. If Restart Fails — Deploy Last Known Good Version

    # From local machine, deploy last stable build ./scripts/deploy-to-vps.sh --backend # Or on VPS, roll back to previous image ssh $DEPLOY_USER@$PLATFORM_VPS_IP cd /opt/curateme && ./deploy/vps/deploy.sh update
  3. Verify Recovery

    # Check health curl https://api.curate-me.ai/api/v1/health | jq # Check for new errors ./scripts/errors recent # Verify metrics ./scripts/analytics health

Post-Recovery (15-60 minutes):

  1. Process Dead Letter Queue (DLQ)
  2. Investigate root cause (check logs, metrics)
  3. Update postmortem document
  4. Schedule blameless postmortem meeting

Scenario 2: Redis Failure / Data Loss

Symptoms:

  • Circuit breaker state lost (all circuits reset to CLOSED)
  • Webhook queue empty
  • Rate limiting not working
  • Session data missing

Immediate Actions:

  1. Check Redis Health

    ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check Redis container docker ps -a --filter name=curateme-redis # Connect to Redis docker exec curateme-redis redis-cli ping # Check memory usage docker exec curateme-redis redis-cli info memory # Check connected clients docker exec curateme-redis redis-cli client list

    For deeper Redis diagnostics, see Redis Incident.

  2. If Redis Down — Restart Redis

    docker restart curateme-redis # Verify it responds docker exec curateme-redis redis-cli ping
  3. If Data Loss — Restore from Backup

    # Check available backups on VPS ls -la /opt/curateme/backups/redis/ # Restore from latest RDB dump docker cp /opt/curateme/backups/redis/latest.rdb curateme-redis:/data/dump.rdb docker restart curateme-redis

Data Recovery Priority:

Data TypeRecovery MethodPriority
Circuit breaker stateRe-learn from live traffic (auto-recover)P1
Webhook queueReplay from MongoDB audit logP0
Rate limit countersReset (acceptable loss)P3
Idempotency keysReplay from last 24h (acceptable duplicates)P2

Post-Recovery:

  1. Monitor circuit breaker auto-recovery (expect 5-10 minutes for state rebuild)
  2. Manually replay failed webhooks from DLQ
  3. Review Redis persistence configuration (save directives)
  4. Verify cost accumulation resumed: ./scripts/analytics costs today

Scenario 3: Database (MongoDB) Failure

Symptoms:

  • Subscription management fails
  • Audit logs not writing
  • Certificate storage unavailable

Immediate Actions:

  1. Check MongoDB Status

    ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check MongoDB container docker ps -a --filter name=curateme-mongo # Test connectivity docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })" # Check replication status docker exec curateme-mongo mongosh --eval "rs.status()"

    For deeper MongoDB diagnostics, see MongoDB Incident.

  2. If MongoDB Down — Restart

    docker restart curateme-mongo # Verify it responds docker exec curateme-mongo mongosh --eval "db.runCommand({ ping: 1 })"
  3. Degraded Mode Operation

    • Webhook deliveries continue (subscriptions cached in Redis)
    • New subscriptions temporarily disabled
    • Certificate management read-only

Recovery Timeline:

  • Container restart: 1-3 minutes
  • Failover (multi-region): 5-10 minutes
  • Full recovery: 15-30 minutes

Scenario 4: Complete Region Failure

Symptoms:

  • All services in primary region unreachable
  • DNS resolution timeout
  • Hetzner status page shows region outage

Immediate Actions (Failover):

  1. Trigger DR Failover

    # Update DNS to point to secondary VPS # (Use your DNS provider dashboard or API) # Activate services on secondary VPS (curateme-runners) ssh $DEPLOY_USER@$RUNNERS_VPS_IP cd /opt/curateme && docker compose -f docker-compose.production.yml up -d
  2. Verify Secondary Region Health

    # Health check on secondary curl https://api.curate-me.ai/api/v1/health # Check errors ./scripts/errors recent # Full health check ./scripts/analytics health
  3. Communication

    • Update status page: “Failover in progress”
    • Notify customers via email
    • Post to Slack #incidents

Expected Impact:

  • RTO: 15 minutes (DNS propagation time)
  • RPO: 5 minutes (last MongoDB sync)
  • Temporary latency increase for some users

Rollback Procedures

Rollback Recent Deployment

If a recent deployment caused the outage:

# On VPS: check recent deploy history ssh $DEPLOY_USER@$PLATFORM_VPS_IP ./deploy/vps/deploy.sh status # Rollback via deploy script ./deploy/vps/deploy.sh update # Or from local machine, deploy a specific known-good commit git checkout <known-good-commit> ./scripts/deploy-to-vps.sh --backend

Rollback Database Migration

If a database migration caused issues:

cd services/backend poetry run alembic downgrade -1 # Or restore from backup ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Check available MongoDB backups ls -la /opt/curateme/backups/mongo/ # Restore from backup docker exec curateme-mongo mongorestore --drop /data/backup/latest/

Validation and Testing

Post-Recovery Health Checks

Run these checks to validate recovery:

# 1. Health endpoint curl https://api.curate-me.ai/api/v1/health | jq # 2. Test webhook delivery curl -X POST https://api.curate-me.ai/api/v1/webhooks/deliver \ -H "X-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "webhook_id": "test_recovery", "url": "https://webhook.site/test", "payload": {"test": "recovery"} }' # 3. Check platform health and recent errors ./scripts/analytics health ./scripts/errors recent # 4. Verify cost tracking is current ./scripts/analytics costs today

SLO Compliance Verification

Ensure SLOs are met post-recovery:

SLOTargetCheck Method
Success Rate>= 99.5%./scripts/analytics health
P95 Latency< 300ms./scripts/analytics performance
Throughput>= 100 req/s./scripts/analytics health

Backup and Restore Procedures

Redis Backups

Scheduled Backups:

  • Frequency: Every 6 hours
  • Retention: 7 days
  • Location: /opt/curateme/backups/redis/ on VPS

Manual Backup:

ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Trigger manual RDB save docker exec curateme-redis redis-cli bgsave # Copy dump to backup location docker cp curateme-redis:/data/dump.rdb /opt/curateme/backups/redis/manual-backup-$(date +%Y%m%d-%H%M).rdb

Restore from Backup:

# List available backups ls -la /opt/curateme/backups/redis/ # Restore specific backup docker cp /opt/curateme/backups/redis/<backup-file>.rdb curateme-redis:/data/dump.rdb docker restart curateme-redis

MongoDB Backups

Automatic Backups:

  • Frequency: Daily (configured via cron on VPS)
  • Retention: 30 days
  • Type: mongodump snapshots

Manual Backup:

ssh $DEPLOY_USER@$PLATFORM_VPS_IP docker exec curateme-mongo mongodump --out /data/backup/manual-$(date +%Y%m%d-%H%M)

Restore from Backup:

# List available backups ls -la /opt/curateme/backups/mongo/ # Restore to specific snapshot docker exec curateme-mongo mongorestore --drop /data/backup/<snapshot-dir>/

Escalation Matrix

Escalation Timeline

Time ElapsedAction
0 minutesOn-call engineer paged
5 minutesSecondary on-call paged if no response
15 minutesEngineering Manager notified
30 minutesCTO notified
60 minutesCEO notified (customer-facing impact)

Escalation Commands

# Trigger PagerDuty escalation curl -X POST https://api.pagerduty.com/incidents \ -H "Authorization: Token token=$PAGERDUTY_API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "incident": { "type": "incident", "title": "Webhook Service P0 Outage", "service": {"id": "PXXXXXX", "type": "service_reference"}, "urgency": "high", "escalation_policy": {"id": "PXXXXXX", "type": "escalation_policy_reference"} } }'

Communication Templates

Status Page Update (Incident Detected)

[INVESTIGATING] Webhook Delivery Service Issues We are currently investigating issues with webhook deliveries. Some webhooks may be delayed or failing. Our engineering team is investigating and we will provide updates as more information becomes available. Posted: <timestamp> Status: Investigating

Status Page Update (Resolved)

[RESOLVED] Webhook Delivery Service Restored The webhook delivery issues have been resolved. All services are operating normally. Delayed webhooks have been processed. Root cause: [Brief description] Impacted time: [Start] - [End] Affected webhooks: [Count] We apologize for any inconvenience caused. Posted: <timestamp> Status: Resolved

Customer Email Template

Subject: [RESOLVED] Webhook Delivery Service Incident Dear Customer, We want to inform you about a recent incident affecting our webhook delivery service: Incident Summary: - Start time: [timestamp] - End time: [timestamp] - Duration: [duration] - Impact: Some webhooks may have been delayed or failed Resolution: All affected webhooks have been reprocessed and delivered. No action is required on your part. Root Cause: [Brief technical explanation] Prevention: We have implemented [preventive measures] to prevent recurrence. If you have any questions or concerns, please contact our support team. Thank you for your patience and understanding. Best regards, curate-me.ai Platform Team

Postmortem Template

After resolving the incident, complete a postmortem within 48 hours:

Required Sections:

  1. Executive Summary
  2. Timeline of Events
  3. Root Cause Analysis (5 Whys)
  4. Impact Assessment (customers, revenue, SLO breach)
  5. What Went Well
  6. What Went Wrong
  7. Action Items (with owners and deadlines)

Postmortem Location: docs/postmortems/YYYY-MM-DD-webhook-outage.md


Disaster Recovery Testing

Quarterly DR Drills

Schedule and execute DR tests quarterly:

Test Scenarios:

  1. Primary region failover (Q1, Q3)
  2. Redis failure and recovery (Q2, Q4)
  3. Database restore (Q1, Q3)
  4. Complete service rollback (Q2, Q4)

Test Checklist:

  • Schedule during low-traffic window
  • Notify stakeholders 1 week in advance
  • Prepare rollback plan
  • Document actual RTO/RPO achieved
  • Update runbook based on findings

Prevention and Monitoring

Preventive Measures

  1. Redundancy:

    • Two-VPS architecture (platform + runners)
    • Redis persistence (RDB snapshots)
    • MongoDB daily backups
  2. Monitoring:

    • ./scripts/errors recent — check production errors on demand
    • ./scripts/analytics health — full platform health check
    • ./scripts/analytics costs today — cost tracking verification
  3. Automated Recovery:

    • Health check-based auto-restart (Docker restart policies)
    • Circuit breakers prevent cascading failures
    • DLQ for failed webhook replay

Early Warning Signs

Monitor these metrics for early detection:

MetricWarning ThresholdAction
Success rate< 99%./scripts/errors recent — investigate logs
P95 latency> 400ms./scripts/analytics performance — check resource utilization
Open circuits> 3Check endpoint health
DLQ depth> 100Process queue manually
Redis memory> 80%Scale up or evict keys

  • Redis Incident — Redis container health, memory diagnostics, connection issues
  • MongoDB Incident — MongoDB container health, replication, restore procedures
  • Incident Response — General incident response playbook and escalation procedures