Skip to Content
RunbooksRunbook: Deployment Procedure

Runbook: Deployment Procedure

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Potential downtime during deploy window Required access: SSH to VPS, git push access Related services: curateme-backend-b2b, curateme-backend-gateway, curateme-dashboard, curateme-docs, curateme-celery-worker

This runbook covers deploying the Curate-Me platform to production. The platform runs on two Hetzner VPS instances behind Caddy (auto-HTTPS) with Docker Compose. Deployments are triggered from your local machine and use a lock file to prevent concurrent deploys.


Quick Reference

ItemValue
Platform VPS$DEPLOY_USER@$PLATFORM_VPS_IP (curateme-platform)
Runners VPS$DEPLOY_USER@$RUNNERS_VPS_IP (curateme-runners)
Compose filedocker-compose.production.yml
Env file.env.production (VPS-local, preserved across deploys)
Dashboardhttps://dashboard.curate-me.ai 
API / Gatewayhttps://api.curate-me.ai 
Docshttps://docs.curate-me.ai 
Deploy scriptscripts/deploy-to-vps.sh
VPS-side scriptdeploy/vps/deploy.sh
Verify scriptscripts/post-deploy-verify.sh
Typical duration2-5 minutes (dashboard-only) / 5-10 minutes (full)

Pre-Deployment Checklist

Before deploying, confirm the following:

  1. All tests pass locally — run npm run test and check gateway tests with cd services/backend && poetry run pytest tests/gateway/ -x
  2. No uncommitted changes — the deploy script warns but lets you proceed; commit first to avoid confusion
  3. You are on the correct branch — default deploy branch is develop
  4. SSH access worksssh -o ConnectTimeout=5 $DEPLOY_USER@$PLATFORM_VPS_IP "echo ok"
  5. No other deploy in progress — the script uses /tmp/curateme-deploy.lock to prevent concurrency
# Quick pre-flight git status npm run type-check ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml ps --format 'table {{.Name}}\t{{.Status}}'"

Standard Deployment (Auto-Detect Changed Services)

The default mode detects which services changed since the last push and rebuilds only those. This is the most common deployment path.

./scripts/deploy-to-vps.sh

What happens under the hood:

  1. Acquires deploy lock (PID-based)
  2. Checks for uncommitted changes
  3. Detects changed services by diffing origin/develop..HEAD
  4. Runs Python syntax check on changed files
  5. Pushes to origin/develop
  6. Verifies SSH connectivity
  7. Creates pre-deploy MongoDB backup
  8. Preserves VPS-local Caddyfile and .env.production
  9. Pulls code on VPS (git fetch && git reset --hard)
  10. Restores VPS-local config files
  11. Builds and restarts only changed services
  12. Polls health checks (up to 3 minutes)
  13. Prunes old Docker images
  14. Runs canary check and post-deploy verification
  15. Records deploy event to the System Ops API

What to look for: The script outputs [SUCCESS] All N services healthy! when the deploy is complete. If it says Some services may not be fully healthy yet, check container logs immediately.


Dashboard-Only Deploy

Use when you have only made frontend changes to apps/dashboard/.

./scripts/deploy-to-vps.sh --dashboard

This rebuilds the dashboard Docker image (Next.js build) and restarts only that container. Typical time: 2-3 minutes. Backend services remain untouched.


Backend-Only Deploy

Use for changes in services/backend/ (gateway, B2B API, Celery workers).

./scripts/deploy-to-vps.sh --backend

This rebuilds and restarts: backend-b2b, backend-gateway, runner-agent, celery-worker, celery-beat.

For gateway-only changes (no B2B API or worker changes):

./scripts/deploy-to-vps.sh --gateway

Full Rebuild

Use after infrastructure changes (Dockerfile, Compose file, Caddyfile, env vars) or when debugging mysterious issues.

./scripts/deploy-to-vps.sh --full

This runs docker compose build --no-cache followed by up -d --remove-orphans --force-recreate. Every container is rebuilt from scratch. Expect 8-12 minutes of downtime.


Post-Deployment Verification

The deploy script automatically runs verification, but you can also run it manually:

# Production verification with Slack alerting ./scripts/post-deploy-verify.sh --production --notify # Production with verbose output (shows response bodies) ./scripts/post-deploy-verify.sh --production --verbose # Machine-readable output for CI ./scripts/post-deploy-verify.sh --production --json

Checks performed:

CheckEndpointExpected
Gateway healthGET /gateway/health200, contains “status”
Gateway v1 healthGET /v1/health200
B2B API healthGET /api/v1/health200, contains “status”
DashboardGET /307 (redirect to login)
Docs siteGET /200
Redis connectivityVia gateway health JSON”connected”
MongoDB connectivityVia gateway health JSON”connected”
Gateway proxyGET /v1/models with API key200 (proves auth + proxy)

What to look for: All checks should show [PASS]. A [WARN] on latency (> 3000ms) after deploy is normal for the first request (cold start). A [FAIL] on Gateway health means the container crashed — check logs immediately.

# Quick manual smoke test curl -sf https://api.curate-me.ai/gateway/health | jq . curl -sf https://api.curate-me.ai/v1/health | jq . curl -sf -o /dev/null -w "%{http_code}" https://dashboard.curate-me.ai

Rollback Procedure

If a deploy breaks production, roll back to the previous commit:

# SSH to VPS and roll back ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \ git checkout HEAD~1 -- . && \ docker compose -f docker-compose.production.yml --env-file .env.production build --parallel && \ docker compose -f docker-compose.production.yml --env-file .env.production up -d --remove-orphans"

For a single-service rollback (e.g., only gateway broke):

ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \ git checkout HEAD~1 -- services/backend/ && \ docker compose -f docker-compose.production.yml --env-file .env.production build backend-gateway && \ docker compose -f docker-compose.production.yml --env-file .env.production up -d backend-gateway"

After rollback, verify:

./scripts/post-deploy-verify.sh --production

Important: The canary check script has built-in rollback capability. If canary fails during deploy, it executes the rollback automatically.


Emergency Hotfix Deploy

For critical production issues that need immediate fix (skip non-essential checks):

# Skip syntax check, backup, canary, and verification for speed ./scripts/deploy-to-vps.sh --backend \ --skip-syntax-check \ --no-backup \ --skip-canary \ --skip-verify

After the hotfix is live, manually verify:

./scripts/post-deploy-verify.sh --production --verbose

Warning: Only use --skip-syntax-check in genuine emergencies. A syntax error in production is worse than 30 extra seconds of checking.


Viewing Logs

# All services ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml logs --tail=100" # Specific service (gateway, dashboard, backend-b2b, celery-worker, etc.) ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs --tail=200 -f curateme-backend-gateway" # VPS-side deploy script has a shortcut ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh logs" # Service status overview ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh status"

Communication Plan

StageAction
Before deployPost in #engineering Slack: “Deploying [service] to production — commit abc1234
Deploy completeSlack notification sent automatically if SLACK_DEPLOY_WEBHOOK is set
Verification failsAlert #engineering with failure details and begin triage
Rollback triggeredPost in #engineering: “Rolling back to previous version — investigating”
Incident resolvedPost root cause summary in thread

Escalation

SeverityConditionResponse
P0Gateway returning 5xx to all orgsImmediate rollback, page on-call
P1One service down, others healthyTargeted rollback of affected service
P2Elevated latency or intermittent errorsInvestigate logs, no immediate rollback
P3Non-critical service degraded (docs, marketing)Fix forward in next deploy

Key contacts:

  • VPS access: Any engineer with SSH key on the curateme user
  • Docker / infra issues: Check deploy/vps/deploy.sh health output
  • Database issues: MongoDB backups at /home/curateme/backups/ on VPS

Useful Flags Reference

FlagEffect
--fullRebuild all services with no cache
--dashboardDashboard container only
--backendAll backend services (B2B + Gateway + workers)
--gatewayGateway container only
--runnerRunner-agent service only
--docsDocumentation site only
--no-pushDeploy whatever is already on remote (skip git push)
--no-backupSkip pre-deploy MongoDB backup
--no-pruneSkip Docker image cleanup
--skip-canarySkip canary health check
--skip-verifySkip post-deploy verification
--skip-syntax-checkSkip Python syntax validation (emergency only)
--branch NAMEDeploy from a specific branch
--dry-runPrint what would happen without executing