Runbook: Deployment Procedure

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Potential downtime during deploy window Required access: SSH to VPS, git push access Related services: curateme-backend-b2b, curateme-backend-gateway, curateme-dashboard, curateme-docs, curateme-celery-worker

This runbook covers deploying the Curate-Me platform to production. The platform runs on two Hetzner VPS instances behind Caddy (auto-HTTPS) with Docker Compose. Deployments are triggered from your local machine and use a lock file to prevent concurrent deploys.

Quick Reference

Item	Value
Platform VPS	`$DEPLOY_USER@$PLATFORM_VPS_IP` (curateme-platform)
Runners VPS	`$DEPLOY_USER@$RUNNERS_VPS_IP` (curateme-runners)
Compose file	`docker-compose.production.yml`
Env file	`.env.production` (VPS-local, preserved across deploys)
Dashboard	https://dashboard.curate-me.ai
API / Gateway	https://api.curate-me.ai
Docs	https://docs.curate-me.ai
Deploy script	`scripts/deploy-to-vps.sh`
VPS-side script	`deploy/vps/deploy.sh`
Verify script	`scripts/post-deploy-verify.sh`
Typical duration	2-5 minutes (dashboard-only) / 5-10 minutes (full)

Pre-Deployment Checklist

Before deploying, confirm the following:

All tests pass locally — run npm run test and check gateway tests with cd services/backend && poetry run pytest tests/gateway/ -x
No uncommitted changes — the deploy script warns but lets you proceed; commit first to avoid confusion
You are on the correct branch — default deploy branch is develop
SSH access works — ssh -o ConnectTimeout=5 $DEPLOY_USER@$PLATFORM_VPS_IP "echo ok"
No other deploy in progress — the script uses /tmp/curateme-deploy.lock to prevent concurrency


# Quick pre-flight
git status
npm run type-check
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml ps --format 'table {{.Name}}\t{{.Status}}'"

Standard Deployment (Auto-Detect Changed Services)

The default mode detects which services changed since the last push and rebuilds only those. This is the most common deployment path.


./scripts/deploy-to-vps.sh

What happens under the hood:

Acquires deploy lock (PID-based)
Checks for uncommitted changes
Detects changed services by diffing origin/develop..HEAD
Runs Python syntax check on changed files
Pushes to origin/develop
Verifies SSH connectivity
Creates pre-deploy MongoDB backup
Preserves VPS-local Caddyfile and .env.production
Pulls code on VPS (git fetch && git reset --hard)
Restores VPS-local config files
Builds and restarts only changed services
Polls health checks (up to 3 minutes)
Prunes old Docker images
Runs canary check and post-deploy verification
Records deploy event to the System Ops API

What to look for: The script outputs [SUCCESS] All N services healthy! when the deploy is complete. If it says Some services may not be fully healthy yet, check container logs immediately.

Dashboard-Only Deploy

Use when you have only made frontend changes to apps/dashboard/.


./scripts/deploy-to-vps.sh --dashboard

This rebuilds the dashboard Docker image (Next.js build) and restarts only that container. Typical time: 2-3 minutes. Backend services remain untouched.

Backend-Only Deploy

Use for changes in services/backend/ (gateway, B2B API, Celery workers).


./scripts/deploy-to-vps.sh --backend

This rebuilds and restarts: backend-b2b, backend-gateway, runner-agent, celery-worker, celery-beat.

For gateway-only changes (no B2B API or worker changes):


./scripts/deploy-to-vps.sh --gateway

Full Rebuild

Use after infrastructure changes (Dockerfile, Compose file, Caddyfile, env vars) or when debugging mysterious issues.


./scripts/deploy-to-vps.sh --full

This runs docker compose build --no-cache followed by up -d --remove-orphans --force-recreate. Every container is rebuilt from scratch. Expect 8-12 minutes of downtime.

Post-Deployment Verification

The deploy script automatically runs verification, but you can also run it manually:


# Production verification with Slack alerting
./scripts/post-deploy-verify.sh --production --notify
 
# Production with verbose output (shows response bodies)
./scripts/post-deploy-verify.sh --production --verbose
 
# Machine-readable output for CI
./scripts/post-deploy-verify.sh --production --json

Checks performed:

Check	Endpoint	Expected
Gateway health	`GET /gateway/health`	200, contains “status”
Gateway v1 health	`GET /v1/health`	200
B2B API health	`GET /api/v1/health`	200, contains “status”
Dashboard	`GET /`	307 (redirect to login)
Docs site	`GET /`	200
Redis connectivity	Via gateway health JSON	”connected”
MongoDB connectivity	Via gateway health JSON	”connected”
Gateway proxy	`GET /v1/models` with API key	200 (proves auth + proxy)

What to look for: All checks should show [PASS]. A [WARN] on latency (> 3000ms) after deploy is normal for the first request (cold start). A [FAIL] on Gateway health means the container crashed — check logs immediately.


# Quick manual smoke test
curl -sf https://api.curate-me.ai/gateway/health | jq .
curl -sf https://api.curate-me.ai/v1/health | jq .
curl -sf -o /dev/null -w "%{http_code}" https://dashboard.curate-me.ai

Rollback Procedure

If a deploy breaks production, roll back to the previous commit:


# SSH to VPS and roll back
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \
  git checkout HEAD~1 -- . && \
  docker compose -f docker-compose.production.yml --env-file .env.production build --parallel && \
  docker compose -f docker-compose.production.yml --env-file .env.production up -d --remove-orphans"

For a single-service rollback (e.g., only gateway broke):


ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \
  git checkout HEAD~1 -- services/backend/ && \
  docker compose -f docker-compose.production.yml --env-file .env.production build backend-gateway && \
  docker compose -f docker-compose.production.yml --env-file .env.production up -d backend-gateway"

After rollback, verify:


./scripts/post-deploy-verify.sh --production

Important: The canary check script has built-in rollback capability. If canary fails during deploy, it executes the rollback automatically.

Emergency Hotfix Deploy

For critical production issues that need immediate fix (skip non-essential checks):


# Skip syntax check, backup, canary, and verification for speed
./scripts/deploy-to-vps.sh --backend \
  --skip-syntax-check \
  --no-backup \
  --skip-canary \
  --skip-verify

After the hotfix is live, manually verify:


./scripts/post-deploy-verify.sh --production --verbose

Warning: Only use --skip-syntax-check in genuine emergencies. A syntax error in production is worse than 30 extra seconds of checking.

Viewing Logs


# All services
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml logs --tail=100"
 
# Specific service (gateway, dashboard, backend-b2b, celery-worker, etc.)
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs --tail=200 -f curateme-backend-gateway"
 
# VPS-side deploy script has a shortcut
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh logs"
 
# Service status overview
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh status"

Communication Plan

Stage	Action
Before deploy	Post in #engineering Slack: “Deploying [service] to production — commit `abc1234`”
Deploy complete	Slack notification sent automatically if `SLACK_DEPLOY_WEBHOOK` is set
Verification fails	Alert #engineering with failure details and begin triage
Rollback triggered	Post in #engineering: “Rolling back to previous version — investigating”
Incident resolved	Post root cause summary in thread

Escalation

Severity	Condition	Response
P0	Gateway returning 5xx to all orgs	Immediate rollback, page on-call
P1	One service down, others healthy	Targeted rollback of affected service
P2	Elevated latency or intermittent errors	Investigate logs, no immediate rollback
P3	Non-critical service degraded (docs, marketing)	Fix forward in next deploy

Key contacts:

VPS access: Any engineer with SSH key on the curateme user
Docker / infra issues: Check deploy/vps/deploy.sh health output
Database issues: MongoDB backups at /home/curateme/backups/ on VPS

Useful Flags Reference

Flag	Effect
`--full`	Rebuild all services with no cache
`--dashboard`	Dashboard container only
`--backend`	All backend services (B2B + Gateway + workers)
`--gateway`	Gateway container only
`--runner`	Runner-agent service only
`--docs`	Documentation site only
`--no-push`	Deploy whatever is already on remote (skip git push)
`--no-backup`	Skip pre-deploy MongoDB backup
`--no-prune`	Skip Docker image cleanup
`--skip-canary`	Skip canary health check
`--skip-verify`	Skip post-deploy verification
`--skip-syntax-check`	Skip Python syntax validation (emergency only)
`--branch NAME`	Deploy from a specific branch
`--dry-run`	Print what would happen without executing

Incident Response — follow this playbook if a deploy causes a production incident
Runner Operations — runner-specific deployment and lifecycle procedures