Runbook: Deployment Procedure
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV2 Customer impact: Potential downtime during deploy window Required access: SSH to VPS, git push access Related services: curateme-backend-b2b, curateme-backend-gateway, curateme-dashboard, curateme-docs, curateme-celery-worker
This runbook covers deploying the Curate-Me platform to production. The platform runs on two Hetzner VPS instances behind Caddy (auto-HTTPS) with Docker Compose. Deployments are triggered from your local machine and use a lock file to prevent concurrent deploys.
Quick Reference
| Item | Value |
|---|---|
| Platform VPS | $DEPLOY_USER@$PLATFORM_VPS_IP (curateme-platform) |
| Runners VPS | $DEPLOY_USER@$RUNNERS_VPS_IP (curateme-runners) |
| Compose file | docker-compose.production.yml |
| Env file | .env.production (VPS-local, preserved across deploys) |
| Dashboard | https://dashboard.curate-me.ai |
| API / Gateway | https://api.curate-me.ai |
| Docs | https://docs.curate-me.ai |
| Deploy script | scripts/deploy-to-vps.sh |
| VPS-side script | deploy/vps/deploy.sh |
| Verify script | scripts/post-deploy-verify.sh |
| Typical duration | 2-5 minutes (dashboard-only) / 5-10 minutes (full) |
Pre-Deployment Checklist
Before deploying, confirm the following:
- All tests pass locally — run
npm run testand check gateway tests withcd services/backend && poetry run pytest tests/gateway/ -x - No uncommitted changes — the deploy script warns but lets you proceed; commit first to avoid confusion
- You are on the correct branch — default deploy branch is
develop - SSH access works —
ssh -o ConnectTimeout=5 $DEPLOY_USER@$PLATFORM_VPS_IP "echo ok" - No other deploy in progress — the script uses
/tmp/curateme-deploy.lockto prevent concurrency
# Quick pre-flight
git status
npm run type-check
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml ps --format 'table {{.Name}}\t{{.Status}}'"Standard Deployment (Auto-Detect Changed Services)
The default mode detects which services changed since the last push and rebuilds only those. This is the most common deployment path.
./scripts/deploy-to-vps.shWhat happens under the hood:
- Acquires deploy lock (PID-based)
- Checks for uncommitted changes
- Detects changed services by diffing
origin/develop..HEAD - Runs Python syntax check on changed files
- Pushes to
origin/develop - Verifies SSH connectivity
- Creates pre-deploy MongoDB backup
- Preserves VPS-local
Caddyfileand.env.production - Pulls code on VPS (
git fetch && git reset --hard) - Restores VPS-local config files
- Builds and restarts only changed services
- Polls health checks (up to 3 minutes)
- Prunes old Docker images
- Runs canary check and post-deploy verification
- Records deploy event to the System Ops API
What to look for: The script outputs [SUCCESS] All N services healthy! when the deploy is complete. If it says Some services may not be fully healthy yet, check container logs immediately.
Dashboard-Only Deploy
Use when you have only made frontend changes to apps/dashboard/.
./scripts/deploy-to-vps.sh --dashboardThis rebuilds the dashboard Docker image (Next.js build) and restarts only that container. Typical time: 2-3 minutes. Backend services remain untouched.
Backend-Only Deploy
Use for changes in services/backend/ (gateway, B2B API, Celery workers).
./scripts/deploy-to-vps.sh --backendThis rebuilds and restarts: backend-b2b, backend-gateway, runner-agent, celery-worker, celery-beat.
For gateway-only changes (no B2B API or worker changes):
./scripts/deploy-to-vps.sh --gatewayFull Rebuild
Use after infrastructure changes (Dockerfile, Compose file, Caddyfile, env vars) or when debugging mysterious issues.
./scripts/deploy-to-vps.sh --fullThis runs docker compose build --no-cache followed by up -d --remove-orphans --force-recreate. Every container is rebuilt from scratch. Expect 8-12 minutes of downtime.
Post-Deployment Verification
The deploy script automatically runs verification, but you can also run it manually:
# Production verification with Slack alerting
./scripts/post-deploy-verify.sh --production --notify
# Production with verbose output (shows response bodies)
./scripts/post-deploy-verify.sh --production --verbose
# Machine-readable output for CI
./scripts/post-deploy-verify.sh --production --jsonChecks performed:
| Check | Endpoint | Expected |
|---|---|---|
| Gateway health | GET /gateway/health | 200, contains “status” |
| Gateway v1 health | GET /v1/health | 200 |
| B2B API health | GET /api/v1/health | 200, contains “status” |
| Dashboard | GET / | 307 (redirect to login) |
| Docs site | GET / | 200 |
| Redis connectivity | Via gateway health JSON | ”connected” |
| MongoDB connectivity | Via gateway health JSON | ”connected” |
| Gateway proxy | GET /v1/models with API key | 200 (proves auth + proxy) |
What to look for: All checks should show [PASS]. A [WARN] on latency (> 3000ms) after deploy is normal for the first request (cold start). A [FAIL] on Gateway health means the container crashed — check logs immediately.
# Quick manual smoke test
curl -sf https://api.curate-me.ai/gateway/health | jq .
curl -sf https://api.curate-me.ai/v1/health | jq .
curl -sf -o /dev/null -w "%{http_code}" https://dashboard.curate-me.aiRollback Procedure
If a deploy breaks production, roll back to the previous commit:
# SSH to VPS and roll back
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \
git checkout HEAD~1 -- . && \
docker compose -f docker-compose.production.yml --env-file .env.production build --parallel && \
docker compose -f docker-compose.production.yml --env-file .env.production up -d --remove-orphans"For a single-service rollback (e.g., only gateway broke):
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && \
git checkout HEAD~1 -- services/backend/ && \
docker compose -f docker-compose.production.yml --env-file .env.production build backend-gateway && \
docker compose -f docker-compose.production.yml --env-file .env.production up -d backend-gateway"After rollback, verify:
./scripts/post-deploy-verify.sh --productionImportant: The canary check script has built-in rollback capability. If canary fails during deploy, it executes the rollback automatically.
Emergency Hotfix Deploy
For critical production issues that need immediate fix (skip non-essential checks):
# Skip syntax check, backup, canary, and verification for speed
./scripts/deploy-to-vps.sh --backend \
--skip-syntax-check \
--no-backup \
--skip-canary \
--skip-verifyAfter the hotfix is live, manually verify:
./scripts/post-deploy-verify.sh --production --verboseWarning: Only use --skip-syntax-check in genuine emergencies. A syntax error in production is worse than 30 extra seconds of checking.
Viewing Logs
# All services
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && docker compose -f docker-compose.production.yml logs --tail=100"
# Specific service (gateway, dashboard, backend-b2b, celery-worker, etc.)
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs --tail=200 -f curateme-backend-gateway"
# VPS-side deploy script has a shortcut
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh logs"
# Service status overview
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "cd ~/platform && ./deploy/vps/deploy.sh status"Communication Plan
| Stage | Action |
|---|---|
| Before deploy | Post in #engineering Slack: “Deploying [service] to production — commit abc1234” |
| Deploy complete | Slack notification sent automatically if SLACK_DEPLOY_WEBHOOK is set |
| Verification fails | Alert #engineering with failure details and begin triage |
| Rollback triggered | Post in #engineering: “Rolling back to previous version — investigating” |
| Incident resolved | Post root cause summary in thread |
Escalation
| Severity | Condition | Response |
|---|---|---|
| P0 | Gateway returning 5xx to all orgs | Immediate rollback, page on-call |
| P1 | One service down, others healthy | Targeted rollback of affected service |
| P2 | Elevated latency or intermittent errors | Investigate logs, no immediate rollback |
| P3 | Non-critical service degraded (docs, marketing) | Fix forward in next deploy |
Key contacts:
- VPS access: Any engineer with SSH key on the
curatemeuser - Docker / infra issues: Check
deploy/vps/deploy.sh healthoutput - Database issues: MongoDB backups at
/home/curateme/backups/on VPS
Useful Flags Reference
| Flag | Effect |
|---|---|
--full | Rebuild all services with no cache |
--dashboard | Dashboard container only |
--backend | All backend services (B2B + Gateway + workers) |
--gateway | Gateway container only |
--runner | Runner-agent service only |
--docs | Documentation site only |
--no-push | Deploy whatever is already on remote (skip git push) |
--no-backup | Skip pre-deploy MongoDB backup |
--no-prune | Skip Docker image cleanup |
--skip-canary | Skip canary health check |
--skip-verify | Skip post-deploy verification |
--skip-syntax-check | Skip Python syntax validation (emergency only) |
--branch NAME | Deploy from a specific branch |
--dry-run | Print what would happen without executing |
Related Runbooks
- Incident Response — follow this playbook if a deploy causes a production incident
- Runner Operations — runner-specific deployment and lifecycle procedures