Rollout & upgrades
What happens to existing tenant deployments when you ship a new release.
Cutting a new release (Versioning & releases) does nothing to existing deployments. Each tenant's instance is pinned to the manifest it deployed against. Upgrades are explicit — either you initiate them, or the tenant does. This page covers what an upgrade actually does, how to roll one out, and how to roll back.
What an upgrade workflow does
The deploy worker has a dedicated UpgradeWorkflow (cloudflare/deploy-worker/src/workflows/upgrade.ts) that:
fetch_deployment— loads the current deployment row.resolve_upgrade_env(orfetch_upgrade_image_servicesfor image-only bumps) — re-resolves env vars so new placeholders or integration bindings get picked up.snapshot_current— writesdeployments.previousTemplateVersion= the currentbundle_versionso a rollback has somewhere to land.upgrade_compute— callsserviceInstanceUpdate({ source: { image } })on each Railway service and triggers a redeploy.await_ready_*— polls the readiness endpoint until it returns 2xx.finalize— flips the deployment back tostatus='running'andprogress: 100.
The web API writes deployments.bundle_version to the target version before firing the workflow, so the row's version pin is what tenants see while the upgrade is in flight. deployments.manifest is also updated to the new snapshot. There is no tool_release_id column; pinning is via deployments.deployed_template_version (varchar) and the manifest JSONB.
If the health check fails after the image swap, the deployment is marked failed — Railway keeps the old running revision because redeploy is not destructive.
Without the image swap, redeploying just reruns the previously-set image. The image tag has to change for the upgrade to actually pull new code.
Tenant-driven vs platform-driven vs auto-update
Three upgrade paths exist, only one of which the tenant initiates:
- Tenant-driven upgrade. The dashboard shows an "Update available" banner on deployments whose
deployments.deployed_template_versionis older than the activetool_releases.template_version. The tenant clicks Update, which POSTs to/api/deployments/[id]/upgradeand kicks offUpgradeWorkflow. This is the default path. - Platform-driven upgrade. For a security patch or hot fix — a Vendo operator (or a tool author with admin access) initiates the upgrade on every affected deployment without waiting for the tenant.
- Auto-update (customize-enabled tools).
workers/update-watcherruns hourly (cron"0 * * * *"). For deployments backed by a customize-enabled tool, it detects upstream commits, queues apending_updatesrow, and firesUpdateWorkflowin the deploy worker (resolver → judge → gate). Depending on the gate's decision the patch is either auto-applied or surfaced to the tenant for confirmation. The watcher honors a weekly cap (default 5/week) and per-deploymentauto_update_paused_at. See Updating a tool for the full description.
The manifest-driven release-upgrade path (the focus of this page) never auto-applies — only the customize/update-watcher path does, and it only touches tools that explicitly opt in.
Rolling out a patch (mass upgrade)
For a security patch or fix where every existing deployment should move:
-
Cut the new release.
-
Query for deployments still on old versions:
SELECT d.id, d.tenant_id, d.deployed_template_version, d.status FROM deployments d JOIN apps_catalog t ON t.slug = d.tool_slug JOIN tool_releases tr ON tr.tool_id = t.id AND tr.status = 'active' WHERE d.tool_slug = 'my-tool' AND d.status = 'running' AND d.deployed_template_version <> tr.template_version; -
For each row, POST
/api/deployments/[id]/upgradevia an admin script. Rate-limit so you don't saturate Railway. -
Watch
deploy_logsfor failures and retry as needed.
Vendo doesn't ship a one-click "upgrade everyone" UI — it would be too easy to take down every tenant of a tool simultaneously. A patch rollout is a deliberate, scriptable operation.
Rolling back
Rollback is just an upgrade aimed backwards:
- Cut a new release pointing at the previous version (don't reactivate the old row — keep
created_atordering meaningful). - Upgrade affected deployments to the new (older-content) release.
Old manifest versions stay in R2 indefinitely. The image tag you reference has to still exist in its registry — if you garbage-collected ghcr.io/me/my-tool:1.0.3, rollback fails.
Deployment management endpoints
| Endpoint | What it does | When to use |
|---|---|---|
POST /api/deployments/[id]/restart | Pushes every deployment_env_vars row into Railway via variableUpsert and redeploys each service. Status flows running → restarting → running. | Env var edits, transient crash loops, a fresh boot without a new image. |
POST /api/deployments/[id]/upgrade | Runs UpgradeWorkflow — re-resolves env vars, swaps image tag, redeploys, polls readiness. Updates deployments.manifest, bundle_version, deployed_template_version. | Shipping a new release. |
POST /api/deployments/[id]/retry | Spawns a fresh DeployWorkflow that re-runs idempotent phases on a failed row. Reuses the original vendo_api_key, admin password, and user env vars — no state lost. | A previous deploy died midway. |
POST /api/deployments/[id]/rollback | Swaps bundle_version ↔ previousTemplateVersion and fires the deploy worker's /upgrade endpoint with the previous version. | Undoing a release that's already in production. |
POST /api/deployments/[id]/suspend | Runs SuspendWorkflow — Railway deploymentRemove, Neon compute pause, app proxy flips to status-page mode. | Pausing a deployment without losing state. |
POST /api/deployments/[id]/resume | Runs ResumeWorkflow — databases first, then compute, then readiness. | Bringing a suspended deployment back. |
POST /api/deployments/[id]/destroy | Runs TeardownWorkflow — destructive. See Teardown. | Permanent decommission. |
All seven endpoints are tenant-callable from the dashboard. None require operator intervention.
Suspend/resume during rollout
If you upgrade a suspended deployment, the upgrade resumes it first. Resumed deployments inherit the new version. This is usually what you want — sleeping deployments should not stay on a known-bad version.
What can break
- Env var rename across versions. If
1.0.0readsBOT_TOKENand1.1.0readsTELEGRAM_BOT_TOKEN, existing deployments still have the old name indeployment_env_varsuntil the upgrade re-resolves. Use the provider'sconnectionEnvVarsregistry to keep names stable across releases. - Database schema migrations. Your tool owns its Postgres. If
1.1.0needs a new column, your container's startup must run the migration before serving traffic. The readiness probe is your gate. - Volume layout changes. Volumes are preserved across upgrades. Moving data from
/data/v1to/data/v2is your job, not Vendo's.
Next: Operate — running a tool after it's deployed.