Healthchecks
The healthcheck contract, what fails counts as, and what happens to a deployment that flunks them.
Vendo polls every running deployment tool with an HTTP healthcheck. If you ship a tool that doesn't answer correctly, it ends up tagged "degraded" in the dashboard — even when it's functionally working. This page tells you exactly what the contract is.
The contract
Your tool MUST serve a healthcheck endpoint that:
- Responds to
GETover plain HTTP (no auth, no cookies). - Returns a
2xxstatus when the tool is healthy (200and204both pass — the probe checksresponse.ok). - Returns within 10 seconds.
The default path is /healthz. Your template manifest can override this with the healthEndpoint field — pick a path that doesn't collide with your app routes (/health, /api/health, /_healthz are all fine).
Anything outside 2xx — 3xx, 4xx, 5xx, or a timeout — counts as unhealthy.
What "healthy" should mean
Two reasonable choices:
- Liveness only. Return
200if the process is up. Cheap, never lies about external dependencies. Good default. - Liveness + critical deps. Return
200only if the process is up AND your Postgres is reachable AND your Redis is reachable. More accurate, but you can flap if a dep has a transient blip.
Avoid heavy work in the healthcheck path. No DB writes, no external API calls (you'll burn credits on every probe), no long queries. The probe runs every 15 minutes by default; if it takes 8 seconds, you've made your tool slower and your bill bigger.
The cadence
Vendo's health-monitor worker probes every running deployment every 15 minutes. The interval is deliberate — Neon Postgres can scale to zero after ~10 minutes idle, and probing more often would wake the database back up on every tick.
After 3 consecutive failures (~45 minutes), your deployment's health_status flips to degraded. A single successful probe resets the failure counter and flips status back to healthy immediately — there's no recovery debounce. If you flap, the dashboard flaps with you.
Every probe writes a row to the health_checks table (status, status code, response time, error message). It's the canonical debug surface for "when did this start failing."
What "degraded" actually does
Today, degraded is a dashboard signal only. It surfaces in the deployment list and on the detail page so the user knows something's off. It does not:
- Restart your container.
- Suspend the deployment.
- Block requests through the app-proxy.
- Page the user.
Your real users keep hitting your tool's URL. A failing healthcheck and a failing app are two different things — Vendo trusts you to know the difference.
When healthchecks bite you
There are three patterns that consistently cause unhealthy-but-working deployments. None of them are bugs in the platform.
-
No
/healthzroute, no override. Your manifest didn't declarehealthEndpoint, your app doesn't have/healthz, and now every probe is a 404. Permanently degraded. Fix: add the route, or override the path in the manifest. -
Healthcheck behind auth. You added
/healthzbut it's behind your app's session middleware. The probe sends no cookies; every probe is a 401 or 302. Fix: exclude the healthcheck path from auth middleware. -
Healthcheck depends on a slow upstream. Your
/healthzcalls an LLM "to verify connectivity". Probes time out at 10s, healthcheck fails, you burn credits on every tick. Fix: return liveness only, never call external APIs from the healthcheck.
Deploy-time vs ongoing healthchecks
The above describes the ongoing healthcheck. There's also a one-time check Vendo runs during deploy, before flipping the deployment to running. Same path, same expectations, but a different worker. If it fails, your deploy fails outright and the deployment lands in failed rather than running.
For Railway tools, this means: if you can't pass your own healthcheck locally, you won't get past the deploy pipeline.
Related
- Logs and observability — where to look when probes start failing.
- Scaling and limits — what happens at the platform layer when things go wrong.