Per-call costs
How the proxy meters each call, where the margin sits, and how to estimate cost before you deploy.
Every call your tool makes through {provider}-proxy.vendo.run is metered and billed to the tenant's wallet. This page covers exactly how that number is calculated, where Vendo's markup sits, and how to estimate the bill your tool will produce before you ship it.
The formula
Cost is computed per event, in arbitrary-precision decimal, rounded to integer micro-USD at the end with banker's rounding:
cost_micros = round_half_to_even(
quantity × unit_cost × (1 + margin_pct / 100) × 1_000_000
)| Variable | Where it comes from |
|---|---|
unit_cost | The rates table. Per-provider, per-meter, per-dimension. Refreshed hourly for OpenRouter from the upstream price feed; manual for others. |
quantity | The upstream provider's usage payload (input tokens, output tokens, characters, audio seconds). |
margin_pct | resolve_margin(tenant, meter, at_time) — cascades from tenant-specific overrides down to a global default of 20%. |
Everything after this is integer micro-USD until the dashboard renders it.
If unit_cost lookup returns no active row for the provider-resolved model, the proxy returns 402 before forwarding upstream — no upstream call, no cost, no usage_events row. This is the rate-miss block: it keeps Vendo from paying upstream costs Vendo can't price back to the tenant. Each miss is also persisted to rate_miss_events (supabase/migrations/192_rate_miss_events.sql), bucketed hourly by (tenant, integration, normalized_model), so operators can find them via proxy/src/lib/rate-miss.ts or the admin dashboard.
What's metered
| Integration | Metric | Pre-margin cost (illustrative) |
|---|---|---|
| OpenRouter | per token, by model | rates refreshed hourly from upstream /api/v1/models |
| Anthropic | per 1M tokens, by model | claude-sonnet-4-5: $3 input / $15 output; opus tier (claude-opus-4-*): $15 input / $75 output (seed: 213_anthropic_rates_4_x_versioned.sql) |
| OpenAI | per 1M tokens, by model | gpt-4o-mini: $0.15 input / $0.60 output; gpt-4o: $2.50 input / $10 output (seed: 155_openai_rates_seed.sql) |
| ElevenLabs TTS | per 1,000 chars | $0.06 (Flash/Turbo), $0.12 (Multilingual) |
| ElevenLabs STT | per audio hour | $0.39 |
| AssemblyAI | per audio hour | $0.75 (ceiling, all features) |
| muapi (image/video) | per submit call | model-dependent |
| Telegram | not metered | adapter returns costUsd: 0 (packages/integrations/telegram/adapter.ts); bandwidth metering is a follow-up |
| Notion | not metered through proxy | OAuth-only install, no *-proxy subdomain |
Provider rates change. The live, authoritative numbers are at vendo.run/pricing, which reads directly from the rates table and serves through a 1-hour cache (web/src/app/(marketing)/pricing/page.tsx — revalidate = 3600).
Where the margin sits
Margin is applied at the proxy layer, on top of the upstream's unit cost. The default is 20% global, resolved per (tenant, meter):
- Tenant + meter (
{tenant_id, meter_id}) — specific overrides for a tenant on a specific meter. - Tenant + integration (
{tenant_id, integration_slug}) — a tenant-wide deal on one provider. - Tenant (
{tenant_id}) — a flat per-tenant override. - Meter (
{meter_id}) — a per-meter default. - Integration (
{integration_slug}) — a per-integration default. - Global (
{}) — seeded at 20%.
First match wins. Old usage events resolve to the rule effective at their timestamp — margin changes never alter historical bills.
As a tool author, you don't set or see margin directly. It's folded into the price the tenant pays. The number you can quote to a prospective tenant is upstream_unit_cost × (1 + 0.20) for the default case.
Margin is uniform across the call: the same percentage on input and output tokens, the same percentage at every model tier. There's no "free tier" subsidy or "bulk discount" — the math is linear all the way through.
Holds and refunds (reserve / settle)
When your tool fires a request, the proxy doesn't know the exact cost until the upstream responds. To stop concurrent requests from over-spending under a single cached balance, it deducts a fixed hold up front, then refunds the difference after settling:
Tool → Proxy: request
Proxy → KV: reserve(tenant, hold_cap) # cache deducted by hold
Proxy → Upstream: forward
Upstream → Proxy: response (+ usage)
Proxy → Tool: pass-through
Proxy → DB: record_billed_usage # write the real cost row
Proxy → KV: settle(tenant, hold, actual) # refund (hold - actual)Hold caps are upper bounds, not averages (source: proxy/src/balance/holds.ts):
| Adapter | Hold cap |
|---|---|
openai, anthropic, openrouter | $1.00 |
elevenlabs, muapi | $0.50 |
assemblyai | $0.20 |
telegram | $0.001 |
Any other adapter not in the map (e.g. gemini) | $1.00 (HOLD_CAP_DEFAULT_USD) |
A request with a true cost of $0.003 still trips a $1.00 hold during the call, then refunds $0.997 once the upstream usage is known. This is invisible to your tool — balance() will momentarily show the lower value if you read it mid-call, but it self-heals within the response window.
Free routes
Some routes the proxy explicitly does not meter:
| Integration | Route |
|---|---|
openai | GET /v1/models, GET /v1/models/:id |
anthropic | POST /v1/messages/count_tokens, GET /v1/models, GET /v1/models/:id |
These work at zero balance — handy for tools that list models in their UI or count tokens before deciding whether to fire a real call. Other adapters route every request through recordUsage (routes: "passthrough") and are billed regardless.
Compute is metered too
If your tool is tool_type: deployment, you're also paying for Railway compute and the per-tenant Neon database. These don't go through the proxy — instead, the hourly billing-rollup worker samples Railway's GraphQL usage feed and Neon's metrics, applies the same margin rules, and writes a single deduction per tenant per hour per resource.
Typical numbers (pre-margin; meter units shown match the meters table in supabase/migrations/071_pricing_seed_data.sql):
| Resource | Meter unit | Pre-margin rate | At idle (1 tenant) |
|---|---|---|---|
| Railway vCPU | cpu_vcpu_minutes | $0.000463 / vCPU-minute (~$0.0000077 / s) | minimal — most idle services pin near 0% CPU |
| Railway memory | memory_gb_minutes | $0.000231 / GB-minute (~$0.000004 / GB-s) | ~$10/GB-month — usually the biggest line item |
| Railway volume storage | volume_gb_month | $0.15 / GB-month | small unless you store data on disk |
| Railway egress | egress_gb | $0.05 / GB | usually small |
| Neon compute | compute_cu_hour | $0.106 / CU-hour | ~$5/month at 0.25 CU |
| Neon storage | storage_gb_month | $0.35 / GB-month | small unless you have a heavy database |
Cloudflare Workers and Pages (the proxy, app-proxy, web app) are not metered to tenants. Cloudflare Workers and R2 meter rows do exist in the schema (see supabase/migrations/071_pricing_seed_data.sql) at unit_cost = 0 for observational rollup; Pages has no meters at all. If a usage_events row shows up against a Cloudflare meter, it's still $0 to the tenant.
The practical takeaway: a typical deployment-type tool at idle costs the tenant somewhere around $12–15/month in compute, plus whatever they spend on proxied API calls. App sleeping (services with no outbound activity for 10 minutes) cuts ~$5/month per idle tenant — but the sleeping mechanism is broken by anything that does outbound work (cron heartbeats, persistent DB connections, WebSockets), so most real tools don't get the benefit.
Estimating cost before you deploy
Before shipping a tool to the public catalog, do this math:
- Per call — pick the median operation your tool performs (e.g. "summarize a CRM note with
gpt-4o-mini"). Multiply input + output tokens by the model's rate, then multiply by 1.20 for margin. Round to the nearest cent and that's the price the tenant pays per operation. - Per user-day — estimate how many such operations a typical user does per day. Multiply by per-call cost. This is your variable bill per active user.
- Per month, idle — assume $12–15/month for compute if you're a deployment tool. This is the floor — the tenant pays this even at zero usage.
- Per month, active — sum the above. A solo user running 30 ops/day at $0.005/op for 30 days = $4.50 in API + $13 compute = ~$17/month.
Put a realistic range in your marketing.monthlyCost field. Tenants quoted "free" who get a $30 bill churn. Tenants quoted "$15-25/mo" who get a $20 bill don't.
Reading per-call cost from your code
The SDK gives you read access to the tenant's balance but not to historical per-call costs — usage events are exposed through the dashboard, not the SDK. Read the balance to display an estimated runway; show the tenant the dashboard for line-by-line ledger detail.
import { vendo } from "@vendodev/sdk";
const balanceUsd = await vendo.billing.balance();If you need to gate an expensive operation on remaining budget, do it in your tool's code: read the balance, estimate the call cost using known input size, refuse if the call would obviously fail. The proxy will 402 you regardless if you guess wrong — this is just UX polish to show the user a friendly message before they wait for a long operation to die.