LLM routing in Rails: OpenRouter with fallback chains and budget caps

Writing the MDX body now. Voice editorial, all claims honest — no invented figures.

Every LLM-backed Rails feature I’ve shipped eventually hits the same wall: the model you picked is down, or rate-limited, or just got three times more expensive overnight. You wrote Model.chat(...) against one provider, and now that one provider is a single point of failure wired straight into a user-facing request. This post is about the layer that fixes that — routing through OpenRouter with an explicit fallback chain and a budget cap, in plain Rails. I’ll give you a decision rubric for ordering the chain, the pitfalls that quietly burn money, and a conceptual worked example you can adapt. No gem required. The whole thing is a service object and a config file, and that’s the point.

What “routing” actually means here

Routing is the decision of which model handles a given request, made at call time rather than baked into your code. OpenRouter is a useful substrate for it because it exposes hundreds of models behind one OpenAI-compatible API and one API key. You change a string, you change the model.

Three concerns live in this layer, and they’re worth separating:

Selection — which model do I want for this task?
Fallback — what do I try when the first choice fails?
Budget — when do I stop spending, regardless of selection?

Most “AI in Rails” tutorials cover selection and stop. The other two are where production bites you. A model being temporarily unavailable is not an edge case; it’s a Tuesday.

Scope for this post: synchronous, single-turn or short-turn calls from a Rails request or a background job. Streaming and multi-agent orchestration are different animals.

The core mechanic: an ordered chain with a budget gate

The pattern is a list of candidates tried in order, each call wrapped so a failure falls through to the next, with a spend check before the whole thing runs. The chain is data, not code.

Order the chain by this rubric. Each tier answers a different question:

Tier	Question it answers	Typical choice
Primary	What gives the best result for this task?	A capable mid/large model
Secondary	What’s nearly as good if the first is down?	A different vendor’s comparable model
Floor	What will always answer, cheaply?	A small, cheap, fast model

The non-obvious rule: your secondary should be a different vendor than your primary. If your primary is an Anthropic model and OpenRouter’s Anthropic upstream is degraded, falling back to another Anthropic model buys you nothing. Cross-vendor fallback is the entire insurance policy.

A minimal router reads as configuration plus a loop:

class LlmRouter
  CHAINS = {
    default: %w[
      anthropic/claude-sonnet-4.5
      google/gemini-2.5-flash
      meta-llama/llama-3.3-70b-instruct
    ]
  }.freeze

  def initialize(chain: :default, budget:)
    @models = CHAINS.fetch(chain)
    @budget = budget
  end

  def call(messages:)
    raise BudgetExceeded if @budget.exhausted?

    @models.each_with_index do |model, i|
      return request(model, messages)
    rescue Faraday::Error, OpenRouter::ServerError => e
      Rails.logger.warn("[llm] #{model} failed: #{e.class}, falling through")
      next if i < @models.size - 1
      raise
    end
  end
end

The BudgetExceeded check at the top is the gate. The rescue/next is the chain. Everything else is detail.

Pitfalls that quietly cost you

Retrying non-retryable failures. A 429 or a 503 is worth a fallback. A 400 — malformed request, context too long, content filtered — will fail identically on every model in your chain. Walking the whole chain on a bad request triples your latency and your error rate for nothing. Branch on the failure class before you fall through.

Treating budget as a single global number. One counter for your whole app means a runaway background job can starve your interactive chat of its budget. Scope budgets the way you scope everything else in a multi-tenant Rails app — per org, per feature, per environment. (If you’re thinking about tenant boundaries generally, row-scoping vs schema vs database-per-tenant is the longer conversation.)

Silent fallback with no signal. If you drop from your primary to your floor model and the user never knows and you never log it, you’ve hidden a quality regression behind a green checkmark. Log every hop with the reason. Emit a metric. You want to notice when your primary has been down for an hour.

Putting the chain in the hot path of a web request. Three sequential model calls, each timing out at 30 seconds, is a 90-second request that holds a Puma thread hostage. Long or fallible LLM work belongs in a job. The Sidekiq vs Solid Queue tradeoff matters more once LLM calls are your dominant job type — they’re slow and they fail, which stresses a queue differently than a mailer does.

How the pattern looks in practice

Picture a SaaS feature that drafts a reply to a customer message. Here’s how I’d wire the routing, conceptually — no client, no invented numbers, just the shape.

The feature calls a job, not the model. The job instantiates the router with a tenant-scoped budget and the :default chain:

class DraftReplyJob < ApplicationJob
  def perform(message_id)
    message = Message.find(message_id)
    budget  = LlmBudget.for(message.organization)

    router = LlmRouter.new(chain: :default, budget: budget)
    draft  = router.call(messages: build_prompt(message))

    message.update!(suggested_reply: draft.content)
    budget.charge!(draft.cost_usd)
  end
end

Three things earn their place here.

The budget is fetched per organization, so a noisy tenant can’t drain a quiet one. LlmBudget.for is just a row with a monthly cap and a running total — a Postgres counter, nothing exotic.

The router charges the budget with the resolved cost from the response after the call succeeds, closing the loop the Callout warned about.

And the whole thing runs in a job, so a slow fallback degrades a background draft, not a page load. The user sees “drafting…” a little longer, not a spinner that hangs their browser.

The router is the cheapest insurance I write. It’s a config array and a rescue clause, and it’s the difference between “the AI feature is down” and “the AI feature is a little dumber for an hour.”

— Self note

If you’re adding this to an app that already has LLM calls scattered around, the migration is mechanical: find every direct provider call, route it through the one service object, delete the duplicated retry logic. My checklist for dropping AI into an existing Rails app covers the surrounding work.

What done looks like

You have a router worth shipping when these are all true:

Every model call in the app goes through one object. No stray HTTP.post to a provider in a controller.
The chain spans at least two vendors, so a single upstream outage can’t take the feature down.
A spend check runs before the call, scoped to a sensible unit (tenant, feature), and the budget is decremented by resolved cost after.
Every fallback hop is logged with a reason and surfaced as a metric you’d actually look at.
Non-retryable failures (4xx that aren’t 429) short-circuit instead of walking the chain.
The whole thing runs off the web request’s hot path when latency or failure is plausible.

Notice none of these require a gem, a framework, or a vendor lock-in deeper than “we use OpenRouter as the gateway.” That’s deliberate. The router is small enough to read in one sitting, which means it’s small enough to debug at 2 a.m.

When to skip this

If you call an LLM once, in an internal admin tool, and a failure just means you retry by hand — don’t build a chain. A single call with a timeout is fine. The routing layer earns its keep when the call is user-facing, frequent, or running unattended in a job.

Likewise, if you’re committed to one vendor’s ecosystem for reasons beyond model quality — data residency, a contract, a feature only they ship — cross-vendor fallback is moot, and OpenRouter is just an extra hop. Route directly. The pattern is insurance; skip it when there’s nothing to insure.

The falsifiable bit

Here’s a claim you can test against your own app: of the LLM failures you’ll see in a month of production, the large majority will be transient — rate limits and upstream timeouts — and a cross-vendor fallback chain will absorb them without a human noticing. The minority that won’t be absorbed are the 4xx bad-request errors, and those are your bug, not the provider’s. If your error budget is being eaten by 400s, no chain will save you, and you should go read your prompts instead.