The question isn't whether your LLM workflow needs fallback logic. It does. Every production system that calls an external model API will eventually encounter a 429, a 503, or a latency spike that makes the primary model unusable. The real question is: where does the fallback logic live, and who owns it?
In most systems today, the answer is: it's scattered. The customer support agent has a try/except that retries with exponential backoff. The document extraction agent has a different retry strategy — one someone copied from a Stack Overflow answer in 2023. The summarization agent has no fallback at all, because the engineer who built it was moving fast. And when the primary provider had a 45-minute partial outage last quarter, all three agents failed differently — one silently dropped requests, one queued them until it ran out of memory, one raised an exception that bubbled up to users as a 500.
This post is about the alternative: defining the fallback chain declaratively, once, in a place that applies it consistently across all invocations that need it.
What a fallback chain is
A fallback chain is an ordered sequence of model configurations that a routing system will attempt in sequence when the current tier fails. "Fails" can mean several things:
- The provider returned a 4xx or 5xx status code
- The invocation latency exceeded a configured threshold (p99 timeout)
- The per-invocation cost exceeded a configured cap
- The response quality score fell below a configured threshold
A three-tier fallback chain for a production workflow might look like this in declarative config:
fallback:
chain:
- tier: primary
model: gpt-4o
provider: openai
- tier: secondary
model: claude-3-5-sonnet
provider: anthropic
- tier: emergency
model: mistral-7b
provider: ollama
endpoint: http://local-gpu:11434
triggers:
on_status_code: [429, 500, 502, 503]
on_latency_p99_ms: 9000
on_cost_per_invocation_usd: 0.08
emit_telemetry: true
The semantics are: try GPT-4o first. If it returns a 429, 500, 502, or 503; or if latency at p99 exceeds 9 seconds; or if the per-invocation cost exceeds $0.08 — fall through to Claude. If Claude also fails under the same conditions, fall through to the local Mistral instance. Log a structured telemetry event every time a tier transition happens.
Three design decisions you'll face
1. Homogeneous vs heterogeneous fallback
A homogeneous fallback chain uses the same model family at lower capability tiers: GPT-4o → GPT-4o-mini → GPT-3.5-turbo. The advantage is predictable output format — all three models share the same API contract and prompt behavior. The disadvantage is that if the provider has a broad outage, all tiers may fail together.
A heterogeneous fallback chain crosses providers: OpenAI → Anthropic → local model. The advantage is provider-level redundancy — an OpenAI outage doesn't affect the Anthropic tier. The disadvantage is that models from different providers may respond differently to the same prompt, which can cause output format inconsistencies if your application parses structured responses.
For most production systems, a hybrid approach works best: GPT-4o-mini as the primary (cost-optimized), GPT-4o as the first fallback (quality upgrade on primary failure), Claude-3-5-Sonnet as the second fallback (provider-level redundancy), local Mistral as the emergency tier (no external dependency). The local tier means you have a guaranteed path that doesn't depend on any external provider's availability — critical for SLA commitments.
2. Trigger conditions and activation thresholds
Most teams default to triggering fallback on HTTP error codes (429, 5xx). This is correct but insufficient. You also want:
Latency-based triggers. A provider can be "available" — returning 200s — but degraded to the point where your p99 latency is 12 seconds. If your SLO is a 3-second response time, a 12-second p99 is worse than a 503 in terms of user impact. Set a latency threshold that triggers fallback before your SLO is violated. A value around 1.5x your target p99 is reasonable (if your SLO is 3s, set a trigger at 4.5s).
Cost-based triggers. For workflows with tight cost budgets, triggering fallback when cost-per-invocation exceeds a ceiling routes expensive queries to a cheaper tier automatically. This is especially useful for prompt-heavy workflows where context length varies significantly — a long-context invocation on GPT-4o can be 20x more expensive than a short one, and you may want to route those to a cheaper model automatically.
3. Preserving output contract across tiers
If your application expects structured JSON output, you need to verify that every model in your fallback chain can reliably produce that format — not just that you can get a response from it. A model that returns prose where you expect JSON is worse than a model that returns an error, because the error fails fast and the prose fails slowly and corrupts downstream processing.
Test every tier of your fallback chain explicitly with your production prompts before shipping. The emergency tier especially — local models via Ollama often have weaker instruction-following capability, and you may need to add explicit output format constraints to their system prompts that aren't necessary for the primary model.
Why the definition belongs in config, not code
The imperative version of fallback logic typically looks like this:
def invoke_with_fallback(prompt):
try:
return openai_client.invoke(prompt, model="gpt-4o", timeout=9)
except (RateLimitError, ServiceUnavailableError):
try:
return anthropic_client.invoke(prompt, model="claude-3-5-sonnet", timeout=9)
except Exception:
return local_client.invoke(prompt, model="mistral-7b")
This function works. It also has several problems at scale:
- Changing the fallback order requires a code change and a deployment
- Changing the latency threshold requires a code change and a deployment
- There's no telemetry — you can't observe which tier activated and when
- If you have 12 agents, this logic is either copied 12 times or in a shared library that all 12 agents need to be updated to use correctly
- There's no audit history of changes — only git blame, which shows you who changed it, not why
When the fallback chain is declared in config, changing the tier order or the trigger threshold is a config update — no deployment required. The change has a git diff. It's reviewable as a pull request. It applies immediately to all workflows that reference that chain without any agent code changes.
Telemetry as first-class output
Every fallback tier transition should emit a structured event. At minimum, the event should contain:
- The invocation ID
- The workflow name
- The tier that was attempted (primary, secondary, emergency)
- The trigger condition that caused the transition (status code, latency, cost)
- The tier that ultimately responded successfully
- The latency of each attempt
- The timestamp
This telemetry gives you something you can't get from application logs: a clear view of how often each tier is activated, which trigger conditions fire most, and whether your primary provider reliability is degrading over time. Tier transition frequency is a leading indicator of provider health that most teams don't have instrumentation for.
When you're in an incident and trying to understand whether the system is recovering, "fallback activation rate dropped from 40% back to 2% in the last five minutes" is exactly the signal you need. You get that only if your control plane emits structured telemetry on every tier transition.
The operational benefit
One of the teams we worked with had fallback logic copy-pasted across eight agent functions. Each one had slightly different timeout values, different error handling, and different retry behavior. When their primary provider had a partial degradation last year, debugging the incident took three hours — not because the system was broken, but because no one could quickly determine which agents were routing to which fallback tier and which were silently dropping requests.
After extracting that logic to a declarative fallback chain in a control plane, their next provider degradation event took 12 minutes to diagnose and resolve. The difference was not better fallback logic — it was observable fallback logic.