LLM Fallback Routing: Survive an AI Model Recall (2026)
June 17, 2026
TL;DR. On June 12, 2026 the US Commerce Department ordered Anthropic to suspend access to Claude Fable 5 and Mythos 5, and Anthropic disabled both models for every customer to comply.12 Apps wired to a single model returned errors within minutes; apps with multi-provider fallback routing kept serving. This guide shows how to build that fallback layer in TypeScript — a normalized provider interface, an automatic fallback chain, a circuit breaker, and a rule for picking a compatible backup model so quality doesn't silently collapse on failover. Total time: ~30 minutes.
An LLM fallback chain is a routing layer that retries a request against the next provider in an ordered list whenever the primary returns a retryable failure — a 429 rate limit, a 5xx, a timeout, or a model-not-found / access-revoked error.3 Until this month, most teams treated "the model is gone" as an impossible state. The Fable 5 recall made it a Tuesday.
What you'll learn
- Why a single-provider LLM integration is now a regulatory availability risk, not just an uptime one.
- How to define one normalized request/response interface across Anthropic, OpenAI, Google, and an open-weight model.
- How to build an automatic fallback chain that trips on rate limits, server errors, timeouts, and access-revoked errors.
- How to add a circuit breaker so a dead provider stops eating latency on every request.
- How to choose a compatible backup model so failover doesn't silently degrade output quality.
- When to stop hand-rolling this and put a gateway like LiteLLM in front instead.
What actually happened to Fable 5
Anthropic launched Claude Fable 5 — its Mythos-class public model — on June 9, 2026.4 Three days later, on June 12, the US Commerce Department issued an export-control directive citing national-security authorities. The order suspended access to Fable 5 and Mythos 5 for any foreign national, whether inside or outside the United States.2 Rather than build per-user nationality gating overnight, Anthropic disabled both models entirely for all customers while it worked through compliance.1 The trigger, per the directive, was awareness of a method of "jailbreaking" the models; Anthropic publicly characterized the technique as narrow, already known, and present in rival models too, and called the situation a misunderstanding it was "working to restore access" from.56
The operational facts that matter for your architecture: there was no advance notice, no firm restoration date, and no automatic quality-preserving fallback from Anthropic's side — sessions that had been routing to Fable 5 began erroring or silently dropping to older models.7 Claude Opus 4.8 and the rest of Anthropic's lineup stayed online,5 but if your code hard-coded the Fable 5 model string, "the rest of the lineup is fine" didn't help you.
This is the part worth internalizing: model availability is no longer a constant you can assume. It is a risk variable that can change by government directive, not just by an outage dashboard. Anthropic had already spent the spring in litigation with the Pentagon, which designated it a "supply chain risk" in early March 2026 — the first US company to receive a label historically reserved for foreign adversaries — and Anthropic sued in two federal courts on March 9.89 The regulatory surface around frontier models is live and contested. Plan for it the way you plan for a region going dark.
Why one provider is now a single point of failure
The classic argument for picking one model and committing was simplicity: one SDK, one set of error codes, one billing relationship. The Fable 5 recall reprices that simplicity. A single-provider integration now concentrates three independent failure modes into one dependency:
The everyday failures haven't gone anywhere — rate limits under load, 5xx blips during a provider's own incident, and timeouts on long generations.3 On top of those, you now carry policy risk: a model can be pulled by directive with no notice. And you carry capability risk: even when a model exists, a provider can quietly reroute you to a weaker one, changing your output distribution without changing your code.7
A fallback chain addresses all three with the same mechanism. The goal isn't to chase five-nines on any one model — it's to make your product survive the loss of any one model.
Step 1: Normalize the provider interface
Every provider's API has a slightly different request shape, response shape, and error format, so the first job is to hide those differences behind one interface.10 Define a minimal contract: take messages plus a max-token cap, return text plus the model that actually served the request.
// llm/types.ts
export interface ChatMessage {
role: "system" | "user" | "assistant";
content: string;
}
export interface CompletionResult {
text: string;
servedBy: string; // which provider:model actually answered
}
// A provider knows how to call exactly one model and
// how to classify its own failures.
export interface LLMProvider {
id: string; // e.g. "anthropic:claude-opus-4-8"
complete(messages: ChatMessage[], maxTokens: number): Promise<CompletionResult>;
}
// Errors we are willing to fail OVER on. Anything else
// (e.g. a 400 for a malformed request) should NOT trigger
// fallback — it will fail identically on every provider.
export class RetryableProviderError extends Error {
constructor(public providerId: string, public cause: string) {
super(`${providerId} failed (retryable): ${cause}`);
}
}
The RetryableProviderError distinction is the one people skip and regret. Failing over on a 400 Bad Request just burns money calling four providers in a row to get the same rejection. Fail over only on transient or availability errors; let genuine client errors surface immediately.
Step 2: Write two concrete providers
Here is an Anthropic provider and an OpenAI provider against their HTTP APIs. The key work is in the catch: map status codes to either a RetryableProviderError (fail over) or a re-thrown error (give up). The status codes worth failing over on are 429 (rate limit), 5xx (server error), and — new to the 2026 threat model — 403/404 on a model that used to exist, which is what an access revocation looks like from the client side.3
// llm/anthropic.ts
import { ChatMessage, CompletionResult, LLMProvider, RetryableProviderError } from "./types";
const RETRYABLE_STATUS = new Set([403, 404, 408, 409, 425, 429, 500, 502, 503, 504]);
export class AnthropicProvider implements LLMProvider {
constructor(
private model: string,
private apiKey = process.env.ANTHROPIC_API_KEY!,
) {}
get id() {
return `anthropic:${this.model}`;
}
async complete(messages: ChatMessage[], maxTokens: number): Promise<CompletionResult> {
const system = messages.find((m) => m.role === "system")?.content;
const turns = messages.filter((m) => m.role !== "system");
let res: Response;
try {
res = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": this.apiKey,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
body: JSON.stringify({ model: this.model, system, messages: turns, max_tokens: maxTokens }),
signal: AbortSignal.timeout(60_000),
});
} catch (e) {
// Network error or the 60s timeout firing → retryable.
throw new RetryableProviderError(this.id, (e as Error).name);
}
if (!res.ok) {
if (RETRYABLE_STATUS.has(res.status)) {
throw new RetryableProviderError(this.id, `HTTP ${res.status}`);
}
throw new Error(`${this.id} hard error: HTTP ${res.status}`);
}
const data = await res.json();
return { text: data.content?.[0]?.text ?? "", servedBy: this.id };
}
}
// llm/openai.ts
import { ChatMessage, CompletionResult, LLMProvider, RetryableProviderError } from "./types";
const RETRYABLE_STATUS = new Set([403, 404, 408, 409, 425, 429, 500, 502, 503, 504]);
export class OpenAIProvider implements LLMProvider {
constructor(
private model: string,
private apiKey = process.env.OPENAI_API_KEY!,
) {}
get id() {
return `openai:${this.model}`;
}
async complete(messages: ChatMessage[], maxTokens: number): Promise<CompletionResult> {
let res: Response;
try {
res = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: { authorization: `Bearer ${this.apiKey}`, "content-type": "application/json" },
body: JSON.stringify({ model: this.model, messages, max_completion_tokens: maxTokens }),
signal: AbortSignal.timeout(60_000),
});
} catch (e) {
throw new RetryableProviderError(this.id, (e as Error).name);
}
if (!res.ok) {
if (RETRYABLE_STATUS.has(res.status)) {
throw new RetryableProviderError(this.id, `HTTP ${res.status}`);
}
throw new Error(`${this.id} hard error: HTTP ${res.status}`);
}
const data = await res.json();
return { text: data.choices?.[0]?.message?.content ?? "", servedBy: this.id };
}
}
Because both providers satisfy the same LLMProvider interface, the router that calls them doesn't need to know which one it's holding. The same pattern extends to a Google Gemini provider or — importantly — an open-weight model you host yourself, which no directive can recall.
Step 3: The fallback chain with a circuit breaker
Now the router. It walks an ordered list of providers, returns the first success, and only advances on a RetryableProviderError. To avoid paying the full timeout on a provider that is already down, each provider gets a circuit breaker: after a threshold of consecutive failures it is skipped for a cooldown window, then given one trial request to see if it recovered.11
// llm/router.ts
import { ChatMessage, CompletionResult, LLMProvider, RetryableProviderError } from "./types";
interface BreakerState {
failures: number;
openedAt: number | null; // timestamp when the breaker tripped
}
export class FallbackRouter {
private breakers = new Map<string, BreakerState>();
constructor(
private providers: LLMProvider[], // ordered: most-preferred first
private threshold = 3, // consecutive failures before opening
private cooldownMs = 30_000, // how long to skip an open provider
) {}
private isOpen(id: string): boolean {
const b = this.breakers.get(id);
if (!b || b.openedAt === null) return false;
if (Date.now() - b.openedAt >= this.cooldownMs) {
b.openedAt = null; // half-open: allow one trial request
return false;
}
return true;
}
private recordFailure(id: string) {
const b = this.breakers.get(id) ?? { failures: 0, openedAt: null };
b.failures += 1;
if (b.failures >= this.threshold) b.openedAt = Date.now();
this.breakers.set(id, b);
}
private recordSuccess(id: string) {
this.breakers.set(id, { failures: 0, openedAt: null });
}
async complete(messages: ChatMessage[], maxTokens: number): Promise<CompletionResult> {
const errors: string[] = [];
for (const provider of this.providers) {
if (this.isOpen(provider.id)) {
errors.push(`${provider.id}: circuit open`);
continue;
}
try {
const result = await provider.complete(messages, maxTokens);
this.recordSuccess(provider.id);
return result;
} catch (e) {
if (e instanceof RetryableProviderError) {
this.recordFailure(provider.id);
errors.push(e.message);
continue; // try the next provider
}
throw e; // hard error → don't waste the rest of the chain
}
}
throw new Error(`All providers exhausted:\n${errors.join("\n")}`);
}
}
Wiring it up reads like a priority list — and a model recall is now just another entry that trips the breaker and moves on:
// llm/index.ts
import { AnthropicProvider } from "./anthropic";
import { OpenAIProvider } from "./openai";
import { FallbackRouter } from "./router";
export const router = new FallbackRouter([
new AnthropicProvider("claude-opus-4-8"), // primary
new OpenAIProvider("gpt-5.5"), // cross-vendor backup
new AnthropicProvider("claude-sonnet-4-6"), // cheaper same-vendor backup
// new SelfHostedProvider("kimi-k2.7-code"), // un-recallable last resort
]);
const answer = await router.complete(
[{ role: "user", content: "Summarize this changelog in three bullets." }],
1024,
);
console.log(answer.servedBy, answer.text);
Log servedBy on every response. The day your primary gets pulled, that field is the difference between "we noticed in Grafana" and "a customer noticed for us."
Step 4: Pick a compatible backup, not just any backup
A fallback chain that swaps a frontier reasoning model for a much weaker one keeps you up while quietly making you worse — the silent-drift trap.12 The fix is to choose backups whose capability and behavior are close enough that a failover is a degradation you've accepted, not a surprise. Here is how the realistic mid-2026 options line up for a coding/agent workload.
| Provider:model | Availability | Recall risk | Notes for fallback |
|---|---|---|---|
| Claude Opus 4.8 | API, GA | Same vendor as Fable 5 | $5/$25 per M tokens standard; 1M context, no long-context surcharge13 |
| GPT-5.5 | API since Apr 24, 2026 | Different vendor | True cross-vendor isolation; behavior differs, so test prompts both ways14 |
| Gemini 3.1 Pro | API (preview) | Different vendor | Strong multimodal; separate jurisdiction exposure15 |
| Kimi K2.7 Code (self-hosted) | Open weights, Modified MIT | None — you hold the weights | 1T-param MoE, 256K context; serve via vLLM/SGLang; ~600GB even at INT4, so real hardware required1617 |
Two rules fall out of that table. First, at least one link in your chain should be a different vendor — a same-vendor backup shares the policy and legal exposure that took your primary down. Second, for the highest-stakes workloads, the only fallback with zero recall risk is a model whose weights you possess: an open-weight model like Kimi K2.7 Code can be served locally via vLLM or SGLang and cannot be pulled by anyone's directive.16 That independence is why "run local models" went from a data-sovereignty talking point to a procurement line item the week Fable 5 went dark.7 The catch is honest: a 1-trillion-parameter MoE runs to roughly 600GB even quantized to INT4, so for most teams the open-weight tier is a deliberate last resort, not the default.17
When to stop hand-rolling and use a gateway
The router above is a couple hundred lines and worth understanding before you outsource it. But once you need per-team budgets, key management, cost tracking, and fallbacks across a dozen models, normalize all of that behind a gateway instead of threading it through every service. LiteLLM Proxy is the open-source default — an MIT-licensed, self-hosted gateway that fronts 100+ provider APIs behind one OpenAI-compatible endpoint with built-in fallback routing, virtual keys, and budgets.18 We have a full production walkthrough in our LiteLLM Proxy production tutorial, and the same fallback list you saw above maps directly onto its model_list config. A gateway also gets you one more thing the hand-rolled version can't: a single place to change the routing order during an incident, without redeploying every service that talks to an LLM.
The bottom line
The Fable 5 recall turned an abstract risk into a dated incident: on June 12, 2026, a top model vanished for every customer with no notice and no restoration date.12 You can't prevent that. You can make your product indifferent to it. A normalized provider interface, a fallback chain that trips on availability errors, a circuit breaker so dead providers stop costing you latency, and at least one cross-vendor — ideally one self-hostable — backup turn "our model got pulled" from an outage into a log line. Build the hand-rolled version to understand it, then put a gateway in front when the operational surface grows. Availability is now a variable. Architect like it.
Footnotes
-
CNBC, "Anthropic disables access to Fable 5 and Mythos 5 to comply with government directive" (June 12, 2026). https://www.cnbc.com/2026/06/12/anthropic-disables-access-to-fable-5-and-mythos-5-to-comply-with-government-directive.html ↩ ↩2 ↩3
-
Bloomberg, "Anthropic Says US Orders Halt to Foreign Access for Fable 5, Mythos 5 AI Models" (June 13, 2026). https://www.bloomberg.com/news/articles/2026-06-13/anthropic-says-us-limits-foreign-access-to-fable-5-mythos-5 ↩ ↩2 ↩3 ↩4
-
Portkey, "Failover routing strategies for LLMs in production." https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/ ↩ ↩2 ↩3 ↩4
-
Anthropic, "Claude Fable 5 and Claude Mythos 5." https://www.anthropic.com/news/claude-fable-5-mythos-5 ↩
-
Anthropic, "Statement on the US government directive to suspend access to Fable 5 and Mythos 5." https://www.anthropic.com/news/fable-mythos-access ↩ ↩2
-
VentureBeat, "Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do" (June 13, 2026). https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do ↩
-
Cosmic, "Fable 5 and Mythos 5 are gone: a developer action plan" (June 14, 2026). https://www.cosmicjs.com/blog/fable-5-mythos-5-suspended-developer-action-plan ↩ ↩2 ↩3 ↩4
-
NPR, "Anthropic sues the Trump administration over 'supply chain risk' label" (March 9, 2026). https://www.npr.org/2026/03/09/nx-s1-5742548 ↩ ↩2
-
Pearl Cohen, "Anthropic Sues Department of Defense Over Supply Chain Risk Designation." https://www.pearlcohen.com/anthropic-sues-department-of-defense-over-supply-chain-risk-designation/ ↩
-
Statsig, "Provider fallbacks: ensuring LLM availability." https://www.statsig.com/perspectives/providerfallbacksllmavailability ↩
-
Portkey, "Retries, fallbacks, and circuit breakers in LLM apps: what to use when." https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/ ↩
-
OpenDirective, "Multi-provider LLM resilience: failover, quotas, and drift." https://opendirective.net/multi-provider-llm-resilience-failover-quotas-and-drift ↩ ↩2
-
Finout, "Claude Opus 4.8 pricing 2026" (standard $5/$25 per million input/output tokens; 1M context with no long-context surcharge). https://www.finout.io/blog/claude-opus-4.8-pricing-2026-everything-you-need-to-know ↩
-
TechCrunch, "OpenAI releases GPT-5.5, bringing company one step closer to an AI 'super app'" (April 23, 2026; GPT-5.5 and GPT-5.5 Pro available in the API as of April 24, 2026). https://techcrunch.com/2026/04/23/openai-chatgpt-gpt-5-5-ai-model-superapp/ ↩
-
NxCode, "Gemini 3.1 Pro complete guide 2026: benchmarks, pricing, API" (released February 19, 2026; available via the Gemini API, still in preview, with the newer Gemini 3.5 Pro announced at Google I/O on May 19, 2026 and targeted for GA in June). https://www.nxcode.io/resources/news/gemini-3-1-pro-complete-guide-benchmarks-pricing-api-2026 ↩
-
Codersera, "Kimi K2.7 Code: the complete guide — benchmarks, pricing & how to use (2026)" (released June 12, 2026 by Moonshot AI; open weights, Modified MIT; 1T total parameters, 32B active, 256K context; serve via vLLM/SGLang). https://codersera.com/blog/kimi-k2-7-complete-guide-2026/ ↩ ↩2 ↩3
-
Spheron, "Deploy Kimi K2.7 Code on GPU Cloud: self-host Moonshot's 1T-parameter agentic coding model (2026)" (INT4 weights ~630GB; the Hugging Face repository is ~595GB on disk). https://www.spheron.network/blog/deploy-kimi-k2-7-code-gpu-cloud/ ↩ ↩2 ↩3
-
LiteLLM, MIT-licensed self-hosted LLM gateway exposing 100+ provider APIs behind an OpenAI-compatible endpoint with fallback routing, virtual keys, and budgets. https://github.com/BerriAI/litellm ↩