RoutingAugust 19, 2025Sarah Chen

Routing Strategies for Cost vs Quality Tradeoffs

Not all agent invocations need the most powerful model. This post explores routing strategies for managing the cost-quality tradeoff across different workload types and user tiers.

Abstract diagram of routing decision tree balancing cost and quality signals

The default behavior of most AI-powered systems is: use the best model available, always. It's the default because it's safe, it's simple, and it feels like the responsible choice when you're building something new. The problem with it is that "best model, always" is 5-10x more expensive than a well-designed routing strategy — and the quality difference, measured against the actual requirements of each workload, is often negligible.

This post covers how to think about routing strategy for production AI systems: what inputs drive routing decisions, how to define quality thresholds that are actually measurable, and the operational concerns that constrain what routing approaches are feasible.

The cost-quality curve is not linear

A common mental model is that more expensive models are proportionally better. This isn't true. The relationship between cost and quality is non-linear and highly task-dependent.

For classification tasks — intent detection, sentiment analysis, category labeling — the quality difference between GPT-4o-mini and GPT-4o is often less than 3 percentage points on accuracy benchmarks, at a cost difference of roughly 15-20x. If your classification accuracy requirement is 93%, GPT-4o-mini likely hits it. GPT-4o hits 95%. You're paying 15x for 2 points of accuracy.

For complex synthesis tasks — multi-document summarization, code generation from natural language, long-form analysis — the quality difference is much larger. The cheaper models produce output that fails more often in ways that matter: missed nuance, incorrect reasoning steps, structural problems in generated code.

The implication for routing strategy: the cost-quality tradeoff is not a global setting. It's a per-workload-type setting. Routing should be defined at the workload level, not the system level.

Workload-type routing

The most impactful routing decision you can make is to define your workload types and assign models to each. A document analysis pipeline might have four distinct workload types:

classification — route to GPT-4o-mini (fast, cheap, sufficient accuracy)
extraction — route to Claude-3-Haiku (structured output, reliable JSON format)
synthesis — route to GPT-4o (quality requirement is high; this is the customer-visible output)
fact-check — route to GPT-4o or Claude-3-5-Sonnet (accuracy critical, A/B split for quality comparison)

In a routing policy config, this looks like:

routing:
  policy: workload_type
  rules:
    - workload: classification
      model: gpt-4o-mini
      max_latency_ms: 2000

    - workload: extraction
      model: claude-3-haiku
      max_latency_ms: 3000

    - workload: synthesis
      model: gpt-4o
      quality_threshold: 0.88

    - workload: fact-check
      ab_split:
        - model: gpt-4o
          weight: 60
        - model: claude-3-5-sonnet
          weight: 40

This declaration is the complete routing policy for the pipeline. A cost optimization that would previously require code changes across multiple agent functions is now a config update — change the model for the classification workload from gpt-4o-mini to gpt-4o-nano when it's available, and the change takes effect immediately across all invocations.

Dynamic routing: quality scores as routing signals

Static workload-type routing is a good starting point. The next level is dynamic routing: using quality signals from the invocation itself to decide whether to route to a better model.

The pattern is: attempt the cheaper model first, evaluate the output quality, and retry on the more expensive model if quality falls below threshold. This is sometimes called "cascading" or "quality-gated fallback" — distinct from failure-based fallback (which handles error conditions) in that it handles quality conditions.

routing:
  policy: quality_cascade
  primary:
    model: gpt-4o-mini
    quality_evaluator: internal_eval_v2
    quality_threshold: 0.82
    fallback_on_below_threshold: true

  fallback:
    model: gpt-4o
    on_trigger: quality_below_threshold

The quality evaluator can be a second model call (evaluate the output, not generate it), a deterministic function (check for required fields, validate JSON schema, run assertions on the output structure), or a combination. The key design constraint is that the evaluation must be cheap enough that the two-call strategy (cheap model + evaluation) is still less expensive than the direct expensive model call in the cases where quality is sufficient.

In practice: if GPT-4o costs 15x more than GPT-4o-mini, the quality cascade strategy breaks even if more than 1/15 of invocations (about 7%) need to fall back to the expensive model. If your classification accuracy on GPT-4o-mini is high enough that fewer than 7% of outputs fail quality evaluation, the cascade is cheaper than using GPT-4o directly. Run this calculation for your actual workload before implementing — the math doesn't always favor cascading.

User tier routing

A common routing dimension for consumer-facing AI products is user tier. Free tier users get routed to GPT-4o-mini. Paid tier users get routed to GPT-4o. Premium tier users get GPT-4o with priority rate limits and a zero-fallback policy.

This strategy is intuitive but has an operational complexity: the tier information has to propagate from your application layer into the routing decision. This means the invocation context must carry tier metadata that the routing engine can evaluate.

routing:
  policy: user_tier
  rules:
    - tier: free
      model: gpt-4o-mini
      budget_ceiling_tokens: 2000

    - tier: pro
      model: gpt-4o
      budget_ceiling_tokens: 8000

    - tier: enterprise
      model: gpt-4o
      budget_ceiling_tokens: 32000
      fallback_policy: none  # enterprise users don't degrade to cheaper models

The fallback_policy: none for enterprise is an important detail. Enterprise customers often have contractual SLAs that specify the model they're using. Silently downgrading to a cheaper model during a provider incident would violate those SLAs. The routing policy encodes this constraint explicitly.

A/B routing for model evaluation

Before committing to a new model for a production workload, you want empirical quality data from your actual production traffic — not benchmark numbers from the model provider's marketing page. A/B routing lets you split traffic between your current model and a candidate model, collect quality signals on real invocations, and make the migration decision based on production data.

The operational requirements for this to work:

The routing decision (which variant each invocation was sent to) must be logged as structured telemetry, with a stable invocation ID you can use to join against downstream quality signals
The split must be adjustable without a deployment — start at 5% to the candidate, increase to 50% if quality looks good, then complete the migration
The two variants must return responses in the same format, or the downstream system must handle format differences

The last point is often the constraint that determines whether A/B routing is feasible for a given workload. If your extraction agent parses structured JSON from the model response, and the candidate model produces different field names or value formats, the A/B test will break downstream processing for the split traffic.

What routing strategy cannot solve

Routing strategy is a cost optimization and quality optimization tool. It has limits worth stating explicitly:

Routing cannot compensate for a fundamentally bad prompt. If your prompt produces poor results on all models, routing to a better model improves the results incrementally — it doesn't fix the underlying prompt problem.

Routing cannot provide perfect latency predictability. Even with latency-based triggers and fallback chains, model provider latency has genuine variance that routing policy cannot eliminate — only route around.

Routing does not replace evals. Empirical quality measurement — running your production workloads against a quality benchmark — is how you know your routing thresholds are set correctly. The routing policy and the evaluation infrastructure are complementary, not substitutes.

What routing strategy does well: it prevents the default behavior of "most expensive model, always" from unnecessarily inflating your infrastructure cost; it makes the cost-quality tradeoff an explicit policy decision that's reviewable and adjustable; and it gives you the instrumentation to understand your workload quality distribution across models — which is the foundation for making better routing decisions over time.

Back to all articles