In 2023, the team that first handed me a "working" multi-agent prototype was proud of what they'd built. The system coordinated three specialized agents — one for retrieval, one for drafting, one for fact-checking — and produced genuinely useful output. The code was around 400 lines. It ran fine in development. I told them to push it to staging. That's when the rot started to show.
Within a week, staging had a different timeout value from development. The retry logic in the drafting agent and the retry logic in the retrieval agent had drifted — they'd been copy-pasted from each other and then independently modified. The fact-checking agent had no fallback at all when its model provider returned a 429. There was no way to ask "what model did invocation abc123 actually use?" because nobody had instrumented anything. Token spend was a post-hoc reconstruction from provider invoices.
The team hadn't done anything wrong. They'd done what you do when building prototypes: ship the thing that works. The problem is that multi-agent AI systems have a class of operational requirements that don't reveal themselves during prototyping — and by the time they do, the orchestration logic is already distributed across every corner of the codebase.
The orchestration problem is infrastructure, not application logic
When you build a distributed backend service, you don't implement mTLS inside each service. You don't implement retry-with-jitter in every service. You don't implement circuit breaking in every service. Those concerns are cross-cutting — they apply to every service boundary — so you put them in infrastructure: a service mesh, a sidecar proxy, a load balancer. The application code is clean precisely because the infrastructure layer absorbed the complexity.
Multi-agent AI systems have exactly the same structure. The concerns that matter in production — which model to route to, what happens when that model is unavailable, how much token budget this workflow is allowed to consume, whether a human needs to review this output before it proceeds — are cross-cutting. They apply to every agent invocation. Putting them inside agent code is like implementing circuit breaking inside every microservice: it works until you have eight agents with eight different retry implementations.
The pattern that has emerged from the teams we've worked with is that orchestration logic has three phases:
- Phase 1 — Prototype: Zero orchestration infrastructure. One agent. Direct model API calls. Works fine.
- Phase 2 — Naive scaling: More agents. Copy-paste of retry logic. Hardcoded model names. Budget alerts in Slack. HITL approval via a Slack bot someone wrote in an afternoon.
- Phase 3 — Crisis: A model provider has an outage. Every agent fails differently. On-call gets paged at 2am. Three different engineers are debugging three different failure modes that are all the same root cause.
Phase 3 is usually when the team decides to "clean up the orchestration layer." By that point, it's a refactor, not a feature — and refactors of production AI systems are expensive and risky.
What a control plane actually does
A control plane for multi-agent AI workflows is conceptually identical to a service mesh: it sits between your application code and the underlying infrastructure (model providers), and it handles the cross-cutting operational concerns so your application code doesn't have to.
In practice, this means four capabilities that should live in the control plane rather than in agent code:
Routing policy. Which model gets this invocation? The naive answer is "the best one." The production answer is a policy: use the cheaper model for classification tasks; use the more capable model when quality score is required above 0.85; split 30% of traffic to the new model to validate it before full rollout. That policy should be declared in config — not embedded as conditionals in eight different agent functions.
Fallback chains. When your primary model returns a 429 or a 503, what happens? The wrong answer is: each agent handles it differently. The right answer is: the fallback chain is declared once, in the control plane, and activates automatically on any invocation that triggers the conditions. The chain can cascade: OpenAI → Anthropic → local model via Ollama — and you can define different chains for different workflows.
Token budget enforcement. Soft limits — "alert when spend exceeds $X" — are advisory. They tell you after the fact. Hard enforcement means the control plane intercepts invocations that would exceed a configured budget ceiling before they reach the model. The invocation doesn't happen. The budget is respected. This is the difference between a guardrail and a speedbump.
Human-in-the-loop gates. When should a human review before an agent's output proceeds? This question has a compliance-driven answer for regulated industries (financial decisions, PII handling, medical triage), and a quality-driven answer for everyone else (confidence below threshold, unusual output detected). Either way, the gate definition belongs in config, not in ad hoc Slack bots.
The declarative argument
One subtle but important property of control plane architecture is that behavior becomes declarative. Your orchestration policy is a YAML file. It has a git history. You can diff it, review it in a pull request, and roll it back if something goes wrong — without redeploying agent code.
With imperative orchestration — logic embedded in agent functions — changing the fallback policy requires finding every place that policy is expressed, modifying each one consistently, testing each agent individually, and deploying each service. With declarative orchestration in a control plane, you edit one file and apply it. The agents are unaware the policy changed.
This distinction matters most during incidents. When a model provider degrades, you want to update your fallback configuration in under two minutes — not coordinate a multi-service deployment at 3am.
The observability argument
One of the consistently underappreciated benefits of centralizing orchestration is what it does for observability. When routing logic is distributed across agents, you can observe each agent in isolation — you can see that agent A made a call and got a response — but you can't easily answer: what was the routing decision for invocation abc123? Did the fallback activate? Which tier? Was the budget enforcement event triggered before or after the primary attempt?
A control plane is a natural instrumentation point. Every invocation passes through it. Every routing decision, every fallback activation, every budget enforcement event, every HITL gate open/close — these are all observable from one place, with consistent structure. The telemetry that comes out of a control plane is the kind of telemetry your on-call engineer can actually use during an incident to understand what the system is doing.
When to do this
The common objection is: "We're too early for infrastructure." Sometimes that's true. If you have one agent and one model and you're still figuring out if the use case works, a control plane is premature. Build the prototype. Validate the concept.
The inflection point is when you have more than two agents, or when you're deploying to a real production environment, or when you have a compliance requirement (HITL, audit trail, budget cap). At that point, the cost of not having a control plane is already accumulating — in the form of inconsistent retry logic, shadow budgets, and HITL approval flows that live in no one's runbook.
The teams that add orchestration infrastructure early consistently report lower incident frequency and faster incident resolution when things do go wrong. The teams that wait until Phase 3 consistently report that the refactor to extract orchestration logic from application code is one of the most expensive engineering projects they've run.
OrchVynt is the control plane we wished existed when we were the team in Phase 3. The four primitives — routing, fallback chains, token budgets, HITL gates — are the four things that belong in infrastructure, not in your agent code.