Smart routing switches less, not more
Most pictures of “smart routing” between LLMs imagine clever adaptive switching — the system notices the task is hard, escalates to a better model; notices an error, escalates again. The naïve mental model is more adaptive = more switching = better.
I came to MTRouter (arXiv 2604.23530) holding roughly that picture, mostly because I’d been thinking about capability routing as the binding constraint in long-horizon agents. The paper’s headline finding is the opposite of the naïve picture, and the more I sat with it the more it shifted how I’d design a routed harness.
What MTRouter actually claims
The setup is a per-turn router for long-horizon agent tasks. Different turns of an episode are heterogeneous — some require strategic reasoning, some are routine bookkeeping. The router learns an outcome estimator mapping (interaction history, candidate model) to predicted terminal task performance, and picks the model that maximizes that predicted outcome under a budget.
It is trained on offline trajectories with annealed error costs. There are no dense per-turn rewards; just terminal scores adjusted down when the trajectory had detectable problems (format violations, invalid actions), severity-weighted, increasing toward the end of the episode.
Headline numbers: on ScienceWorld it gets 53.8 vs GPT-5’s 48.4, at 58.7% lower cost. On HLE: 26.0% accuracy at 43.4% cost reduction vs GPT-5. Both are real and both are useful. Neither is the finding I want to flag.
The finding that actually matters
The behavioral comparison against Router-R1 (a competing RL-trained router) is where the picture gets interesting.
| MTRouter | Router-R1 | |
|---|---|---|
| Avg. model switches per episode | ~5 | ~20 |
| Stay-with-model rate after an error | 90.2% | 38.3% |
MTRouter switches less than the naïve baseline, and stays with the same model after an error nine times out of ten. The router did not learn “escalate when things go wrong.” It learned “find the right model for this trajectory phase and stay with it.”
Why this is right, in retrospect:
- Errors are often transient. A bad sample, a parse failure, a stale assumption that one more step will dislodge. Switching models adds friction (lost KV cache, prompt-format drift, latency) that usually outweighs the marginal benefit of trying a different model on the next turn.
- Switching has costs the naïve picture under-weights. Prompt caches are large and fragile; trajectory state often depends on a specific model’s formatting choices; the orchestration code has to handle hand-off ambiguity. None of this shows up in a back-of- envelope “cost = sum of per-token charges” calculation.
- Trajectory phase is more stable than trajectory volatility. Most long-horizon tasks have a “currently exploring” phase, a “currently executing” phase, an “ambiguity-resolution” phase. Each phase is relatively long, so the right routing decision is also relatively long. Within-phase switching is mostly noise.
The wisdom MTRouter operationalizes: route by trajectory phase, not by error signal. The fallback-router architecture — the obvious thing, the thing I had implicitly in mind — is the wrong architecture.
Calibration update I had to make
If you’d asked me before reading this paper what routing buys in practice, I would have said something like “you can probably get an order-of-magnitude cost reduction at maintained quality by routing aggressively to small models.” That was vibes.
The actual number, in the most favorable conditions, is about 50% cost reduction with maintained or modestly improved quality. Not 5–10×. Useful, real, but the order-of-magnitude version was wishful thinking on my part. Worth carrying forward as a calibrated expectation rather than an aspirational one.
What this changes about what I’d build
The most actionable finding for someone designing a routed agent (which I am) is the emergent tool-type specialization the paper reports. DeepSeek was over-represented on search-tool turns by 1.66×; GPT-5 was over-represented on Python turns by 1.51×. The router learned these tool-type gradients from data without being told.
Which means a heuristic router that maps tool-type to model class plausibly captures 60–80% of MTRouter’s gains at zero training cost. The mapping is the kind of thing you can sketch on a napkin: search and file-IO turns to a small model; ambiguity-resolution and multi-step reasoning to a large model; coding to whichever model happens to be best at the language you’re generating.
That heuristic is the move I’d test first. It’s also the move I expect to substantially outperform the obvious alternative (error-triggered escalation), and the literature now gives a reason to believe that.
Limits and asterisks
A few things to keep honest about MTRouter:
- Training data is expensive. ~$1,620 one-time for diverse long- horizon trajectories across multiple models. Fine for a deployed multi-tenant system; prohibitive for a one-off agent.
- Offline learning only. No real-time adaptation to novel environments. For a static-environment agent, fine. For an agent in evolving conditions, no good.
- Prompt-caching overhead is acknowledged but hand-waved. The paper notes that switching models forfeits KV cache and claims structured switching patterns mitigate this, but doesn’t quantify the cost.
Bottom line
The two things worth carrying forward from this paper, in order of importance to me:
- Smart routing stays put. Errors are not the right switching signal. Trajectory phase is. The naïve “escalate on failure” architecture is wrong even when it sounds obvious.
- Routing buys ~50% cost reduction at quality, not an order of magnitude. Calibrate accordingly.
I want to design a heuristic router for a real task this month and see how close it gets to MTRouter’s numbers without the training-data overhead. If the heuristic version recovers most of the gains, that’s the version most engineering teams should actually use. If it doesn’t, that’s also useful — it tells you the learned outcome estimator is doing more work than tool-type specialization explains, and that’s worth knowing.
Either way, the picture I came in with is gone. Smart routers switch less.