Skip to content

Cahid Arda Öz

DX - Software Engineer at Upstash

Istanbul, Turkey

Sakana Fugu: Multi-Agent Orchestration Explained in Detail

Sakana's Fugu orchestrates frontier LLMs into a single endpoint. A guided tour of its two engines (the evolved TRINITY router and the RL-trained Conductor) with interactive diagrams of both architectures.

Blog Essays, opinions, and how-tos.

Yesterday, Sakana AI (an AI research lab in Tokyo) introduced Sakana Fugu:

Frontier models have started to specialize. GPT is strong at math and planning, Opus at software engineering and security, Gemini at science and recall. No single one wins everywhere. Sakana AI’s Fugu takes a different tack: instead of being a model that answers your query, it is a model that decides which other models should, and how they should work together.

“Sakana Fugu is a family of language models trained to adaptively and dynamically orchestrate a team of more powerful frontier agent workers… the user interacts with Fugu as if calling a single model, while internally the system can route, delegate, and coordinate across multiple specialized agents.” (Sakana Fugu Technical Report)

Two things are worth knowing up front:

  • Not trained from scratch. Fugu owns no frontier weights of its own. It is a small orchestrator that learns to combine the existing models you already know (GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, and a bench of open-source LLMs) into one collective. Sakana calls this “a new complementary scaling axis beyond ever larger and expensive language models.”
  • Hosted, not a download. You reach Fugu through an OpenAI-compatible API and cannot run it locally. The routing is proprietary, and it never even tells you which underlying model answered.

This post is a tour of how it works, built around the two ICLR 2026 papers the system is grounded in: TRINITY, an evolved coordinator, and the Conductor, an RL-trained one. The diagrams below are interactive; play them.

One model, many minds

There are two variants. Fugu picks a single best worker per query, so its latency is “comparable to a direct call to a frontier model.” Fugu-Ultra composes a whole team. Toggle between them:

How Fugu worksone interface · many minds
Youone API callFuguorchestratorGPT-5.5OpenAIOpus 4.8AnthropicGemini 3.1Google

1/5 · You send one request to a single endpoint.

The trick that makes this more than a router is that Fugu is itself a trained model. It is not a hand-written if coding: use_opus rule; it learns, from data, who is good at what.

A pool of specialists

Why bother? Because the workers genuinely differ, and the differences are fine-grained:

“At the domain level, past GPT-series models often displayed state-of-the-art performance on mathematical reasoning tasks, while Opus-series models have specialized in software engineering and cybersecurity… in competitive coding, Gemini-3.1-Pro can be particularly effective at directly implementing known algorithms, while GPT-series models can often excel at planning and combining multiple algorithmic ideas.” (Sakana Fugu Technical Report)

A good orchestrator should track these specializations and deploy each model where it shines. Fugu does: its routing distribution shifts by domain, peaking on whichever worker is SOTA there. Pick a domain and watch the mix move (the percentages are illustrative of the routing trends the technical report describes):

Domain adaptivityroute a query, watch the mix shift

Pick a domain. The orchestrator’s routing distribution re-weights toward whichever worker is strongest there, “a hallmark feature of an intelligent orchestrator.”

GPT-5.5
60%
Claude Opus 4.8
29%
Gemini 3.1 Pro
11%

GPT-5.5 leads here. Agentic terminal coding: GPT-5.5 is SOTA, so routing peaks on GPT.

Fugu: routing in a single forward pass

The fast variant, Fugu, builds on TRINITY. It is two small pieces: a ~0.6B language model (the “backbone”) and a tiny head of about ten thousand parameters bolted onto it. The backbone reads your query into an internal vector; the head turns that vector into one score per worker. The choice is just a softmax over those scores:

πθ(as)    exp ⁣(fθ(h(s))a)\pi_\theta(a \mid s) \;\propto\; \exp\!\big(f_\theta(h(s))_a\big) The router's policy: the probability of dispatching query/state s to worker a. The backbone compresses the query into a hidden vector h(s); the head f_θ turns it into one score per worker; a softmax over those scores gives the selection probabilities (∝ means 'proportional to', before normalizing).

The clever part: the backbone never actually writes anything. A normal model commits to a token, then the next, autoregressively. Slow. But Fugu doesn’t need a sentence, only a choice, so it reads the backbone’s internal state and routes immediately, skipping generation entirely. As TRINITY puts it, “the coordinator’s generated text is discarded because the job of prompting is delegated to the LLMs in the pool.” That decision-only design is the whole reason Fugu is fast. TRINITY can also hand the chosen worker one of three roles (Thinker, Worker, Verifier) and loop over several turns; production Fugu drops the roles and just picks a worker, which is simpler and quicker. Toggle between the two below:

Fugu’s orchestrator: the TRINITY parametrizationit outputs a choice, not an answer
1 · Query“Write me code for binary search”
2 · Small LM (≈0.6B)reads the query and turns it into an internal vector; its “hidden state” h
3 · Trained head≈10K paramsthe only part trained; turns the hidden state into a score for every worker
4 · Pick
GPT
Opus
Gemini

The orchestrator never writes the answer; its own text output is thrown away. Because it only needs to emit a choice, it can decide almost instantly instead of generating a full response, which keeps Fugu’s latency “comparable to a direct call to a frontier model.” Fugu keeps it minimal: one worker, no roles, dispatched at once.

Training without gradients

Heads up: this section gets a bit technical. If you only want the high-level picture, skip ahead to Fugu-Ultra.

How do you train a head whose decision only pays off after several expensive LLM calls? In two stages.

Stage 1: supervised fine-tuning. Where do the labels come from? Not from humans. Sakana starts with a big pool of verifiable single-step tasks (coding, math, reasoning), each with a known correct answer. To label a task, they simply run every worker on it several times and check who actually solves it. Those measured success rates (rˉi,j\bar r_{i,j} = worker jj‘s average score on task ii) become a soft target: the head is trained to match the distribution of who-does-well, rather than memorizing one “best” label. The loss is just the KL divergence between that target and the head’s softmax:

pi(j)=exp(rˉi,j/τ)jexp(rˉi,j/τ),LSFT(θ)=1DiDKL ⁣(pi()πθ(qi))p_i(j) = \frac{\exp(\bar r_{i,j}/\tau)}{\sum_{j'} \exp(\bar r_{i,j'}/\tau)}, \qquad \mathcal{L}_{\text{SFT}}(\theta) = \frac{1}{|\mathcal{D}|}\sum_i \mathbb{D}_{\mathrm{KL}}\!\big(p_i(\cdot)\,\|\,\pi_\theta(\cdot \mid q_i)\big) Left, the soft training target: worker j's measured success rate r̄ on task i, passed through a temperature-τ softmax so stronger workers get more probability mass. Right, the SFT loss: the KL divergence that pulls the head's distribution π_θ toward that target, averaged over the dataset D. Lower KL = the head agrees with who actually solves each task.

Stage 2: evolution. Single-step accuracy isn’t the real goal, though; what matters is whether a whole multi-turn session succeeds. So the second stage trains on end-to-end tasks taken from real coding assistants (Claude Code, Codex, OpenCode), and the reward is a single bit: did the task ultimately get completed?

R(τ){0,1},J(θ)=Eτπθ ⁣[R(τ)]R(\tau) \in \{0, 1\}, \qquad J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\big[R(\tau)\big] A whole multi-turn run τ scores 1 if the task is ultimately solved, 0 otherwise (a single success bit). The objective J(θ) is the expected reward over runs drawn from the policy π_θ: in plain terms, how often the orchestrator's sessions succeed. Training pushes this number up.

They measure it the obvious way: run the orchestrator on a task a few times and average the outcomes. The question is how to improve the ~10K head parameters from that signal. The textbook answer is reinforcement learning: policy-gradient methods like REINFORCE that estimate a gradient for each parameter from the reward. Here that fails: each of the thousands of weakly-coupled parameters barely moves a noisy, all-or-nothing reward, so the per-parameter gradient is drowned in noise.

“We observe weak coupling among parameters — each has only a tiny influence on the scalar reward, making traditional methods like REINFORCE’s per-parameter gradients low-SNR and therefore ineffective.” (TRINITY)

So Fugu skips gradients altogether and uses an evolution strategy (sep-CMA-ES). It treats the orchestrator as a black box: perturb the entire parameter vector many different ways, run each variant on real tasks, then keep and blend the ones that scored highest into the next “parent”:

θ(k)=θt+σtDtz(k),z(k)N(0,I)\theta^{(k)} = \theta_t + \sigma_t D_t\, z^{(k)}, \qquad z^{(k)} \sim \mathcal{N}(0, I) One generation of the evolution strategy. Candidate k is the current parameter vector θ_t plus Gaussian noise z (drawn from a standard normal), scaled by the step size σ_t and the diagonal covariance D_t, i.e. a perturbed copy of the orchestrator. Many such candidates are run on real tasks, ranked by reward, and the best are blended into the next parent θ_{t+1}.

Then repeat. No gradient required, just run it and rank it. That is exactly what the animation shows: a cloud of perturbed orchestrators, scored by reward, recombining toward the best.

Training stage 2: evolution with sep-CMA-ESno gradients, just fitness

After supervised fine-tuning, Fugu is refined with a derivative-free evolution strategy that directly maximises the terminal reward of end-to-end tasks. Each dot is one perturbed orchestrator; elites are recombined into the next parent.

generation 0/22 · distance to optimum 4.83

“sep-CMA-ES… iteratively improves a central ‘parent’ policy by sampling a population of perturbed parameter vectors, evaluating each candidate to obtain a fitness score, and recombining candidates via fitness-weighted averaging.”

Fugu-Ultra: conducting an orchestra

For the hardest problems, a single pick is not enough. Fugu-Ultra is built differently from Fugu. Where Fugu is a tiny routing head that emits a single number, Fugu-Ultra’s orchestrator (the Conductor) is a full language model (~7B in the paper) that genuinely writes. It reads your query and generates a whole plan as text: who does what, and who gets to read whom.

“The Conductor outputs full agentic workflows that divide an input task, allocate natural-language subtasks, and define targeted communication strategies to best make use of the agents’ complementary capabilities.” (the Conductor)

Concretely, that plan is three aligned lists: the subtasks, the worker for each, and an access list stating which earlier outputs each worker is allowed to see. Those access lists are the communication topology, letting the same model emit a chain, a best-of-N vote, a debate tree, or even a recursive call to itself. It’s trained end-to-end with reinforcement learning (GRPO), and the reward is blunt: 00 if the lists don’t parse, 0.50.5 for a well-formed-but-wrong workflow, 11 for a correct one, with the GRPO advantage normalized within each group, Ai=(rimean(r))/std(r)A_i = \big(r_i - \mathrm{mean}(r)\big)/\mathrm{std}(r). The diagram below shows the Conductor writing the plan, then the topology it produced. Switch topologies and step through one:

Fugu-Ultra: the Conductor’s workflowthree lists become a topology

Unlike Fugu’s tiny routing head, the Conductor is a full language model (~7B) that actually writes. It reads the query and emits a whole workflow as natural language: a list of subtasks, the worker for each, and an access list naming which earlier outputs that worker may read.

Query
Conductor LM≈7B · RL-trained · writes the plan
[subtasks] [workers] [access]
executed as the topology ↓
Geministep 0Opusstep 1GPTstep 2 · out
Gemini 3.1 Prostep 0 · no inputs

Attempt: derive the invariant

Claude Opus 4.8step 1 · no inputs

Attempt: derive the invariant

GPT-5.5step 2 · reads step 0, 1

Resolve the disagreement by re-deriving the spectral numbers; synthesize

Two workers attempt the problem independently; a domain-matched aggregator resolves them (a Calabi–Yau invariant from Humanity’s Last Exam).

And the worker pool is yours to constrain. By training on randomized subsets of workers, the Conductor learns to make the most of whatever models you allow, which helps when you want to exclude a provider for cost or compliance. Restricted to open-source models only, the fine-tuned Conductor still beats a frontier model it isn’t even allowed to use:

“when evaluated with only open models, the finetuned Conductor is able to effectively combine their individually weaker capabilities with surprising effectiveness, even consistently outperforming Claude Sonnet 4 by almost 10% within our constrained setting.” (the Conductor)

That is evidence that the gains come from coordination, not just from riding on the strongest weights.

The strategies that emerge

Nothing above hand-codes “use a debate tree for trivia.” These patterns emerge from reward maximization, and the technical report’s qualitative analysis reads like a set of habits a good engineering manager would develop. Adaptive aggregation: on a niche trivia question Fugu-Ultra built a tree with Gemini synthesizing two attempts, but on a math question it swapped GPT into the same root role:

“dynamic adaptation of an aggregator role is precisely the kind of adaptation unavailable to existing multi-agent systems, which necessitate a fixed model to always act as a final synthesizer.” (Sakana Fugu Technical Report)

Build-and-debug: GPT builds, Opus reviews. On one Terminal Bench task, after GPT built a PyPI server, “Opus was then deployed to enumerate all concrete risks in GPT’s implementation,” caught three real bugs and a bogus reachability check, and “relaying these findings back to GPT enabled GPT to complete the build successfully.” Bring in a specialist: for a cryptanalysis problem Fugu-Ultra had Opus write the attack, then tasked GPT “to act specifically as a math specialist to re-derive the entire attack from first principles.” Different jobs, different experts, decided per query.

Does it work?

Orchestration alone is enough to match or beat every individual worker in the pool, and to trade blows with frontier models that aren’t even in the pool. Browse the benchmarks:

ResultsTechnical Report, Table 1

Through orchestration alone, the Fugu models match or beat every worker in their own pool. Pick a benchmark; bars are Fugu vs the three frontier workers it routes between.

Fugu-Ultra
73.7
Fugu
59.0
Claude Opus 4.8
69.2
Gemini 3.1 Pro
54.2
GPT-5.5
58.6

Bars start at 46 to make the gaps legible · highest score outlined · provider-reported baselines.

It generalizes past static benchmarks too. On Karpathy’s AutoResearch, an agent autonomously tuning a GPT training recipe over 123 experiments, Fugu-Ultra finished ahead of all three frontier baselines:

“Fugu-Ultra achieves the lowest mean validation BPB (0.9774 ± 0.0019)… an initial demonstration that multi-model orchestration can outperform any individual frontier agent on an agentic training-optimization benchmark.” (Sakana Fugu Technical Report)

The bigger bet is architectural: because Fugu “composes models at the behavioral level rather than the parameter level, it can incorporate new worker models as they become available.” Adding one needs no access to its weights, only its API. The orchestrator itself still has to be retrained to learn the newcomer’s strengths (the head emits one score per worker, and its labels come from actually running each worker on tasks), but that is a roughly two-week turnaround per new frontier model, not a fresh foundation-model run. If it holds, frontier capability stops being something only the largest training runs can buy.

References

  1. Sakana AI (2026). Sakana Fugu Technical Report · product page · github.com/SakanaAI/fugu
  2. S. Nielsen, E. Cetin, P. Schwendeman, et al. (2026). Learning to Orchestrate Agents in Natural Language with the Conductor. ICLR 2026. arXiv:2512.04388
  3. J. Xu, Q. Sun, P. Schwendeman, et al. (2026). TRINITY: An Evolved LLM Coordinator. ICLR 2026. arXiv:2512.04695