For the first few weeks, Daneel did everything. One agent, all domains: email triage, code review, research, smart home control, calendar, blog drafts. The configuration was clean, the setup was simple, and the outputs were consistently mediocre.
Not broken. Just mediocre. And I eventually figured out why.
The single-agent problem
When an agent handles email classification at 09:00 and rewrites a Python module at 10:00, the same context window carries both concerns. A session loaded with inbox threads, calendar events, and Home Assistant device states isn’t an ideal substrate for code review advice. The model isn’t broken — it’s trying to maintain quality across too many unrelated domains simultaneously.
There’s also the specialization problem. A good email composer has different instincts than a good code reviewer. Different heuristics, different priorities, different failure modes. Training a single system prompt to be excellent at both is a losing game. You end up with something adequate at everything and exceptional at nothing.
The practical sign that something was wrong: I kept getting responses that were technically correct but contextually shallow. Daneel would write a blog draft that read like a summary. Review code without catching the architectural issue. Flag emails as low-priority that deserved a reply. Nothing catastrophic — just consistently below what the model was capable of when focused.
The root cause was context pollution. Every capability I added to Daneel’s single-agent setup made every other capability slightly worse.
The decision: routing over monolith
The alternative wasn’t smarter prompting or a larger model. It was decomposition.
Instead of one agent trying to be excellent at everything, I’d have fifteen agents each trying to be excellent at one thing. A coordinator — Daneel — handles routing, calendar, and simple cross-domain queries. Everything else delegates.
The routing table is deliberately simple:
email / Zulip / Twitter → Hermes
write text / blog / draft → Scribe
implement code / script → Forge
review code / PR → Sentinel
architecture / design / RFC → Archon
security / SAST / vulnerability → Warden
write tests / test automation → Tester
QA / acceptance criteria → Proctor
UX / design / usability → Artisan
critique / devil's advocate → Critic
research / news / RSS → Scout
servers / K8s / deploy → Atlas
smart home / HA / devices → Keeper
calendar / scheduling → Daneel (direct)
Daneel’s role shifted from “does everything” to “routes everything, does almost nothing.” It reads the request, identifies the domain, delegates to the specialist, and synthesizes the result into one to three sentences. It doesn’t write emails. It doesn’t write code. It doesn’t research anything. It knows who does those things and tells them to do it.
This sounds like a coordination tax. In practice, the tax is small and the quality improvement is not.
Fifteen specialists, fifteen contexts
Each specialist agent has a narrowly scoped system prompt. Scribe knows about the blog, Martinův hlas, and ox-hugo conventions. Forge knows about codebase patterns and conventions and nothing about email or home automation. Sentinel knows about code review standards and security — and nothing about blog formatting.
The context isolation is the feature. A specialist never has to decide whether the thing it’s doing is relevant to some other domain. It just does the thing it knows.
This also means each specialist can carry domain-specific memory. Scribe remembers the blog’s tone and previous posts in the series. Hermes knows email contacts and communication history. Keeper knows which Home Assistant entities map to which rooms. That memory would be noise in a single-agent context. In a specialist, it’s leverage.
Practically, each agent runs in its own session. There’s no shared state between them except what the orchestrator explicitly passes. If Scribe needs research from Scout, Daneel runs both and hands Scribe’s session the Scout output as input. No implicit context bleed.
Communication: one DM room per agent
Every agent communicates with Martin through its own private Matrix room. Fifteen agents, fifteen rooms. Each agent knows only its own room ID.
This looks redundant until you’ve experienced the alternative. In a shared room with multiple agents, you get cross-talk: answers that assume context from a different thread, unclear attribution, noise from agents that have nothing to do with the current task. A group chat for AI agents has all the same problems as a group chat for humans, with the additional problem that agents don’t have social instincts to keep them quiet when they have nothing to contribute.
The DM model is clean. When Hermes sends a draft reply, it appears in Hermes’s room. When Scout delivers research, it lands in Scout’s room. When Atlas finishes a deployment, the result is in Atlas’s room. Martin gets focused, attributable output from each specialist without noise from the others.
Daneel’s room handles general requests and coordination. When a task requires multiple specialists, Daneel orchestrates the chain and delivers a synthesized summary — never the raw specialist output unless explicitly asked.
A concrete example: this post
The blog post pipeline illustrates the model.
Martin’s request arrives in Daneel’s room: “write a post about the multi-agent architecture.” Daneel identifies three domains — research, writing, critique — and sequences three specialists.
Scout runs first. It gets a focused task: research on multi-agent AI architectures, relevant tradeoffs, prior art. It reads nothing about email or home automation. It produces a research document.
Scribe runs second, with Scout’s output as explicit input context. Scribe knows the blog format, the voice, the previous posts in this series. It writes a draft without needing to be told what a blog post is or how it should sound.
Critic runs third, with the draft. Critic’s job is adversarial by design — it looks for logical gaps, weak claims, places where specificity would help. It returns structured feedback, not a revised draft.
Daneel synthesizes: delivers the reviewed draft with a one-line note on the major issues Critic flagged.
For a software feature, the chain is longer: Archon (architecture design) → Artisan (UX) → Forge (implementation) → Tester (test suite) → Sentinel (code review) → Warden (security audit) → Proctor (acceptance criteria). Seven specialists, each working with output from the one before it, each in their own focused context.
What changed
Quality went up noticeably for writing and code. The improvement isn’t uniform — simple tasks are about the same — but anything that requires real domain judgment is better. Scribe produces blog drafts that sound like Martin rather than like a summary of what a blog post about the topic would contain. Sentinel catches architectural issues that a generalist code reviewer misses. Critic finds the argument’s weakest point on the first pass.
The other gain is parallelization. Independent tasks on different domains can run simultaneously. Hermes handling email preprocessing while Scout runs a research job while Atlas checks infrastructure status — those three things happen in the same time window without competing for the same context.
What got harder: setup overhead per agent. Each specialist needs a carefully tuned system prompt, domain-specific memory, and routing rules that handle edge cases. Adding a new specialist is a few hours of work, not a one-line config change. The routing table needs maintenance as domains evolve.
Memory isolation is also tricky to get right. Information that should stay with one specialist sometimes needs to reach another. The clean solution is explicit handoffs via the orchestrator — Daneel passes Scout’s research document as a file to Scribe’s session — but that requires every multi-specialist workflow to be explicitly designed. Miss a handoff and the downstream specialist works with incomplete context.
The prompt engineering overhead is real. Fifteen system prompts instead of one means fifteen opportunities to get it wrong, fifteen things to update when coordination patterns change, fifteen memory files to maintain.
This architecture isn’t for everyone. If your tasks stay in one domain, a single capable agent is easier to run and reason about. The fifteen-specialist setup makes sense when you have genuine multi-domain load, when domain quality matters, and when you’re willing to invest in the scaffolding that makes routing actually work.
For the use case it’s designed for — a personal assistant that handles email, code, writing, infrastructure, and home automation with consistent quality across all of them — the tradeoff is worth it. One Daneel doing everything was adequate. Fifteen specialists coordinated by a routing layer is noticeably better.
Running: OpenClaw, self-hosted. 15 agents: Daneel (coordinator) + 14 domain specialists. All on Claude Sonnet\/Opus (Anthropic). Agent-to-Martin communication via Matrix, one DM room per agent.