My multi-agent pipeline was failing at random. Not always, not predictably — just often enough to make me stop trusting it. Worker-2 would run, write its output, and then nothing would happen. The orchestrator was sitting there waiting for an announce that never arrived. The bug already had a ticket number: #17000. Description: hardcoded 60-second timeout, no retry. I’d built the entire coordination model on message delivery, and message delivery was the single point of failure. The fix wasn’t more retries. It was getting rid of message-based coordination entirely.
The Old Pattern and Why It Broke
The original approach was simple: spawn worker-1, wait for it to announce completion, spawn worker-2, wait for announce, spawn worker-3. Clean, readable, easy to reason about. It also failed under any real-world condition.
The announce system in OpenClaw has a 60-second delivery window. If the gateway is under load, if there’s a transient network issue, if the announce just gets dropped — your orchestrator is stalled indefinitely. It sits in a waiting state with no way to know whether the worker finished successfully, finished and the announce was lost, or actually crashed. There’s no retry mechanism. There’s no fallback. The main session has no way to distinguish “worker is still running” from “announce was lost three minutes ago.”
I hit this pattern enough times that I started logging it. About 20-30% of announce delivers were unreliable under normal load. That’s not a bug you work around with patience. That’s a design assumption that doesn’t hold.
Distributed Systems Problems I Rediscovered the Hard Way
Building multi-agent systems means independently rediscovering everything microservices engineers figured out in 2015. I ran into all of it.
Race conditions when two workers write to the same output location. Context loss when an announce arrives out of order and the orchestrator can’t reconstruct state. Coordinator overhead — when the orchestrator itself is a sub-agent (depth-2 pattern), it has its own lifecycle problems. In OpenClaw, bug #18043 documents this: depth-2 orchestrators terminate prematurely and lose their announce chains. Meaning: the orchestrator agent finishes before it has processed all results from the workers it spawned. You think you have a pipeline. You actually have a ticking clock.
The debugging tax was the worst part. When something goes wrong in a sequential announce-based pipeline, you spend time answering: did the worker crash, did the announce drop, did the orchestrator miss it, or is it still running? A failure that takes 30 seconds to occur takes 20 minutes to diagnose.
The Spawn-All-Wait Pattern
The solution was conceptually simple and felt slightly absurd in practice: spawn all workers in a single turn, and have sequential workers coordinate via the filesystem instead of via messages.
Here’s what it looks like. The main session spawns every worker — parallel and sequential — in one shot. Parallel workers start immediately. Sequential workers that need output from a previous worker start by executing a bash wait loop:
for i in $(seq 1 60); do
[ -f /path/to/pipeline-dir/worker-1.md ] && echo 'INPUT_READY' && break
echo "Waiting... $i"
sleep 5
done
That’s it. The worker polls every 5 seconds for up to 5 minutes. When the file appears, it reads it and starts working. When it finishes, it writes its own output file. The next worker in the chain finds it the same way.
The main session’s job is reduced to: spawn everything, tell the user “pipeline running, N workers active,” and wait. No intermediate actions required. No processing announces as triggers. The chain runs itself through the filesystem.
Worker timeouts are set accordingly: 180 seconds for parallel workers with no dependencies, 360 seconds for sequential workers (5 minutes of possible waiting plus 1 minute of actual work).
Filesystem Handoff vs. Message-Based Handoff
The practical difference comes down to one property: a file either exists or it doesn’t. There’s no delivery window, no retry budget, no 60-second timeout. If worker-1.md is there, the next worker reads it and continues. If it’s not there after 5 minutes, the worker times out and reports TIMEOUT — which is a signal, not a silent failure.
Compare this to the announce model. An announce either arrives within 60 seconds or it’s gone. There’s no way to request it again. There’s no persistent record that the orchestrator can check on startup. If the main session restarts after a crash, it has no idea what state the pipeline was in. With filesystem handoff, it can check which worker files exist and reconstruct state immediately.
Debugging is also qualitatively different. With the old model, I’d run a pipeline, wait 10 minutes, and then start trying to figure out what happened. With filesystem handoff, I open a terminal, run ls pipeline-tmp/rw-1827/ and immediately see which workers completed. The files are the state. The state is visible.
There’s one real constraint: because of bug #10334 (concurrent announces can deadlock the gateway), I cap parallel workers at 4. This isn’t a filesystem limitation — it’s a gateway limitation that applies regardless of coordination method. I plan around it.
The Terminal Worker and No Double Send
One worker in every pipeline is different: the terminal worker. Its job is to read all previous worker outputs, synthesize a final result, and deliver it to the user. It’s the only worker that’s allowed to call the message tool. All other workers write files and stay silent.
This exists because of the double-send problem. If a worker sends to Matrix and then the main session also sends the same content via announce processing, the user gets the message twice. The rule is simple: one delivery path, enforced by convention. Every worker except the last one is file-only. The last one sends, then writes MATRIX_SENT in its announce response.
When the main session sees MATRIX_SENT in an announce, it does nothing — the terminal worker already delivered. If the announce doesn’t contain MATRIX_SENT, the main session interprets it as a mid-pipeline announce and just notes the progress.
The heartbeat watchdog covers the edge case: if worker files exist but no sub-agents are currently running and the result hasn’t been delivered, the main session synthesizes and sends itself. It’s a fallback I’ve needed twice. Both times it saved what would have been a completely silent failure.
What I Measured and What Still Hurts
In a typical write pipeline — researcher, creator, critic running sequentially — the old model took around 6 minutes plus announce latency plus the overhead of me watching and intervening. The new model runs in about 4 minutes with no intervention required. Parallel research phases (two workers running simultaneously) finish in around 2 minutes. Sequential synthesis adds another 2. Total: 4 minutes, unattended.
Three bugs are still open. #17000 (announce timeout, no retry) is the root cause of everything described here — the workaround works, but the bug remains. #10334 (concurrent announce deadlock) caps parallelism at 4. #18043 (depth-2 orchestrator termination) means I can’t delegate orchestration to a sub-agent — the main session has to stay in the loop.
None of these bugs touch what the pattern can’t fix: hallucination rates, token cost per pipeline, or the fact that MCP and A2A protocol standardization are still immature. The pipeline coordinates reliably. What each worker does with its context is a separate problem.
Closing
If you’re building multi-agent pipelines and coordinating through message delivery, you’re one network blip away from a stalled orchestrator and a silent failure. The Spawn-All-Wait pattern isn’t elegant — a bash polling loop inside an LLM prompt is not how anyone imagined this going. But it’s the thing that actually works in production, today, with the infrastructure that exists.
The files are always there. The announces sometimes aren’t.
If you’ve run into similar issues with LangChain, CrewAI, or your own orchestration layer, I’d genuinely like to compare notes. These patterns came from real failures — not from a whitepaper — and they’ll keep evolving as the tooling matures. MCP and A2A will change the picture, probably by late 2026. Until then: write to files, not messages.
M>