The Problem We Had
Our first multi-agent pipeline was a disaster waiting to happen. The architecture seemed clean: spawn workers, each does its thing, updates a shared `status.json` to record completion, and if it’s the last one in its phase, spawns the next batch. Workers know the workflow, workers drive progress. What could go wrong?
Plenty.
The race condition was textbook. Two parallel research workers — `researcher-a` and `researcher-b` — finish around the same time. At `t=0`, both read `status.json`. Both see themselves as the last remaining worker. At `t=1`, both write back with themselves marked completed. One write wins. The other is silently lost. The “winning” worker sees only its own completion, decides the phase isn’t done, and does nothing. The pipeline stalls. No error. No timeout for another ten minutes. Just silence.
[Read More]