The Problem We Had
Our first multi-agent pipeline was a disaster waiting to happen. The architecture seemed clean: spawn workers, each does its thing, updates a shared `status.json` to record completion, and if it’s the last one in its phase, spawns the next batch. Workers know the workflow, workers drive progress. What could go wrong?
Plenty.
The race condition was textbook. Two parallel research workers — `researcher-a` and `researcher-b` — finish around the same time. At `t=0`, both read `status.json`. Both see themselves as the last remaining worker. At `t=1`, both write back with themselves marked completed. One write wins. The other is silently lost. The “winning” worker sees only its own completion, decides the phase isn’t done, and does nothing. The pipeline stalls. No error. No timeout for another ten minutes. Just silence.
That was the obvious failure. The subtle one was worse: state trapped in the agent’s context window.
When a worker gets killed mid-task — OOM, timeout, platform restart — the in-progress state dies with it. Nothing in `status.json` says “this worker was halfway through step 3 of 7.” There’s no way to resume. You either restart the whole pipeline or manually reconstruct what happened from logs.
We looked at alternatives. LangChain and LangGraph are elegant for small pipelines, but their state lives in memory — restart the process and you start over. CrewAI puts LLM reasoning in the control plane: agents decide what to do next, which sounds powerful until you realize your orchestration is non-deterministic. AutoGen is similar — control flow emerges from conversation, making it genuinely hard to reason about edge cases. Prefect and Airflow are solid but not built for LLM agent workflows. None gave us what we needed: a simple, external, inspectable state machine that survives restarts and eliminates race conditions by construction.
So we built one.
What FSA Actually Is
A finite state automaton formalizes something you already know: a system with a fixed set of states, a fixed set of events, and a table mapping (state, event) → next state + action.
Think of a traffic light. Three states: RED, YELLOW, GREEN. Deterministic transitions: GREEN → timer expires → YELLOW → timer expires → RED → timer expires → GREEN. No traffic light “decides” anything. It doesn’t reason about traffic density or consult a language model. It reads its current state, checks which event fired, looks up the table, and acts.
That’s the key insight: the orchestrator has no opinions. It reads `(current_state + event)`, looks up the table, and executes the action. The intelligence lives in the table definition, written by humans at design time. Runtime execution is mechanical.
For multi-agent pipelines, this translates directly. “States” are phase statuses: `pending`, `running`, `completed`, `failed`, `paused`. “Events” are things like “worker output file appeared” or “timeout exceeded.” The “table” is a decision matrix the orchestrator consults on every tick. No LLM in the loop. No ambiguity.
The New Architecture
The redesigned system has exactly three components:
`workflows.json` — static definition. Describes every pipeline type: phases, ordering (sequential or parallel), workers per phase, models, timeouts, and input file dependencies. Never changes at runtime. It’s the blueprint.
`status.json` — runtime state. One file per pipeline run, created at launch, updated only by the orchestrator (main session). Tracks current phase, worker statuses, session IDs, retry counts, and delivery state. This is the single source of truth.
Workers — pure executors. A worker receives a task prompt with the topic, input files, and an explicit output path. It does its work, writes the output file, and exits. That’s the entire contract. Workers never touch `status.json`. Workers never spawn other workers. Workers don’t know what phase they’re in or what comes next.
The orchestrator runs a reconciliation loop on every trigger — worker completion announce, heartbeat, user message. Each time, it does the same thing: check which output files exist, update `status.json` to reflect detected completions, then consult the decision table:
┌─────────────────────────────────┬──────────────────────────────────┐
│ State │ Action │
├─────────────────────────────────┼──────────────────────────────────┤
│ All workers done + next pending │ Spawn next phase workers │
│ All workers done + pause_after │ Summarize to user, wait │
│ Final phase completed │ Deliver final.md to user, archive│
│ Phase running > timeout + 120s │ Mark failed, notify user │
│ Phase running, within limit │ Wait (nothing to do) │
│ result_delivered: true │ Archive │
└─────────────────────────────────┴──────────────────────────────────┘
File existence as completion signal is the key to idempotency. The orchestrator doesn’t rely on receiving a message from the worker. It checks: does `researcher-a.md` exist? If yes, that worker is done — regardless of what `status.json` currently says. You can kill and restart the orchestrator at any point; it will reconstruct correct state from the filesystem. No lost updates. No ghost workers.
Concrete Example: Research Pipeline
Here’s a real pipeline definition — two parallel researchers followed by a synthesis pass:
{
"research": {
"description": "Pure research + analysis",
"phases": [
{
"id": "collect",
"mode": "parallel",
"workers": [
{ "role": "researcher-a", "model": "sonnet", "timeout": 600, "task": "Research perspective A: main sources, facts, current state" },
{ "role": "researcher-b", "model": "sonnet", "timeout": 600, "task": "Research perspective B: alternative views, criticism, edge cases" }
]
},
{
"id": "synthesis",
"mode": "sequential",
"workers": [
{ "role": "synthesizer", "model": "opus", "timeout": 420, "final": true, "reads": ["researcher-a.md", "researcher-b.md"], "task": "Synthesize research from both researchers" }
]
}
]
}
}
The Walkthrough
Step 1. User triggers `/pipeline research FSA architecture`. Orchestrator reads `workflows.json`, creates `pipeline-tmp/research-180141/`, initializes `status.json`:
{
"pipeline": "research", "dir": "research-180141", "topic": "FSA architecture",
"current_phase": 0, "retry_count": 0,
"phases": [
{ "id": "collect", "status": "running", "workers": {
"researcher-a": { "status": "running", "session": "agent:main:subagent:abc123" },
"researcher-b": { "status": "running", "session": "agent:main:subagent:def456" }
}},
{ "id": "synthesis", "status": "pending", "workers": {
"synthesizer": { "status": "pending", "session": "" }
}}
],
"result_delivered": false
}
Step 2. Orchestrator spawns `researcher-a` and `researcher-b` in parallel. Both get a task prompt with an explicit output path. The orchestrator tells the user: “Pipeline running, 2 workers in phase 1.”
Step 3. `researcher-a` finishes first. Writes `researcher-a.md` and exits.
Step 4. Orchestrator trigger fires. Reconcile checks the filesystem, sees `researcher-a.md`, updates status:
{
"current_phase": 0,
"phases": [
{ "id": "collect", "status": "running", "workers": {
"researcher-a": { "status": "completed", "session": "agent:main:subagent:abc123" },
"researcher-b": { "status": "running", "session": "agent:main:subagent:def456" }
}},
{ "id": "synthesis", "status": "pending", "workers": {
"synthesizer": { "status": "pending", "session": "" }
}}
]
}
Decision table: phase 0 still has a running worker within timeout → Wait.
Step 5. `researcher-b` finishes. Writes `researcher-b.md`, exits.
Step 6. Orchestrator trigger fires. Both output files exist. Updates both workers to `completed`, marks phase 0 `completed`. Decision table: all workers done, next phase pending → Spawn next phase. Spawns `synthesizer` with both research files in its prompt. Updates `status.json`:
{
"current_phase": 1,
"phases": [
{ "id": "collect", "status": "completed", "workers": {
"researcher-a": { "status": "completed", "session": "agent:main:subagent:abc123" },
"researcher-b": { "status": "completed", "session": "agent:main:subagent:def456" }
}},
{ "id": "synthesis", "status": "running", "workers": {
"synthesizer": { "status": "running", "session": "agent:main:subagent:ghi789" }
}}
]
}
Step 7. `synthesizer` reads both research files, writes `synthesizer.md`, exits. It has `“final”: true` in the workflow definition.
Step 8. Orchestrator detects `synthesizer.md`, phase 1 complete, final phase → Deliver final.md to user, archive. Sends the synthesis to the user. Sets `result_delivered: true`. Moves `pipeline-tmp/research-180141/` to `memory/pipelines/`.
At no point did any worker touch `status.json`. At no point did any worker decide what comes next. Every control decision came from reading state and consulting the table.
Tradeoffs and Limitations
This architecture earns its complexity in production pipelines with predictable structure: content generation, research workflows, code review, multi-stage analysis. Anywhere you’ve been burned by race conditions, lost state on restart, or non-deterministic orchestration — FSA fixes all three by construction.
It’s not the right tool for genuinely dynamic multi-agent conversations where agents negotiate task structure on the fly. If your workflow can’t be expressed as phases + transitions at design time, FSA forces you into contortions. Use something else.
There’s also a rigidity cost. Adding a new pipeline type means editing `workflows.json`, defining phases, specifying worker roles and models. That’s deliberate friction — it forces you to think about structure before you run anything — but it does mean you can’t just say “figure it out” and hope for the best. Every workflow needs to be designed, not discovered.
The pattern demands discipline: workers must respect their contract (write output, exit, touch nothing else). One worker that “helps” by updating `status.json` breaks the single-writer guarantee and reintroduces every race condition you just eliminated. Enforce the contract at the prompt level and audit it at every pipeline change.
Error handling is minimal by design. A failed worker gets marked `failed`, the orchestrator notifies the user, and that’s it. There’s no automatic retry with modified prompts, no fallback to a different model, no sophisticated error recovery. You could build those features on top of the FSA — the decision table is extensible — but out of the box, the system assumes that most failures are better surfaced to a human than papered over by automation.
The payoff is a system you can debug by reading two files, resume after any failure, and reason about without running it. In production multi-agent systems, that’s not a nice-to-have. It’s the difference between something you can operate and something that operates you.