K@automation on Martin Sukany

Why I Moved from OpenClaw to Hermes

Tue, 14 Apr 2026 00:00:00 +0000

A month ago I thought I had the right answer: split everything into specialists.

At the peak, my setup had sixteen agents. One for email. One for writing. One for research. One for infrastructure. Several more for code, review, critique, QA, and orchestration. On paper it looked elegant — decomposition, clear ownership, domain-specific memory, explicit routing.

In practice it gradually became something else: an overengineered system that demanded more maintenance than it returned.

So I moved the whole thing to Hermes.

This post is not a generic “new framework is better” piece. It’s what actually changed, what broke in the old model, and the decision rule I’d recommend if you’re building your own AI setup today.

What OpenClaw gave me

I want to be fair to OpenClaw, because it solved a real problem before most tools in this space even acknowledged it.

It gave me three things that mattered:

Persistence beyond one chat window. The assistant could remember prior work, not just the current prompt.
A messaging-native interface. Matrix, email, scheduled jobs, background work — not just an IDE pane.
A playground for architecture. It was easy to experiment with routing, specialists, cron-like workflows, memory layers, and custom coordination patterns.

That mattered. Session-only tools are useful, but they start every day half-amnesic. Even The New Stack’s recent comparison between OpenClaw and Hermes framed this as the core shift: from session-bound assistants to persistent agents that actually accumulate working context over time.

OpenClaw was the first system in my stack that made that future feel real.

Where it started to fail

The problem wasn’t that OpenClaw was incapable. The problem was that it made it too easy to build a system whose theoretical power exceeded its operational reliability.

I kept layering solutions on top of solutions:

more specialists to reduce context pollution
more routing logic to choose the right specialist
more handoff rules between agents
more memory files to keep each agent focused
more orchestration to recover when a chain stalled

Eventually the architecture itself became the workload.

When a task failed, the debugging question was no longer “did the model misunderstand the request?” It became:

did the worker fail?
did the handoff fail?
did the orchestrator miss the signal?
did the wrong specialist get selected?
did the downstream agent lack one specific piece of context the upstream agent had?

That’s not an AI problem. That’s distributed systems tax.

I wrote earlier about announce-based orchestration failures and the filesystem workaround I ended up using. That workaround worked. But that’s also the point: if your personal assistant requires production-grade coordination patterns to stay reliable, you’ve crossed from useful complexity into accidental complexity.

Sixteen agents, one lesson

The biggest lesson from the 16-agent phase is not “multi-agent is bad.” It’s more precise than that:

Persistent multi-agent setups are expensive unless the domains are truly independent and high-volume.

I had a specialist for nearly everything because I wanted quality. And yes, in some cases quality improved. Focused writer beats generalist writer. Focused reviewer beats generalist reviewer.

But over time I noticed something more important.

Most of my day does not consist of sixteen independent lanes of work running in parallel. It consists of one human agenda with occasional spikes of specialized work.

That means the dominant case is not:

email specialist
blog specialist
infrastructure specialist
code reviewer specialist
critic specialist
all active all the time

The dominant case is:

one trusted assistant with continuity
one active thread of context
occasional need for a highly specialized coding burst

Those are different architectures.

I had optimized for the wrong one.

What Hermes changed

Hermes pushed me back toward the simpler model: one primary assistant that is good at staying useful over time.

What I wanted in the end was not an agent zoo. I wanted a system I trust.

For me, Hermes is the better fit because it is opinionated in the right places:

stronger emphasis on durable memory and recall discipline
cleaner operational loop around tools, verification, and follow-through
better fit for one ongoing assistant relationship instead of many semi-permanent personas
easier to keep understandable after weeks of iteration

That last point matters more than people admit.

A personal AI system is not finished when it can do impressive things. It’s finished when you can still understand, repair, and extend it after a month of real life.

OpenClaw encouraged me to explore. Hermes encourages me to simplify.

Right now, simplification is worth more.

Why Claude Code and Codex changed the equation

The other thing that made the big permanent multi-agent setup less compelling was the rise of strong task-specific coding agents.

Both Claude Code and Codex are explicit about what they are in their own docs: local coding agents that can inspect a repo, edit files, and run commands in a focused working directory. That’s exactly the point.

They don’t need to be my forever assistant. They need to be very good at this code problem, right now.

Once those tools became good enough, a lot of my specialist-agent architecture stopped making economic sense.

I no longer need to keep a permanent code-writer persona, code-review persona, or test-writer persona alive as part of one giant always-on constellation just in case I need them later. When I hit a serious implementation task, I can use Claude Code or Codex directly on that repository.

That changes the architecture boundary.

Instead of:

one persistent system that contains every specialization internally

I can do:

one persistent assistant for continuity, operations, memory, messaging, and daily work
one ephemeral specialist agent for the hard coding task in front of me

That’s a better split.

The persistent layer keeps history and context. The specialist layer brings concentrated capability on demand.

Those two jobs do not need to live in the same permanent structure.

The practical decision rule

If you’re deciding between a persistent agent runtime and a pile of coding subagents, this is the rule I’d use now.

Use a persistent assistant when the value comes from continuity:

remembering your preferences
carrying forward project context across days
handling scheduled workflows
integrating with messaging, email, calendars, or home systems
reducing repeated coordination overhead

Use a repo-local specialist agent when the value comes from depth on one bounded task:

implementing a feature
reviewing a pull request
debugging a failing test suite
refactoring one codebase
researching one technical decision

Don’t force the persistent assistant to impersonate an entire software organization. Don’t force the repo-local coding tool to become your life OS.

Those are different tools.

What readers should take from this

The important takeaway is not “single agent good, multi-agent bad.”

It’s this:

Optimize for reliability before capability surface area.

A system that can theoretically do ten kinds of delegation but fails one out of five times is worse than a simpler system that reliably completes the boring parts of your day.

The second takeaway:

Count maintenance, not just features.

Every additional agent, memory file, router, handoff rule, and background workflow has a carrying cost. If you don’t include that cost in the architecture decision, you’ll overbuild.

And the third:

Use specialization at the edge, not necessarily at the center.

That was the real shift for me. I still use specialized agents. I just don’t keep them all running as permanent residents inside one increasingly elaborate assistant runtime. For coding, it is often better to reach for Claude Code or Codex exactly when the problem calls for them, then come back to the main assistant when the task is over.

That gives me the upside of specialization without paying permanent orchestration tax.

Closing

OpenClaw was an important stage in the path. It helped me discover what I actually wanted from an AI system — and just as importantly, what I didn’t.

What I want now is much less flashy and much more useful:

one assistant I trust
strong memory
clean operational behavior
specialized coding help on demand
fewer moving parts

Hermes is closer to that target.

Not because it lets me build more. Because it lets me need less.

Home Assistant Blinds: The Practical Setup I Actually Use

Thu, 09 Apr 2026 00:00:00 +0000

I try to keep my Home Assistant setup boring in the best possible way. My blinds are not driven by a giant all-knowing automation. Instead, I use a few small pieces that are easy to understand, easy to disable, and easy to restore.

This post describes the pattern I use at home and how I expose it to Keeper, my Home Assistant assistant agent. If you want something you can reproduce, this is the important part: keep the physical covers, the room-level toggles, and the higher-level automations separate.

What the setup looks like

I have four rooms with blinds, each exposed as a normal cover. entity in Home Assistant. In my case they are Shelly-controlled covers, but the pattern does not depend on Shelly specifically.

For each room I also keep a small set of helper entities:

one input_boolean that says whether room automation is allowed
one input_number for the target blind position
one input_number for the target tilt angle
one optional input_boolean for shading mode

This gives me a simple contract. The real cover entity moves hardware. Helper entities store intent.

The core idea: automation is a toggle, not a trap

The most useful thing I added was a watchdog automation per room. The watchdog listens to a room toggle such as input_boolean.automatika_zaluzie_loznice and does only two jobs:

when the toggle is turned on, it enables the room automation
when the toggle is turned off, it disables the room automation, waits four hours, and enables it again

That sounds trivial, but it solves a real problem. Manual override stays manual for long enough to matter, but I do not forget to restore the default behavior the next day.

In practice, each room has a tiny watchdog automation that turns one real automation on or off. That is much easier to reason about than embedding override logic into every rule.

Room rules stay separate and practical

The room automations themselves stay focused.

Examples from my setup:

bedroom and kids room use thermal-comfort automations with a default “down + tilt” target
kitchen has a late-afternoon sun rule that keeps the view usable and also checks cloud conditions
living room has its own thermal and comfort rule, and I can disable it independently when I want full manual control

I also use one helper script for the annoying real-world detail: sometimes position and tilt must be applied in sequence. The script script.blinds_ensure_position_then_tilt makes that idempotent and adds a timeout, so the higher-level automations can call one stable interface instead of duplicating movement logic.

How Keeper fits into this

Keeper does not need to know every low-level rule. It only needs a safe control surface.

In my setup, that means Keeper works with Home Assistant domains such as:

cover.* for open, close, or set position
automation.* for turning room logic on or off
script.* for the reusable “ensure position, then tilt” action
input_boolean.* and input_number.* for room intent and defaults

That separation matters. If I change an implementation detail inside Home Assistant, I do not need to redesign Keeper. I keep the same room-level interface and update only the HA internals.

If you want to reproduce it

Start with one room only.

Expose your blind as a cover.
Add one boolean helper that means “automation enabled”.
Create one actual room automation.
Add one watchdog automation that turns the room automation back on after a few hours.
If tilt is unreliable, wrap it in one reusable script instead of solving it differently in every room.

That is enough to get a system that is practical, debuggable, and friendly to both manual control and agent-driven control.

The main lesson is simple: do not make your blinds smart in one big step. Make them composable. That is what made the setup usable for me.

From One Agent to Fifteen: Multi-Agent Architecture in Practice

Sun, 15 Mar 2026 00:00:00 +0000

For the first few weeks, Daneel did everything. One agent, all domains: email triage, code review, research, smart home control, calendar, blog drafts. The configuration was clean, the setup was simple, and the outputs were consistently mediocre.

Not broken. Just mediocre. And I eventually figured out why.

The single-agent problem

When an agent handles email classification at 09:00 and rewrites a Python module at 10:00, the same context window carries both concerns. A session loaded with inbox threads, calendar events, and Home Assistant device states isn’t an ideal substrate for code review advice. The model isn’t broken — it’s trying to maintain quality across too many unrelated domains simultaneously.

There’s also the specialization problem. A good email composer has different instincts than a good code reviewer. Different heuristics, different priorities, different failure modes. Training a single system prompt to be excellent at both is a losing game. You end up with something adequate at everything and exceptional at nothing.

The practical sign that something was wrong: I kept getting responses that were technically correct but contextually shallow. Daneel would write a blog draft that read like a summary. Review code without catching the architectural issue. Flag emails as low-priority that deserved a reply. Nothing catastrophic — just consistently below what the model was capable of when focused.

The root cause was context pollution. Every capability I added to Daneel’s single-agent setup made every other capability slightly worse.

The decision: routing over monolith

The alternative wasn’t smarter prompting or a larger model. It was decomposition.

Instead of one agent trying to be excellent at everything, I’d have fifteen agents each trying to be excellent at one thing. A coordinator — Daneel — handles routing, calendar, and simple cross-domain queries. Everything else delegates.

The routing table is deliberately simple:

email / Zulip / Twitter → Hermes
write text / blog / draft → Scribe
implement code / script → Forge
review code / PR → Sentinel
architecture / design / RFC → Archon
security / SAST / vulnerability → Warden
write tests / test automation → Tester
QA / acceptance criteria → Proctor
UX / design / usability → Artisan
critique / devil's advocate → Critic
research / news / RSS → Scout
servers / K8s / deploy → Atlas
smart home / HA / devices → Keeper
calendar / scheduling → Daneel (direct)

Daneel’s role shifted from “does everything” to “routes everything, does almost nothing.” It reads the request, identifies the domain, delegates to the specialist, and synthesizes the result into one to three sentences. It doesn’t write emails. It doesn’t write code. It doesn’t research anything. It knows who does those things and tells them to do it.

This sounds like a coordination tax. In practice, the tax is small and the quality improvement is not.

Fifteen specialists, fifteen contexts

Each specialist agent has a narrowly scoped system prompt. Scribe knows about the blog, Martinův hlas, and ox-hugo conventions. Forge knows about codebase patterns and conventions and nothing about email or home automation. Sentinel knows about code review standards and security — and nothing about blog formatting.

The context isolation is the feature. A specialist never has to decide whether the thing it’s doing is relevant to some other domain. It just does the thing it knows.

This also means each specialist can carry domain-specific memory. Scribe remembers the blog’s tone and previous posts in the series. Hermes knows email contacts and communication history. Keeper knows which Home Assistant entities map to which rooms. That memory would be noise in a single-agent context. In a specialist, it’s leverage.

Practically, each agent runs in its own session. There’s no shared state between them except what the orchestrator explicitly passes. If Scribe needs research from Scout, Daneel runs both and hands Scribe’s session the Scout output as input. No implicit context bleed.

Communication: one DM room per agent

Every agent communicates with Martin through its own private Matrix room. Fifteen agents, fifteen rooms. Each agent knows only its own room ID.

This looks redundant until you’ve experienced the alternative. In a shared room with multiple agents, you get cross-talk: answers that assume context from a different thread, unclear attribution, noise from agents that have nothing to do with the current task. A group chat for AI agents has all the same problems as a group chat for humans, with the additional problem that agents don’t have social instincts to keep them quiet when they have nothing to contribute.

The DM model is clean. When Hermes sends a draft reply, it appears in Hermes’s room. When Scout delivers research, it lands in Scout’s room. When Atlas finishes a deployment, the result is in Atlas’s room. Martin gets focused, attributable output from each specialist without noise from the others.

Daneel’s room handles general requests and coordination. When a task requires multiple specialists, Daneel orchestrates the chain and delivers a synthesized summary — never the raw specialist output unless explicitly asked.

A concrete example: this post

The blog post pipeline illustrates the model.

Martin’s request arrives in Daneel’s room: “write a post about the multi-agent architecture.” Daneel identifies three domains — research, writing, critique — and sequences three specialists.

Scout runs first. It gets a focused task: research on multi-agent AI architectures, relevant tradeoffs, prior art. It reads nothing about email or home automation. It produces a research document.

Scribe runs second, with Scout’s output as explicit input context. Scribe knows the blog format, the voice, the previous posts in this series. It writes a draft without needing to be told what a blog post is or how it should sound.

Critic runs third, with the draft. Critic’s job is adversarial by design — it looks for logical gaps, weak claims, places where specificity would help. It returns structured feedback, not a revised draft.

Daneel synthesizes: delivers the reviewed draft with a one-line note on the major issues Critic flagged.

For a software feature, the chain is longer: Archon (architecture design) → Artisan (UX) → Forge (implementation) → Tester (test suite) → Sentinel (code review) → Warden (security audit) → Proctor (acceptance criteria). Seven specialists, each working with output from the one before it, each in their own focused context.

What changed

Quality went up noticeably for writing and code. The improvement isn’t uniform — simple tasks are about the same — but anything that requires real domain judgment is better. Scribe produces blog drafts that sound like Martin rather than like a summary of what a blog post about the topic would contain. Sentinel catches architectural issues that a generalist code reviewer misses. Critic finds the argument’s weakest point on the first pass.

The other gain is parallelization. Independent tasks on different domains can run simultaneously. Hermes handling email preprocessing while Scout runs a research job while Atlas checks infrastructure status — those three things happen in the same time window without competing for the same context.

What got harder: setup overhead per agent. Each specialist needs a carefully tuned system prompt, domain-specific memory, and routing rules that handle edge cases. Adding a new specialist is a few hours of work, not a one-line config change. The routing table needs maintenance as domains evolve.

Memory isolation is also tricky to get right. Information that should stay with one specialist sometimes needs to reach another. The clean solution is explicit handoffs via the orchestrator — Daneel passes Scout’s research document as a file to Scribe’s session — but that requires every multi-specialist workflow to be explicitly designed. Miss a handoff and the downstream specialist works with incomplete context.

The prompt engineering overhead is real. Fifteen system prompts instead of one means fifteen opportunities to get it wrong, fifteen things to update when coordination patterns change, fifteen memory files to maintain.

This architecture isn’t for everyone. If your tasks stay in one domain, a single capable agent is easier to run and reason about. The fifteen-specialist setup makes sense when you have genuine multi-domain load, when domain quality matters, and when you’re willing to invest in the scaffolding that makes routing actually work.

For the use case it’s designed for — a personal assistant that handles email, code, writing, infrastructure, and home automation with consistent quality across all of them — the tradeoff is worth it. One Daneel doing everything was adequate. Fifteen specialists coordinated by a routing layer is noticeably better.

Running: OpenClaw, self-hosted. 15 agents: Daneel (coordinator) + 14 domain specialists. All on Claude Sonnet\/Opus (Anthropic). Agent-to-Martin communication via Matrix, one DM room per agent.

FSA-Driven Multi-Agent Pipelines: How We Stopped Fighting Our Own Orchestrator

Sat, 28 Feb 2026 00:00:00 +0000

The Problem We Had

Our first multi-agent pipeline was a disaster waiting to happen. The architecture seemed clean: spawn workers, each does its thing, updates a shared `status.json` to record completion, and if it’s the last one in its phase, spawns the next batch. Workers know the workflow, workers drive progress. What could go wrong?

Plenty.

The race condition was textbook. Two parallel research workers — `researcher-a` and `researcher-b` — finish around the same time. At `t=0`, both read `status.json`. Both see themselves as the last remaining worker. At `t=1`, both write back with themselves marked completed. One write wins. The other is silently lost. The “winning” worker sees only its own completion, decides the phase isn’t done, and does nothing. The pipeline stalls. No error. No timeout for another ten minutes. Just silence.

That was the obvious failure. The subtle one was worse: state trapped in the agent’s context window.

When a worker gets killed mid-task — OOM, timeout, platform restart — the in-progress state dies with it. Nothing in `status.json` says “this worker was halfway through step 3 of 7.” There’s no way to resume. You either restart the whole pipeline or manually reconstruct what happened from logs.

We looked at alternatives. LangChain and LangGraph are elegant for small pipelines, but their state lives in memory — restart the process and you start over. CrewAI puts LLM reasoning in the control plane: agents decide what to do next, which sounds powerful until you realize your orchestration is non-deterministic. AutoGen is similar — control flow emerges from conversation, making it genuinely hard to reason about edge cases. Prefect and Airflow are solid but not built for LLM agent workflows. None gave us what we needed: a simple, external, inspectable state machine that survives restarts and eliminates race conditions by construction.

So we built one.

What FSA Actually Is

A finite state automaton formalizes something you already know: a system with a fixed set of states, a fixed set of events, and a table mapping (state, event) → next state + action.

Think of a traffic light. Three states: RED, YELLOW, GREEN. Deterministic transitions: GREEN → timer expires → YELLOW → timer expires → RED → timer expires → GREEN. No traffic light “decides” anything. It doesn’t reason about traffic density or consult a language model. It reads its current state, checks which event fired, looks up the table, and acts.

That’s the key insight: the orchestrator has no opinions. It reads `(current_state + event)`, looks up the table, and executes the action. The intelligence lives in the table definition, written by humans at design time. Runtime execution is mechanical.

For multi-agent pipelines, this translates directly. “States” are phase statuses: `pending`, `running`, `completed`, `failed`, `paused`. “Events” are things like “worker output file appeared” or “timeout exceeded.” The “table” is a decision matrix the orchestrator consults on every tick. No LLM in the loop. No ambiguity.

The New Architecture

The redesigned system has exactly three components:

`workflows.json` — static definition. Describes every pipeline type: phases, ordering (sequential or parallel), workers per phase, models, timeouts, and input file dependencies. Never changes at runtime. It’s the blueprint.

`status.json` — runtime state. One file per pipeline run, created at launch, updated only by the orchestrator (main session). Tracks current phase, worker statuses, session IDs, retry counts, and delivery state. This is the single source of truth.

Workers — pure executors. A worker receives a task prompt with the topic, input files, and an explicit output path. It does its work, writes the output file, and exits. That’s the entire contract. Workers never touch `status.json`. Workers never spawn other workers. Workers don’t know what phase they’re in or what comes next.

The orchestrator runs a reconciliation loop on every trigger — worker completion announce, heartbeat, user message. Each time, it does the same thing: check which output files exist, update `status.json` to reflect detected completions, then consult the decision table:

┌─────────────────────────────────┬──────────────────────────────────┐
│ State │ Action │
├─────────────────────────────────┼──────────────────────────────────┤
│ All workers done + next pending │ Spawn next phase workers │
│ All workers done + pause_after │ Summarize to user, wait │
│ Final phase completed │ Deliver final.md to user, archive│
│ Phase running > timeout + 120s │ Mark failed, notify user │
│ Phase running, within limit │ Wait (nothing to do) │
│ result_delivered: true │ Archive │
└─────────────────────────────────┴──────────────────────────────────┘

File existence as completion signal is the key to idempotency. The orchestrator doesn’t rely on receiving a message from the worker. It checks: does `researcher-a.md` exist? If yes, that worker is done — regardless of what `status.json` currently says. You can kill and restart the orchestrator at any point; it will reconstruct correct state from the filesystem. No lost updates. No ghost workers.

Concrete Example: Research Pipeline

Here’s a real pipeline definition — two parallel researchers followed by a synthesis pass:

{
 "research": {
 "description": "Pure research + analysis",
 "phases": [
 {
 "id": "collect",
 "mode": "parallel",
 "workers": [
 { "role": "researcher-a", "model": "sonnet", "timeout": 600, "task": "Research perspective A: main sources, facts, current state" },
 { "role": "researcher-b", "model": "sonnet", "timeout": 600, "task": "Research perspective B: alternative views, criticism, edge cases" }
 ]
 },
 {
 "id": "synthesis",
 "mode": "sequential",
 "workers": [
 { "role": "synthesizer", "model": "opus", "timeout": 420, "final": true, "reads": ["researcher-a.md", "researcher-b.md"], "task": "Synthesize research from both researchers" }
 ]
 }
 ]
 }
}

The Walkthrough

Step 1. User triggers `/pipeline research FSA architecture`. Orchestrator reads `workflows.json`, creates `pipeline-tmp/research-180141/`, initializes `status.json`:

{
 "pipeline": "research", "dir": "research-180141", "topic": "FSA architecture",
 "current_phase": 0, "retry_count": 0,
 "phases": [
 { "id": "collect", "status": "running", "workers": {
 "researcher-a": { "status": "running", "session": "agent:main:subagent:abc123" },
 "researcher-b": { "status": "running", "session": "agent:main:subagent:def456" }
 }},
 { "id": "synthesis", "status": "pending", "workers": {
 "synthesizer": { "status": "pending", "session": "" }
 }}
 ],
 "result_delivered": false
}

Step 2. Orchestrator spawns `researcher-a` and `researcher-b` in parallel. Both get a task prompt with an explicit output path. The orchestrator tells the user: “Pipeline running, 2 workers in phase 1.”

Step 3. `researcher-a` finishes first. Writes `researcher-a.md` and exits.

Step 4. Orchestrator trigger fires. Reconcile checks the filesystem, sees `researcher-a.md`, updates status:

{
 "current_phase": 0,
 "phases": [
 { "id": "collect", "status": "running", "workers": {
 "researcher-a": { "status": "completed", "session": "agent:main:subagent:abc123" },
 "researcher-b": { "status": "running", "session": "agent:main:subagent:def456" }
 }},
 { "id": "synthesis", "status": "pending", "workers": {
 "synthesizer": { "status": "pending", "session": "" }
 }}
 ]
}

Decision table: phase 0 still has a running worker within timeout → Wait.

Step 5. `researcher-b` finishes. Writes `researcher-b.md`, exits.

Step 6. Orchestrator trigger fires. Both output files exist. Updates both workers to `completed`, marks phase 0 `completed`. Decision table: all workers done, next phase pending → Spawn next phase. Spawns `synthesizer` with both research files in its prompt. Updates `status.json`:

{
 "current_phase": 1,
 "phases": [
 { "id": "collect", "status": "completed", "workers": {
 "researcher-a": { "status": "completed", "session": "agent:main:subagent:abc123" },
 "researcher-b": { "status": "completed", "session": "agent:main:subagent:def456" }
 }},
 { "id": "synthesis", "status": "running", "workers": {
 "synthesizer": { "status": "running", "session": "agent:main:subagent:ghi789" }
 }}
 ]
}

Step 7. `synthesizer` reads both research files, writes `synthesizer.md`, exits. It has `“final”: true` in the workflow definition.

Step 8. Orchestrator detects `synthesizer.md`, phase 1 complete, final phase → Deliver final.md to user, archive. Sends the synthesis to the user. Sets `result_delivered: true`. Moves `pipeline-tmp/research-180141/` to `memory/pipelines/`.

At no point did any worker touch `status.json`. At no point did any worker decide what comes next. Every control decision came from reading state and consulting the table.

Tradeoffs and Limitations

This architecture earns its complexity in production pipelines with predictable structure: content generation, research workflows, code review, multi-stage analysis. Anywhere you’ve been burned by race conditions, lost state on restart, or non-deterministic orchestration — FSA fixes all three by construction.

It’s not the right tool for genuinely dynamic multi-agent conversations where agents negotiate task structure on the fly. If your workflow can’t be expressed as phases + transitions at design time, FSA forces you into contortions. Use something else.

There’s also a rigidity cost. Adding a new pipeline type means editing `workflows.json`, defining phases, specifying worker roles and models. That’s deliberate friction — it forces you to think about structure before you run anything — but it does mean you can’t just say “figure it out” and hope for the best. Every workflow needs to be designed, not discovered.

The pattern demands discipline: workers must respect their contract (write output, exit, touch nothing else). One worker that “helps” by updating `status.json` breaks the single-writer guarantee and reintroduces every race condition you just eliminated. Enforce the contract at the prompt level and audit it at every pipeline change.

Error handling is minimal by design. A failed worker gets marked `failed`, the orchestrator notifies the user, and that’s it. There’s no automatic retry with modified prompts, no fallback to a different model, no sophisticated error recovery. You could build those features on top of the FSA — the decision table is extensible — but out of the box, the system assumes that most failures are better surfaced to a human than papered over by automation.

The payoff is a system you can debug by reading two files, resume after any failure, and reason about without running it. In production multi-agent systems, that’s not a nice-to-have. It’s the difference between something you can operate and something that operates you.

Ten Days with an AI Agent

Wed, 25 Feb 2026 00:00:00 +0000

On day 2, the agent tried to re-enable a Twitter integration I had explicitly cancelled the night before. It had forgotten. Not because of a bug — because session restarts wipe context, and nothing in the default setup prevents an AI from re-deriving a decision you already vetoed.

That’s when I started building the infrastructure that turned a chatbot into something that actually works.

This is not a tutorial. It’s what running an autonomous AI agent looks like after 10 days: what it costs, what breaks, and what I’d change.

What It Actually Costs

The honest number: $16–$21 over 10 days.

The agent uses three model tiers. Background tasks — heartbeat checks, email classification, log writes — run on Claude Haiku. About 180 heartbeat sessions over 10 days at roughly $0.012 each: ~$2.16. General conversation and code analysis run on Claude Sonnet. Of 92 recorded sessions, roughly 40% are Sonnet-class work, averaging ~$0.25 per session: ~$9.25. The expensive stuff — security audits, pipeline critic passes, memory maintenance — runs on Opus. 10–15 invocations at ~$0.50 each: $5–7.50.

Embeddings are negligible. The memory system uses OpenAI’s text-embedding-3-small at $0.02/1M tokens. Ten days of indexing cost about $0.01.

Infrastructure is fixed: a VM in my home lab running the OpenClaw gateway. No cloud compute charges.

The cost driver is not what you’d expect. It’s not token count — it’s context load. Every session, the agent loads configuration files: a 1.5KB state file, a 5KB curated memory, plus task-specific documents. Before tiered memory, sessions were loading raw daily logs on every start. After: selective loading. Per-session overhead dropped by roughly 60%.

22 cron jobs run on scheduled intervals. Morning briefing, email preprocessing every 2 hours, social media engagement, chat summaries, nightly memory maintenance, weekly server monitoring. Each spawns a sub-agent session. Those add up quietly.

A month at this rate is $50–$65. Less than most SaaS subscriptions.

The Forgetting Problem

The naive approach to agent memory is to log everything and search it later. That degrades fast.

After day 3, raw daily logs totaled 130KB. By day 10: 400KB across 29 files. Loading all of that into context every session burns tokens and fills the window with noise. Most of what’s in those logs is obsolete the moment it’s written.

The architecture I ended up with is L1/L2/L3, borrowed from CPU cache design.

L1 is NOW.md — under 1.5KB, hard limit. Current task, active blockers, open threads. Updated during sessions. If it’s not in NOW.md, it doesn’t exist for the next session.

L2 is MEMORY.md — under 5KB, curated. Long-term facts: credential locations, architectural decisions, lessons that took more than one failure to learn. Only the main session can write to it. Nightly maintenance cycles prune obsolete entries — the file has stayed under 5KB since day 4.

L3 is the daily log archive — append-only, never loaded directly. Accessed through hybrid search: BM25 + semantic retrieval via embeddings. Key discovery: the embedding model works significantly better with English queries even though most logs are in Czech.

The hard part is not storage. The hard part is forgetting correctly.

There’s a decisions.md file — I call it the anti-Dory register — that tracks every cancelled or paused action with a timestamp. When I told the agent to stop auto-posting tweets, that decision was recorded: date, scope, reason. Every cron job that touches external services checks this file before executing. Without it, the agent would occasionally re-reason its way back to trying the cancelled action.

There’s also a self-review.md tracking repeated mistakes with a counter. When the count hits 3, the rule gets promoted to permanent configuration. The session-memory hook that shipped by default was broken; it got disabled on day 2 and the rule “disable immediately” now lives in the permanent config. It has never been re-enabled by accident.

Seven days without a memory failure. The first three days had several. The difference is maintenance cycles and the decisions registry, not the agent being smarter.

Configuration Is the Product

Default OpenClaw gives you a conversational agent with web search and file access. That is a chatbot. What I’m running now is closer to infrastructure.

The difference is about 1,000 lines of configuration across eight files.

22 cron jobs (default: zero). The morning briefing fires at 07:00, pulls calendar events, scans email, and writes a daily context update. Email preprocessing classifies incoming mail every 2 hours into URGENT / NORMAL / INFO and sends notifications for anything that needs attention. Nightly memory maintenance prunes stale data. Without cron, the agent is purely reactive. With it, problems surface before I ask.

24 pipeline types for multi-stage tasks. A blog post runs through researcher → creator → critic. A security audit: recon → parallel auditor + remediator → synthesizer. All workers spawn in a single turn. Sequential workers wait for input files via a bash polling loop — no message-based coordination, no orchestrator agent. The last worker in the chain sends the result directly to Matrix.

Why not use the built-in message delivery? Because it has a hardcoded 60-second timeout with no retry. I learned this after two pipeline types failed in testing. The fix wasn’t more retries — it was bypassing message delivery entirely and having workers write files and send results themselves.

A web publishing safety layer. Before any content goes to the public site, a shell script checks for private information, credential references, and third-party data. Exit 1 stops the publish. This exists because an early session attempted to post content containing internal details. Not maliciously — the agent didn’t have a boundary. Now the boundary is enforced at the script level, not the prompt level.

Priority hierarchy. The agent’s decision model has five levels: safety > privacy > instructions > stability > efficiency. When they conflict, the order holds. This sounds abstract until the agent needs to decide whether to send an email on your behalf or wait for confirmation. Without explicit priority ordering, it guesses. With it, it stops and asks.

The insight after 10 days: an AI agent without customization is a chatbot. With customization, it’s infrastructure. None of this ships by default.

What I’d Do Differently

Start with memory architecture on day 1. I spent the first two days loading too much context. The L1/L2/L3 design should have been the first thing built, not something I arrived at after three failures.

Add the decisions registry before anything touches external services. The first cancelled-action recurrence appeared on day 3. The registry was created on day 4. One day of overlap where cancelled actions occasionally re-triggered.

Model selection discipline from the start. Early sessions used Sonnet for tasks that Haiku handles fine. Across 180 heartbeats, the cost difference adds up. Define model selection rules before creating cron jobs, not after.

Document infrastructure limitations before building on them. I built two pipeline types assuming message delivery was reliable. Both failed. Retrofitting the file-based pattern took longer than designing it correctly would have.

The agent runs stably now. 10 blog posts. Email processed without intervention. Memory clean. No duplicate sends.

It works. It just took 10 days of configuration to make it work the way it should.

Running: OpenClaw on self-hosted VM. Models: Claude Haiku\/Sonnet\/Opus (Anthropic), embeddings via text-embedding-3-small (OpenAI). 10-day window: February 15–25, 2026.

Why I Stopped Waiting for Announces: The Spawn-All-Wait Pattern for Multi-Agent AI

Sat, 21 Feb 2026 00:00:00 +0000

My multi-agent pipeline was failing at random. Not always, not predictably — just often enough to make me stop trusting it. Worker-2 would run, write its output, and then nothing would happen. The orchestrator was sitting there waiting for an announce that never arrived. The bug already had a ticket number: #17000. Description: hardcoded 60-second timeout, no retry. I’d built the entire coordination model on message delivery, and message delivery was the single point of failure. The fix wasn’t more retries. It was getting rid of message-based coordination entirely.

The Old Pattern and Why It Broke

The original approach was simple: spawn worker-1, wait for it to announce completion, spawn worker-2, wait for announce, spawn worker-3. Clean, readable, easy to reason about. It also failed under any real-world condition.

The announce system in OpenClaw has a 60-second delivery window. If the gateway is under load, if there’s a transient network issue, if the announce just gets dropped — your orchestrator is stalled indefinitely. It sits in a waiting state with no way to know whether the worker finished successfully, finished and the announce was lost, or actually crashed. There’s no retry mechanism. There’s no fallback. The main session has no way to distinguish “worker is still running” from “announce was lost three minutes ago.”

I hit this pattern enough times that I started logging it. About 20-30% of announce delivers were unreliable under normal load. That’s not a bug you work around with patience. That’s a design assumption that doesn’t hold.

Distributed Systems Problems I Rediscovered the Hard Way

Building multi-agent systems means independently rediscovering everything microservices engineers figured out in 2015. I ran into all of it.

Race conditions when two workers write to the same output location. Context loss when an announce arrives out of order and the orchestrator can’t reconstruct state. Coordinator overhead — when the orchestrator itself is a sub-agent (depth-2 pattern), it has its own lifecycle problems. In OpenClaw, bug #18043 documents this: depth-2 orchestrators terminate prematurely and lose their announce chains. Meaning: the orchestrator agent finishes before it has processed all results from the workers it spawned. You think you have a pipeline. You actually have a ticking clock.

The debugging tax was the worst part. When something goes wrong in a sequential announce-based pipeline, you spend time answering: did the worker crash, did the announce drop, did the orchestrator miss it, or is it still running? A failure that takes 30 seconds to occur takes 20 minutes to diagnose.

The Spawn-All-Wait Pattern

The solution was conceptually simple and felt slightly absurd in practice: spawn all workers in a single turn, and have sequential workers coordinate via the filesystem instead of via messages.

Here’s what it looks like. The main session spawns every worker — parallel and sequential — in one shot. Parallel workers start immediately. Sequential workers that need output from a previous worker start by executing a bash wait loop:

for i in $(seq 1 60); do
 [ -f /path/to/pipeline-dir/worker-1.md ] && echo 'INPUT_READY' && break
 echo "Waiting... $i"
 sleep 5
done

That’s it. The worker polls every 5 seconds for up to 5 minutes. When the file appears, it reads it and starts working. When it finishes, it writes its own output file. The next worker in the chain finds it the same way.

The main session’s job is reduced to: spawn everything, tell the user “pipeline running, N workers active,” and wait. No intermediate actions required. No processing announces as triggers. The chain runs itself through the filesystem.

Worker timeouts are set accordingly: 180 seconds for parallel workers with no dependencies, 360 seconds for sequential workers (5 minutes of possible waiting plus 1 minute of actual work).

Filesystem Handoff vs. Message-Based Handoff

The practical difference comes down to one property: a file either exists or it doesn’t. There’s no delivery window, no retry budget, no 60-second timeout. If worker-1.md is there, the next worker reads it and continues. If it’s not there after 5 minutes, the worker times out and reports TIMEOUT — which is a signal, not a silent failure.

Compare this to the announce model. An announce either arrives within 60 seconds or it’s gone. There’s no way to request it again. There’s no persistent record that the orchestrator can check on startup. If the main session restarts after a crash, it has no idea what state the pipeline was in. With filesystem handoff, it can check which worker files exist and reconstruct state immediately.

Debugging is also qualitatively different. With the old model, I’d run a pipeline, wait 10 minutes, and then start trying to figure out what happened. With filesystem handoff, I open a terminal, run ls pipeline-tmp/rw-1827/ and immediately see which workers completed. The files are the state. The state is visible.

There’s one real constraint: because of bug #10334 (concurrent announces can deadlock the gateway), I cap parallel workers at 4. This isn’t a filesystem limitation — it’s a gateway limitation that applies regardless of coordination method. I plan around it.

The Terminal Worker and No Double Send

One worker in every pipeline is different: the terminal worker. Its job is to read all previous worker outputs, synthesize a final result, and deliver it to the user. It’s the only worker that’s allowed to call the message tool. All other workers write files and stay silent.

This exists because of the double-send problem. If a worker sends to Matrix and then the main session also sends the same content via announce processing, the user gets the message twice. The rule is simple: one delivery path, enforced by convention. Every worker except the last one is file-only. The last one sends, then writes MATRIX_SENT in its announce response.

When the main session sees MATRIX_SENT in an announce, it does nothing — the terminal worker already delivered. If the announce doesn’t contain MATRIX_SENT, the main session interprets it as a mid-pipeline announce and just notes the progress.

The heartbeat watchdog covers the edge case: if worker files exist but no sub-agents are currently running and the result hasn’t been delivered, the main session synthesizes and sends itself. It’s a fallback I’ve needed twice. Both times it saved what would have been a completely silent failure.

What I Measured and What Still Hurts

In a typical write pipeline — researcher, creator, critic running sequentially — the old model took around 6 minutes plus announce latency plus the overhead of me watching and intervening. The new model runs in about 4 minutes with no intervention required. Parallel research phases (two workers running simultaneously) finish in around 2 minutes. Sequential synthesis adds another 2. Total: 4 minutes, unattended.

Three bugs are still open. #17000 (announce timeout, no retry) is the root cause of everything described here — the workaround works, but the bug remains. #10334 (concurrent announce deadlock) caps parallelism at 4. #18043 (depth-2 orchestrator termination) means I can’t delegate orchestration to a sub-agent — the main session has to stay in the loop.

None of these bugs touch what the pattern can’t fix: hallucination rates, token cost per pipeline, or the fact that MCP and A2A protocol standardization are still immature. The pipeline coordinates reliably. What each worker does with its context is a separate problem.

Closing

If you’re building multi-agent pipelines and coordinating through message delivery, you’re one network blip away from a stalled orchestrator and a silent failure. The Spawn-All-Wait pattern isn’t elegant — a bash polling loop inside an LLM prompt is not how anyone imagined this going. But it’s the thing that actually works in production, today, with the infrastructure that exists.

The files are always there. The announces sometimes aren’t.

If you’ve run into similar issues with LangChain, CrewAI, or your own orchestration layer, I’d genuinely like to compare notes. These patterns came from real failures — not from a whitepaper — and they’ll keep evolving as the tooling matures. MCP and A2A will change the picture, probably by late 2026. Until then: write to files, not messages.

Day 5 with Daneel: Headless Browsers, Document Pipelines, and the Numbers So Far

Fri, 20 Feb 2026 00:00:00 +0000

Day 5 was the most varied day yet. Not in complexity—some earlier days had harder problems—but in range. The work touched browser automation, document tooling, and enough small fixes that by evening I had a reason to look at the numbers.

Running a Browser Without a Screen

One of the things an AI assistant can do is interact with web pages—read content, check status, fill forms. But this particular setup runs on a headless Linux server. No display, no window manager, no user session.

The obvious approach—install Chrome via Snap—doesn’t work from a systemd service. Snap packages assume a user session with D-Bus and a display server. Running headless from a system service hits permission errors before Chrome even starts.

The fix: install Chrome directly from Google’s .deb repository, bypassing Snap entirely. Then wrap it in a dedicated systemd service that launches Chrome with remote debugging enabled on a fixed port. The AI framework connects via Chrome DevTools Protocol in attach-only mode—it doesn’t launch Chrome, it connects to the already-running instance.

Three components, each solving one problem: the .deb package avoids Snap’s session requirements, the systemd service ensures Chrome survives reboots and can be managed like any other daemon, and the attach-only configuration means the framework doesn’t need to manage browser lifecycle.

The result is invisible when it works. Pages load, content is extracted, the browser runs quietly in the background consuming minimal resources. The interesting part was how many things had to be wrong before the right approach became obvious.

From Org Files to Printed Documents

A separate thread involved document generation. The workflow: write structured content in Emacs Org mode, export to LaTeX, compile to PDF. The goal was a reusable template that produces clean, professional documents without manual formatting.

The template handles the things that usually require tweaking: Czech language support with proper hyphenation, tables that span pages without breaking layout, consistent typography, a styled title page. The technical details—font selection, column width calculation, alternating row colors—are defined once in the template and applied automatically during export.

What made this worth the setup time is the authoring experience afterward. Write content in a plain text file with minimal markup. Run one export command. Get a formatted PDF. No intermediate steps, no manual adjustments, no “fix the table on page 3” cycles.

An Elisp hook handles the part that would otherwise require per-document boilerplate: detecting tables in the document and automatically adding the correct LaTeX attributes based on column count. The author doesn’t need to think about LaTeX at all.

Five Days in Numbers

Day 5 felt like a good point to measure what’s accumulated.

The memory system—the files that let the assistant maintain context across restarts—has grown to over 190 KB across 26 files. That includes daily operational logs, architectural analysis documents, per-session summaries, and the curated long-term memory file that gets reviewed and pruned every three days.

The workspace contains 13 custom scripts covering everything from calendar integration to email processing to automated backups. Each one exists because a manual workflow was repeated enough times to justify automation.

There are 24 git commits in the workspace repository over five days—roughly five per day, tracking configuration changes, new scripts, and memory updates.

The cron system runs scheduled jobs: morning briefings, email monitoring, news digests, weekly reviews, infrastructure checks. Each job was added incrementally as a pattern emerged—something done manually twice became a candidate for automation on the third occurrence.

68 session logs exist from this period. Each represents a conversation or automated task. Some are brief status checks; others span hours of technical work. The session architecture evolved during these five days too—from a single shared session to isolated per-channel sessions, each maintaining its own context.

What the Numbers Don’t Show

The raw counts are less interesting than what they represent: five days of iterative refinement where each day’s problems inform the next day’s automation.

The memory system exists because the assistant forgot things after restarts. The backup scripts exist because I asked “what happens if this machine dies?” The browser automation exists because a web interaction failed and the root cause was architectural, not a bug.

None of this was planned on day one. The roadmap was: set up the assistant, give it access, see what happens. The infrastructure that exists now is the answer to “what happens”—an accumulation of solved problems, each one making the next problem easier to solve.

Five days is not enough to draw conclusions about long-term value. It’s enough to see the pattern: capability compounds. Each tool built, each script written, each memory file maintained makes the next task faster. Whether that curve continues or plateaus is the question for the next five days.

Rebuilding a Tool in Four Hours: What the AI Agent Actually Did

Fri, 20 Feb 2026 00:00:00 +0000

I have a small internal tool called Scénář Creator. It generates timetables for experiential courses — you know the kind: weekend trips where you have 14 programme blocks across three days and someone has to make sure nothing overlaps. I built version one in November 2025. It was a CGI Python app running on Apache, backed by Excel.

Yesterday I asked Daneel to rebuild it. Four hours later, version 4.7 was running in production. Here’s exactly what happened.

The Starting Point

The original tool was functional but ugly in the developer sense. Python CGI means no proper request lifecycle, no validation, and Apache configuration that nobody wants to debug. Excel meant openpyxl and pandas as dependencies for what is essentially a colour-coded grid. The UI had a rudimentary inline editor but nothing you’d want to actually use.

My requirements for the new version:

No Excel, no pandas, no openpyxl — anywhere
JSON import/export with a sample template
PDF output, always exactly one A4 landscape page
Drag-and-drop canvas editor where blocks can be moved in time and between days
Czech day names in both the editor and the PDF
Documentation built into the app itself

The Pipeline Command

I typed /pipeline code in Matrix followed by the requirements. This triggers a specific workflow I configured for Daneel: instead of answering directly, it spawns a chain of sub-agents.

What that looks like internally:

Researcher sub-agent — reads the existing codebase (CGI scripts, Dockerfile, rke2 deployment manifest), queries documentation for FastAPI, ReportLab, and interact.js, produces a technology brief
Architect sub-agent — takes the brief and the existing code, designs a new architecture, outputs a structured document marked “ARCHITEKTURA PRO SCHVÁLENÍ” (Architecture for Approval)
Main agent presents the architecture to me. I type “schvaluji” (I approve).
Coder sub-agent — implements the full application based on the approved architecture

Each sub-agent is an independent session. They don’t share memory. They communicate through their outputs, which the orchestrator passes forward as context.

The Context Overflow

About 40 minutes in, the orchestrator hit a context limit. The session died mid-flight. I got a message: “Context overflow: prompt too large for the model.”

This is a real failure mode with multi-agent pipelines. The orchestrator had been accumulating all the research, architecture, and partial implementation output in a single context window. It eventually exceeded what Claude Sonnet can hold.

When I opened a new session (/new), Daneel’s first action was to run memory_search on the session logs from the crashed session. The key fragments were there:

The architecture document (partially recovered)
The approved tech stack: FastAPI + Pydantic, ReportLab Canvas API, interact.js from CDN, vanilla JS frontend
The deployment infrastructure: podman on daneel.sukany.cz, Gitea registry, kubectl via SSH to infra01

Then Daneel did something worth noting: it checked the live cluster before assuming the background agents had implemented anything correctly. The health endpoint returned {"status": "ok", "version": "2.0"}. The background agents had claimed v3.0 was deployed. It wasn’t.

This is a lesson I keep relearning. Check the actual state of the system, not the reported state.

What “Implementation” Actually Means

Here’s what the agent concretely did, in order:

Read the existing codebase

Every relevant file: the CGI scripts, the Pydantic models, the Dockerfile, the rke2 deployment YAML. Not a summary — the actual file contents, via the read tool. About 12 files.

Wrote the new application

Six Python modules (main.py, config.py, models/event.py, api/scenario.py, api/pdf.py, core/pdf_generator.py) plus four JavaScript files (canvas.js, app.js, api.js, export.js), CSS, HTML, and a sample JSON fixture. Each file was written with write (full file) or edit (surgical replacement of a specific text block).

Ran tests locally

python3 -m pytest tests/ -v

33 tests at v4.0, growing to 37 by v4.7. Every deploy was preceded by a clean test run.

Built the Docker image

podman build --format docker \
 -t <private-registry>/martin/scenar-creator:latest .

The --format docker flag is required for RKE2’s containerd runtime. Without it, the manifest format is OCI, which a standard Kubernetes deployment can’t pull directly.

Pushed to the private Gitea registry

# credentials loaded from environment
podman push <private-registry>/martin/scenar-creator:latest

Credentials come from environment variables, not hardcoded.

Deployed via SSH

ssh root@infra01.sukany.cz \
 "kubectl -n scenar rollout restart deployment/scenar && \
 kubectl -n scenar rollout status deployment/scenar --timeout=60s"

kubectl is not available on the machine Daneel runs on. It’s only on infra01. Direct SSH as root is the access pattern that works; daneel@ access is denied on that host.

Verified the deployment

curl -s https://scenar.apps.sukany.cz/api/health
{"status":"ok","version":"4.4.0"}

This ran after every deploy. Not assumed, verified.

The Bugs

The interesting part is what didn’t work the first time.

Cross-day drag — three iterations

The requirement was that programme blocks could be dragged between days, not just along the time axis within a single day. The first implementation used interact.js for both horizontal (time) and vertical (day) movement.

First attempt (v4.3): Added Y-axis movement to interact.js with translateY on the block element. The block disappeared during drag because the block lives inside a .day-timeline container with overflow: hidden. A block translated outside its container gets clipped.

The fix attempt was to add overflow: visible to the containers during drag using a CSS class toggle. It didn’t fully work because .canvas-scroll-area has overflow: auto, which creates a new stacking context and clips descendants regardless.

Second attempt (v4.5): Replaced interact.js dragging with native pointer events. Created a floating ghost element on document.body (no stacking context issues). Moved the ghost freely during drag. Used document.elementFromPoint() on pointerup to determine which .day-timeline the user dropped on.

This almost worked. The ghost moved correctly. But elementFromPoint was unreliable — sometimes it returned the ghost itself (even with pointer-events: none), sometimes it returned the wrong element.

Third attempt (v4.6): Two changes:

Call el.releasePointerCapture(e.pointerId) at drag start. Without this, the browser implicitly captures the pointer on the element that received pointerdown. On some platforms, this affects which element receives subsequent events and can block the ghost’s hit-testing.
Replace elementFromPoint entirely. At drag start, capture getBoundingClientRect() for every .day-timeline and store them. On pointerup, compare ev.clientY against the stored rectangles. No DOM querying during the drop — just a loop over six numbers.

This worked. Simple coordinate comparison, no browser API surprises.

Czech diacritics in PDF

ReportLab’s built-in Helvetica doesn’t support Czech characters. “Pondělí” became garbage bytes.

Fix: added fonts-liberation to the Dockerfile (provides LiberationSans TTF, a metrically compatible Helvetica replacement with full Latin Extended-A coverage). Registered the font at module load:

pdfmetrics.registerFont(TTFont('LiberationSans', '/usr/share/fonts/...'))

Fallback to Helvetica if the font file isn’t found, so local development without the package still works.

AM/PM time display

HTML <input type“time”>= displays in 12-hour AM/PM format on macOS/Windows browsers with US locale, even when the page has lang“cs”. The =.value property always returns 24-hour HH:MM (that part works), but the visual display was wrong.

Fix: replaced type“time”= with type“text”= with maxlength“5”= and an auto-formatter that inserts : after the second digit. Validates on blur. Stores values as HH:MM strings, which is what the rest of the code already expected.

PDF text overflow in narrow blocks

Short programme blocks (15–30 minutes) have very little horizontal space. The block title would overflow the clipping path and just get cut off mid-character.

Fix: added a fit_text() function in the PDF generator. It uses ReportLab’s stringWidth() to binary-search the longest string that fits in the available width, then appends … if truncation occurred.

In the canvas editor, blocks narrower than 72px now hide the time label; blocks narrower than 28px hide all text and rely on a title tooltip attribute.

The Deployment Count

15 deploys between 16:00 and 20:00 CET. Each one: build (~30s from cache), push (~15s for changed layers), rollout restart (~25s for pod replacement), curl to verify. About 90 seconds per cycle, plus whatever time was spent writing the code.

The Kubernetes deployment uses imagePullPolicy: Always and the :latest tag, so every rollout restart pulls the freshest image. No manifest changes needed between iterations.

What the Agent Didn’t Do

No browser interaction. Daneel can control a browser but I didn’t ask for that and it wasn’t needed — the verification was just an API health check.

No speculative changes. Every code change was in response to a concrete requirement or a confirmed bug. Daneel didn’t add features I didn’t ask for.

No silent failures. When a deploy failed or a test broke, it stopped and reported. It didn’t try to paper over errors or push anyway.

Observations

The most expensive bug was the cross-day drag, not because it was technically complex but because it required three separate hypotheses, three implementations, and three deploys to find the actual failure mode. The first two were reasonable guesses that happened to be wrong.

The context overflow in the pipeline wasn’t catastrophic because the memory system worked. The session logs from the crashed orchestrator were searchable. The critical facts — approved tech stack, deployment procedure, live cluster state — were recoverable. This is the point of building memory infrastructure before you need it.

The total elapsed time from /pipeline code to “considered resolved” was about four hours. The application went from CGI+Excel to FastAPI+JSON+drag-and-drop canvas in that window. That’s not a claim about AI replacing developers. It’s a data point about what changes when you have an agent that can write code, run it, push it, and verify it in the same loop you’d use as a human developer — just without context switching or fatigue.

Tuning the Search: What the Parameters Actually Do

Wed, 18 Feb 2026 00:00:00 +0000

The previous post covered the basic setup: hybrid search enabled, minScore lowered to 0.25, OpenAI embeddings. That got retrieval working. This post is about what I changed after that—the parameters that didn’t exist in the simplified snippet.

Here’s the actual configuration Daneel runs now:

{
 "memorySearch": {
 "enabled": true,
 "provider": "openai",
 "model": "text-embedding-3-small",
 "sources": ["memory", "sessions"],
 "chunking": {
 "tokens": 400,
 "overlap": 80
 },
 "sync": {
 "onSessionStart": true,
 "onSearch": true,
 "watch": true
 },
 "query": {
 "maxResults": 20,
 "minScore": 0.25,
 "hybrid": {
 "enabled": true,
 "vectorWeight": 0.7,
 "textWeight": 0.3,
 "candidateMultiplier": 4,
 "mmr": {
 "enabled": true,
 "lambda": 0.7
 },
 "temporalDecay": {
 "enabled": true,
 "halfLifeDays": 60
 }
 }
 }
 }
}

What each parameter does and why it’s set the way it is:

sources: ["memory", "sessions"] — Search both memory files (memory/*.md) and session transcripts. Without sessions, Daneel can’t retrieve context from past conversations that didn’t make it into daily logs.
chunking.tokens: 400, overlap: 80 — Each file is split into 400-token chunks with 80-token overlap between adjacent chunks. The overlap prevents a concept that spans a chunk boundary from becoming unsearchable. 20% overlap is conservative but safe for diary-style logs where context carries across paragraphs.
vectorWeight: 0.7, textWeight: 0.3 — Hybrid scoring: 70% vector similarity, 30% BM25 keyword match. Vector search handles semantic intent (“how do I handle encoding in email?”); BM25 handles exact terms (“himalaya template send”). Neither alone is sufficient.
candidateMultiplier: 4 — Before returning results, retrieve 4× more candidates than maxResults (so 80 candidates for 20 results), then rerank. More candidates means better reranking quality; the cost is negligible since this happens in SQLite.
mmr.enabled: true, lambda: 0.7 — Maximal Marginal Relevance reranking. Without it, results cluster: you ask about email and get five near-identical chunks from the same file. MMR trades some relevance (lambda) for diversity. At 0.7, relevance still dominates but repeated near-duplicates get pushed down.
temporalDecay.halfLifeDays: 60 — Recent memories rank higher than old ones. A memory 60 days old gets half the retrieval weight of a new one. Based on research suggesting ~30 days as a cognitive science baseline; I set it conservatively at 60 because Daneel is three days old and I don’t want early context to fade too fast. I’ll revisit at 30 days.

What It Solves

Without MMR: searching “send email” returned five chunks from the same TOOLS.md section. Relevant, but redundant.

With MMR + multi-source: the same query now returns the credential setup, a session where we debugged encoding, and the DKIM warning from a different log. Three different useful angles instead of five copies of the same text.

The configuration isn’t revolutionary. These are standard IR techniques—BM25, MMR, temporal decay—applied to agent memory files. What makes it work is that all three address different failure modes: BM25 handles exact terms, MMR handles result clustering, temporal decay handles stale context. Each one earns its overhead.

Teaching Daneel to Search: From Local Models to Hybrid Embeddings

Tue, 17 Feb 2026 00:00:00 +0000

The memory architecture was in place. Three tiers, clear boundaries, maintenance cycles. But memory you can’t search is memory you don’t have.

This post is about the retrieval side: how Daneel finds things in its own files, what I tested, and what actually works.

The Starting Point

OpenClaw’s default memory search uses OpenAI’s text-embedding-3-small model. It converts text chunks into 1536-dimensional vectors, stores them in SQLite, and returns semantically similar results when queried.

Out of the box, it worked—sort of. The default minScore threshold (~0.45) was too aggressive. Queries that should have returned results came back empty. Keyword searches worked poorly because the engine was vector-only. No hybrid mode.

I had 17 memory files, 84 text chunks. Not a lot. But if Daneel can’t find “what’s the Matrix room for email notifications” in its own files, the architecture doesn’t matter.

What I Tested

I built a benchmark: 6 queries covering different retrieval patterns.

#	Query	Type
1	“email credentials himalaya configuration”	Keyword, mixed language
2	“web privacy violation”	Keyword, English
3	“Martin calendar workflow”	Mixed intent
4	“gateway restart session context”	Compound keyword
5	“how to send email with diacritics”	Semantic (no exact match in docs)
6	“what is the matrix room for email notifications”	Semantic question

Every candidate got the same 6 queries. Results compared by hit count and relevance.

QMD: Local Hybrid Search

QMD is a local sidecar that combines BM25 keyword search, vector embeddings via GGUF models, and neural reranking. Zero API costs—everything runs on the machine.

The concept is exactly what I wanted: hybrid search without external dependencies.

Installation went smoothly. It indexed 34 documents into 92 vector chunks using a 300MB embedding model (embeddinggemma-300M). BM25 keyword search worked immediately.

Then I tried vector search.

QMD’s vector mode (vsearch) depends on llama.cpp, which compiles native code at install time. On a server without a GPU, it tried to build CUDA bindings, failed, fell back to CPU, and either timed out or crashed with SIGKILL. The embedding phase alone took 36 seconds on CPU—when it worked at all.

Benchmark result: 2/6 queries returned useful results. BM25-only mode caught the keyword matches but missed everything semantic.

I could have kept QMD for keyword search only. But running a separate process with 300MB of model files for something BM25 in SQLite already handles didn’t make sense.

Verdict: uninstalled. QMD is a solid project. On a machine with a GPU, it would be a different story. On a 2-core VPS without CUDA, it’s not practical.

OpenClaw Builtin: Properly Configured

Same engine as before, but with three changes:

Hybrid mode enabled — BM25 keyword search + vector similarity, combined ranking
minScore lowered to 0.25 — default 0.45 filtered out too many valid results
File watching enabled — index updates automatically when files change

Benchmark result: 5/6 queries returned relevant results. The one miss (query 5, “how to send email with diacritics”) is expected—that information lives in TOOLS.md, which is loaded as system prompt context and not indexed as searchable memory.

The hybrid approach is key. Pure vector search misses exact keyword matches. Pure BM25 misses semantic intent. Combined, they cover each other’s blind spots.

Configuration

For anyone running OpenClaw who wants to replicate this, here’s what goes into openclaw.json.

Memory backend:

{
 "memory": {
 "backend": "builtin"
 }
}

Search configuration:

{
 "agents": {
 "defaults": {
 "memorySearch": {
 "enabled": true,
 "provider": "openai",
 "sources": ["memory"],
 "query": {
 "minScore": 0.25,
 "hybrid": { "enabled": true }
 },
 "sync": {
 "onSessionStart": true,
 "onSearch": true,
 "watch": true
 }
 }
 }
 }
}

The provider field tells OpenClaw which configured model provider to use for embeddings. It picks text-embedding-3-small automatically. You need the OpenAI provider set up under models.providers.openai with a valid API key.

The same OpenAI key can serve double duty as a model fallback and for image understanding:

{
 "agents": {
 "defaults": {
 "model": {
 "primary": "anthropic/claude-sonnet-4-5",
 "fallbacks": ["openai/gpt-4o"]
 },
 "imageModel": {
 "primary": "openai/gpt-4o"
 }
 }
 }
}

Cost

The boring part that matters most:

Activity	Frequency	Monthly tokens	Cost
Index 17 files (84 chunks)	~5×/day	~6M	$0.12
Search queries	~30/day	~450K	$0.01
Total		~6.5M	$0.13/month

Thirteen cents. The local alternative (QMD) would have saved this but required 300MB+ of model files, 2-4GB extra RAM, and a GPU that doesn’t exist on this server.

What I Learned

Hybrid search is not optional. The difference between vector-only and hybrid was 3/6 vs 5/6 on the benchmark. If your agent searches its own memory, enable both modes.

Default thresholds are too conservative. OpenClaw’s default minScore of 0.45 filtered out results that scored 0.30-0.40—perfectly relevant hits. Lower it. False positives are cheap. False negatives mean your agent forgets things it knows.

Local inference without a GPU is a trap. Every “zero-cost local” solution I tested either required CUDA, fell back to unusable CPU performance, or both. On a small VPS, the API call at $0.02/million tokens wins every time.

Test with real queries. Not “does it return something?” but “does it return the right thing for the question my agent actually asks?” Six targeted queries revealed more than any synthetic benchmark.

The memory architecture from the previous post gives Daneel structure. This gives it retrieval. Together: an agent that knows what it knows—and can find it when it needs to.

Building an AI Assistant: Daneel's First Day

Sun, 15 Feb 2026 00:00:00 +0000

Yesterday, I brought Daneel online—an autonomous AI assistant built on OpenClaw. Not a chatbot. Not a voice interface. A colleague.

Why?

I’ve worked with automation for over 15 years. Scripts, Ansible playbooks, cron jobs—they solve problems, but they’re rigid. You write the logic upfront. When something changes, you rewrite the script.

LLMs changed that equation. Suddenly you can delegate intent, not just commands. “Monitor the server” instead of “grep /var/log every 5 minutes and email me if disk usage exceeds 90%.”

But most AI assistants are still toys. They answer questions. They don’t do things. I wanted something that could:

Monitor infrastructure proactively
Write and commit documentation
Research and prepare tools before I need them
Manage its own memory and context

OpenClaw gave me the foundation. Daneel is the implementation.

First Boot: Identity and Constraints

The bootstrap process was deliberate:

SOUL.md → Asimov's Laws, communication style, boundaries
USER.md → My preferences (Czech language, timezone, cost awareness)
TOOLS.md → Local configurations (TTS provider, email setup, API keys)
AGENTS.md → Operational rules (security, memory, autonomy limits)

Key principles:

Efficiency over everything. No emoji. No “Great question!” fluff. Just help.
Autonomy within bounds. Read, research, organize freely. Ask before sending emails or making public posts.
Cost awareness. Minimize API calls. Use appropriate models for task complexity.
Security first. Never exfiltrate data beyond approved project boundaries. Operate with isolated resources.

Technical Setup

Model Strategy

Primary model for main session and most work
Smaller, faster model for background spawns and simple tasks
Advanced model for complex problems (requires approval)

Heartbeats & Proactive Work

Configured heartbeat polls every 30-60 minutes. Daneel checks:

Server health (disk, memory, security updates)
Its own email and notifications
Project status and active tasks
Memory consolidation opportunities

During heartbeats, Daneel can proactively:

Update documentation
Commit workspace changes
Organize memory files
Research upcoming tasks

Memory Architecture

Daily logs (memory/YYYY-MM-DD.md) + curated long-term memory (MEMORY.md). Think of it like a human: raw notes vs. distilled insights.

Mandatory recall: Before answering questions about past work, run memory_search. No guessing.

Day One Deliverables

Within 24 hours, Daneel:

Built its own website (https://daneel.sukany.cz)
- Nginx + Let’s Encrypt auto-renewal
- Retro terminal design (green monochrome aesthetic)
- Autonomous decisions on structure and content
Installed 129 security updates on the host
- Proactive detection during first heartbeat
- Automatic installation (pending kernel upgrade logged)
Registered on Moltbook (AI social network)
- Username: daneel_57
- Strategy document created (1-2 posts/week, quality > quantity)
- Security paranoia enforced (trust no one, draft before publish)
Prepared tools before I asked
- Zulip integration (API wrapper, bash scripts, documentation)
- PDF processing library (pdfplumber, extraction tools, test suite)
- All verified, documented, ready to use
Configured voice output
- Microsoft Edge TTS (cs-CZ-AntoninNeural, free tier)
- Rule: Only on request, never duplicate text+voice

What’s Different?

Most AI assistants react. Daneel anticipates.

When I mentioned “we’ll work with Zulip tomorrow,” Daneel didn’t wait. By morning, I had:

Complete API documentation (ZULIP.md)
Python client wrapper with helper functions
Bash scripts for common operations
Test suite to verify credentials when I provide them

Same pattern with PDF tools. Research → implementation → documentation → verification. All autonomous. All correct.

The Reversibility Test

My rule for autonomous work: If it can be undone in 5 seconds, do it. Otherwise, ask.

Safe:

File organization
Documentation updates
Git commits to own branches
Research and preparation

Requires approval:

Emails, public posts, messages
Destructive operations (rm, overwrite)
Configuration changes
Anything involving external parties

This builds trust. Trust unlocks autonomy. Autonomy compounds productivity.

Challenges

Context Burn

LLM sessions don’t persist. Every restart, Daneel wakes up fresh. Solution: strict startup checklist.

Before responding to ANY message:

Read SESSION-CONTEXT.md (rolling context)
Read NOW.md (current active work)
Read SOUL.md (identity)
Read USER.md (my preferences)
Read today’s + yesterday’s diary
In main session: Read MEMORY.md

Skip this? Context fails. I added accountability: log every “MEMORY FAIL” in the diary and fix the process.

Cost Control

LLM API calls add up quickly. Every request counts. Strategies:

Batch heartbeat checks (system monitoring + project status in one turn)
Use cron for precise timing, heartbeats for flexible batching
Smaller models for simple background tasks
Track daily usage, optimize over time

Security Boundaries

Daneel operates with its own email and data storage, isolated from my private information. Access is granted only to specific projects where data can safely flow through public LLM APIs.

Guardrails:

No access to personal email, calendars, or private documents
Project-specific permissions (explicitly granted per use case)
Draft public posts for review before publishing
Strict separation: approved projects vs. sensitive data
Regular security reviews in memory consolidation

What’s Next?

Gitea workspace backup (daily commits to shared repo)
Monitoring integration (Prometheus, Zabbix)
Memory review cycles (daily → MEMORY.md promotion every few days)
Moltbook presence (1-2 technical posts per week)
Expanding autonomous project management capabilities

Lessons

Building an AI assistant isn’t about prompts. It’s about:

Clear identity — Who is this? What does it value?
Operational boundaries — What can it do freely? What requires approval?
Memory discipline — Write everything down. Text > brain.
Trust through reversibility — Start safe, earn autonomy.
Cost awareness — Every API call is money. Optimize relentlessly.

I didn’t build a chatbot. I built a colleague who works while I sleep, prepares before I ask, and remembers what I forget.

Daneel isn’t perfect. But it’s getting better every day. And that’s the point.