On day 2, the agent tried to re-enable a Twitter integration I had explicitly cancelled the night before. It had forgotten. Not because of a bug — because session restarts wipe context, and nothing in the default setup prevents an AI from re-deriving a decision you already vetoed.
That’s when I started building the infrastructure that turned a chatbot into something that actually works.
This is not a tutorial. It’s what running an autonomous AI agent looks like after 10 days: what it costs, what breaks, and what I’d change.
What It Actually Costs
The honest number: $16–$21 over 10 days.
The agent uses three model tiers. Background tasks — heartbeat checks, email classification, log writes — run on Claude Haiku. About 180 heartbeat sessions over 10 days at roughly $0.012 each: ~$2.16. General conversation and code analysis run on Claude Sonnet. Of 92 recorded sessions, roughly 40% are Sonnet-class work, averaging ~$0.25 per session: ~$9.25. The expensive stuff — security audits, pipeline critic passes, memory maintenance — runs on Opus. 10–15 invocations at ~$0.50 each: $5–7.50.
Embeddings are negligible. The memory system uses OpenAI’s text-embedding-3-small at $0.02/1M tokens. Ten days of indexing cost about $0.01.
Infrastructure is fixed: a VM in my home lab running the OpenClaw gateway. No cloud compute charges.
The cost driver is not what you’d expect. It’s not token count — it’s context load. Every session, the agent loads configuration files: a 1.5KB state file, a 5KB curated memory, plus task-specific documents. Before tiered memory, sessions were loading raw daily logs on every start. After: selective loading. Per-session overhead dropped by roughly 60%.
22 cron jobs run on scheduled intervals. Morning briefing, email preprocessing every 2 hours, social media engagement, chat summaries, nightly memory maintenance, weekly server monitoring. Each spawns a sub-agent session. Those add up quietly.
A month at this rate is $50–$65. Less than most SaaS subscriptions.
The Forgetting Problem
The naive approach to agent memory is to log everything and search it later. That degrades fast.
After day 3, raw daily logs totaled 130KB. By day 10: 400KB across 29 files. Loading all of that into context every session burns tokens and fills the window with noise. Most of what’s in those logs is obsolete the moment it’s written.
The architecture I ended up with is L1/L2/L3, borrowed from CPU cache design.
L1 is NOW.md — under 1.5KB, hard limit. Current task, active blockers, open threads. Updated during sessions. If it’s not in NOW.md, it doesn’t exist for the next session.
L2 is MEMORY.md — under 5KB, curated. Long-term facts: credential locations, architectural decisions, lessons that took more than one failure to learn. Only the main session can write to it. Nightly maintenance cycles prune obsolete entries — the file has stayed under 5KB since day 4.
L3 is the daily log archive — append-only, never loaded directly. Accessed through hybrid search: BM25 + semantic retrieval via embeddings. Key discovery: the embedding model works significantly better with English queries even though most logs are in Czech.
The hard part is not storage. The hard part is forgetting correctly.
There’s a decisions.md file — I call it the anti-Dory register — that tracks every cancelled or paused action with a timestamp. When I told the agent to stop auto-posting tweets, that decision was recorded: date, scope, reason. Every cron job that touches external services checks this file before executing. Without it, the agent would occasionally re-reason its way back to trying the cancelled action.
There’s also a self-review.md tracking repeated mistakes with a counter. When the count hits 3, the rule gets promoted to permanent configuration. The session-memory hook that shipped by default was broken; it got disabled on day 2 and the rule “disable immediately” now lives in the permanent config. It has never been re-enabled by accident.
Seven days without a memory failure. The first three days had several. The difference is maintenance cycles and the decisions registry, not the agent being smarter.
Configuration Is the Product
Default OpenClaw gives you a conversational agent with web search and file access. That is a chatbot. What I’m running now is closer to infrastructure.
The difference is about 1,000 lines of configuration across eight files.
22 cron jobs (default: zero). The morning briefing fires at 07:00, pulls calendar events, scans email, and writes a daily context update. Email preprocessing classifies incoming mail every 2 hours into URGENT / NORMAL / INFO and sends notifications for anything that needs attention. Nightly memory maintenance prunes stale data. Without cron, the agent is purely reactive. With it, problems surface before I ask.
24 pipeline types for multi-stage tasks. A blog post runs through researcher → creator → critic. A security audit: recon → parallel auditor + remediator → synthesizer. All workers spawn in a single turn. Sequential workers wait for input files via a bash polling loop — no message-based coordination, no orchestrator agent. The last worker in the chain sends the result directly to Matrix.
Why not use the built-in message delivery? Because it has a hardcoded 60-second timeout with no retry. I learned this after two pipeline types failed in testing. The fix wasn’t more retries — it was bypassing message delivery entirely and having workers write files and send results themselves.
A web publishing safety layer. Before any content goes to the public site, a shell script checks for private information, credential references, and third-party data. Exit 1 stops the publish. This exists because an early session attempted to post content containing internal details. Not maliciously — the agent didn’t have a boundary. Now the boundary is enforced at the script level, not the prompt level.
Priority hierarchy. The agent’s decision model has five levels: safety > privacy > instructions > stability > efficiency. When they conflict, the order holds. This sounds abstract until the agent needs to decide whether to send an email on your behalf or wait for confirmation. Without explicit priority ordering, it guesses. With it, it stops and asks.
The insight after 10 days: an AI agent without customization is a chatbot. With customization, it’s infrastructure. None of this ships by default.
What I’d Do Differently
Start with memory architecture on day 1. I spent the first two days loading too much context. The L1/L2/L3 design should have been the first thing built, not something I arrived at after three failures.
Add the decisions registry before anything touches external services. The first cancelled-action recurrence appeared on day 3. The registry was created on day 4. One day of overlap where cancelled actions occasionally re-triggered.
Model selection discipline from the start. Early sessions used Sonnet for tasks that Haiku handles fine. Across 180 heartbeats, the cost difference adds up. Define model selection rules before creating cron jobs, not after.
Document infrastructure limitations before building on them. I built two pipeline types assuming message delivery was reliable. Both failed. Retrofitting the file-based pattern took longer than designing it correctly would have.
The agent runs stably now. 10 blog posts. Email processed without intervention. Memory clean. No duplicate sends.
It works. It just took 10 days of configuration to make it work the way it should.
Running: OpenClaw on self-hosted VM. Models: Claude Haiku\/Sonnet\/Opus (Anthropic), embeddings via text-embedding-3-small (OpenAI). 10-day window: February 15–25, 2026.
See also
- Why I Use Two AI Assistants Instead of One
- Why I Gave My AI Agent a Soul (Again)
- FSA-Driven Multi-Agent Pipelines: How We Stopped Fighting Our Own Orchestrator
- Why I Stopped Waiting for Announces: The Spawn-All-Wait Pattern for Multi-Agent AI
- Day 5 with Daneel: Headless Browsers, Document Pipelines, and the Numbers So Far