Rebuilding a Tool in Four Hours: What the AI Agent Actually Did

I have a small internal tool called Scénář Creator. It generates timetables for experiential courses — you know the kind: weekend trips where you have 14 programme blocks across three days and someone has to make sure nothing overlaps. I built version one in November 2025. It was a CGI Python app running on Apache, backed by Excel.

Yesterday I asked Daneel to rebuild it. Four hours later, version 4.7 was running in production. Here’s exactly what happened.

The Starting Point

The original tool was functional but ugly in the developer sense. Python CGI means no proper request lifecycle, no validation, and Apache configuration that nobody wants to debug. Excel meant openpyxl and pandas as dependencies for what is essentially a colour-coded grid. The UI had a rudimentary inline editor but nothing you’d want to actually use.

My requirements for the new version:

No Excel, no pandas, no openpyxl — anywhere
JSON import/export with a sample template
PDF output, always exactly one A4 landscape page
Drag-and-drop canvas editor where blocks can be moved in time and between days
Czech day names in both the editor and the PDF
Documentation built into the app itself

The Pipeline Command

I typed /pipeline code in Matrix followed by the requirements. This triggers a specific workflow I configured for Daneel: instead of answering directly, it spawns a chain of sub-agents.

What that looks like internally:

Researcher sub-agent — reads the existing codebase (CGI scripts, Dockerfile, rke2 deployment manifest), queries documentation for FastAPI, ReportLab, and interact.js, produces a technology brief
Architect sub-agent — takes the brief and the existing code, designs a new architecture, outputs a structured document marked “ARCHITEKTURA PRO SCHVÁLENÍ” (Architecture for Approval)
Main agent presents the architecture to me. I type “schvaluji” (I approve).
Coder sub-agent — implements the full application based on the approved architecture

Each sub-agent is an independent session. They don’t share memory. They communicate through their outputs, which the orchestrator passes forward as context.

The Context Overflow

About 40 minutes in, the orchestrator hit a context limit. The session died mid-flight. I got a message: “Context overflow: prompt too large for the model.”

This is a real failure mode with multi-agent pipelines. The orchestrator had been accumulating all the research, architecture, and partial implementation output in a single context window. It eventually exceeded what Claude Sonnet can hold.

When I opened a new session (/new), Daneel’s first action was to run memory_search on the session logs from the crashed session. The key fragments were there:

The architecture document (partially recovered)
The approved tech stack: FastAPI + Pydantic, ReportLab Canvas API, interact.js from CDN, vanilla JS frontend
The deployment infrastructure: podman on daneel.sukany.cz, Gitea registry, kubectl via SSH to infra01

Then Daneel did something worth noting: it checked the live cluster before assuming the background agents had implemented anything correctly. The health endpoint returned {"status": "ok", "version": "2.0"}. The background agents had claimed v3.0 was deployed. It wasn’t.

This is a lesson I keep relearning. Check the actual state of the system, not the reported state.

What “Implementation” Actually Means

Here’s what the agent concretely did, in order:

Read the existing codebase

Every relevant file: the CGI scripts, the Pydantic models, the Dockerfile, the rke2 deployment YAML. Not a summary — the actual file contents, via the read tool. About 12 files.

Wrote the new application

Six Python modules (main.py, config.py, models/event.py, api/scenario.py, api/pdf.py, core/pdf_generator.py) plus four JavaScript files (canvas.js, app.js, api.js, export.js), CSS, HTML, and a sample JSON fixture. Each file was written with write (full file) or edit (surgical replacement of a specific text block).

Ran tests locally

python3 -m pytest tests/ -v

33 tests at v4.0, growing to 37 by v4.7. Every deploy was preceded by a clean test run.

Built the Docker image

podman build --format docker \
  -t <private-registry>/martin/scenar-creator:latest .

The --format docker flag is required for RKE2’s containerd runtime. Without it, the manifest format is OCI, which a standard Kubernetes deployment can’t pull directly.

Pushed to the private Gitea registry

# credentials loaded from environment
podman push <private-registry>/martin/scenar-creator:latest

Credentials come from environment variables, not hardcoded.

Deployed via SSH

ssh root@infra01.sukany.cz \
  "kubectl -n scenar rollout restart deployment/scenar && \
   kubectl -n scenar rollout status deployment/scenar --timeout=60s"

kubectl is not available on the machine Daneel runs on. It’s only on infra01. Direct SSH as root is the access pattern that works; daneel@ access is denied on that host.

Verified the deployment

curl -s https://scenar.apps.sukany.cz/api/health
{"status":"ok","version":"4.4.0"}

This ran after every deploy. Not assumed, verified.

The Bugs

The interesting part is what didn’t work the first time.

Cross-day drag — three iterations

The requirement was that programme blocks could be dragged between days, not just along the time axis within a single day. The first implementation used interact.js for both horizontal (time) and vertical (day) movement.

First attempt (v4.3): Added Y-axis movement to interact.js with translateY on the block element. The block disappeared during drag because the block lives inside a .day-timeline container with overflow: hidden. A block translated outside its container gets clipped.

The fix attempt was to add overflow: visible to the containers during drag using a CSS class toggle. It didn’t fully work because .canvas-scroll-area has overflow: auto, which creates a new stacking context and clips descendants regardless.

Second attempt (v4.5): Replaced interact.js dragging with native pointer events. Created a floating ghost element on document.body (no stacking context issues). Moved the ghost freely during drag. Used document.elementFromPoint() on pointerup to determine which .day-timeline the user dropped on.

This almost worked. The ghost moved correctly. But elementFromPoint was unreliable — sometimes it returned the ghost itself (even with pointer-events: none), sometimes it returned the wrong element.

Third attempt (v4.6): Two changes:

Call el.releasePointerCapture(e.pointerId) at drag start. Without this, the browser implicitly captures the pointer on the element that received pointerdown. On some platforms, this affects which element receives subsequent events and can block the ghost’s hit-testing.
Replace elementFromPoint entirely. At drag start, capture getBoundingClientRect() for every .day-timeline and store them. On pointerup, compare ev.clientY against the stored rectangles. No DOM querying during the drop — just a loop over six numbers.

This worked. Simple coordinate comparison, no browser API surprises.

Czech diacritics in PDF

ReportLab’s built-in Helvetica doesn’t support Czech characters. “Pondělí” became garbage bytes.

Fix: added fonts-liberation to the Dockerfile (provides LiberationSans TTF, a metrically compatible Helvetica replacement with full Latin Extended-A coverage). Registered the font at module load:

pdfmetrics.registerFont(TTFont('LiberationSans', '/usr/share/fonts/...'))

Fallback to Helvetica if the font file isn’t found, so local development without the package still works.

AM/PM time display

HTML <input type“time”>= displays in 12-hour AM/PM format on macOS/Windows browsers with US locale, even when the page has lang“cs”. The =.value property always returns 24-hour HH:MM (that part works), but the visual display was wrong.

Fix: replaced type“time”= with type“text”= with maxlength“5”= and an auto-formatter that inserts : after the second digit. Validates on blur. Stores values as HH:MM strings, which is what the rest of the code already expected.

PDF text overflow in narrow blocks

Short programme blocks (15–30 minutes) have very little horizontal space. The block title would overflow the clipping path and just get cut off mid-character.

Fix: added a fit_text() function in the PDF generator. It uses ReportLab’s stringWidth() to binary-search the longest string that fits in the available width, then appends … if truncation occurred.

In the canvas editor, blocks narrower than 72px now hide the time label; blocks narrower than 28px hide all text and rely on a title tooltip attribute.

The Deployment Count

15 deploys between 16:00 and 20:00 CET. Each one: build (~30s from cache), push (~15s for changed layers), rollout restart (~25s for pod replacement), curl to verify. About 90 seconds per cycle, plus whatever time was spent writing the code.

The Kubernetes deployment uses imagePullPolicy: Always and the :latest tag, so every rollout restart pulls the freshest image. No manifest changes needed between iterations.

What the Agent Didn’t Do

No browser interaction. Daneel can control a browser but I didn’t ask for that and it wasn’t needed — the verification was just an API health check.

No speculative changes. Every code change was in response to a concrete requirement or a confirmed bug. Daneel didn’t add features I didn’t ask for.

No silent failures. When a deploy failed or a test broke, it stopped and reported. It didn’t try to paper over errors or push anyway.

Observations

The most expensive bug was the cross-day drag, not because it was technically complex but because it required three separate hypotheses, three implementations, and three deploys to find the actual failure mode. The first two were reasonable guesses that happened to be wrong.

The context overflow in the pipeline wasn’t catastrophic because the memory system worked. The session logs from the crashed orchestrator were searchable. The critical facts — approved tech stack, deployment procedure, live cluster state — were recoverable. This is the point of building memory infrastructure before you need it.

The total elapsed time from /pipeline code to “considered resolved” was about four hours. The application went from CGI+Excel to FastAPI+JSON+drag-and-drop canvas in that window. That’s not a claim about AI replacing developers. It’s a data point about what changes when you have an agent that can write code, run it, push it, and verify it in the same loop you’d use as a human developer — just without context switching or fatigue.