Sprint 29 — The app audits itself


Sprint 29 was supposed to be a quiet infrastructure week: stand up Sentry, keep dogfooding, fix whatever I tripped over. Sentry got done — and then the sprint produced two threads I didn’t plan for, both of which I’ll remember longer than the observability wiring. One was a real bug that turned out to be an architecture problem. The other was the app sitting down and grading its own homework.

The 4 MB wall wasn’t a number

I went to upload a multi-page PDF — exactly the kind of messy real-life artifact Domi exists to eat — and it bounced. Anything past roughly 4 MB failed. My first instinct was the lazy one: find the limit, bump the number, move on.

There was no number to bump.

The upload was flowing through a Vercel serverless function: browser → function → Cloudflare R2. And Vercel functions have a hard 4.5 MB request-body cap. The “limit” wasn’t a config value I’d set too conservatively. It was the shape of the data path. Every byte of every document was being relayed through compute that was never meant to be a file conduit, and that compute had a ceiling baked into the platform.

So the fix wasn’t a constant — it was removing the function from the path entirely. Presigned direct-to-R2 uploads (#549): the browser asks the server for a short-lived signed URL, then PUTs the file straight to R2, bypassing the serverless function completely. The function only ever sees a tiny “here’s the key, here’s the metadata” request. The real ceiling is now 50 MB, which covers basically every household document I’m going to throw at it.

It’s not free of plumbing. Direct browser-to-R2 means R2 has to accept cross-origin PUTs, so it needed a CORS policy on the domi-documents bucket — a piece of config that lives outside the repo and would absolutely bite a future me who forgot it existed, so it went into the DR runbook as an addendum. But the lesson is the one worth keeping: when a limit can’t be raised, ask whether it’s a number or an architecture. This one was an architecture wearing a number’s clothes.

A couple of related upload bugs fell out of the same area — a document-lookup path that wasn’t wrapped in the tenant context and threw a 500 on ingest (#543), and a friendlier error envelope on the chat route so an unhandled failure surfaces something human instead of a raw 500 (#542). Both now report to Sentry, which is the whole point of having wired it up.

Making Domi grade itself against the book

The second thread started as housekeeping and turned into the most useful exercise I’ve run all year.

A while back I distilled a set of working principles out of Chip Huyen’s AI Engineering (2024) — the canonical text for the kind of app Domi is. This sprint I finished that work: four principles docs, including a new one drawn from the book’s Chapter 7 on model adaptation. That chapter lays out the escalation ladder — prompt engineering, then RAG, then finetuning — and writing the principle forced me to say out loud why finetuning is off the table for V1. Domi is tenant-isolated and Law 25-bound. Finetuning a shared model on user data would either leak one household’s facts into another’s completions or require per-tenant models nobody can afford. Prompt + RAG is not a compromise here; it’s the correct ceiling for a privacy-first, multi-tenant app. Good to have that written down instead of re-litigating it every few weeks.

Then I did the thing that surprised me: I turned the principles into a rubric and pointed the codebase at itself.

A 10-dimension audit, run by 21 agents, burning about 3.3 million tokens. Each dimension — eval depth, AI-safety / prompt-injection hardening, observability completeness, prompt management, the role/gateway abstraction, provenance, RLS, cost discipline, and so on — got worked over, and crucially, findings got adversarially verified by a second pass instead of taken at face value. That verification step was the difference between a useful report and a noisy one: a meaningful chunk of the first-pass “gaps” were false positives — things the auditor thought were missing that were actually implemented somewhere it hadn’t looked. Killing those left 65 verified gaps and a prioritized roadmap.

The verdict was honestly reassuring. The foundations I’ve spent 29 sprints on held up: the role-based LLM gateway (never call a provider directly), provenance on every extracted fact, RLS on every tenant table, the eval discipline. Where the gaps clustered was predictable in hindsight — eval depth (I have evals, but thin ones), AI-safety prompt-injection hardening (the document-RAG and chat surfaces are the obvious attack surface and I’ve under-invested there), observability completeness (Sentry’s in now, but coverage isn’t uniform), and prompt management (prompts are scattered, not versioned as first-class artifacts). S30 will implement the top eight.

What I learned

Two things. First — and I keep relearning this — the most expensive bugs aren’t bugs, they’re architectures that haven’t been noticed yet. The 4 MB wall looked like a one-line fix and was actually a data-path decision.

Second: having an AI app audit itself against the field’s canonical text is genuinely valuable, but only because of the adversarial verification pass. A single agent enumerating “what’s missing” produces a confident, plausible, partly-wrong list. Making a second agent prove each claim before it counts is what turned 100-odd assertions into 65 real ones. That pattern — generate, then verify with something incentivized to disagree — is going straight into how I run evals next sprint.

Sentry’s up. The upload wall is gone. And for the first time I have a ranked, verified list of exactly where Domi is weakest. That’s a good place to start Sprint 30.