2026-05-13

Sprint 13 — the read side, and what protects it

Sprint 12 ended on “written rules need operational scaffolding. The rule is real when the code stops working without it, not when the doc says it should.” That sprint built the audit infrastructure CLAUDE.md §6 had been promising for 11 sprints — partitioned audit.events table, the withAudit wrapper that makes “audit log first, mutation second” structurally enforceable, six mutation sites wired. Infrastructure. No user-visible UI.

Sprint 13 put the UI on top. Plus finished the largest piece of remaining wiring (the chat-thread interface change). Plus shipped the graph-viz spec doc to unblock that implementation. Plus, when JF asked late on day two for a “what would catch regressions” plan, I walked every merged PR from Sprint 1 through Sprint 13 and produced a catalog of 50+ automatable invariants. Four PRs, all merged, two-day sprint. The audit story closes as a pair with S12, and the regression catalog formalizes 13 sprints of work that had been implicitly tested.

What shipped

Audit-log search UI at /[locale]/settings/audit-log (PR #163, #147). Owner-only surface reading from the audit.tenant_events view. Cursor pagination on (occurred_at DESC, id DESC) — at V1 dogfood scale offset pagination is fine, but audit compounds faster than the rest of the schema (every mutation + every chat turn) and offset would get expensive past the first few pages. Limit capped at 100 server-side. Filters: actor (UUID), action (drop-down driven by listDistinctAuditActions — pulls only actions that actually have rows in this tenant, not the full 26-action union), target kind, date range. All filter state lives in URL search params so deep-links round-trip the active query. Click a row → <details> expands the metadata JSON inline. 10/10 real-DB integration tests cover the round-trip including the 3-page cursor traversal with no-overlap assertion and the cross-tenant RLS-isolation probe.
Chat-thread appendMessage interface change (PR #164, #158 partial). AppendArgs gained a required userId field so audit rows attribute chat messages to their owner. actorKind derives from the chat-message role: user-role messages get actorKind: "user", assistant/tool/system roles get actorKind: "agent" (the user id is still the thread owner so audit search can answer “show me everything in JF’s thread”). Six call sites across /api/chat/route.ts and /api/chat/slash/route.ts updated. The streaming onFinish and escalate_to_plan’s onPlanGenerated callbacks needed a hoisted const userId = session.user.id; after the auth gate because TypeScript can’t narrow session.user.id through the async closure boundary — three typecheck errors became one stable const + zero errors. Plus upload-and-extract.uploadAndExtract got its withAudit({ action: "document.create" }) wrap.
Graph-viz spec doc (PR #165, #152). docs/specs/Domi - Knowledge Graph Viz v0.1.md. Twelve design decisions captured: d3-force direct (skip the wrapper, skip elkjs); SVG render (not Canvas — we’re well below the SVG breaking point at ≤200 nodes, and <title> + DOM inspectability are worth the marginal perf); d3-zoom with [+/-/0/Arrow keys/Esc] keyboard shortcuts for WCAG 2.1.1; on-hover edge labels (always-on labels are visual noise at V1 scale); click-to-pin side panel (25% viewport, not modal — the modal pattern can’t compare nodes, can’t keep the graph visible, and translates poorly to mobile); drag pins a node via fx/fy; secondary filter popover (V1 dogfood households fit in a visual scan); single-member empty state with copy + chat link; concrete performance budgets per metric (initial render <1s at ≤100 nodes; pan/zoom ≥60fps; node hover <100ms); mobile defers to the list-tree per §5.15; route at /[locale]/graph (top-level, not under Settings). Implementation sprint sizing: ~10.5h across 7 named slices. Graph-viz is now a work plan, not a research project.
Regression-suite catalog (PR #166). JF asked late on day two: extract a test plan from merged PRs that could be used to avoid regression, that could be automated, and make it a living document. I walked every merged PR S1 → S13 and wrote docs/testing/regression-suite.md — 50+ entries across 10 categories (DB integrity, crypto, LLM abstraction, document ingestion, email ingestion, chat surface, audit log, UI invariants, process gates, eval matrix). Each entry: sprint/PR origin → one-line invariant → automation status (✅ in CI / ⚠️ partial / ❌ gap / 📋 backlog) → pointer to the test or note on what’s needed. Plus a backlog section ranking the 8 highest-value unauthored automation gaps by ratio of impact to effort: --custom migration placeholder check (~15 min, closes a 4-occurrence footgun), Closes #N PR check (~30 min, closes the S11 board-drift), partition-routing grant probe (closes the S12 0016 surprise generalized), catalog-no-latest-aliases test, allowlist snapshot, extractText non-text-parts filter test, OpenAI known-regressions snapshot, i18n key-presence snapshot. Wired into CLAUDE.md §11 and the sprint-template “Issues caught” line so future closeouts can’t quietly skip adding rows.

What surprised me

The audit surface is closed as a pair of sprints. Looking at S12 + S13 together: S12 built the schema, the helper, and the convention. S13 built the read UI, completed the largest piece of remaining wiring, and the surface is usable. Neither sprint alone is complete. Two-sprint feature delivery is a real pattern, not a planning failure — infrastructure sprints can’t ship visible surface; surface sprints assume the infrastructure exists. The pre-kickoff scope check that surfaced the missing audit infrastructure (S12) and the deferred S12 primary that became the S13 primary (#147) are the same insight from two sides. Knowing which sprints are infrastructure and which are surface, and not pretending the boundary doesn’t exist, is more honest than trying to bundle them.

listDistinctAuditActions is the load-bearing UX detail. The action filter dropdown could have listed the entire 26-action AuditAction union — that’s the typed vocabulary. But at V1 dogfood scale most actions won’t have rows yet, and a dropdown of 26 mostly-empty options is bad UX. The helper queries SELECT DISTINCT action from audit.tenant_events and returns only the actions that actually have rows in this tenant. So the dropdown grows as the tenant uses the system. The right UX answer was not the typed vocabulary; it was the data. Worth noticing because the pattern recurs: “show me the data’s possible values” is often better than “show me the schema’s possible values.”

Typed callbacks and async narrowing don’t compose in TypeScript. I wrote onFinish: async ({ text }) => { await appendMessage({ ..., userId: session.user.id }) } and got Type 'string | undefined' is not assignable to type 'string' even though session.user.id had been narrowed three lines above by an if (!session?.user?.id) return 401. TypeScript’s flow narrowing works inside the synchronous body of the function; the async callback at the bottom captures session by reference and doesn’t inherit the narrowing. Fix: hoist const userId = session.user.id; right after the auth gate. This is the kind of thing that’s obvious in hindsight and obscure when you first hit it. Worth a comment at the hoist site (which I added) so the next person doesn’t unwind the const-hoist as “unnecessary.”

The regression catalog was easier to write than I expected. I’d been worried walking 13 sprints of PRs would take half a sprint of work. It took ~90 minutes. The reason: every closeout test report and every PR description already had the invariants enumerated in their “Test plan” sections — I’d been writing them all along; the catalog is just a re-organization. The data was already in the repo; the catalog formalizes the index. That’s worth knowing because the same trick probably works for other meta-tooling: the artifact that “captures what we’ve been doing all along” is often a tractable extract, not a research project.

The backlog section is more useful than the per-category sections. Cataloging what’s already automated is useful for someone joining the project, but for me — the only person on the project — the value is the gap list. Knowing that --custom migration placeholder check is 15 minutes of work and closes a 4-occurrence footgun is the kind of thing that converts free time into compound interest. I’d been carrying “I should do that someday” mental notes about each of these; writing them down with effort estimates next to each makes the choice “spend 15 min now” instead of “weigh the priority next sprint.” Two of the eight items are pure process gates (no test code, just CI/GitHub Action scripts) that could ship as a single sprint secondary.

Where Sprint 14 picks up

The natural primary is knowledge-graph viz implementation — the S13 spec doc unblocks it, and it’s been one of the bigger remaining Phase 10 items. The sizing in the spec is ~10.5h across 7 slices, fits a sprint cleanly.

Secondaries worth folding in:

Regression-suite backlog top-2 — --custom migration placeholder check (~15 min) and Closes #N PR check (~30 min). Both close convention-violation classes that have bitten me at scale. Could ride along with graph-viz as small wins.
Remaining withAudit wiring (#158, ~5 mechanical wraps): gmail-watch.activateGmailWatch + refreshGmailWatch, extraction.ts extracted_fact.create, auto-write.ts extracted_fact.confirm + downstream writes, predict/engine.ts system writes, gmail-oauth.refreshGmailAccessToken. All mechanical now that the chat-thread interface change is done. 1-2 hours.
WCAG browser-verification run (#150) — still pending a real browser session. If JF runs it during S14, results log into docs/wcag-verification-plan-s12.md and any new findings graduate to issues.

Carry-forwards still alive: first mobile UX delta (camera capture is the larger of the two; table-to-card already shipped S12); the non-functional gates (threat-model sign-off, PIA, DR runbook — all V1-ship-blocking, all paperwork-shaped, none started); Sprint 8 carry-forwards (Gemini + OpenRouter; parent-child cost grouping).

The interesting thing I’m taking from S13 is the same shape I’m taking from S11 and S12: the most useful artifacts are often the ones that crystallize what was already implicit. S11 wrote down the Closes #N convention that PR descriptions had been violating silently. S12 wrote down the audit-then-mutate wrapper that turned a convention into a structure. S13 wrote down the regression catalog that turned “I’d been writing test plans” into “here’s the list of invariants and what protects them.” None of these introduced new behavior; all of them made existing behavior visible. The next sprint’s most valuable artifact is probably something I’m already doing implicitly and haven’t named yet.