2026-05-11

Sprint 10 — the carries close

Sprint 8 ended with “sometimes the sign that an abstraction is right is that the work to use it for the second time is faster than the work to build it for the first.” Sprint 9 ended with “using a thing tells you what’s missing in a way no spec can.” Sprint 10 closes on a different note: debt finally clears in the right places.

The plan was a carry-forward + Phase 10 ramp sprint. Five issues queued: cost-line UI in Settings → AI usage (carry from M8 — third sprint of “data is there, UI isn’t”), Beta badge on the public sign-in, chat placeholder hint pointing at /help, a procedural guard for the --custom migration footgun, and a stretch item — crm_optins table + the sign-in opt-in checkbox. The expected outcome was 3-4 of those landing and the rest carrying to Sprint 11. What actually happened: all five merged the same evening, plus a sixth drive-by PR for a stale-copy bug that lived on the public landing for five sprints before anyone noticed.

What shipped

Cost-line UI in Settings → AI usage. Closes the M8 carry. New getLlmUsageLast30Days({ db, tenantId, days? }) aggregate in packages/shared/src/llm/usage.ts — one Postgres GROUP BY over (role, model) then a JS reduce to per-role rollups with the dominant model (largest cost share inside the role). RLS-scoped via withTenant; sorted by cost desc. New _ai-usage-panel.tsx server component in Settings renders the spec’s §4.3 panel: total this month, per-role breakdown, dominant model per row. Empty-state copy when the window has no calls. Currency via Intl.NumberFormat with the active locale (en-CA / fr-CA). Five real-DB integration tests cover totals across roles + models, dominant-model identification, sort order, RLS isolation, and the empty-window. Per-day chart and parent-child cost grouping (chat → planner) deferred to V1.5 — chat still goes through @ai-sdk/anthropic directly so chat calls aren’t in llm.calls yet.
Beta badge on public sign-in + /help discoverability hint. Two Phase 10 ramp items batched. The Beta badge is an inline span next to the Domi h1, emerald-on-emerald-tint to match the existing prompt-mark accent — bilingual via a new home.beta_badge key (“Beta” / “Bêta”). The /help hint is a placeholder change: chat.placeholder becomes "ask domi something… (or /help)" (and the French equivalent), giving the slash-command surface that landed in Sprint 9 a foothold on the chat input.
--custom migration procedural guard. The footgun has fired four times: Sprint 6 (app_get_user_tenant), Sprint 8 (llm.calls), and twice in Sprint 9 (gmail_connections + the auto-generated 0012 that tried to recreate llm.calls). Three sprints of “remember to fill the SQL into the file before db:migrate” mental notes did not stick — because the gap between “I have an empty file” and “the migration has been marked applied with no DDL run” is invisible from the CLI output, and the two-step generate-then-fill feels natural. PR #132 promotes the rule from mental note to written convention in two places: CLAUDE.md §6 directly under the additive-only migration rule, and docs/testing/README.md both as a Convention and as a checklist line in the sprint-template “Issues caught” section. Every closeout from Sprint 11 forward will be asked “if you used drizzle-kit generate --custom this sprint, did you cat the migration file before the first db:migrate?”
crm_optins table + signin opt-in checkbox. The stretch item, which turned out not to be a stretch. New table in packages/shared/src/db/schema/crm.ts: email PK, first_name + last_name nullable (back-filled later via name analysis on the email local-part or other signals — low priority for V1 dogfood where N=1), source (‘signin_beta_optin’ for V1), opted_in_at, unsubscribed_at, last_emailed_at. Not tenant-scoped — these are pre-signup leads, mirrors the auth-table convention. The sign-in form gains a checkbox “Keep me updated of progress (occasional emails about new features and milestones)”, defaults off so opt-in must mean opted-in. startSignin gains an optional optInToUpdates boolean; the insert is wrapped in try/catch so failure does not block the magic-link send (the opt-in is a side benefit, not the primary action). ON CONFLICT DO UPDATE clears any prior unsubscribed_at and refreshes opted_in_at so the V1.5 unsubscribe path doesn’t have to retroactively patch the schema. Resend wiring + token-backed unsubscribe URL deferred to V1.5; V1 captures only. Migration 0014_brown_mimic.sql is vanilla additive — drizzle-kit auto-generated it from the schema diff, not a --custom migration, so the new procedural guard didn’t apply.
Drive-by: removed stale Sprint 5 copy from the public landing. During the visual review for PR #131 (Beta badge), I noticed the public sign-in page had SPRINT 5 — CHAT SURFACE above the brand h1 and M5 — chat surface live. M6 next: settings + persistence. in the footer. Both had been Sprint 5 placeholders that nobody updated as the project moved forward. PR #134 removed both rather than refreshed them — the underlying issue wasn’t that the strings were out of date, it was that the public landing was leaking internal sprint state to anyone hitting domiapp.ai. Footer reorganized to justify-end so the LocaleSwitcher sits alone.
Eval re-baseline. pnpm eval --role=chat --matrix against the 17 chat fixtures × 2 providers. Anthropic claude-sonnet-4-6: 17/17 (100.0%) at $0.2293. OpenAI gpt-4o-2024-08-06: 15/17 (88.2%) at $0.1009. Cost essentially flat across both providers vs the Sprint 8 baseline (within 3%). But OpenAI lost one fixture between Sprint 8 and Sprint 10 — the regression list grew from one (escalation-positive-cross-asset) to two (now also failing run-predictions-now). More on that below.

What surprised me

The “stretch” was not the stretch. I’d labeled crm_optins as the cuttable item if cost-line UI grew. The cost-line UI turned out to be small: one aggregate function (~30 lines), one server component (~70 lines), five integration tests (~150 lines), a couple of i18n entries. Maybe 90 minutes total. The CRM opt-in was bigger — schema file, migration generation, drizzle config update, server action change, form change, two i18n keys, two locale files, a careful read of the failure-tolerance pattern (insertion failure does not block the magic-link). Maybe two hours. The thing I was most worried about was the easiest; the thing I called stretch was the actual mid-sized item. Useful re-calibration: I tend to over-estimate “I have to query a partitioned table” complexity and under-estimate “another column on a public form, plus a server action change.” Surface-area work compounds quietly.

Written conventions stick where mental notes don’t. The --custom migration footgun has been the same bug four sprints in a row — drizzle-kit generate --custom emits an empty placeholder file, you forget to write the SQL into it before db:migrate, the migration row gets inserted, the DDL never runs, re-running migrate is a no-op because the row says it’s done. Each sprint I caught the issue, fixed it manually, and added a “remember next time” note to my mental checklist. Each sprint that mental note failed me. Sprint 10’s PR #132 puts the rule in two written places: CLAUDE.md §6 and the testing README convention list, plus a checklist line in the sprint-N.md template. The fix isn’t to remember harder, it’s to make the rule a question your closeout template asks every sprint. I’d been resisting writing it down because it felt like over-engineering for a solo build, but four occurrences is the data point. Writing rules down isn’t bureaucracy when you’re the only person who has to follow them; it’s the thing that lets you stop allocating attention to remembering them.

The public landing leaked internal sprint state for five sprints. SPRINT 5 — CHAT SURFACE had been at the top of the public sign-in page since, well, Sprint 5. By Sprint 10 we were five milestones past M5, two phases past Phase 5, and the page still announced “Sprint 5 — chat surface” to anyone who hit the URL. No test would catch this. No lint rule, no type check, no build step has any reason to flag stale copy. The only place the public surface gets reviewed is when JF himself opens it in a browser — and the authenticated app gets opened approximately every day, while the public landing gets opened once every couple of weeks when checking deploy state. The public-facing surface needs a different review cadence than the authenticated app. Sprint 9’s lesson was “using the product surfaces gaps the spec doesn’t”; Sprint 10’s variant is “the public surface needs a different review cadence than the rest of the product.” I’m adding “did you actually look at the public landing this sprint?” to the closeout template as a follow-up.

The chat-roundtrip flake is the same bug as the blog pubDate duplicate. Sprint 9 closed with a fix to the jfgailleur-blog homepage — two posts had identical pubDate timestamps, the sort fell back to filesystem read order, and the wrong post displaced the right one. Sprint 10 had the chat-roundtrip.test.ts flake re-fire during PR #130’s verification: two messages inserted in the same INSERT batch get identical createdAt, the ORDER BY createdAt returns them in undefined order, the test sometimes asserts .role === "user" against the assistant message. Same bug class, two unrelated codebases. The fix is the same too: bump one of the two values by some non-zero amount so the sort is deterministic. I’d flagged it as not-this-sprint while writing PR #130’s commit message; logging it as a one-line follow-up for next sprint’s chat-touching PR.

OpenAI grew a second regression on the same shape as the first. Sprint 8’s eval matrix closed with gpt-4o failing one fixture: escalation-positive-cross-asset, the cross-asset synthesis case where the user asks “Looking at my car, my house, and my health, what should I prioritize this month?” Anthropic calls escalate_to_plan, OpenAI answers without escalating. I’d flagged that as “gpt-4o reads the cross-asset trigger more conservatively than Sonnet does.” The Sprint 10 re-baseline confirmed that — and added a second failure: run-predictions-now, where the user asks “run my predictions now” and Anthropic calls the predict_task tool while OpenAI answers conversationally. Same shape, different fixture. gpt-4o, on the same prompt and the same tool description as Sonnet, reads imperative requests as informational more often. The right answer isn’t to rewrite the tool descriptions — that risks pulling Sonnet into over-calling, which is worse than the under-calling we have today. The eval is the surface that tracks this; the regression list grew, gets logged, gets watched. The 88.2% pass rate still clears the 0.80 floor, so OpenAI remains a viable alternative provider for chat. But “the abstraction holds” from Sprint 8 is the structural answer; the model-specific answer is “gpt-4o is roughly 5-10% more conservative on tool-use triggers, and that gap is unlikely to close without prompt-engineering work I’m not going to do during V1.” Worth knowing before any future “should we switch the chat default?” decision lands.

Five issues, five PRs, zero rollovers, ~10 hours. Sprint 8 was two evenings of work. Sprint 9 was two evenings plus three mid-sprint scope expansions. Sprint 10 was two evenings of pure throughput. The pace is real but I’m tracking it carefully — Sprint 11 has Terms + Privacy page content (which is paperwork-heavy and slow), audit-log search UI (which is genuinely complex), and the start of the WCAG spot check (which always uncovers more than budgeted). The clean Sprint 10 number is partly a consequence of all the items being right-sized; that’s not guaranteed to repeat.

Where Sprint 11 picks up

Phase 10 continues. Three concrete items want to land:

Terms of Service and Privacy Policy pages. /[locale]/terms and /[locale]/policy, markdown-rendered content per supported locale, footer + brand-menu links, sourced from the threat-model + PIA artifacts. The pages are user-facing surfaces; the PIA already on the gate list is the internal artifact. This is mostly content work, not code.
Per-day cost chart in Settings → AI usage. The aggregate function already does the role × model grouping; the per-day axis is one more date_trunc('day', occurred_at) group. A small chart over the last 30 days adds context to “Total this month: $X” by showing whether costs are stable, growing, or dominated by a single day. Half a sprint of work, max.
The chat-roundtrip flake. Insert two messages with explicit createdAt: new Date(now) and createdAt: new Date(now + 1) so the ORDER BY is deterministic. Five minutes; can ride along with whatever else touches the chat code path.

Carry-forwards: Gemini + OpenRouter adapters still V1-cuttable per Dev Plan §10 (decision at S18); knowledge-graph viz still cuttable at S12; mobile UX deltas + audit-log search UI not started; first WCAG 2.2 AA spot check pending; the threat-model sign-off, PIA, and DR runbook still to do; 4-week dogfood window still ahead. Sprint 8’s parent-child cost grouping (chat → planner) waits on the chat surface routing through the adapter.

The sprint that closed three carries — M8 cost-line UI, --custom migration mental note, four Phase 10 ramp items in one shot — felt different from the building sprints because most of the work was mopping up rather than breaking new ground. Sprints that ship debt off the books matter more than sprints that ship new features, and they’re harder to write satisfying retros about because you don’t get to point at a new capability. The capability already existed; what changed is that it’s now visible, durable, and codified. That counts.