2026-05-26

Sprint 26 — the loop measures itself

S25 was 47 PRs in 7 days of reactive dogfood. S26 was the inverse shape — 7 product PRs in one calendar day, almost all of them about closing the observability gap that S25’s papercut sprint left wide open. The two perf-side Requirements §11 indicators both PASS in the first measurement, but the headline lesson is that the measurement itself was wrong until we fixed it mid-sprint.

What shipped

The observability arc. Three PRs.

#423 — @vercel/speed-insights in the locale layout. Auto-disabled in dev, no-op on non-Vercel hosts, renders nothing visible. +1 KB on the shared first-load JS. Closes the “what are my actual Core Web Vitals?” question.
#433 — perf measurement methodology + script + week-1 doc. scripts/measure-perf.mjs queries the production DB for the two §11 perf-side indicators: upload→first-fact latency (single-page, < 30s avg bar), and operating cost (≤ $100/month bar). Output is a markdown block ready to paste into the weekly doc. docs/perf/README.md documents which SQL, which rows, what each indicator captures vs. what it doesn’t.
#435 — chat-route cost telemetry + rename Opex → COS. While writing #433 I noticed the chat route doesn’t actually record llm.calls rows for its main generateText / streamText calls — only the extraction pipeline and the escalation planner do. The week-1 COS projection therefore understated real spend by ~$13-50/month (chat is the dominant cost surface). #435 wires recordChatLlmCall into both the multipart generateText (post-await) and the streaming streamText.onFinish (where aggregate usage across all steps lives). Anthropic prompt-caching honored — cached input tokens charge at the cheaper cachedInputUsdPerMillion rate per config/pricing.ts. Bonus rename: “Opex” → “COS” (Cost of Sales) because every LLM call directly serves a household interaction, so it’s marginal cost to deliver the service, not general operating cost.

Dogfood papercuts. Four PRs.

#425 — five fixes in one. Container containerAssetName now renders on the asset propose card so the user can verify it pre-confirm. “Tamias home” matches a residence named “Tamias” via suffix-strip. /api/chat/proposals/[id]/confirm calls revalidatePath on the affected pages per toolName, so freshly-confirmed entities show in the graph without manual reload. Tier-3 prefix-match dedup catches “Thalassa” vs “Thalassa Storm Clouds.” New formatYear helper so year_installed: 2018 renders as “2018” not “2,018”.
#427 — JF’s “fix this once and for all” ask. Audit of the applicator layer found 5 name-resolution sites still on ad-hoc LOWER-equality while 4 others had already switched to the tolerant matcher + LLM Haiku fallback per S25 #404. Brought them all into alignment: container resolution, task assetName, obligation payer, transaction asset + member. Deterministic 4-stage matcher first; LLM fallback only fires when all 4 stages return null (~$0.0001/call).
#429 — invoice for a brand-new asset that doesn’t exist yet was getting orphaned. The upload-time auto-link (matchAssetFromText from S24 PR-C) only fires for existing asset names; when the propose creates a NEW asset, the doc just sits with asset_id=NULL. Fix: propose_asset server-stamps sourceDocumentId when exactly one unlinked doc was uploaded in the turn; applicator links the doc on confirm (insert + merge paths) only when documents.asset_id IS NULL (no silent cross-asset stealing — same rule as PR #415 from S25).
#431 — three changes under one principle: the user owns the description of their documents; admin never sees content. (A) listAllDocumentTypeRegistry redacts description + firstSeenIn{Tenant,Document}Id from the admin catalog query — those were seeded from the first sighter’s actual document text, which is a per-household content leak. The DB columns stay populated for V1.5 attribution work; only the read path redacts. (B) New documents.user_summary column + UserSummaryEditor component — when set, overrides the (often low-confidence) extraction summary on the doc detail page + the documents list headline. Badge becomes “user-confirmed” when set; “awaiting your input” when extraction is low-confidence + unset. Audit logs lengths only, never the user-entered text. (C) Optional description input on the per-asset upload form, plumbed through to seed user_summary at INSERT.

Both perf-side indicators PASS week one

Latency comes in well under bar — 4.8s avg (p50 4.3, p95 7.9, max 8.6) on 20 single-page extractions in the window, vs the 30s bar. That’s a ≈6× margin and includes the vision model call + the DB roundtrip + everything else in the extraction pipeline.

COS comes in at $0.91/month projected on 30-day rolling, vs the $100 bar. Two roles in the breakdown: extract_document at $0.21 and generate_plan at $0.008.

The honest read: the COS number is wrong by an order of magnitude — by design, until #435 lands and propagates. The chat surface is the dominant cost driver and it wasn’t being recorded. Rough estimate of the missing chunk: ~30 chat turns/day × ~5K input tokens/turn × $3/M = ~$13.50/month, with tool-use heavy turns pushing toward $30-50/month upper bound. Even with that correction we’re at $14-50/mo against a $100 bar, still well under — but the projection should be honest, and now it will be starting week 2.

The Speed Insights dashboard tells the same story: Real Experience Score 100, all metrics in the “Great” zone, per-route /graph 97, /assets/[id] 100, /tasks 94. But Speed Insights only turned on 2026-05-25, so the sample sizes are tiny (6-26 per route, only a few hours of post-activation data). The week-2 doc gets the first meaningful trend.

What surprised me

The chat route was bypassing cost telemetry for months. The LLM abstraction layer was designed precisely so internal callers invoke by role + the adapter records cost telemetry centrally (CLAUDE.md §6). But the chat route uses the AI SDK directly (generateText / streamText) rather than going through an adapter — which is the right call for AI SDK ergonomics but bypassed the auto-recording. The abstraction is the only thing that guarantees “every provider call gets billed in llm.calls,” and the chat route’s SDK-direct shape broke that invariant silently. Fixed via a small recordChatLlmCall shim that constructs the adapter-shaped result from the SDK’s raw usage. Lesson written into the regression-suite for future SDK-direct paths.

Schema-orthogonal fields aren’t read-path-orthogonal. I documented this in S25 #415 (set-photo applicator auto-link) and S26 #429 ran into the same pattern. documents.asset_id is conceptually orthogonal to assets.primary_photo_document_id, but getAssetDetail joins them in the read path — so a doc set as primary without being linked renders both slots empty. The schema diagram doesn’t surface this dependency. PR #429’s source-doc auto-link is the same shape: a “this asset came from this invoice” relationship needs to be EXPLICITLY linked at write time even though the DB lets you skip it, because every read path that asks “what documents does this asset have?” joins on asset_id.

Admin-readable catalogs leak more than they look like they leak. PR #431. The document_type_registry is the shared catalog of doc-type schemas — across-tenant, by design, so admin can curate the kinds. But its description field was seeded from the first sighter’s actual extraction text, which means a household’s invoice content was visible to admin via the catalog. Same pattern with firstSeenInTenantId + firstSeenInDocumentId — those identify which household + document first introduced each type, which is its own privacy concern. Lesson: when a cross-tenant table has a column that gets seeded from per-tenant data, the read path needs an explicit redaction step. Adding more cross-tenant catalogs in the future (V1.5 kind_registry unification, for example) needs the same review.

The dashboard tells you the route is fast even when the user says it’s slow. JF reported in S26 #422 that the UI feels slow on navigation chat → tasks → graph. Speed Insights came back showing all routes RES > 90, all metrics in the “Great” zone. Two reasons not to dismiss the user report: (1) tiny sample size — the dashboard only activated this sprint, so the “felt slow” episodes from before activation aren’t measured at all; (2) the metric measures the page itself, not the SPA-like client transitions that happen between routes (RSC streaming, fetch-during-navigation, etc.) which Speed Insights doesn’t capture cleanly. The week-2 measurement gets the first meaningful sample; if perception still doesn’t match the dashboard, the next dig is Sentry traces (S12) for per-request server timing.

Decisions made

Insurance gets typed FKs on obligations for V1, not the general relationships table from spec §7.5. The spec lays out a principled cross-entity edges table (member ↔ asset ↔ obligation ↔ contact ↔ transaction). For V1 dogfood scope JF wants insurance visible in the graph this week; option B (typed FKs insured_asset_ids[] + insured_member_ids[] + graph view extension) ships in days, option C (general relationships table) is a meaningful refactor touching every existing edge. Option B’s data shape is forward-compatible with option C — same logical edges, different storage — so V1.5 migration is mechanical. Decision land: S28 ships option B; V1.5 migrates to option C if dogfood reveals other cross-entity edges that want the same shape.
V1 paperwork pulled forward from “post-dogfood” to S27 (dogfood week 3). Original sequencing per JF direction 2026-05-15: write threat model + PIA + DR runbook AFTER dogfood reveals actual data flows + perf posture, so the docs describe verified behavior rather than design hypothesis. Updated decision 2026-05-25: dogfood has already surfaced enough actual behavior to make the docs honest — kind_registry, obligations/contacts/transactions, member relationships, asset hierarchy, address jsonb, user_summary all shipped + dogfooded. Pull paperwork forward to S27; refine in S28 with final week of dogfood data.
“Opex” renamed to “COS” everywhere except the Requirements §11 spec. Per JF: every LLM call directly serves a household interaction, so it’s marginal cost-of-sales not general operating cost. The §11 spec language (“operating cost”) stays as-is for traceability; everywhere else (the script, the README, the weekly docs) uses COS. The pricing table + recordLlmCall shape unchanged — just the user-facing labels.

Where Sprint 27 picks up

S27 is the V1 paperwork sprint. Three docs to draft:

docs/specs/Domi - Threat Model v0.2.md — refresh v0.1 against the actual shipped surface. New entities to cover: kind_registry, obligations, contacts, transactions, member_relationships, asset hierarchy (container_asset_id), structured address jsonb, user_summary, llm.calls cost telemetry. Actual auth/RLS posture confirmed (every domain table has RLS, every mutation goes through withAudit). Crypto posture confirmed (libsodium wrapper-mode for filename + sensitive-overflow bytea; per-tenant DEK still on the V1.5 roadmap, libsodium-portable shape kept for the zero-knowledge migration path per Requirements §17). 35 threats from v0.1 → re-evaluated against shipped reality + new threats for the new entities.
docs/specs/Domi - Privacy Impact Assessment v0.1.md — Law 25 + GDPR-shape PIA. Data flows actually shipped (chat → extraction → audit; upload → R2 → extraction → asset link; Gmail watch → push; Law 25 self-serve deletion #265). Retention actually configured. Sub-processor list actually accurate (Vercel/Neon/R2/Anthropic/Resend, with the ca-central-1 Vercel region confirmed in S24 #314). Document the V1 admin-side read-path redactions (S26 #431) as the privacy invariant.
docs/specs/Domi - Disaster Recovery Runbook v0.1.md — actual Neon branch + R2 + Vercel restore procedure, tested against the single-env dogfood deployment per docs/environments.md. No prod/staging split yet (deliberately deferred per CLAUDE.md §11), so the runbook describes single-env recovery + the migration path to a proper split (which is itself V1.5).

Plus reactive dogfood PRs through the rest of the sprint week — S26 had fewer than S25 mostly because the easy targets got smoothed. The harder ones (graph entity panel UX, multi-doc invoice handling, predictive engine tuning) will show up as dogfood week 3 gets enough data to surface them.

S28 then picks up insurance entity (option B + graph-render) + week-4 perf measurement (with honest chat-route COS) + the §11 dogfood-side success-criteria roll-up (≥20 predicted tasks, ≥12 documents, 100% members onboarded — the ones that need a real 4-week window to evaluate).

S26 was the sprint that made the dogfood loop measure itself. S27 makes the system document itself.