Sprint 17 — the loop closes on itself
S16 closed the chat-driven create trio (member, asset, task — all three entity types now creatable via natural-language chat → confirm card → audit-wrapped insert). But the task surface ended S16 half-built: chat created tasks, the audit log recorded them, /list tasks (a slash command in chat) could enumerate them, and that was it. No top-level page to see what you’d added, no way to mark anything complete without going back through chat.
Sprint 17 fixed that. Three PRs in ~9 hours one evening:
- Tasks list UI primary (PR #199) —
/[locale]/taskspage with filters + cursor pagination + inline “Mark complete” / “Dismiss” server actions - Escalation eval stabilization S1 (PR #200) — fixes the catalog-size flake that re-broke at 10 tools in S16
- API-docs CI gate S2 (PR #201) — diff-counts routes vs
openapi.yamlpaths, and chat tools vsmcp-tools.mdcatalog rows
Plus the original stretch — live graph perf measurement — stayed deferred again. Dev server was up, tenant was populated, JF chose to skip the actual trace. #181 stays open for a future session.
What shipped
-
Tasks list UI (PR #199). New shared-package module
packages/shared/src/tasks/with three functions:listTasks(cursor pagination on(due_at NULLS LAST, id)),completeTask,dismissTask. The/taskspage server-renders rows fromlistTaskswith filters (category, source, lifecycle state, date range) and a paginated cursor in the URL. Each row has two inline buttons whose server actions wrapcompleteTask/dismissTaskandrevalidatePathto refresh the list. Both mutations are idempotent and race-safe: they pre-read the lifecycle state, then audit-wrap an UPDATE withWHERE lifecycle_state = <pre-read>so a concurrent transition returnsalready_completedinstead of double-emitting an audit row. The dogfood-defining round-trip —add via chat → see on /tasks → mark complete from row → audit row written, list refreshes— works end-to-end. 22 new i18n keys per locale, 9/9 real-DB integration tests, 3.53 kB / 127 kB build output for the new page. -
Escalation priority rules (PR #200). S16’s escalation regression carried into S17 as the “flaky-at-10-tools” follow-up. Two fixtures were sensitive at the catalog edge:
escalation-positive-multi-step-plan(“Build me a 12-month maintenance plan for my new car”) andescalation-positive-cross-asset(“Looking at my car, my house, and my health, what should I prioritize this month?”). Both expectedescalate_to_planto fire first; Sonnet was reaching forlist_tasks+run_predictions_now. The S17 fix’s mechanism is different from S16’s. S16 tightened individual tool descriptions (run_predictions_nowscoped to 30-90 days, etc.). S17 added a prefix the model reads BEFORE the per-tool descriptions, listing three escalate-first triggers with explicit verbs/horizons/multi-asset signals. The catalog-size signal-dilution gets solved at the priority-ordering level, not by tightening each individual tool. 28/28 chat eval, $0.59/run. -
API-docs CI gate (PR #201).
scripts/check-api-docs.mjsruns two count comparisons:route.tsfiles underapps/web/src/app/api/**↔paths:entries inopenapi.yaml, and chat-tool source files ↔ catalog rows inmcp-tools.md. New.github/workflows/api-docs-sync.ymlruns it on every PR (2 min, pure Node stdlib, no install). The first execution of the script found real drift —propose_taskfrom S16 #194 was in the chat-tools registry but had no detail section inmcp-tools.md. Backfilled in the same PR. Count after backfill: 11 routes ↔ 11 paths · 10 tools ↔ 10 catalog entries.
What surprised me
Prompt prefixes beat per-tool description tightening when the catalog grows. S16’s fix to the escalation regression was to tighten the descriptions of run_predictions_now, escalate_to_plan, and the system prompt’s tool-list intro. Worked at 9 tools (25/25). Broke again at 10 (27/28). The reason: per-tool description tightening doesn’t help when the model is scanning a longer list — each tool’s description still says “use me for X”, and adding a 10th tool dilutes whichever was supposed to win. A prefix that orders the catalog (“escalate FIRST when…, propose_* SECOND when…, read tools OTHERWISE”) gives the model an explicit decision tree before it reads the per-tool list. The deeper lesson: as a tool catalog grows, the per-tool descriptions become signal-poorer, but a structured prefix stays cheap. The catalog-size cost from S16 (“there’s no free tool”) is real, but the priority-prefix mechanism amortizes it — adding the 11th tool won’t break the escalation eval as long as it doesn’t accidentally re-trigger one of the three priority rules.
Whack-a-mole between two eval fixtures revealed an implicit gap. First attempt at the priority prefix enumerated horizon-based and verb-based triggers (“build me a plan”, “>90 days”). Fixed multi-step-plan (27→28). Broke cross-asset (28→27). Cross-asset was “this month” + three different assets. The model now read “this month” as near-term and skipped escalate — exactly what the prefix told it. The fix: extend the priority rule to say “the time horizon does NOT downgrade a multi-asset prompt; cross-asset synthesis itself is the escalate signal.” Generalized lesson: when a prompt prefix asserts “do X when…” with one type of trigger, the model will defensively apply that single rule and ignore other signals. Enumerate every trigger explicitly; the model treats the prefix as a closed list, not an open one. This is also why the next propose_* tool added shouldn’t accidentally re-trigger the prefix — a propose_health_record whose description says “use for medical questions” could collide with the cross-asset rule.
The API-docs gate justified itself on its first run. S16 introduced the docs/api/ convention manually. The first run of the S17 gate found the convention had already drifted — S16 #194 added propose_task to the chat-tools registry without its mcp-tools.md detail section. That’s exactly the slow doc-rot a CI gate exists to prevent. S11/S12 hit the same finding for Closes #N PR keywords and --custom migration placeholders, both of which became hard gates. The pattern: documentation convention written but not CI-enforced is a soft commitment. The cost of writing a count-diff script (~30 min) bought enforcement; the cost of not writing it shows up later as silent drift you find weeks after the fact.
Idempotency + race-safety in the mutation layer needed pattern-encoding. completeTask and dismissTask both transition a row to a terminal lifecycle state. The naive implementation (read, update unconditionally, audit) has two failure modes: a second click double-emits an audit row, and a concurrent transition between read and write makes the audit log lie about which state the task was actually in. The pattern that works: pre-read the lifecycle state, then audit-wrap an UPDATE with WHERE lifecycle_state = <pre-read>. If the UPDATE affects zero rows, return already_* and skip the audit. This is the kind of pattern that needs to be a shared helper or convention before the third mutation lands — currently it’s open-coded in both completeTask and dismissTask. The next terminal-transition mutation (snooze, undismiss, whatever V1.5 brings) should probably get the helper.
Cursor pagination on NULLS LAST ordering needed an explicit null-tail convention. Tasks have nullable due_at. Default ordering is (due_at ASC NULLS LAST, id ASC). Cursoring past the last non-null due-date row into the null tail needs a different WHERE shape than ordinary (field, id) cursors. The convention I encoded: cursorDue="" (empty string in the URL) + a real cursorId means “we’re in the null tail, advance by id only”. This is now the third place this pattern lives in the codebase (audit-log search S13, predictions list, tasks). Time to pull it into a shared helper before the fourth occurrence.
Skipping the perf trace was the right call but burned setup time. JF asked to do the graph perf measurement now. Dev server up, runbook written, tenant ready to populate. Then JF skipped. The cost was small (~5 min) but it’s the second sprint where this carry-forward almost-but-didn’t-happen. The actual blocker isn’t “JF doesn’t have 30 minutes” — it’s that the trace requires a long-running browser session with DevTools open, and there’s no good moment in a 10-hour evening sprint when that’s the cheapest task. Maybe the right play is bundling it with the next dogfood-tenant population pass so we get the trace as a side effect.
Where Sprint 18 picks up
The Phase 10 candidate list narrows:
- First mobile UX delta — camera capture. Still pending after four sprints. Table-to-card ✓ S12, graph mobile list-tree ✓ S14, header layouts ✓ S16, tasks rows mobile-friendly ✓ S17. Camera capture is the dogfood-defining mobile feature — open the camera, snap a bill, the document-ingestion pipeline does the rest. ~2-3 hr for a first cut.
- Live graph perf measurement. Still #181. Carry-forward since S15. The runbook is now written into S17’s transcript; next session can pick it up cold.
- End-of-S18 decision: Gemini + OpenRouter adapters. Per Dev Plan §10 these are V1-cuttable. S18 is the decision point. Tilting toward “cut” — the abstraction layer is already provider-agnostic, and dogfood doesn’t need multi-provider routing.
- WCAG browser-verification run (#150). JF runs. Plan written at
docs/wcag-verification-plan-s12.md; code-side P1+P2 fixes shipped S11+S12. - Phase 10 non-functional gates. Threat-model sign-off, PIA, DR runbook. All V1-ship-blocking, paperwork-shaped, none started.
- Eval cost trajectory. S14 was ~$0.34/run, S16 ~$0.54, S17 ~$0.59. The growth is escalate_to_plan firing more often and new propose_* fixtures. Holding flat between S16 and S17 is reassuring — the prompt-prefix fix didn’t pull more fixtures into escalate. Worth keeping an eye on at S18+.
S17 was the smallest sprint of Phase 10 in number of PRs — three vs. four-or-five in recent sprints — but it closed the largest structural gap: the chat-driven create story now has a corresponding read side. S15 and S16 built the front of the loop (chat ingests, audit records, data lands). S17 built the back (UI shows, mutations close out, audit records again). The user’s full mental model of Domi-as-something-that-manages-things is finally addressable end-to-end — type into chat to add, click on a list to manage, audit to inspect. The dogfood window can start gathering signal about whether the loop, as a whole, lands.