Sprint 16 — the trio closes (and a docs detour)


S15 ended with the chat-driven create gap closed for members and assets. Image #3 from JF’s S15 dogfood had Sonnet saying “I can’t create a task manually yet — please add it quickly in the app under Tasks → Add Task when JF typed “$500 may 13, 2026” after a Hydro-Québec bill extraction. No such UI existed; the task surface was the obvious S16 candidate.

Then JF asked, before Sprint 16 kicked off: “what are the current list of APIs? let’s create a documentation of all the APIs with the latest openAPI specifications.” Mid-sprint pivot. The API docs landed as the new primary, then we came back to the original tier plan after that PR closed. Four PRs total, in this order:

  1. docs/api/ — OpenAPI 3.1 inventory + MCP catalog (PR #187, the pivot primary)
  2. Escalation prompt tightening (PR #192, original P1)
  3. UserPill on graph + settings + audit-log (PR #193, original S2)
  4. propose_task + /add task (PR #194, original P2+S1 — the third entity in the chat-driven create trio)

Plus the original stretch — graph perf measurement — stays as JF-runs work, since the instrumentation already shipped in S15 PR #184.

What shipped

  • API docs as living inventory (PR #187). docs/api/openapi.yaml (OpenAPI 3.1, 11 paths, 17 reusable schemas, 5 tags); docs/api/mcp-tools.md (9 chat tools with MCP-style annotations — readOnlyHint, destructiveHint, idempotentHint); docs/api/README.md (maintenance convention, current-vs-spec coverage table, future auto-gen plan). Cross-linked to the locked API + MCP Surface spec doc which holds the aspirational 79-capability V1 design. The pattern: spec doc is the target; the new docs/api/ is the inventory. When the V1 surface is fully built out, the inventory will catch up to the spec. Until then, docs/api/ tracks what’s actually there, and the convention says every new route adds a paths entry; every new chat tool adds an entry in mcp-tools.md. Future zod-to-openapi auto-gen lands with tRPC.

  • Escalation prompt tightening (PR #192). The escalation-positive-multi-step-plan eval fixture had been failing since S15 PR #183 (when propose_member + propose_asset landed and grew the tool catalog from 6 to 9 visible tools). Sonnet was picking list_tasks + run_predictions_now for “Build me a 12-month maintenance plan for my new car” instead of escalate_to_plan. Three description edits: (1) run_predictions_now explicitly scoped to near-term (next 30-90 days) with an explicit anti-pattern: “do NOT call this for multi-step planning requests”; (2) escalate_to_plan adds long-horizon planning (>90 days) as canonical trigger #5, marked “the most common trigger”; (3) system prompt enumerates triggers, with long-horizon getting first-class billing. Result: 25/25 on chat eval. The fixture passed clean for the first time in two sprints.

  • UserPill on graph + settings + audit-log (PR #193). Bug-bash review post-S15 flagged: UserPill rendered only on /chat. Graph and Settings had no shortcut to sign-out or locale switching from within the page. Fixed by adding UserPill to all three page headers. The /graph back-to-chat link gets hidden sm:inline so mobile narrow viewports collapse the back-link cleanly. /settings + /settings/audit-log use a justify-between row pattern. displayName fallback chain matches /chat. The change is unobtrusive — same component, same dropdown menu, just three more entry points.

  • propose_task + /add task (PR #194). The third and final chat-driven create surface. Reuses every piece of plumbing from propose_member (S15 #177) and propose_asset (S15 #178): same chat_proposals table, same confirm/cancel routes, same applicator pattern. New taskProposalSchema Zod definition with title, category, dueAt?, earliestStartAt?, description?, assetName?, memberName?, attributes?. assetName + memberName resolve case-insensitively against assets.display_name / members.display_name — same pattern as custodianMemberName on propose_asset. Audit-wrapped with task.create + metadata.proposal_id + resolution flags. Provenance JSONB records {source: "chat_proposal", proposal_id} so audit-log search can scope to chat-originated tasks vs. predicted vs. extracted. 11 new i18n keys per locale for the task card. /add task slash command with synonyms (reminder, todo, to-do). 3 new eval fixtures. 27/28 on chat eval — see “what surprised me” below for the 28th.

What surprised me

Tool-selection sensitivity to catalog size is a real V1 cost. The escalation-positive-multi-step-plan fixture flipped twice based on catalog size: passed pre-S15 with 6 tools, broke with 9 when propose_member/asset landed, fixed at 9 by S16 P1’s description tightening, re-broke at 10 when propose_task landed. Same code change from P1 still in place. Model’s tool-selection is genuinely fragile at the margin and grows more so as the catalog grows. Pass rate stays above the 0.80 floor (27/28 = 96.4%) but the regression is real. Worth a follow-up: either a sharper system-prompt prefix instructing the model to call escalate_to_plan FIRST when the user says “build me a plan”, or marking the fixture judgement_only (acknowledging the choice is sensitive). The deeper lesson: every tool you add to the catalog dilutes the model’s selection signal for every other tool. There’s no free tool. Worth pricing the “add a tool” decision in S17+ as a real cost, not just an implementation effort.

The applicator pattern is genuinely reusable. propose_task was the third propose_* tool to land. Each one followed the same shape: Zod schema in chat-proposals/<entity>-proposal.ts, tool factory in chat-tools/propose-<entity>.ts, applicator in chat-proposals/applicators.ts with audit wrap + name resolution, dispatch arm in confirm route, chat panel summarizer. By the third pass it took ~3 hours including 4 new applicator tests + 3 eval fixtures, which is half of what propose_member took. The architecture cost amortizes; a future propose_document_metadata or propose_relationship would be even faster. S15’s “build the right surface” investment is paying compound interest now.

JF’s mid-sprint pivot was the right call. I’d planned S16 as the propose_task primary + smaller items. JF asked for API docs first. That doc work landed cleanly as PR #187 in ~3 hours. The pivot wasn’t a delay; it was a different valuable artifact in the same time budget. I’d been mentally treating sprints as “execute the plan” but they’re really “ship valuable artifacts within ~10h.” If the most valuable artifact mid-sprint changes, change. The propose_task primary still landed by end of sprint.

Eval fixture-text quality matters. Two of my initial S16 propose_task eval fixtures failed not because the tool didn’t work — it did, 100% tool-call success — but because my reply_includes assertions expected the model to echo user keywords back in its reply. With propose_task’s “confirm on the card” UX, Sonnet now replies tersely (“La tâche est prête à être ajoutée — confirmez sur la carte ci-dessus 🦷”) and doesn’t echo. Dropped to tools_called-only assertions. The propose-asset-en-appliance fixture had similar issue: “My wife Alice is in charge” was ambiguous about whether Alice existed; reworded to “Alice (already in our household) takes care of it”. Generalized lesson: reply-text assertions should test what’s load-bearing in the assistant’s output (dates, named entities the user MUST see), not what the user typed. And natural-language fixtures need to be unambiguous about state the model can’t infer (existing vs. new entities).

UserPill on /graph took 30 minutes; I could have shipped it three sprints ago. Bug-bash flagged it post-S15. Fix was trivial — add UserPill to the page header, mobile-collapse the back-link, done. The reason I didn’t ship it sooner is that the graph viz felt like a “standalone view” worth its own chrome. That’s wrong. Every signed-in page is part of the same app and deserves the same shortcuts. The principle is “navigation consistency over per-page idioms” — easy to state, easy to forget when you’re focused on a feature surface. Worth a CLAUDE.md addition.

Eval cost is creeping up. S14 eval was ~$0.34 across 24 fixtures. S16 eval is ~$0.54 across 28. The growth is escalate_to_plan firing more often (each call is ~$0.10 of generate_plan tokens) plus the new propose_task fixtures. Still cheap in absolute terms but worth tracking — at the current trajectory we’ll hit $1/run by S20 if every sprint adds 2-3 fixtures + a propose_* tool.

Where Sprint 17 picks up

The natural candidates:

  • Tasks list UI. propose_task creates them; /list tasks shows them in chat; but there’s no top-level viewer at /tasks or /list tasks page. JF dogfood will reveal whether the chat-only view is enough — my intuition is no, V1 needs at least a basic list view with category filters and a “mark complete” action.
  • Escalation flakiness stabilization. Pick one: sharper system-prompt prefix OR judgement_only flag on the fixture. ~30 min either way. The pass rate stays above floor regardless but the flake makes the eval gate noisier than it should be.
  • Live graph perf measurement. Carry-forward from S15. Instrumentation hooks in place; JF runs Chrome DevTools at dogfood scale + fills in spec §2.9 Actual column. Now that propose_member + propose_asset + propose_task make populating a tenant trivial, this is genuinely cheap.
  • API-docs CI gate. Convention enforced by code review today. A small CI script that diff-counts routes vs. paths in openapi.yaml would lock it in. ~30 min.
  • First mobile UX delta — camera capture. Still pending. Table-to-card ✓ S12, graph list-tree ✓ S14, header layout ✓ S16. Camera capture is the dogfood-defining mobile feature.

Carry-forwards still alive: WCAG browser-verification run (#150, JF runs); non-functional gates (threat-model sign-off, PIA, DR runbook); Sprint 8 carry-forwards (Gemini + OpenRouter; parent-child cost grouping).

The pattern I’m taking from S16 — combining what S15 taught with what S16 added — is the architecture is paying off. Three propose_* tools, all sharing one table + one route + one applicator pattern + one card UI. Each new addition takes less time than the previous. What S17 should pay off is the read side — propose_task without a tasks viewer is half a feature. The chat-driven create story closes structurally with S16; the chat-driven manage story is what S17 should open.