2026-05-09

Sprint 7 — depth on demand

The framing for Sprint 7 came from Requirements §5.13.9: “stay light by default, lean in when it matters.” The chat surface that shipped in Sprint 5 grounds answers in tools (list_tasks, list_documents, run_predictions_now) — cheap, fast, factual. The hard turns — “build me a maintenance plan for the new car,” “design a weekly cleaning routine,” “if I get a second car next year, how should the maintenance schedule change?” — those need a different model and a different mode. M7’s job was to wire that without making everyone pay the premium-reasoning bill on every turn.

Two evenings, five PRs, all merged by Friday night. Six total LLM calls per escalated turn (chat call + tool result + plan call), about $0.02 per escalation on Sonnet 4.6. The chat eval extended from 5 fixtures at $0.04 to 17 at $0.22 — most of the cost increase is the escalation positives actually generating plans.

What shipped

generate_plan role on AnthropicAdapter. The role was declared in packages/shared/src/llm/types.ts since Sprint 2; the catalog had claude-sonnet-4-6 mapped to it; the router enforced its tool_use capability gate. What had been missing was an actual adapter method to back it. PR #93 added AnthropicAdapter.generate_plan(input, options) following the same shape as chat / classify / extract_document — typed input/output, versioned prompt, routes through callMessages + buildResult. Live smoke against the API: a 12-month maintenance plan for a 2023 Mazda CX-5 + 2018 Honda Civic in Quebec. The plan referenced winter-tire deadlines, SAAQ inspection nuances, and a Transport Canada recall lookup for the high-mileage Civic — $0.014, 312 in / 865 out tokens. First non-trivial use of the role-routed LLM abstraction since it was scaffolded in Sprint 2.
escalate_to_plan chat tool. The chat model decides when to call this. The tool’s description — and this is the load-bearing piece — embeds the §5.13.9 trigger heuristics verbatim: 5 SHOULD-call categories (multi-step plan, cross-asset synthesis, conditional reasoning, routine generation, explicit phrasing) and 4 SHOULD-NOT categories (factual questions, single-step actions, app-meta, status checks). Calling the tool routes through getPlanner() → AnthropicAdapter.generate_plan → returns the plan to the chat model, which paraphrases it into its reply. The result persists as a chat_messages row with role='tool' so follow-up turns can reason against the full plan rather than the chat assistant’s summary of it.
“Domi is thinking harder…” 1.5s indicator. Non-blocking inline indicator that surfaces when an escalate_to_plan tool call has been outstanding for more than 1.5 seconds. Below 1.5s the UI still feels responsive; above 1.5s the user wants signal that something is happening. CSS-only emerald spinner, role="status" for screen readers, auto-dismisses when the response renders. PR #95 was built off main ahead of #94 landing — the detection was a silent no-op until the tool fired, so the two PRs were genuinely independent and could land in either order.
/think and /quick slash overrides + global toggle. Per-turn user control: /think your-message strips the prefix client-side and sends a force override; the server adds a system-prompt nudge pushing toward escalation. /quick your-message sends suppress; the server filters escalate_to_plan out of the tools array for that turn. Plus a global toggle in Settings (Chat preferences, the third Settings IA section): “Allow chat to escalate to premium reasoning when needed,” default ON. Precedence: per-turn slash always wins; global OFF + no slash = tool not exposed; global OFF + /think = escalation still happens (slash is explicit user intent).
Chat eval extended with 12 escalation fixtures. Five positives (one per SHOULD-call category — must call escalate_to_plan), four negatives (one per SHOULD-NOT category — must NOT call), three ambiguous cases logged but not graded (per spec §14.2 “judged by reviewer”). The runner gained a tools_must_not_call predicate — first time the eval asserts a tool wasn’t called. 17/17 pass on Sonnet 4.6 at $0.224 total. Baseline written; future prompt edits get caught by the eval before they regress.

What surprised me

The tool description is the prompt-engineering surface. Half the work of getting escalate_to_plan to behave was deciding what to write in its description string. The 5 SHOULD-call + 4 SHOULD-NOT categories from §5.13.9 are pasted verbatim. I’d have expected to need a separate prompt-engineering doc, an iterative tuning pass, and probably a couple of cases where the model would surprise me. The eval ran 17/17 first try. Sometimes the spec is doing the work; you just have to copy it into the right place.

Sprint 6’s parallel-branch lessons actually stuck. Sprint 6 wrapped up with three different flavors of merge-friction pain — Drizzle snapshot collisions, the empty-body --custom migration, the main-merge regression. Going into Sprint 7 I designed the dependency graph upfront: #88 (foundational) → #89 (depends on #88) → #90 (independent of #89, can run parallel against main) → #91 (conflicts with #89, wait for it) → #92 (needs the route live, wait for #91 OR build off it). Worked exactly as planned. Zero conflicts, no rebases beyond standard “pull main, merge in.” The “verify the PR is still open before pushing follow-ups” mental check that I added to the post-merge habit list at the end of Sprint 6 — that one apparently took.

The “first eval that asserts in two directions” felt like a small thing and is actually a big one. Earlier evals (classify, predict_task, the original chat 5) all assert “the right tool was called.” For the escalation eval, the negative cases needed “the wrong tool was NOT called” — and “wrong” is just escalate_to_plan on a status-check trigger. Adding tools_must_not_call was a 20-line runner change. But it changes what the eval can express. Future tools can now be eval’d both ways: “we want list_tasks here, and we don’t want run_predictions_now firing too.”

The carve-out for role='tool' persistence felt nuanced and turned out clean. Sprint 6 said “tool messages are turn-local; the model recomputes them per turn.” That works for list_tasks because the answer to “what tasks do I have?” depends on the current DB state and should be re-fetched each turn. It does not work for escalations because the plan is an LLM-generated artifact that the user paid (literally, in tokens) to produce, and recomputing it next turn would be both expensive and incoherent. The schema’s chat_role enum had a 'tool' value we hadn’t been using; now it has exactly one user. The “deliberate exception” pattern is well-known in software but always feels like a smell when you write it; this one feels right because the cost-of-recomputation tradeoff is asymmetric.

Per-turn slash overrides were trivially clean once I read the AI SDK API for them. I’d spent a few minutes thinking about how to extend the DefaultChatTransport body builder, plumb the override through transport rebuilds on every turn, etc. Then I read useChat’s sendMessage(message, options) signature — options.body is merged into the request body. Done in five lines. There’s a subtle lesson here: the AI SDK has more affordances than I was treating it as having, and I keep forgetting to read the type definitions before writing custom code. Adding to the post-PR mental checklist alongside “verify the PR is still open.”

Where Sprint 8 picks up

M8 — multi-provider adapters + eval matrix. OpenAI, Gemini, OpenRouter adapters behind the existing LlmAdapter interface. Capability gates (vision, tool_use, privacy_approved) enforced at strategy save time so a user can’t map extract_document to a non-vision model and ship it. Cross-provider eval matrix for at least one role to prove the abstraction holds — the test is whether chat works on OpenAI’s GPT-4o-mini with the same prompts and tools, or whether the implicit assumptions about Anthropic’s tool-calling protocol leak through.

This sprint had a decision point at the end (per Dev Plan §8): “if behind, cut OpenRouter and Gemini for V1, ship Anthropic + OpenAI only.” I’m going into M8 betting we land at least Anthropic + OpenAI and at least one of the other two. The interface-versus-implementation cost ratio favors shipping the third adapter once the second is done, because the second is what proves the interface. We’ll see.

The other thing I’m carrying into M8: cost-line UX with parent-child linking (per spec §5.13.6). The escalation flow already records parent_call_id shape in provenance JSON; what’s missing is the llm.calls partitioned table to put it in. That table belongs with M8 too — multi-provider eval is meaningless without per-call cost telemetry.