2026-05-10

Sprint 8 — the abstraction holds

Sprint 2 scaffolded a role-routed LLM abstraction that I described at the time as the place “every Domi internal caller goes through to invoke an LLM, by role rather than by provider/model.” The intent was clean: when a future Sprint adds OpenAI or Gemini, it’s a new file under packages/shared/src/llm/adapters/ and a catalog entry, nothing else changes. That’s the contract on paper. Sprint 8 was when the contract got cashed.

It cashed cleanly. gpt-4o ran the same 17 chat fixtures with the same prompts and the same tools as claude-sonnet-4-6 at 94% accuracy and 52% the cost. The one OpenAI miss is interesting in a way I’ll get to, but the structural answer is: yes, the abstraction works. Two evenings, five PRs, all merged.

What shipped

OpenAIAdapter covering all four V1 roles. Same shape as AnthropicAdapter: typed input/output per role, versioned prompts, the LlmAdapter interface. The public input/output types come from anthropic.ts and get re-imported — they’re provider-agnostic by design, which is the abstraction’s whole point. Catalog entries for gpt-4o-2024-08-06 (flagship, $2.50/$10 per MTok) and gpt-4o-mini-2024-07-18 (cost tier, $0.15/$0.60). ESLint rule extension banning openai outside the adapters directory, mirroring the Sprint 6 @anthropic-ai/sdk rule. Live smoke against the API on Day 1: chat, generate_plan, classify all round-tripped at the expected cost.
llm.calls partitioned table per Data Model §8. llm Postgres schema, monthly RANGE partitions on occurred_at, three index shapes (time-ordered tenant view, role-filtered, partial on parent_call_id), RLS via the same current_setting('app.tenant_id', true)::uuid pattern as everywhere else. Ten partitions seeded covering Feb–Nov 2026; rotation is automation work for later. New recordLlmCall(db, args) helper; the adapters stay HTTP-pure (no DB dep) so tests / eval / smoke scripts continue to work without a live Postgres connection. Wired into the two adapter call sites that exist in V1: lib/extraction.ts (extract_document) and the escalate_to_plan tool’s onPlanGenerated callback (generate_plan). Both fire-and-forget — a telemetry insert failure must never block the user-visible response.
validateStrategy(strategy) at packages/shared/src/llm/strategy.ts. Pure function, no I/O. Given a complete Record<WorkloadRole, ModelEntry> it enforces four gates: every of the seven roles is covered, each role’s model has the required capabilities, no entry is deprecated, every entry’s provider is in IMPLEMENTED_PROVIDERS. Returns ALL violations rather than first-fail so the future Settings picker UI can show them at once. Reuses ROLE_CAPABILITY_REQUIREMENTS from router.ts — the gate definitions live in exactly one place so routing-time and save-time enforcement can never drift.
Three strategy templates in config/strategies.ts: anthropic-only (current default — Sonnet for chat / extract / plan / privacy, Haiku for the cheap tier), cost-optimized (gpt-4o-mini for chat, Haiku for the cheap tier, Sonnet kept for vision / reasoning / privacy), frontier (best-in-class per role — gpt-4o for chat, Sonnet for the rest). Each template module-load-validates against validateStrategy, so a template that drifts from the catalog gets caught at test time. openrouter-test deferred until the OpenRouter adapter ships; omitted rather than commented as a half-shipped artifact.
The cross-provider chat eval matrix. pnpm eval --role=chat --matrix runs the 17 chat fixtures against every provider in MATRIX_CHAT_MODELS. Per-provider planner (so escalate_to_plan is exercised on every provider, not just Anthropic). Per-fixture × per-provider grid in the console output, plus a single *-chat-matrix.json result file for regression tracking. Cost-regression warning if any non-Anthropic provider exceeds 2× Anthropic on the same fixtures (informational, doesn’t fail the run). Pass-rate floor 0.80 per provider; CI exits 1 if any provider falls below.

The matrix demo per Dev Plan §8 reads cleanly on the console:

=== chat eval matrix — 2026-05-08 ===
fixtures:        17
providers:       2

  anthropic   claude-sonnet-4-6    pass=17/17 (100.0%) cost=$0.2233
  openai      gpt-4o-2024-08-06    pass=16/17  (94.1%) cost=$0.1047

What surprised me

The abstraction is genuinely an abstraction. Sprint 2’s design called for one place where Domi internal code dispatches LLM calls by role, the catalog handling provider/model selection, and adapters being the only files that import provider SDKs. I’d half-expected to find that the abstraction leaked — that some adapter-specific assumption had crept into a higher layer over six sprints of building, and the OpenAI add would surface it. It didn’t. The OpenAI adapter took a pleasant few hours on a single evening. The next time someone asks “is this abstraction worth the design overhead before you have a second implementation,” I’ll point at this sprint.

OpenAI is materially cheaper than Anthropic on the same fixtures. I knew the per-MTok rates were lower — gpt-4o is $2.50/$10 versus Sonnet’s $3/$15, and gpt-4o-mini is much cheaper still — but the practical cost difference on a real workload depends on how the provider tokenizes prompts and how chatty the model is in its replies. Empirically: OpenAI ran the chat matrix at $0.10 vs Anthropic’s $0.22 — about 52% the cost. The escalation positives (which generate plans via the planner adapter, so two LLM calls per fixture) drive most of the differential. Tighter responses + cheaper rates compound.

OpenAI’s one miss is illuminating. escalation-positive-cross-asset is the fixture where the user asks “Looking at my car, my house, and my health, what should I prioritize this month?” — explicitly a cross-asset synthesis case (one of the five SHOULD-call categories in §5.13.9). Anthropic calls escalate_to_plan and produces a plan. OpenAI answered the question directly without escalating. The tool description is identical for both; the trigger heuristics are pasted verbatim from the spec. gpt-4o reads the cross-asset trigger more conservatively than Sonnet does. Not a structural failure of the abstraction — both providers were given the same prompt and the same tool, and the disagreement is at the model’s interpretation of the description, not the framework’s plumbing. The right surface to track this is the eval; if a future prompt edit affects it, we’ll know. Tightening the cross-asset language risks regressing Anthropic’s behavior. For now: known regression, logged in the result file, no fix.

The drizzle-kit --custom empty-body migration bit a third time. When you run drizzle-kit generate --custom, it creates a SQL file with a placeholder comment. If you run db:migrate before filling in the SQL, the migration is marked applied with the placeholder content, and re-running db:migrate after writing the real DDL doesn’t re-apply because it’s already marked done. I learned this on Sprint 6 (the app_get_user_tenant function), saw it again on Sprint 6’s tail of follow-ups, and hit it again this sprint on the llm.calls partitioned table. The procedural fix — write the SQL into the file before db:migrate — is obvious and I’ve now committed to it three times. The footgun is real because the two-step (generate the empty file, edit it, then migrate) feels natural and the gap between “I have an empty file” and “I have a migration applied” is invisible from the CLI. Adding a manual cat step between generate and migrate to my mental checklist. We’ll see if that one sticks.

Capability gates have to live in one place. Sprint 7’s escalate_to_plan tool description embeds the spec’s trigger heuristics verbatim — same source, single point of truth. Sprint 8’s validateStrategy reuses ROLE_CAPABILITY_REQUIREMENTS from router.ts — same idea, different shape. If routing-time enforcement and save-time enforcement diverged, the user could save a strategy that the runtime would reject mid-call. Today they can’t, because the constant is shared. Worth noticing as a pattern: when two different code paths enforce the same rule, they should source it from one constant, full stop.

Where Sprint 9 picks up

M9 — Email & Calendar connector (Dev Plan §8 Phase 9). The big piece is Gmail watch via Cloud Pub/Sub — locked in spec v0.9 as push notifications, not polling, because polling 15-minute windows for one user’s inbox is noise and Gmail’s API supports a watch+notify primitive that drops a Pub/Sub message when a label gets a new message. The webhook receives the notification, enqueues an extraction job, and the existing Sprint 3-4 pipeline handles the rest. The GCP project for Pub/Sub was provisioned in Sprint 0; activation happens here.

A few smaller things I’d like to fold in if there’s room:

Cost-line UI in Settings → AI usage. The data is now in llm.calls; the UI to read it isn’t. M9 might be the right time to surface it as a simple table — “calls in the last 30 days, role / model / cost / tokens” — even if the parent-child grouping has to wait for the chat surface to be routed through the adapter.
The OpenAI cross-asset miss. Track as a known regression in the eval. If a future prompt edit affects it (positively or negatively), the eval will catch it.
Gemini and OpenRouter adapters. Per Dev Plan §10 these were the cuttable items at the end of Sprint 8. Pace was clean enough they’re feasible, but they’re not critical-path. Carrying both forward as V1-cuttable; if a strong reason emerges to add a third or fourth provider, the work is now bounded by the same shape as Sprint 8’s.

The sprint that I wasn’t sure how to scope going in turned out to be one of the cleanest. Sometimes the sign that an abstraction is right is that the work to use it for the second time is faster than the work to build it for the first.