2026-05-07

Sprint 4 — the engine fires

Sprint 4 closed today, the day after Sprint 3. Five one-week sprints, in five calendar days. The predictive engine now fires: a vehicle owned by a tenant, a Quebec region pack with twenty-nine cadence rules, and a function call returns the right tasks for the right dates with full provenance. For the realized rule (vehicle.tire.winter_swap, “install winter tires”), the database row has the regulated source — Quebec law: winter tires mandatory from December 1 — pinned to the predicted task. The first piece of the system that turns living in Quebec with a Mazda into Domi tells you to swap your tires by November 15.

Five fixtures, all deterministic, total cost 0.

What shipped

**members + assets schemas with RLS.** The graph-side foundation. assets.kind enum (member|vehicle|boat|residence|appliance), members.kind enum (adult|teen|child|caregiver|helper|other), attributes JSONB on both for kind-specific extras (vehicle: make, model, year, vin, …; residence: address, bedrooms, …), self-referential guardian_member_id for the §7.3 minor-needs-guardian case, GIN index on assets.attributes for “vehicles where attributes->>‘make’ = ‘Mazda’” rule predicates. Plain Postgres, plain Drizzle, all the indexes the spec lists.
**tasks schema.** One table for predicted, extracted, and manual tasks; source and lifecycle_state enums distinguish them. provenance JSONB carries the mandatory (call_id, model, prompt_version, region_pack_version) tuple per CLAUDE.md §6, plus rule-specific extras for predicted tasks. created_by stored as plain text rather than a users FK so 'system' is representable for predictions without a sentinel users row.
**region_packs table + Quebec pack as data.** No tenant_id, no RLS — region packs are reference data shared across tenants; the region_pack_version recorded on each row is what attributes a fact to the version that produced it. The CA-QC v2026.05.0 pack is hand-translated from the spec into a typed TS module (packages/shared/src/region-pack/seeds/ca-qc-2026-05-0.ts), validated against a Zod schema, then JSON-stringified into a dollar-quoted Postgres INSERT inlined in the migration so the seed lands wherever the migration runs. Twenty-nine cadence rules across vehicles / boats / residences / appliances / member health, eleven holidays.
The engine itself. Two layers. (a) Pure decision function evaluateSeasonalWindow({ rule, asset, now, regionPack, languagePref }) → TaskCandidate | null. No DB, no clock — caller passes now, fourteen vitest cases cover in-window / pre-window / post-window / acquired-after-window-opens / kind-mismatch / archived / sold / retired / start-boundary / end-boundary / year rollover / language-pref selection / wrong-schedule-kind guard. (b) Side-effecting runPredictionsForTenant(db, tenantId, options?) loads the active region pack, reads non-archived assets inside withTenant(), iterates rules × assets, dedupes against existing tasks via provenance->>'rule_key' = ? AND asset_id = ? AND provenance->>'predicted_year' = ?, inserts. Returns a summary so the dev route + eval set can assert on counts directly.
First predict_task eval baseline. Five for five. Zero dollars. Each fixture creates a fresh tenant, inserts seeded assets, fires the engine with the fixture’s now, asserts the emitted task set, then DELETE FROM tenants WHERE id = ? cascades child rows. Zero residual rows in staging post-run, verified. Pass-rate floor for predict_task is 1.0 — the engine is deterministic, anything less is a regression, not flakiness. The five fixtures cover the four meaningful states (in-window / before / after / acquired-after-window) plus a dedup case (run prediction twice, second run is 0 created + 1 skip).
Dev surfaces for dogfood. POST /api/dev/predict (auth + ENABLE_DEV_ROUTES=1 gated) and a pnpm --filter @domi/shared run predict CLI. Both thin wrappers around runPredictionsForTenant so the engine code stays in one place. Lets me seed my own Mazda + watch the tire-swap task appear from a browser or terminal.

That’s M4 — predictive engine — substantively done.

What I cut

The LLM-augmented predict_task role. The cadence rules are JSON; the engine that turns Quebec winter tire law + my Mazda + 2026-10-20 into a task row is deterministic window math. Spec calls for a Haiku layer on top — refining the copy (“hey, your car’s been parked since September, take it to Pneus Beaumont before mid-November”), bundling related tasks, prioritizing by season, interpreting edge cases. None of that is load-bearing for “the right tasks appear at the right time,” which is what the deterministic baseline does. The Haiku layer becomes a clean A/B against the existing eval set when it lands.

Of the four schedule kinds in the schema (seasonal_window, mileage_based, date_based, event_triggered), only the first is realized. The other three are silent no-ops in the dispatcher — the engine intentionally returns silently on unsupported kinds so the same code path runs against the full pack. Nine of twenty-nine rules are reachable today (every seasonal_window rule). The next-most-natural one — vehicle.registration_renewal, annual on the asset’s acquired_at anniversary — needs date_based, which lands when I want the rule to fire.

What surprised me

Per-fixture tenant isolation is the right pattern. First eval design instinct was to wrap each fixture in a Postgres transaction with rollback at the end, so nothing persists. Realized partway in: the engine itself uses withTenant() which opens its own transactions, and Postgres savepoint semantics under nested transactions are different enough between Drizzle’s db.transaction and the engine’s path that “wrap the test in a transaction” stops being a clean abstraction. Switched to “create a tenant per fixture, run, assert, drop the tenant — DELETE FROM tenants WHERE id = ? cascades through every child table.” Five fixtures, ~3-4 seconds end to end, zero residual rows. Idiom worth keeping for any future eval that needs real DB state.

Anthropic’s Date.toISOString() always emits .sssZ. First eval run failed two fixtures because the JSON expected 2026-11-15T00:00:00Z (no millis) and the engine emits 2026-11-15T00:00:00.000Z. Fix: per-task date matcher compares by epoch milliseconds, format-tolerant. Same shape as the extract_document grader’s date comparator from Sprint 3 — date-format normalization keeps reappearing as a small but recurring eval-side concern. Worth pulling into a shared comparator helper at some point.

The deterministic baseline ships immediately useful. I’d budgeted for the Haiku layer to be where the rules “really came alive” with personalization, region-aware copy, bundling. But the deterministic engine alone is already useful: it turns a region pack rule into a typed graph row with full provenance, surfaceable through any UI that reads the tasks table. The Haiku layer becomes refinement, not the load-bearing primitive. That’s a smaller-than-expected gap between “shipped” and “useful” — same shape as the Sprint 3 surprise where Sonnet’s PDF support meant zero rasterization glue code.

On the pace

Five sprints in five days. The naive extrapolation puts V1 in mid-June; the rest of the actual schedule (chat surface, MCP server, predictive scheduler, calendar integration, ingestion connectors) genuinely is harder than what’s been done so far. But three of the next four milestones — M5 (chat surface), M6 (settings IA), M7 (MCP server) — have shipped reference implementations (Vercel AI SDK, shadcn, Node MCP SDK) that should keep pace high. M8 (Gmail / Drive / Dropbox connectors) is where I expect the rate to slow.

I’m still not re-scoping. Going to finish M5 and decide at the end of next sprint.

Where Sprint 5 picks up

M5 — chat surface. First user-visible feature on domiapp.ai. Vercel AI SDK chat route, shadcn-themed UI, tool-use over the existing read paths (documents, extracted_facts, tasks, utility_bills, confirm_prompts). Document upload from chat. Ad-hoc runPredictions trigger from chat. The natural sentence “what bills do I have?” reads from extracted_facts; “anything coming up?” reads from tasks; “I just got a new utility bill” routes to extract_document against the upload. Every backend feature surfaces through one UI naturally — that’s why the chat surface is M5 and not M9.

Three sprints of backend work were inspectable via Drizzle Studio + curl. Sprint 5’s work will be visible to anyone visiting domiapp.ai. That’s the trade I made by sequencing the foundations first; the cash-out happens this week.