Sprint 3 — the document loop closes
Sprint 3 closed today, continuing on the one day Sprint. Four one-week sprints, in four calendar days. M3 — document ingestion — went from “needs to land” to “the whole loop runs against real bytes” in one evening. A user can drop a PDF on the upload endpoint and it lands in object storage, gets read by a vision model, has its fields extracted and validated against a typed schema, and either gets auto-promoted into the canonical graph or queued for human review depending on whether the model was confident enough. Every step writes provenance.
The first eval-matrix run for extract_document against Sonnet 4.6 was five for five, average confidence 0.984, total cost $0.045 across all five fixtures. Locked in as the baseline.
What shipped
documents+extracted_factsschemas with RLS. Standardtenant_id = current_setting('app.tenant_id', true)::uuidpolicy on both. Filename encrypted at rest inbytea; onlymime_typeandbyte_sizeare plaintext, per the spec. Extracted facts carry the full provenance JSONB on every row —call_id,model,prompt_version,region_pack_version, plus token counts and USD cost.POST /api/documents— multipart-form upload to Cloudflare R2 (Canada region), with libsodium secretbox encrypting the original filename before it touches the database. R2 key scheme is<tenantId>/<documentId>. The DB insert runs insidewithTenant(tenantId, ...)so the RLS WITH-CHECK policy is satisfied even at write time. Tenant resolution is a temporary placeholder using a privileged DB connection until the proper SECURITY DEFINER lookup function lands.extract_documentwired to the upload path. Anthropic’s Messages API supports PDF natively as adocumentcontent block, so there’s no rasterization layer — the upload bytes go straight to the vision model. The prompt asks Sonnet to do classification and extraction in one pass, returning{kind, fields, confidence}; this saves a round-trip versus a separate classify-then-extract pipeline. Output validated against per-kind Zod schemas (lenient: every field optional, type-checked where present). Cost on a real PDF: $0.0073 per call.- Confidence-gated auto-write. The decision is a pure function with two thresholds: per-CLAUDE.md,
confidence ≥ 0.85ANDcompleteness ≥ 0.7of the declared fields. Above both thresholds, fields are coerced (date → ISOYYYY-MM-DD, money strings →numeric(12,2)) and written into the canonical graph table for that kind, withextracted_facts.auto_written_atstamped for audit. Below either threshold, the row drops into aconfirm_promptsqueue with a reason —low_confidence,low_completeness, orunsupported_kind— for the review UI to handle in a later sprint. V1 realizes one kind end-to-end:utility_bill→utility_bills. Other kinds drop to the queue asunsupported_kinduntil their target tables land in Sprint 4+. - First
extract_documenteval matrix run. Five synthetic PDF fixtures (Hydro-Québec utility bill, RBC mortgage statement, Belair Direct insurance renewal, CSMB school letter, Desjardins Visa statement; mixed FR + EN) generated programmatically withpdf-liband base64-inlined into JSON fixtures. Per-field grading with format-aware comparators — “April 30, 2026” matches “2026-04-30”, “1 248,00 $” matches “1248.00”, etc. Five for five pass. Average confidence 0.984. Total run cost $0.044778 — well under the $0.50 budget. The only non-100% per-field rate was Sonnet appending the school commission toschool_name(“École primaire des Découvreurs (Commission scolaire Marguerite-Bourgeoys)” vs. expected “École primaire des Découvreurs”); pedantic mismatch, fixture still cleared the 70% per-field floor at 80%.
That’s M3 — document ingestion — substantively done.
What I cut
Nothing meaningful this sprint. The cuttable list still has Gemini, OpenRouter, Word/Excel ingestion, knowledge-graph viz, and Gmail watch on it; none of them came up. The decision points where I’d cut them haven’t arrived.
The closest thing to a cut: I’m running one single filename-encryption key for V1 instead of building per-tenant DEKs with a KEK envelope. The wire format I’m storing is libsodium’s nonce || ciphertext packed bytes — exactly what per-tenant DEKs will produce — so the migration to real per-tenant keys is a re-key in place, not a format change. Documented as “V1 placeholder” in the secrets register.
What surprised me
Sonnet got every fixture right on the first eval pass. I was prepared to iterate the prompt against per-fixture failures — the V1 floor for extract_document is 70%, on the assumption that vision-based field extraction on real-world documents is meaningfully harder than text classification. Five for five on the first run, average confidence 0.984, says my fixtures are too clean (clean text-only PDFs from a programmatic renderer, no real-world skew, watermarks, or scan noise) and the eval bar is going to need to rise before it actually gates anything. That’s a happy problem; the immediate next move is to get a real Hydro-Québec PDF and a real RBC statement into the eval set and see what the number does.
libsodium-wrappers 0.7.16 ships an unresolvable ESM bundle. Vitest fails immediately at module-load time with a missing-import error inside the package’s own ESM build. Pinned to 0.7.15; recorded in memory so future sessions don’t re-bump it. A small reminder that “supported” and “actually works in this resolver” can be different things.
Anthropic’s PDF support means I never touched OCR or rasterization. The original spec called out “vision model directly, no separate OCR layer” as a Sprint-3 design lock — but I was still expecting some kind of bytes-massaging step (resize, page-by-page split, etc.). Anthropic’s document content block accepts the raw PDF bytes and Sonnet does the rest, including multi-page documents. The upload route hands the bytes from R2 straight to the model and the extracted-fields JSON comes back. Zero glue code in between.
Per-field grading needed format-aware comparators basically immediately. The first version of the eval grader compared expected and actual as case-insensitive strings. It failed every date and every monetary amount because Sonnet returns “April 30, 2026” when my fixture said “2026-04-30”, or “$1,847.22” when my fixture said “1847.22”. Added two specialized comparators (date parses both formats to ISO; money strips currency/separators and compares as numbers, with FR-style “1 248,00” and EN-style “1,248.00” both supported). Without these the eval would be measuring formatting drift, not extraction accuracy.
On the pace, again
Last post I said: “Sprint 2’s substance took several hours of unfocused time in the evening… the schedule to V1 is going to compress; I just don’t know by how much yet.” Sprint 3 was another evening. So that’s now four sprints in four days.
The naive extrapolation — twenty-four sprints at this pace — puts V1 in mid-June, six weeks instead of six months. That’s almost certainly wrong; pace will slow somewhere. The current sprints are still benefiting from being on well-trodden territory (auth, schemas, one provider’s SDK, one well-shaped LLM call). The sprints where things genuinely get harder — multi-channel chat surface, the predictive engine, the MCP server, real Gmail integration — are not in the rear-view yet.
But “wrong by 4×” is the more interesting case. If V1 ships in two months instead of six, the decision points I’d parked at Sprint 9 and Sprint 12 (multi-provider commit, V1 trim or commit) collapse into “now.” I haven’t re-scoped yet; I’m going to do another sprint and see if the pace holds before deciding what to do about the schedule.
Where Sprint 4 picks up
M4 — predictive engine. The Quebec region pack already lives at docs/specs/Domi - Region Pack v0.1 (Quebec).md with 31 cadence rules: things like “school registration deadline = March 1 of preceding school year for CSMB,” “winter tire deadline = December 1,” “RAMQ renewal cadence = every 4 years.” The job is to wire these into a recurring task generator that runs against the canonical graph (residences, vehicles, members, …) and emits predicted tasks with full provenance into a tasks table.
This is also where confidence-gated auto-write earns its keep — predicted tasks are emitted with a provenance.kind = "regional_rule" and the same review queue used for low-confidence document extractions can host predicted tasks the user hasn’t accepted yet.
If next sprint’s pace looks like the last three, M4 closes in a week and we’re talking about M5 (chat surface) and the schedule re-scope at the end of it.