2026-05-05

Sprint 3 — the document loop closes

Sprint 3 closed today, continuing on the one day Sprint. Four one-week sprints, in four calendar days. M3 — document ingestion — went from “needs to land” to “the whole loop runs against real bytes” in one evening. A user can drop a PDF on the upload endpoint and it lands in object storage, gets read by a vision model, has its fields extracted and validated against a typed schema, and either gets auto-promoted into the canonical graph or queued for human review depending on whether the model was confident enough. Every step writes provenance.

The first eval-matrix run for extract_document against Sonnet 4.6 was five for five, average confidence 0.984, total cost $0.045 across all five fixtures. Locked in as the baseline.

What shipped

documents + extracted_facts schemas with RLS. Standard tenant_id = current_setting('app.tenant_id', true)::uuid policy on both. Filename encrypted at rest in bytea; only mime_type and byte_size are plaintext, per the spec. Extracted facts carry the full provenance JSONB on every row — call_id, model, prompt_version, region_pack_version, plus token counts and USD cost.
POST /api/documents — multipart-form upload to Cloudflare R2 (Canada region), with libsodium secretbox encrypting the original filename before it touches the database. R2 key scheme is <tenantId>/<documentId>. The DB insert runs inside withTenant(tenantId, ...) so the RLS WITH-CHECK policy is satisfied even at write time. Tenant resolution is a temporary placeholder using a privileged DB connection until the proper SECURITY DEFINER lookup function lands.
extract_document wired to the upload path. Anthropic’s Messages API supports PDF natively as a document content block, so there’s no rasterization layer — the upload bytes go straight to the vision model. The prompt asks Sonnet to do classification and extraction in one pass, returning {kind, fields, confidence}; this saves a round-trip versus a separate classify-then-extract pipeline. Output validated against per-kind Zod schemas (lenient: every field optional, type-checked where present). Cost on a real PDF: $0.0073 per call.
Confidence-gated auto-write. The decision is a pure function with two thresholds: per-CLAUDE.md, confidence ≥ 0.85 AND completeness ≥ 0.7 of the declared fields. Above both thresholds, fields are coerced (date → ISO YYYY-MM-DD, money strings → numeric(12,2)) and written into the canonical graph table for that kind, with extracted_facts.auto_written_at stamped for audit. Below either threshold, the row drops into a confirm_prompts queue with a reason — low_confidence, low_completeness, or unsupported_kind — for the review UI to handle in a later sprint. V1 realizes one kind end-to-end: utility_bill → utility_bills. Other kinds drop to the queue as unsupported_kind until their target tables land in Sprint 4+.
First extract_document eval matrix run. Five synthetic PDF fixtures (Hydro-Québec utility bill, RBC mortgage statement, Belair Direct insurance renewal, CSMB school letter, Desjardins Visa statement; mixed FR + EN) generated programmatically with pdf-lib and base64-inlined into JSON fixtures. Per-field grading with format-aware comparators — “April 30, 2026” matches “2026-04-30”, “1 248,00 $” matches “1248.00”, etc. Five for five pass. Average confidence 0.984. Total run cost $0.044778 — well under the $0.50 budget. The only non-100% per-field rate was Sonnet appending the school commission to school_name (“École primaire des Découvreurs (Commission scolaire Marguerite-Bourgeoys)” vs. expected “École primaire des Découvreurs”); pedantic mismatch, fixture still cleared the 70% per-field floor at 80%.

That’s M3 — document ingestion — substantively done.

What I cut

Nothing meaningful this sprint. The cuttable list still has Gemini, OpenRouter, Word/Excel ingestion, knowledge-graph viz, and Gmail watch on it; none of them came up. The decision points where I’d cut them haven’t arrived.

The closest thing to a cut: I’m running one single filename-encryption key for V1 instead of building per-tenant DEKs with a KEK envelope. The wire format I’m storing is libsodium’s nonce || ciphertext packed bytes — exactly what per-tenant DEKs will produce — so the migration to real per-tenant keys is a re-key in place, not a format change. Documented as “V1 placeholder” in the secrets register.

What surprised me

Sonnet got every fixture right on the first eval pass. I was prepared to iterate the prompt against per-fixture failures — the V1 floor for extract_document is 70%, on the assumption that vision-based field extraction on real-world documents is meaningfully harder than text classification. Five for five on the first run, average confidence 0.984, says my fixtures are too clean (clean text-only PDFs from a programmatic renderer, no real-world skew, watermarks, or scan noise) and the eval bar is going to need to rise before it actually gates anything. That’s a happy problem; the immediate next move is to get a real Hydro-Québec PDF and a real RBC statement into the eval set and see what the number does.

libsodium-wrappers 0.7.16 ships an unresolvable ESM bundle. Vitest fails immediately at module-load time with a missing-import error inside the package’s own ESM build. Pinned to 0.7.15; recorded in memory so future sessions don’t re-bump it. A small reminder that “supported” and “actually works in this resolver” can be different things.

Anthropic’s PDF support means I never touched OCR or rasterization. The original spec called out “vision model directly, no separate OCR layer” as a Sprint-3 design lock — but I was still expecting some kind of bytes-massaging step (resize, page-by-page split, etc.). Anthropic’s document content block accepts the raw PDF bytes and Sonnet does the rest, including multi-page documents. The upload route hands the bytes from R2 straight to the model and the extracted-fields JSON comes back. Zero glue code in between.

Per-field grading needed format-aware comparators basically immediately. The first version of the eval grader compared expected and actual as case-insensitive strings. It failed every date and every monetary amount because Sonnet returns “April 30, 2026” when my fixture said “2026-04-30”, or “$1,847.22” when my fixture said “1847.22”. Added two specialized comparators (date parses both formats to ISO; money strips currency/separators and compares as numbers, with FR-style “1 248,00” and EN-style “1,248.00” both supported). Without these the eval would be measuring formatting drift, not extraction accuracy.

On the pace, again

Last post I said: “Sprint 2’s substance took several hours of unfocused time in the evening… the schedule to V1 is going to compress; I just don’t know by how much yet.” Sprint 3 was another evening. So that’s now four sprints in four days.

The naive extrapolation — twenty-four sprints at this pace — puts V1 in mid-June, six weeks instead of six months. That’s almost certainly wrong; pace will slow somewhere. The current sprints are still benefiting from being on well-trodden territory (auth, schemas, one provider’s SDK, one well-shaped LLM call). The sprints where things genuinely get harder — multi-channel chat surface, the predictive engine, the MCP server, real Gmail integration — are not in the rear-view yet.

But “wrong by 4×” is the more interesting case. If V1 ships in two months instead of six, the decision points I’d parked at Sprint 9 and Sprint 12 (multi-provider commit, V1 trim or commit) collapse into “now.” I haven’t re-scoped yet; I’m going to do another sprint and see if the pace holds before deciding what to do about the schedule.

Where Sprint 4 picks up

M4 — predictive engine. The Quebec region pack already lives at docs/specs/Domi - Region Pack v0.1 (Quebec).md with 31 cadence rules: things like “school registration deadline = March 1 of preceding school year for CSMB,” “winter tire deadline = December 1,” “RAMQ renewal cadence = every 4 years.” The job is to wire these into a recurring task generator that runs against the canonical graph (residences, vehicles, members, …) and emits predicted tasks with full provenance into a tasks table.

This is also where confidence-gated auto-write earns its keep — predicted tasks are emitted with a provenance.kind = "regional_rule" and the same review queue used for low-confidence document extractions can host predicted tasks the user hasn’t accepted yet.

If next sprint’s pace looks like the last three, M4 closes in a week and we’re talking about M5 (chat surface) and the schedule re-scope at the end of it.