Sprint 2 — three sprints in three days
Sprint 2 closed today, Tuesday. Sprint 1 closed the day before on Monday. Sprint 0 — the bootstrap — closed on Sunday. Three one-week sprints, in three calendar days. The eval matrix passed five-for-five on its first real run, row-level security started enforcing in production, and the LLM abstraction layer compiled with capability gates that say no when they should.
The schedule I scoped seventy-two hours ago — twenty-four one-week sprints, ~10 hours each, six months calendar — is now obviously wrong. I’ll get into what that means at the bottom; first, the actual work.
What shipped
- RLS now enforces. From Sprint 1’s “the policy is in
pg_policiesbut does nothing” to a production database where queries return zero rows under the wrong tenant context and one row under the right one. The runtime connects as a non-superuserapp_roleinstead of the table-owningneondb_owner; per-request, the route handler opens a transaction and runsset_config('app.tenant_id', $tenantId, true)before any tenant-scoped query touches the DB. AwithTenant(db, tenantId, fn)helper wraps the whole pattern so callers don’t have to think about it. - The LLM abstraction layer. Types, role-based router, capability gates, typed errors. Eleven unit tests cover capability mismatch (
vision/tool_use/privacy_approved) and lifecycle preference (stable > preview > deprecated). The router is the only placeimport Anthropic from "@anthropic-ai/sdk"is allowed to appear in the codebase — every other caller dispatches by role, never by provider or model. - Anthropic adapter for
chat,classify, andextract_document. Live smoke against a Hydro-Québec utility-bill fixture: Haiku 4.5 returned{kind: "utility_bill", confidence: 0.98}for $0.000385. Provenance on every result —callId,model,promptVersion, input/output tokens, USD cost. - First real eval matrix run. Five classify fixtures (mix EN/FR, Quebec-flavored synthetic data — Hydro-Québec utility bill, an RBC mortgage statement, a belairdirect insurance renewal, a CSMB school letter, a Desjardins Visa statement). Five for five pass. Average confidence 0.976. Total run cost $0.0026 — under the $0.05 budget by twentyfold. Every future run gets compared to that baseline.
That’s two major milestones — M1 (auth + tenant + RLS) and M2 (LLM abstraction + Anthropic + first eval) — substantively closed in one sprint. The original plan had each as its own phase taking two-plus weeks.
What I cut
- Google OAuth credentials. Code path is in;
GOOGLE_CLIENT_IDandGOOGLE_CLIENT_SECRETwould light it up. I’m the only user, magic-link works, not bothering yet. - Sentry, Cloudflare R2 (Sprint 3), and most of the cuttable list.
- Multi-provider LLM adapters — OpenRouter, Gemini. Anthropic-only is fine for V1; the abstraction is in place when I want to test others.
config/workspace-package conversion. Today there’s a small relative-path wart whereconfig/catalog.tsandconfig/pricing.tsimport types frompackages/shared/src/llm/typesvia../packages/shared/.... Works fine; isn’t load-bearing; will fix in a 30-minute PR when I feel like it.
What surprised me
db.transaction() does not work on the Neon HTTP driver. Caught at runtime mid-PR. The HTTP driver throws "No transactions support in neon-http driver" when you call transaction() — the transport is request-per-statement, no shared session. The withTenant helper needs real transactions because set_config('app.tenant_id', …, true) has to be transaction-scoped to be safe under pooled connections; the alternative (session-scoped, third arg false) leaks across requests reusing the same Postgres connection. Switched to the WebSocket-based neon-serverless driver — same connection string, same Drizzle query API, transaction support included. Five-minute fix once I knew, but the kind of thing that reads in a doc as “supported with caveats” and only bites once the code runs.
Haiku 4.5 nails Quebec-flavored document classification with a five-line prompt. I was prepared for the model to flounder on French samples or to need Sonnet for accuracy. It didn’t. Five for five at average confidence 0.976, including French school-board correspondence and a French utility bill, on Haiku — the cheap tier. The eval cost $0.0026 for the whole run. At that price, scaling to a fifty-fixture eval per role is trivial. I’m going to be much more aggressive about eval coverage than I’d have budgeted.
Three sprints in three days. I scoped twenty-four one-week sprints at ten hours each. The actual rate is significantly faster — Sprint 2’s substance took several hours of unfocused time in the evening. Some of that is bootstrap-shaped: the early sprints have less integration friction; auth and abstraction are well-trodden. But a meaningful chunk is that AI-assisted development on a well-specified codebase is just faster than I had budgeted. The schedule to V1 is going to compress; I just don’t know by how much yet. Probably not 24×, but probably not 1× either. I’ll know better after Sprint 3, where the pace will normalize once I’m past auth/abstraction and into the messier ingestion territory.
The lesson I’m taking from these three days: the pre-build estimate I gave myself was wrong by enough that I should re-scope before pushing further on V1 commitments. Not by reducing scope — the spec is right — but by accelerating which decision points come up. The ones I’d parked for “after Sprint 9” (multi-provider commit) and “Sprint 12” (V1 trim or commit) might land much earlier.
Where Sprint 3 picks up
M3 — document ingestion. Cloud accounts: Cloudflare R2 (carry-forward from Sprint 1’s backlog). Schema: a documents table with tenant_id + RLS plus the provenance fields per spec. Upload route in apps/web that hits R2 and triggers an extract_document call against the stored bytes. Confidence-gated auto-write — fields below 0.85 confidence go to a confirm-prompt queue rather than directly into the graph.
If Sprint 2’s pace holds, this is half a day. If the integration with R2 + signed URLs + the actual document parsing turns out to be where the messy work lives, it’s a full sprint. The next post will say which.