Sprint 2 — three sprints in three days


Sprint 2 closed today, Tuesday. Sprint 1 closed the day before on Monday. Sprint 0 — the bootstrap — closed on Sunday. Three one-week sprints, in three calendar days. The eval matrix passed five-for-five on its first real run, row-level security started enforcing in production, and the LLM abstraction layer compiled with capability gates that say no when they should.

The schedule I scoped seventy-two hours ago — twenty-four one-week sprints, ~10 hours each, six months calendar — is now obviously wrong. I’ll get into what that means at the bottom; first, the actual work.

What shipped

  • RLS now enforces. From Sprint 1’s “the policy is in pg_policies but does nothing” to a production database where queries return zero rows under the wrong tenant context and one row under the right one. The runtime connects as a non-superuser app_role instead of the table-owning neondb_owner; per-request, the route handler opens a transaction and runs set_config('app.tenant_id', $tenantId, true) before any tenant-scoped query touches the DB. A withTenant(db, tenantId, fn) helper wraps the whole pattern so callers don’t have to think about it.
  • The LLM abstraction layer. Types, role-based router, capability gates, typed errors. Eleven unit tests cover capability mismatch (vision / tool_use / privacy_approved) and lifecycle preference (stable > preview > deprecated). The router is the only place import Anthropic from "@anthropic-ai/sdk" is allowed to appear in the codebase — every other caller dispatches by role, never by provider or model.
  • Anthropic adapter for chat, classify, and extract_document. Live smoke against a Hydro-Québec utility-bill fixture: Haiku 4.5 returned {kind: "utility_bill", confidence: 0.98} for $0.000385. Provenance on every result — callId, model, promptVersion, input/output tokens, USD cost.
  • First real eval matrix run. Five classify fixtures (mix EN/FR, Quebec-flavored synthetic data — Hydro-Québec utility bill, an RBC mortgage statement, a belairdirect insurance renewal, a CSMB school letter, a Desjardins Visa statement). Five for five pass. Average confidence 0.976. Total run cost $0.0026 — under the $0.05 budget by twentyfold. Every future run gets compared to that baseline.

That’s two major milestones — M1 (auth + tenant + RLS) and M2 (LLM abstraction + Anthropic + first eval) — substantively closed in one sprint. The original plan had each as its own phase taking two-plus weeks.

What I cut

  • Google OAuth credentials. Code path is in; GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET would light it up. I’m the only user, magic-link works, not bothering yet.
  • Sentry, Cloudflare R2 (Sprint 3), and most of the cuttable list.
  • Multi-provider LLM adapters — OpenRouter, Gemini. Anthropic-only is fine for V1; the abstraction is in place when I want to test others.
  • config/ workspace-package conversion. Today there’s a small relative-path wart where config/catalog.ts and config/pricing.ts import types from packages/shared/src/llm/types via ../packages/shared/.... Works fine; isn’t load-bearing; will fix in a 30-minute PR when I feel like it.

What surprised me

db.transaction() does not work on the Neon HTTP driver. Caught at runtime mid-PR. The HTTP driver throws "No transactions support in neon-http driver" when you call transaction() — the transport is request-per-statement, no shared session. The withTenant helper needs real transactions because set_config('app.tenant_id', …, true) has to be transaction-scoped to be safe under pooled connections; the alternative (session-scoped, third arg false) leaks across requests reusing the same Postgres connection. Switched to the WebSocket-based neon-serverless driver — same connection string, same Drizzle query API, transaction support included. Five-minute fix once I knew, but the kind of thing that reads in a doc as “supported with caveats” and only bites once the code runs.

Haiku 4.5 nails Quebec-flavored document classification with a five-line prompt. I was prepared for the model to flounder on French samples or to need Sonnet for accuracy. It didn’t. Five for five at average confidence 0.976, including French school-board correspondence and a French utility bill, on Haiku — the cheap tier. The eval cost $0.0026 for the whole run. At that price, scaling to a fifty-fixture eval per role is trivial. I’m going to be much more aggressive about eval coverage than I’d have budgeted.

Three sprints in three days. I scoped twenty-four one-week sprints at ten hours each. The actual rate is significantly faster — Sprint 2’s substance took several hours of unfocused time in the evening. Some of that is bootstrap-shaped: the early sprints have less integration friction; auth and abstraction are well-trodden. But a meaningful chunk is that AI-assisted development on a well-specified codebase is just faster than I had budgeted. The schedule to V1 is going to compress; I just don’t know by how much yet. Probably not 24×, but probably not 1× either. I’ll know better after Sprint 3, where the pace will normalize once I’m past auth/abstraction and into the messier ingestion territory.

The lesson I’m taking from these three days: the pre-build estimate I gave myself was wrong by enough that I should re-scope before pushing further on V1 commitments. Not by reducing scope — the spec is right — but by accelerating which decision points come up. The ones I’d parked for “after Sprint 9” (multi-provider commit) and “Sprint 12” (V1 trim or commit) might land much earlier.

Where Sprint 3 picks up

M3 — document ingestion. Cloud accounts: Cloudflare R2 (carry-forward from Sprint 1’s backlog). Schema: a documents table with tenant_id + RLS plus the provenance fields per spec. Upload route in apps/web that hits R2 and triggers an extract_document call against the stored bytes. Confidence-gated auto-write — fields below 0.85 confidence go to a confirm-prompt queue rather than directly into the graph.

If Sprint 2’s pace holds, this is half a day. If the integration with R2 + signed URLs + the actual document parsing turns out to be where the messy work lives, it’s a full sprint. The next post will say which.