Sprint 21 — a green CI that was lying


S21 was the sprint where the test harness grew teeth. It shipped real user-facing features — Law 25 self-serve account deletion, co-owned assets (multiple custodians), an admin console index — but the thing that mattered most wasn’t a feature. It was finally making the real-database integration suites run in CI instead of only on my laptop. The moment that gate went live it immediately caught a prod incident, two latent test bugs, and a silent RLS-bypass in the test harness itself. A green CI had been lying for months. It isn’t anymore.

What shipped

  • Law 25 self-serve tenant deletion (#265). The spec and the privacy policy had promised a right-to-erasure flow since v0.4; the code had a raw DELETE FROM tenants that orphaned every R2 object and left audit PII intact. Now there’s a real deletion_requests table (migration 0020), an enumerate → R2-purge → cascade → orphaned-auth-cleanup → audit-redaction service, a CRON_SECRET-gated daily sweep with a 30-day grace, owner-only Settings deletion, and a hidden Meta-key “delete now” for test/dogfood. The privacy policy (en + fr) now matches what the code does.
  • Multiple custodians / co-owners (#279). Assets had a single custodian_member_id. A house with two owners couldn’t be represented. Now asset_custodians is a many-to-many join table (migration 0022 + backfill); the knowledge graph draws an edge to each owner; chat propose_asset accepts custodianMemberNames[]; the graph entity panel has a custodian multi-select.
  • DB-integration CI (#281) — the headline. See below.
  • Plus: db-state-sync pre-merge gate + test-isolation hardening (#285), the asset_custodians GRANT prod hotfix (#283), a co-owns flaky fix (#287), and the earlier-S21 batch (doctype registry, composer focus, Recents rename/archive, tax-doc fields, user menu).

The story: a green CI that was lying

Domi’s packages/shared real-DB suites — RLS isolation, audit-before-mutation, tenant-deletion cascade, telemetry/usage rollups — are all skipUnlessConfigured: they only run when DATABASE_URL is set. ci.yml has no database. So for months, every one of those suites silently skipped in CI while the regression-suite doc proudly marked them ”✅ in CI.” They passed on my laptop and nowhere else. The whole DB-backed safety net was theatre.

CLAUDE.md §11 always said these should run against “Neon ephemeral branches” in CI. It just never got built. S21 built it: a workflow that spins a throwaway Neon branch, migrates it, runs the suite, and deletes the branch — per PR, per push, nightly.

It went live and immediately failed. Three times, three different root causes, in sequence:

  1. relation "asset_custodians" does not exist. Migration 0022 had been generated but never applied to staging Neon — the exact S21 #279 footgun that had already 500’d production. Fix: a hotfix GRANT migration (#283), actually migrate staging, and a db-state-sync gate (#285) so “schema ahead of DB” can’t merge silently again.

  2. Seven RLS-isolation and usage-rollup suites failing. First hypothesis: missing GRANT (drizzle-kit doesn’t emit them — a recurring pattern). Applied the grant; no change. Second hypothesis: the ephemeral branch is copy-on-write of a populated single-env DB, so “expect 0 rows for another tenant” assertions see real inherited data. Built a clean ci-base template branch (schema + roles + grants, zero data). Re-ran. Byte-identical failures. That determinism — same numbers regardless of data or parent — was the tell: it was never data or schema.

  3. The actual cause: the harness was bypassing RLS. The Neon branch action’s default connection string is neondb_owner — the table owner. Postgres lets a table owner bypass row-level security unless you FORCE it, and Domi only ever ENABLEs RLS. The workflow had fed that owner URL into DATABASE_URL, so the app/test path in CI ran with RLS effectively off. Tenant isolation wasn’t broken in the product — secrets-register.md had stated the invariant outright the whole time: “RLS only enforces because we’re using app_role.” CI just wasn’t. Fix: connect the app path as app_role, keep the owner only for migrate + the tests’ own deliberate bootstrap bypass. Green. 149 tests.

Two wrong hypotheses before the right one — but neither was wasted. The GRANT was a genuine prod bug. ci-base is correct hygiene we’d have wanted anyway. Ruling them out is what isolated the real cause.

Decisions made

  • Archive = soft delete for chat threads (recoverable; artifacts untouched) — carried into how tenant-deletion was scoped too.
  • DB-integration CI runs every PR + nightly, no path filter — path filters miss cross-cutting breakage (migration ordering, shared helpers); the per-run cost is negligible at this scale.
  • Branch CI from a clean ci-base, never the live DB — an integration harness must own its starting state.

Lessons

  • Determinism is a diagnostic. Identical failures across changed environments mean config/role, not data or schema. Don’t fix the first plausible thing; let the invariance point at the cause.
  • A skipped test is worse than no test — it carries a ✅ it hasn’t earned. The day you make it run, expect a backlog of latent defects to surface at once. That’s the gate working, not the gate being noisy.
  • The invariant was already written down. The fix was in secrets-register.md before the bug was understood. Read your own docs before theorising.

Where V1 stands

Phase 10 ramp. M10 (V1 launch readiness) still open — it closes at the readiness gate, not per sprint. Remaining, in order: JF-driven browser sessions (graph perf trace #181, WCAG verification #150) → the 4-week personal dogfood with success-criteria measurement → non-functional paperwork last (threat model, PIA, DR runbook), deliberately after dogfood so they describe the system as it actually behaves. The real-DB safety net being genuine now makes the dogfood window trustworthy in a way it wasn’t a sprint ago.