Sprint 21 — a green CI that was lying
S21 was the sprint where the test harness grew teeth. It shipped real user-facing features — Law 25 self-serve account deletion, co-owned assets (multiple custodians), an admin console index — but the thing that mattered most wasn’t a feature. It was finally making the real-database integration suites run in CI instead of only on my laptop. The moment that gate went live it immediately caught a prod incident, two latent test bugs, and a silent RLS-bypass in the test harness itself. A green CI had been lying for months. It isn’t anymore.
What shipped
- Law 25 self-serve tenant deletion (#265). The spec and the privacy policy had promised a right-to-erasure flow since v0.4; the code had a raw
DELETE FROM tenantsthat orphaned every R2 object and left audit PII intact. Now there’s a realdeletion_requeststable (migration0020), an enumerate → R2-purge → cascade → orphaned-auth-cleanup → audit-redaction service, aCRON_SECRET-gated daily sweep with a 30-day grace, owner-only Settings deletion, and a hidden Meta-key “delete now” for test/dogfood. The privacy policy (en + fr) now matches what the code does. - Multiple custodians / co-owners (#279). Assets had a single
custodian_member_id. A house with two owners couldn’t be represented. Nowasset_custodiansis a many-to-many join table (migration0022+ backfill); the knowledge graph draws an edge to each owner; chatpropose_assetacceptscustodianMemberNames[]; the graph entity panel has a custodian multi-select. - DB-integration CI (#281) — the headline. See below.
- Plus:
db-state-syncpre-merge gate + test-isolation hardening (#285), theasset_custodiansGRANT prod hotfix (#283), a co-owns flaky fix (#287), and the earlier-S21 batch (doctype registry, composer focus, Recents rename/archive, tax-doc fields, user menu).
The story: a green CI that was lying
Domi’s packages/shared real-DB suites — RLS isolation, audit-before-mutation, tenant-deletion cascade, telemetry/usage rollups — are all skipUnlessConfigured: they only run when DATABASE_URL is set. ci.yml has no database. So for months, every one of those suites silently skipped in CI while the regression-suite doc proudly marked them ”✅ in CI.” They passed on my laptop and nowhere else. The whole DB-backed safety net was theatre.
CLAUDE.md §11 always said these should run against “Neon ephemeral branches” in CI. It just never got built. S21 built it: a workflow that spins a throwaway Neon branch, migrates it, runs the suite, and deletes the branch — per PR, per push, nightly.
It went live and immediately failed. Three times, three different root causes, in sequence:
-
relation "asset_custodians" does not exist. Migration0022had been generated but never applied to staging Neon — the exact S21 #279 footgun that had already 500’d production. Fix: a hotfix GRANT migration (#283), actually migrate staging, and adb-state-syncgate (#285) so “schema ahead of DB” can’t merge silently again. -
Seven RLS-isolation and usage-rollup suites failing. First hypothesis: missing
GRANT(drizzle-kit doesn’t emit them — a recurring pattern). Applied the grant; no change. Second hypothesis: the ephemeral branch is copy-on-write of a populated single-env DB, so “expect 0 rows for another tenant” assertions see real inherited data. Built a cleanci-basetemplate branch (schema + roles + grants, zero data). Re-ran. Byte-identical failures. That determinism — same numbers regardless of data or parent — was the tell: it was never data or schema. -
The actual cause: the harness was bypassing RLS. The Neon branch action’s default connection string is
neondb_owner— the table owner. Postgres lets a table owner bypass row-level security unless youFORCEit, and Domi only everENABLEs RLS. The workflow had fed that owner URL intoDATABASE_URL, so the app/test path in CI ran with RLS effectively off. Tenant isolation wasn’t broken in the product —secrets-register.mdhad stated the invariant outright the whole time: “RLS only enforces because we’re using app_role.” CI just wasn’t. Fix: connect the app path asapp_role, keep the owner only for migrate + the tests’ own deliberate bootstrap bypass. Green. 149 tests.
Two wrong hypotheses before the right one — but neither was wasted. The GRANT was a genuine prod bug. ci-base is correct hygiene we’d have wanted anyway. Ruling them out is what isolated the real cause.
Decisions made
- Archive = soft delete for chat threads (recoverable; artifacts untouched) — carried into how tenant-deletion was scoped too.
- DB-integration CI runs every PR + nightly, no path filter — path filters miss cross-cutting breakage (migration ordering, shared helpers); the per-run cost is negligible at this scale.
- Branch CI from a clean
ci-base, never the live DB — an integration harness must own its starting state.
Lessons
- Determinism is a diagnostic. Identical failures across changed environments mean config/role, not data or schema. Don’t fix the first plausible thing; let the invariance point at the cause.
- A skipped test is worse than no test — it carries a ✅ it hasn’t earned. The day you make it run, expect a backlog of latent defects to surface at once. That’s the gate working, not the gate being noisy.
- The invariant was already written down. The fix was in
secrets-register.mdbefore the bug was understood. Read your own docs before theorising.
Where V1 stands
Phase 10 ramp. M10 (V1 launch readiness) still open — it closes at the readiness gate, not per sprint. Remaining, in order: JF-driven browser sessions (graph perf trace #181, WCAG verification #150) → the 4-week personal dogfood with success-criteria measurement → non-functional paperwork last (threat model, PIA, DR runbook), deliberately after dogfood so they describe the system as it actually behaves. The real-DB safety net being genuine now makes the dogfood window trustworthy in a way it wasn’t a sprint ago.