2026-06-02

Sprint 31 — A month old, and already a major behind

The day after I posted the one-month-by-the-numbers retrospective — 687 commits, 287 merges, all the lines pointing up — I spent Sprint 31 looking in the opposite direction. Down, at the foundation. No new features shipped this sprint. It was an enabler sprint: pay down the security and dependency debt that had quietly accumulated under the feature work, so the next thirty sprints build on something solid. It started, as these things do, with a red alert.

The CVE that started it

Dependabot’s security tab lit up with four high alerts: drizzle-orm, my ORM — the thing that touches every query in the app — had a SQL-injection CVE (CVE-2026-39356). That’s about as load-bearing as a vulnerability gets when your entire data layer is one library. Bumping to the patched 0.45.2 (#591) was easy; the uncomfortable part was the question it raised. Drizzle had drifted from 0.36 at Sprint 0 to a CVE at 0.45 in a single month, and I only found out because a vulnerability made it light up. I cleared the moderate alerts too (#592) — transitive qs/postcss/esbuild via overrides, plus next-auth — and then went looking for why nobody had told me any of this sooner.

The answer: nothing was. The repo had GitHub’s security alerting on by default — reactive, “this version has a CVE” — but no dependency update bot. No proactive “a newer version exists, here’s a PR.” So every library had frozen at whatever was current the day I scaffolded it, and would stay frozen until a CVE forced the issue. I added a Dependabot config (#593). That’s when the surprise arrived.

The surprise

I expected a trickle of patch bumps. What I got was a wall of major version jumps — for a codebase that is one month old. A sampling of what was already behind:

zod 3 → 4 (#662) — the schema layer, 441 call sites, the runtime bridge for every LLM tool
@neondatabase/serverless 0.10 → 1.1 (#657) — the Postgres driver crossed 1.0
@anthropic-ai/sdk 0.40 → 0.100 (#659) — sixty minor releases; pre-1.0, so each is fair game to break
eslint 9 → 10 and @eslint/js 9 → 10 (#652, #668)
vitest 2 → 4 (#598) — two majors
lint-staged 15 → 17 (#597) — two majors
@types/node 22 → 25 (#601) — three majors
lucide-react 0 → 1, dotenv 16 → 17, tailwind-merge 2 → 3, libsodium-wrappers 0.7 → 0.8, @astrojs/starlight 0.30 → 0.39

That list still surprises me written out. A month-old project, and I’m already a major version — sometimes two or three — behind across the toolchain. The lesson is that “new” and “current” are not the same thing. The JavaScript ecosystem moves fast enough that the gap opens the day you pin a version, and pre-1.0 libraries (half the AI and tooling stack) treat a minor bump as a license to break. The scaffolding pinned what was current in early May; the ecosystem shipped majors all month while I built features on top.

Taming the bot

Standing up Dependabot turned out to be its own small project, because the first wave of PRs all failed CI — and almost never because the dependency itself was broken. They failed for infrastructure reasons I had to fix one class at a time: Dependabot can’t read repo secrets, so the real-DB and eval gates errored (#606); the new lint-staged needed a newer Node than .nvmrc pinned (#607); range-keyed pnpm overrides serialized differently and tripped the frozen-lockfile check (#625); AWS SDK packages that share types have to move in lockstep (#658), as do ai + @ai-sdk/* (#626/#627); and a per-directory config was opening five duplicate PRs for every dependency until I collapsed it to a single root (#647). Each failure looked like “the bump is broken” and turned out to be “my plumbing was wrong.”

By the end: 45 PRs merged this sprint, and roughly 36 more triaged and closed — the per-directory duplicates, the superseded ones, and the genuinely-can’t-merge ones like a Tailwind v4 jump I deferred to V1.5 and an @opentelemetry/api bump that’s futile because it’s force-pinned via overrides (to keep drizzle-orm from splitting into two type-incompatible trees). Call it ~80 dependency PRs handled in a few days. Zero features. Pure foundation.

zod 4, verified with a real model call

The major I cared most about getting right was zod, because it’s the source of truth for every schema and the bridge that hands tool definitions to the AI SDK. Rather than guess at the blast radius, I ran the whole migration in a throwaway git worktree first: bump, install, typecheck, test, count the damage. The damage was small — seven real breaks (six z.record() calls that now need an explicit key type, and one genuinely subtle one where zod 4 infers a trailing .transform() as a required-but-undefined key instead of an optional one, which broke every conditional-spread that built an address). Everything else was a deprecation cleanup I folded in so the eventual v5 jump is free.

The interesting part was verifying it. Typecheck and unit tests prove the shapes line up, but nothing in CI actually called Anthropic with the new SDK on a schema-only change — the one eval that makes a live model call is path-filtered to the prediction code. So I added a manual workflow_dispatch trigger to that eval (#660) and ran it against the branch: nine fixtures, real round-trips to the model, all green under zod 4. Now any future SDK or schema major can be confirmed with an actual call before it merges, not just inferred from types.

The build that only broke in production

Halfway through, every production deploy started failing while previews stayed green and the live site kept serving an older build. The migrations were fine; next build was getting OOM-killed on the 8 GB build machine. The tell was the asymmetry: previews skip the Sentry webpack wrapper (it’s production-only), so production builds carry source-map generation on top of the compile, and as the bundle grew the combined peak crossed 8 GB. I stopped the crash with a memory-optimization flag (#661) — which worked, but then cold builds crawled, because the flag disables webpack’s module cache; the zod rebuild took thirty minutes. The real fix was moving to Vercel’s Enhanced Build Machines (16 GB): the same commit then built in three minutes, after which I dropped the flag again (#663) since the RAM made it unnecessary. A tidy little arc — crash, workaround, proper fix, remove the workaround.

And a quieter one: the help docs site (help.domiapp.ai) had been silently failing to build for days. A Starlight bump (#653) had pulled it to a version that requires Astro 6 while we’re on Astro 5 — but the docs-site deploy isn’t a required check, so the red mark sat on every PR without blocking anything, and the live docs just stopped updating. I pinned Starlight back to the last Astro-5-compatible line (#664) and added a guard so it won’t drift again. The lesson filed itself: a non-blocking check is one nobody reads, which means it can hide real breakage indefinitely.

What I learned

The big one is that supply-chain entropy starts on day one. I’d assumed a brand-new codebase would be, almost by definition, up to date — and it was already a major behind across the board, with a live CVE in the most critical library, and nothing watching. “I just built this” buys you nothing against the ecosystem’s clock. The fix isn’t heroics, it’s a bot plus a CI safety net plus the discipline to triage the noise weekly instead of letting it freeze into a security fire drill. The other recurring shape was the narrowly-true green checkmark — the futile override bump that “fails CI,” the non-required docs check nobody reads, the typecheck that passes without ever calling the model. Same lesson as last sprint, different costume: a check is only as good as the question it actually asks.

Next sprint, I get to go back upstairs. The foundation’s current and the alerts are clear, so Sprint 32 returns to the real work of V1 — dogfooding the thing by living in it, fixing what the daily use surfaces, and shipping a few improvements I’ve been wanting. Features again. But on dependencies that, for the first time, I can actually trust to be current.