Release v0.85.0 — Evidence & Resilience (2026-06-11)
Audit remediation release 2 of 2, closing the five high-value findings the 2026-06-10 fresh-pass audit left open after v0.84.x (“operator truth”): render-pool saturation handling (H3), the month-red HA failover check (H4), never-executed render-isolation suites (H5), template format versioning (H7), and the editor publish-gate trust gaps (H8). Shipped across PRs #86–#100. No breaking changes — every contract addition is additive, and all new env vars have backward-compatible defaults.
Highlights
Saturation sheds instead of cascading (H3)
Before: an overloaded render pool queued unboundedly; the child-process dispatcher’s wall-clock deadline then killed workers mid-render, orphaning every sibling render (one slow burst could cascade into total render-path failure). Now:
- The browser-pool queue is bounded (
RENDER_MAX_QUEUE_DEPTH, default 2× pool). Beyond it, renders shed immediately with the newrender_saturatedcode. POST /render/pdf,POST /render/preview/pdf, and the sandbox render return 503 +Retry-After: 30(body carriesretryAfter: 30) — clients get an actionable backpressure signal instead of a slow timeout.- Dispatch is two-phase: queue-wait expiry sheds only the waiting render; a render that has started keeps its full execution budget.
- An API-side admission gate (
RENDER_MAX_CONCURRENT_PAGES + RENDER_MAX_QUEUE_DEPTHin flight) sheds before HTML-generation cost is sunk — and bounds concurrentdocker runspawns in container mode.
New env vars (see deployment guide § “Sizing the render pool”):
RENDER_MAX_CONCURRENT_PAGES (1–50, default 5), RENDER_MAX_QUEUE_DEPTH
(0–1000, default 2× pool), RENDER_WORKER_TIMEOUT_MS (5000–600000,
default 65000). RENDER_PREVIEW_RESERVED_SLOTS may now go up to 49,
with a boot-time rule that it leaves at least one batch slot.
HA Check 7 fixed at the root (H4)
The single-replica outage failover check had failed every nightly since it
landed (~95.2 % availability vs the 99 % threshold). Root cause: docker compose stop blackholes new connections (no RST); nginx’s default 60 s
proxy_connect_timeout let probes hang until the client aborted, and
client aborts never count toward max_fails, so the dead upstream was
never benched — a serial probe loop ate ~3 failures per outage window.
The LB now fails connects in 1 s, retries the surviving peer on
error/timeout/502/503 (bounded: 2 tries, 4 s), and benches explicitly.
A red nightly can no longer rot silently: any failing check creates or
comments on a single pinned ha-nightly-failure issue.
Render isolation modes are now proven in CI (H5)
The container- and socket-mode streaming suites (env-gated since v0.46.0)
had never executed in CI — RENDER_MODE=container/socket shipped on
unit coverage alone. The push-gated docker job now builds the worker
image, probes a Chromium launch inside it under the dispatcher’s exact
hardening flags, and runs both suites against it: real docker run
dispatch, and a real render-controller on a Unix socket.
The gate caught a production bug on its first execution: Chromium’s
crashpad handler crashed the entire browser launch inside hardened
containers (--read-only + --cap-drop ALL) on common kernels —
container/socket isolation was broken on such hosts despite working on
Docker Desktop. Fixed by disabling the crash machinery in the launch args
and pointing the worker image’s HOME at the tmpfs.
Template format versioning (H7)
Definitions now carry formatVersion: 1, stamped at the save boundary
(absence still means 1, so nothing existing changes; an unknown future
format is rejected loudly instead of misread).
template-compatibility.md documents the
additive-only format promise and the pulp validate upgrade pre-flight.
Two new permanent gates enforce it: a frozen compat corpus (including a
v0.18.0 production-era template, asserted parseable forever) and an
editor↔server parity test (every starter pack must build a definition the
server’s save-boundary schema accepts).
Editor publish-gate trust (H8)
- Global shortcuts no longer act on the canvas underneath an open dialog (Delete could remove canvas nodes mid-publish-review), and Escape with a dialog open belongs to the dialog.
- A stale “All checks passed.” verdict can no longer publish unchecked content: if the template mutated after the verdict, Publish Now re-runs the checks against the mutated template.
- A new e2e spec exercises the REAL
/render/validate(no route stubs): fail-closed failure path with the machine code surfaced, and a full-format success path through to publish. - FormApp embed test debt resolved with no silent skips: the ready-message test is un-skipped and green in CI; the submit-path test’s deterministic ubuntu-only failure is tracked in #96.
Evidence
| Claim | Evidence |
|---|---|
| Full CI matrix green on the release line, including both isolation suites executing real renders | run 27280878065 (commit 6402f9c) |
| HA nightly green incl. Check 7 — dispatch | run 27278656273: all 4 checks green; Check 7 = 282/282 requests, 100.00 % availability |
| HA nightly green incl. Check 7 — real scheduled run | run 27286219985: all 4 checks green; Check 7 again 282/282, 100.00 % |
| Check 7 fix reproduced locally before CI | 280/280 (100.00 %) against the same stack via pnpm ha:check-7 |
| Saturation contract | Route-level test holds the real pool, asserts 503 + Retry-After: 30 + envelope, then proves recovery with a live Chromium render |
| Crashpad fix | CI probe CHROMIUM_PDF_OK + container suite 3/3 in CI (previously instant launch failure with chrome_crashpad_handler: --database is required) |
Upgrade notes
- No action required. All new env vars default to the previous effective behaviour (pool of 5, bounded queue of 10, 65 s watchdog).
- Operators who want immediate shedding under load can set
RENDER_MAX_QUEUE_DEPTH=0; clients should treat503 + Retry-Afteron render routes as retryable backpressure. RENDER_MODE=container/socketoperators should pull the v0.85.0 worker image — older images can fail Chromium launch on kernels where crashpad misbehaves under the hardened flags (see Fixed above).- Stored templates are untouched;
formatVersion: 1appears on definitions the next time they are saved. Restores remain byte-faithful clones.
Deferred / known items
- npm/PyPI SDK publishing remains deferred (backlog PCR-2); the two publish workflows are disabled until registry credentials exist, so tag pushes no longer produce guaranteed-red runs.
- FormApp submit-path embed test: deterministic ubuntu-only failure under jsdom, tracked in #96 (runtime behaviour unaffected; header wiring is covered by the un-skipped sibling test).
- HA Check 6 rolling-replacement variant and the multi-instance S3 read-after-write check remain deferred (see ha-validation-report.md).